Building a File Observatory: Making sense of PDFs in the Wild

Building a File Observatory: Making sense of PDFs in the Wild



Tim Allison, NASA


In this talk, I’ll report on progress that my team at NASA’s Jet Propulsion Laboratory (CalTech) has made towards building a file observatory to support researchers on DARPA’s SafeDocs program. The goal of the SafeDocs program is to develop provable secure parsers for static and streaming file formats. The initial focus is on the PDF format, specifically. For this observatory, we’ve gathered millions of PDFs from Common Crawl and run them through around 20 open source parsers to assess variations in error codes and exception handling among the tools. We have also extracted and made searchable the files’ metadata and elements of the underlying structures of these PDFs to help researchers identify, understand and quantify structural features that are out of alignment with the PDF specification. We believe that some of our tools and techniques will be of use for digital preservationists working to understand file features and risk at scale.


Registration is now closed.

To find out when new webinars are announced please subscribe to our mailing list.

OPF members benefit from exclusive access to our archive of webinar recordings. Learn more about the benefits of becoming an OPF member and supporting initiatives like our webinar series.