Coming to “Preserving PDF – identify, validate, repair” in Hamburg?

Coming to “Preserving PDF – identify, validate, repair” in Hamburg?

The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.

Pre-reading

The Wikipedia entry on PDF provides a readable overview of the formats history with some technical details. Adobe provide a brief PDF 101 post that avoids technical detail.

Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:

This MacTech article is still a reasonable introduction to PDF for developers. Finally, if you really want a detailed look you could try the Adobe specification page but it's heavy weight reading.

Tools

Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.

JHOVE

JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.

Apache Tika

The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.  

Apache PDFBox

Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.

These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.

Test Data

These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:

PDFs from GovDocs selected dataset

The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.

Isator PDF/A test suite

The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words: 

This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

More information about the suite can be found on the PDF Association's website along with a download link.

PDFs from OPF format corpus

The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.

Software

If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.

Vagrant downloads page:

Oracle VirtualBox downloads page:

Be sure to install the VirtualBox extensions also, it's the same download for all platforms.

What next?

I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.

13
reads

1 Comment

  1. johan
    August 12, 2014 @ 10:56 am CEST

    Small addition to the test data listed in the blog post: there are some really interesting test files (and lots of them as well) over at the Adobe Acrobat Engineering site. Check out the subcategories of the PDF Test Suites page:

    http://acroeng.adobe.com/wp/?page_id=10

Leave a Reply

Join the conversation