Coming to “Preserving PDF – identify, validate, repair” in Hamburg?
The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.
Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:
- PDF – Inventory of long-term preservation risks links to a report on the same subject. This is written from a preservation point of view and despite Johan admitting it's incomplete (see blog post) it's still a good overview of the format and associated preservation issues.
- Identification of PDF preservation risks with Apache Preflight: a first impression examines the use of Apache PDF Box's Preflight module to detect preservation risks. Again it links to a report written as part of the SCAPE project. PDF Box is one of the tools we'll be looking at in Hamburg, see the Tools section later in this post.
Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.
JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.
- Official Website: JHOVE website on SourceForge
- Licensing: LGPL v2.1
- Version: v1.11 released 09/2013
- Download: SourceForge
The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.
- Official Website: Apache Tika Home
- Licensing: Apache License v2.0
- Version: v1.5 released 02/2014
- Download: Apache Tika download page
Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.
These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.
- Official Website: Apache PDFBox Home
- Licensing: Apache License v2.0
- Version: v1.8.6 released 06/2014
- Download: Apache PDFBox download page
These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:
PDFs from GovDocs selected dataset
The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.
Isator PDF/A test suite
The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words:
This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.
PDFs from OPF format corpus
The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.
If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.
Be sure to install the VirtualBox extensions also, it's the same download for all platforms.
I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.