PDF Eh? – Another Hackathon Tale

PDF Eh? – Another Hackathon Tale

“Characterization” can mean many things (I’m particularly fond, especially in this context, of the OED’s “creation of a fictitious character or fictitious characters”). Back in October Paul Wheatley suggested that digital preservation practitioners needed “better characterisation” and defined this as enabling them to determine the condition, content and value of digital records prior to ingest (computer-aided appraisal if you will). To this end SPRUCE organised a Unified Characterisation Hackathon with the intent of “unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities”. With FITS, DROID and JHOVE2 all well represented this promised to be a good event, and good it was!

After the usual chaos of any open agenda event, we quickly settled into groups covering four issues: Integrating Tika and C3PO, Integrating Tika and FITS, comparing and contrasting Tika’s and DROID’s approaches to file format signature definition, and identifying preservation risks in PDFs. I worked on this last one in the company of some great people who may well have their own takes on how it went, but here are my thoughts.

We began with a discussion about possible risks for PDFs. The answer wasn’t simple. For example, an encrypted PDF may be at risk, but Andy Jackson pointed out some encrypted PDFs have empty passwords and argued that this was a safe kind of encryption (assuming people remember to try empty passwords when opening them). Tangled into this discussion was the question of validity. It was suggested that the PDF/A specification could be used as a checklist for PDF preservation risks. However a simple conformance check – valid or invalid – was not enough. The specification might disallow things that we, as the keepers of that content, have decided are risks we are willing to take. Some issues may be too expensive to solve; require an even riskier migration or have other external factors (commercial agreements e.g.) that determine their importance. In short, we needed a tool that would not just respond in Boolean, but rather empower the user to make an informed choice about their content.

Johan van der Knijff steered the group towards Apache’s PDFBox and its Preflight component – a tool that tests PDFs for conformance to the PDF/A1-b specification (and by implication the PDF specification). Seemed a good starting point and Will Palmer quickly set to work. Very soon we had Preflight built, running and outputting XML instead of its usual unstructured text (see some example outputs and the changes made). Each divergence from the spec was interpreted as a possible preservation risk to be reported to the user. Will has since contacted the PDFBox developers to get this patch included into the Preflight release.

Armed with an XML statement showing which parts of the PDF/A specification a given document failed on, we now needed a way to allow the user to say which of these where of interest and only present those in the final report. Sheila Morrissey came up with an elegant solution. She created a policy file that enables practitioners to define which of the errors Preflight handled they were interested in – either flag as a warning, ignore completely or fail the entire validation process. She then defined an XSLT stylesheet used to filter Preflight’s output to create a report showing only those risks.

To show how this could all fit together we then created a GUI – PDF Eh? (a name I'm particularly proud of!) – that enables a user to run Preflight over a directory of files, applies the policy via the XSL transform and reports the results. Lynn Marwood created a parser for the rules file to enable the end-user to define their policy in the UI and re-run validation tests. Currently these rules are displayed but turning them on or off does not change the output. The GUI runs the validation too, so it is slow to respond when called on large directories. While a basic proof-of-concept, this code provides a useful framework and shows that with a few tweaks of an existing tool and some neat XSLT we can identify preservation risks and incorporate policies very succinctly.

Code for the GUI including the XSL is on GitHub.

We also felt this approach could be used in other contexts – a SCAPE component or a plug-in for (insert favourite preservation/repository solution).To prove this point Maurice integrated the XML-enabled Preflight with FIDO and used it to show how a validation check can be used to augment file identification, particularly in the case where the magic is intact but the file is otherwise broken (streams of zeros after the end-of-file marker for instance). While this extra step may add time and complexity to file ID via magic numbers, Maurice demonstrated here that further characterisation can help provide a definitive answer for awkward edge cases.

As an aside, using XML output gave rise to the question what should we use for the tags? I’m not sure we ever came to an answer. Maurice de Rooij argued for XCDL, I wondered about aligning with jpylyzer’s XML output. Standardising could make creation of further processing tools (comparisons for quality assurance for example) easier.

It was a good couple of days. We talked to other people, learnt new things and shared approaches and if that isn’t the first steps to a unified approach I don’t know what is! My only regret was getting so caught up I didn’t spend more time in the other groups, digging into Tika or FITS or DROID or C3PO. But there will be other days.


Leave a Reply

Join the conversation