PDF/A and Long Term Preservation

PDF/A and Long Term Preservation

I wasn’t at iPres this year where the Foundation was represented by our new Executive Director, Martin Wrigley. One paper I was interested in was “PDF/A considered harmful for digital preservation” by Marco Klindt of Zuse Institute Berlin (ZIB). If you’re interested in PDF/A and haven’t read it you should, it’s well researched and presents a good argument. PDF/A is not a standard you’d have designed with long-term preservation in mind. It’s also significantly more complex than necessary in many cases. While complex, PDF/A is a restricted subset of the PDF standard and does improve the outlook for long-term preservation. Ensuring that the fonts are embedded, images are in specific formats and metadata is consistent increases the likelihood that people will be able to access a document in the future.

I have to declare an interest in so far as the Open Preservation Foundation is part of the veraPDF consortium. Much of the last three years of my life has been spent working on veraPDF, an open-source validator for the PDF/A standards developed as part of PREFORMA. I’m an advocate of open source reference validators rather than the PDF format itself. While we’re really happy with the veraPDF software, Marco’s article makes the limitations of PDF/A validation clear. veraPDF doesn’t perform accessibility tests, nor does it attempt any semantic validation of a document and any tags. To be clear I think that PDF/A validation is still an important part of a preservation workflow that processes the format. It’s even more useful carried out by content creators who are in a better position to address any issues highlighted by validation.

Regardless of opinions regarding the format, a major consideration for memory institutions with a mandate for preservation is pragmatism. Governmental and commercial organisations currently make wide use of the PDF format and there seems little prospect of that changing in the short to medium term. It’s often the case that an institution cannot dictate the formats it must collect and preserve. Then there’s the question of an institution’s existing holdings. The reality is that many already have large PDF collections of PDFs and they’ll continue to receive PDF and PDF/A submissions for the foreseeable future.

Another reality for these institutions is that they’re often asked to preserve material received long after its creation. Web harvests and national archives are both good examples. In these cases, it’s not possible to get producers to fix problems with submissions. Often the best that can be done is to validate the documents then record and analyse any failures. Even this can present a challenge, I confess that I don’t understand every check failure message produced by veraPDF. Something the Foundation is working on is ways to help our software users to interpret the results returned and understand the preservation implications of those results. We’ve already published a linkable Wiki of veraPDF error codes and messages. The next JHOVE release will include error message IDs and a similar error message Wiki. Watch this space.

247
reads

Leave a Reply

Join the conversation