In this blog post I'll be dusting off some old stuff for a change. The occasion for this is the following question, posted by Paul Wheatley on the Libraries and Information Science Stack Exchange website a few days ago:
This reminded me of a report I wrote on this very subject back in 2009. (Incidentally this was my very first foray into the wacky world of digital preservation, but that's another story.) Originally this document was intended for internal use at the KB, but looking at it again, I think it may be of interest to a wider audience. It also aligns quite nicely with the upcoming work on a knowledge base of file-format related risks that will be done as part of the SCAPE project. The main idea here is to take a file format, identify its main (preservation-related) risks, and describe how "risky" features can be detected by existing (characterisation) tools. In fact I was envisaging something along these lines when I wrote PDF report in 2009, but other things got in the way, and I never got round to the final step. The SCAPE work should finally make this happen.
Although the work on the knowledge base is still in its early stages, some very first results can be found here. The initial focus will be on JPEG 2000 (JP2/JPX) and PDF.
As for the report, I should add that some of it is a little rough around the edges, and you may note some gaps and not-quite-finished bits. This is also why we never released this first time around. Also, one aspect that is not well covered is PDF's potential for transmitting viruses and other malware. Nevertheless, as a general introduction to the format and an overview of its main risks I think it's not too shabby, but I'll let you be the judge of that! As always, feel free to use the comment fields for you feedback and suggestions.
Link to report
Johan van der Knijff
KB / National Library of the Netherlands