In this blog post I'll be dusting off some old stuff for a change. The occasion for this is the following question, posted by Paul Wheatley on the Libraries and Information Science Stack Exchange website a few days ago:
What preservation risks are associated with the PDF file format?
This reminded me of a report I wrote on this very subject back in 2009. (Incidentally this was my very first foray into the wacky world of digital preservation, but that's another story.) Originally this document was intended for internal use at the KB, but looking at it again, I think it may be of interest to a wider audience. It also aligns quite nicely with the upcoming work on a knowledge base of file-format related risks that will be done as part of the SCAPE project. The main idea here is to take a file format, identify its main (preservation-related) risks, and describe how "risky" features can be detected by existing (characterisation) tools. In fact I was envisaging something along these lines when I wrote PDF report in 2009, but other things got in the way, and I never got round to the final step. The SCAPE work should finally make this happen.
Although the work on the knowledge base is still in its early stages, some very first results can be found here. The initial focus will be on JPEG 2000 (JP2/JPX) and PDF.
As for the report, I should add that some of it is a little rough around the edges, and you may note some gaps and not-quite-finished bits. This is also why we never released this first time around. Also, one aspect that is not well covered is PDF's potential for transmitting viruses and other malware. Nevertheless, as a general introduction to the format and an overview of its main risks I think it's not too shabby, but I'll let you be the judge of that! As always, feel free to use the comment fields for you feedback and suggestions.
Link to report
Johan van der Knijff
KB / National Library of the Netherlands
johan
July 30, 2012 @ 1:34 pm CEST
Hi Euan,
First of all, I largely agree with your proposed definition of preservation risks, although I’m a bit wary of restricting the scope of preservation to just rendering. The reason for this is that it implies that the objective of preservation is restricted to being able to generate a visual representation of a file object (e.g. display the pages of a PDF on-screen). This doesn’t cover additional functionality, such as being able to do text searches, or re-using content (e.g. simple copy/paste operations on selected text and images). Moreover, not all types of digital objects are actually designed to “render” at all. Examples are e.g. many kinds of scientific research data. As this is worth a discussion in its own right, I’ll limit myself to PDF here. For the sake of convenience, for the remainder of my response I will also restrict myself to the ability to render a PDF, so we can largely ignore that discussion!
Main points
The central point of your response is, that the report describes the structure of PDF format-standard components, but that these components don’t affect the ability to render content (so it’s not really about risks at all).
First of all, you’re right in that the report give a description of the general structure of PDF as well as of a selected number of components that make up the standard. These components were all selected because they are associated with one or more preservation-related risks. So it was certainly not my intention to provide an exhaustive description of every single feature in PDF!
Second, much of your response appears to be based on the assumption that as long as a PDF is compatible with the standard, a compliant reader will render it correctly. I think this view is a bit overly optimistic. In the following sections I will try to illustrate this.
Complexity and size of standard
The first problem here is that the PDF specification is huge: wel over 700 pages for the ISO 32000-1 version. Not all readers support every single feature of the specification, and to make matters worse some features are described in rather vague terms. A good example of this is the format’s support of multimedia features. As I pointed out in the report, especialy the older PDF specifications (before version 1.5) were pretty non-specific about the exact nature of embedded multimedia content. Essentially they just described a general mechanism for embedding audio and video, without saying anything about the actual format or encoding. This means that pretty much anything was allowed, which I think implies a real risk in case of any archaic and/or exotic embedded video formats (because you cannot expect a viewer to handle just any format).
Rendering: environment matters as well
Secondly, rendering doesn’t solely depend on the reader software, but also on the environment. Fonts are a good example of this. If a PDF uses fonts that are not embedded, its appearance/rendering may be very different from its intended appearance. This has nothing to do with conformance to the standard (as the standard doesn’t oblige you to embed fonts). And I have a (slightly embarrassing) example to demonstrate this as well. A few years ago I did an internal presentation on my PDF report for some publishers who were visiting the KB at the time. Unbeknownst to myself, a colleague then took the original PowerPoint file, converted it to PDF and put it on the web. Here’s the link:
http://www.kb.nl/hvp/congressen/e-depot2009/uitgeversdag16112009JohanvanderKnijff.pdf
However, my colleague (to be honest I don’t even know who this was!) didn’t use any of the font embedding options. If I open the file in Acrobat on my Windows PC (on which I created the original PowerPoint) the first slide looks like this:
Here’s the same slide on my other machine, which is running Ubuntu (using Acrobat reader):
Now where are these differences coming from? Very simple: the PDF uses some default Windows fonts. Since these fonts are not included with Ubuntu, they are substituted by an alternative font. This doesn’t even have anything to do with the reader software, but is simply related to the environment (OS). I think you’ll agree with me that the ability to render the content is affected by the use of the font embedding options!
Fonts aside, PDFs may also contain a host of other external dependencies (which are addressed in the report), all of which have some relevance to preservation.
Passwords
In your response you do acknowledge that passwords are of some possible concern, but you then say:
I’m really a bit puzzled by this: it assumes that you know beforehand which PDFs are password-protected (but maybe you already check this upon ingest?), that you know the password (and that it is somehow available to the user or some access application), that you know what was the creator application, and that you also have a functioning copy of it up and running. That would be pretty impressive, but it makes me wonder how close you really are to this “ideal” situation?
I hope I managed to answer your questions to some degree. There’s much more to say about this, but the length of this reply is getting ridiculous, so for the moment I’ll leave it at this!
Cheers,
Johan
ecochrane
September 6, 2012 @ 2:39 pm CEST
Hi again Johan,
Sorry its taken so long to follow up on this. I finally found some time so here I go:
Covering your first paragraph: I have recently come to dislike the term “rendering” partly because of your comments. I think I should have said “interact with” or something similar. That should cover both static objects where the interacting is just reading or listening, and more dynamic objects that respond to active or passive inputs.
In response to most of the rest, you started by assuming that I was making an assumption:
“Second, much of your response appears to be based on the assumption that as long as a PDF is compatible with the standard, a compliant reader will render it correctly. I think this view is a bit overly optimistic. In the following sections I will try to illustrate this.”
I didn’t communicate well sorry, that was not what I was trying to say. What I was assuming was that as long as we can recreate a representative version of the software environment that we know was compatible with the object (and we know there was one as otherwise the object would never have been useful) and maintain it indefinitely then the particular components of the pdf standard that are used in the file are really unimportant from a risk perspective.
At some stage we had a software environment that was used to interact with the contents of the file. The file was always compatible/compliant with that environment. It is somewhat irrelevant how closely the file adheres to the PDF standard in that context as the standard doesn’t matter once/as long as we have the file and its interaction environment. For most pdfs, that environment is usually a standard pc desktop from the era with the most recent or second-to-most-recent version of acrobat reader installed on it.
You bring up the important point of dependencies that are part of the wider environment. In your example you referred to the issue with needing fonts from a particular operating system. This is definitely a challenge but it does allow for some speculation on the concept I referred to but did not discuss above: a “representative” interaction environment.
My idea here is that in many contexts the most a digital preservation practitioner should be expected to provide is a few representative interaction environments for interacting with digital objects. By these I mean environments that represent typical environments from the era in which the objects were being interacted with and exchanged. The reason why I suggest all that is need is a “representative” environment (or a few) and not the actual original is because that is all that we expect from users now in most cases. In any office users may be interacting with objects using many different configurations of hardware and software but that is assumed by everyone creating and disseminating digital objects. There is no expectation of a particular environment being used to interact with the objects and there is an expectation that there may be minor variations in experiences as a result. When there is an expectation that specific environmental components should be used for interacting with objects this is usually specified somewhere (e.g. best viewed using acrobat reader 6.0 etc). For these reasons digital preservation practitioners should not be expected (in most cases) to provide the original environment and one (or a few) representative environments should be ok.
Getting back to your example with the fonts, if suspect if future users were given the ability to interact with that pdf using three representative environments covering the standard contemporary operating systems (i.e a standard windows, linux and OSX environment) then you would probably be comfortable with the idea that the object/experience had been preserved (but I’m putting words in your mouth now!).
There may be cases where a close copy of the original environment is needed but and as a community we need to develop more tools to enable easy identification of those cases. But I think that what is more important is to have some tools that enable us to identify the original creating environments for files. The reason being that the original creating environment is most often a good starting point for identifying a representative interaction environment.
Addressing the questions about my comments about the passwords: Again I was not clear sorry. Basically all I was trying to say was that if we can provide the ability to interact with the object then there is no preservation risk. If the object then also has a password on it then that could be argued to be outside of the preservation scope and merely another artifact of the object. It would then be up to someone else to be aware of that and to document the password somewhere. This is a very tight understanding of digital preservation though and is a bit disingenuous so I think I should concede to your criticism here.
ecochrane
July 29, 2012 @ 12:13 pm CEST
Hi Johan,
I am still perplexed by the idea of risks in digital preservation and unfortunately your report has confused me even more.
Wikipedia defines risk as follows:
“Risk is the potential that a chosen action or activity (including the choice of inaction) will lead to a loss (an undesirable outcome). The notion implies that a choice having an influence on the outcome exists (or existed). Potential losses themselves may also be called “risks”. Almost any human endeavor carries some risk, but some are much more risky than others.”
Now, assuming our objective in digital preservation is maintaining the ability to render content across time (which you may well disagree with and if so then ignore the rest of this comment) then as far as I can tell (and the purpose of this comment is to hopefully be corrected) the risks that exist in digital preservation are those things that increase the probability that we will be unable to render content at a particular point in time in the future.
Your report is really interesting and valuable, but I’d suggest its been miss-named. It seems to be a description of the structure of various PDF format-standard components rather than an inventory of risks.
These components shouldn’t affect our ability to render content*: either the software we associate with PDF files can render them all or it is not compatible with the standard and shouldn’t be associated with those files. If a particular file uses particular components and not others it shouldn’t matter so long as the software associated with them for rendering is compatible with the all aspects of those files.
The components described in your excellent report may reduce our ability to migrate content from files with those components however that is not necessarily the outcome we are trying to achieve. Assuming we are trying to preserve our ability to render the content from pdf files then this report seems to show that using migration to do so for PDFs will be very complicated and potentially impossible to do on a large scale while maintaining verifiable content integrity and that therefore we should look for other options.
In other words the description of these components of PDFs as ‘risks’ seems to assume migration as the strategy we will employ to maintain our ability to render the content in the pdfs, as it is only when assuming migration that the existence of such components in files might effect our ability to maintain our ability to render the content in the files in the future (by making the content difficult to migrate out to new technology). On the other hand if we employ emulation and use the original rendering software to render the content from files forever, then these components will never constitute risks to that ability*. The reason being that if we can assume the files could ever be rendered with that original software then the existence or not of those components should not change our ability to use the same software to render them in the future (other things might of course, just not most of the things identified in the report).
So, hopefully I made some sense, and as I said, if you disagree with those assumptions above then the rest won’t be very worthwhile. However if you do disagree perhaps you could counter with your understanding of what a risk is and how the things outlined in your report are examples of those and what you consider our purpose to be in this digital preservation endeavour.
Also, sorry if this has come across as particularly pedantic and simply an issue of semantics. It definitely is an issue of semantics however it seems to be quite an important one that is at the heart of what we are trying to achieve in digital preservation.
Thanks,
Euan
*There is a possible exception in the use of passwords and other restrictions in pdfs however these shouldn’t necessarily be considered risks to our ability to render content as using the original software we will still be able to render the content but will then have to also ensure we have the passwords etc.