The file format registry related announcements in recent days have prompted some excellent discussion between my good colleagues, Andy Jackson, Chris Rusbridge and Andrew Wilson. This is excellent as I think we really need to debate our way through some pretty complex issues if we’re to work out how we solve this thing we call digital preservation. In my last blog post I questioned if we really knew what problems we’re trying to solve with format registries. It’s all right shouting about it, but I thought I’d better have a go at working out what those problems might actually be.
In his latest post, Chris used the principles of object oriented computing to help describe the fundamentals that we need in order to make sense of our data now and in the future. Comments from Andrew and Andy expanded this a bit further into the performance space and began to tackle the somewhat fuzzy relationship between the information stored up in a file and the meaning or intellectual content that emerges at the point of rendering or use. In other words, the combination of data and software that results in a performance for the user. Chris concludes that we’ve not focussed sufficiently on the “methods” that help us do this.
I’m going to take a slightly different approach, so here’s an (almost) real life use case I’d like to work through. It’s not entirely dissimilar to some unpublished work I did with former colleagues at the BL. Yes it’s one file, one format and one use case, so it’s not particularly representative. But it’ll have to do for now. Here we go: A new PDF file shows up at our all-singing all-dancing long-term-proofed digital repository. What do we want to do to ensure we can serve this PDF to our repository users now, and in the future?
Lets start with the preservation challenges, and I’ll rank them in order of risk scariness(TM). First = biggest risk:
- Is the PDF bit perfect in relation to how it was first created?
- Is the PDF password or DRM protected, preventing some or all modes of rendering/re-use?
- Does the PDF depend on one or more resources external to the file itself (eg. an obscure but critical font)?
- Is the PDF created by a particular application that results in bad PDFs that don’t render properly or are missing some critical information?
- Does the PDF render with a double click on the PCs in the repository reading room (or some definition of a “typical users computer”)?
- Are we confident of rendering the PDF in 100 years time?
I’d like to neatly (cowardly?) side-step the issue of whether good old fashioned file format obsolescence is really a major worry. And instead just assume that it isn’t. I’ll just lazily point to messrs Rusbridge and Rosenthal. What I’ve prioritised in the above has a bit more practical focus on imminent problems. The file might render ok, but maybe imperfectly. Or the file might not strictly be obsolete. But it might be “institutionally obsolete” if the reading room doesn’t have the appropriate software installed.
So what are the processes we’d ideally like to have automatically running over this PDF in order to mitigate the problems listed above?
- Calculate some checksums and see if we have an unbroken file. If we don’t have checksums, things get a whole lot more complicated and I’m not going there now.
- File format ID. Lets start by seeing if it’s likely to be a PDF. This will let us throw the right tools at the file in the next stages.
- Scan the file for DRM / password protection
- Perform external dependency test to see if we’re missing anything important that’s not yet in our emerging AIP.
- Attempt to identify the creating application. If this is possible, use it to look up likely issues with the application used and try and check for them. Otherwise search for common PDF build problems.
- Perform automated render test, or look up the format to see if it’s on our “supported in the reading room” list.
- See if we have some source code for a good quality PDF renderer, the source for a reference implementation for a PDF renderer and the spec for the PDF format.
- If we don’t like the answers provided by the above, we might want to go back to the source of the PDF file for another copy, or even contemplate repair/migration.
So which of these processes needs to be supported by information from some kind of “registry”? These bullets align with the ones directly above.
- None if we have checksums, otherwise we could try renderers that report errors if files are broken…?
- We need the relevant file format magic
- We need to know what kind of DRM/password protection that the PDF has the potential to annoy us with, and we need to know what tool+configuration will identify said DRM/password protection for us
- We need to know a tool or process that will let us identify dependencies
- We need to know where in the PDF header the creating application appears, and have a tool to extract this for us. We also need to know common PDF creation problems with particular creating applications
- Possibly none. We might benefit from some information about high quality PDF renderers
- Some source code and some format spec docs.
- This is complicated, but it will certainly require knowledge of PDF editors, migration tools and QA type processes.
So to sum up, we aren’t after much general information about PDFs, although we do need to know enough to be able to apply tools in a useful way. We are after a few very select bits of information about PDF to support some of those processes. We also need to know about tools for rendering, identifying DRM, identifying external dependencies and extracting some very specific metadata. And we need to know which are the best tools to do these operations, ideally based on the experiences of others (even better, based on cold hard data of running one tool against another on test data).
Most of this information that we need seems to be about tools. And it’s a bit more than straight up factual stuff like tools X,Y and Z identify DRM in format A. What we really want to see is some kind of evidence for why tool X is rubbish, tool Y is ok, and tool Z does a great job. In an ideal world we would have evidence that tool Z identifies all 876 instances of PDFs with DRM in our test set of 10000 PDFs. This is very much SCAPE/REPEL territory.
So to recap. This is of course one use case. The result will differ if the specifics are changed. And there are clearly more use cases out there. But I think we can still draw some broad conclusions.
By working through the practical challenges we face on the ground, I think that we can have a good go at working out what specific knowledge we need to be capturing in our format registry/wiki/knowledgebase/whatever. We should not be trying to capture as much information about each file format as possible. But with some clear use cases on the table, we can home in on the specific information that helps us answer the practical preservation questions we have. My guess is that most of these specifics are about what the tools do and how well they do it. This is broadly speaking the “methods” that Chris discussed. Do we need a bunch of other information about each format? Maybe not.