The file format registry related announcements in recent days have prompted some excellent discussion between my good colleagues, Andy Jackson, Chris Rusbridge and Andrew Wilson. This is excellent as I think we really need to debate our way through some pretty complex issues if we’re to work out how we solve this thing we call digital preservation. In my last blog post I questioned if we really knew what problems we’re trying to solve with format registries. It’s all right shouting about it, but I thought I’d better have a go at working out what those problems might actually be.
In his latest post, Chris used the principles of object oriented computing to help describe the fundamentals that we need in order to make sense of our data now and in the future. Comments from Andrew and Andy expanded this a bit further into the performance space and began to tackle the somewhat fuzzy relationship between the information stored up in a file and the meaning or intellectual content that emerges at the point of rendering or use. In other words, the combination of data and software that results in a performance for the user. Chris concludes that we’ve not focussed sufficiently on the “methods” that help us do this.
I’m going to take a slightly different approach, so here’s an (almost) real life use case I’d like to work through. It’s not entirely dissimilar to some unpublished work I did with former colleagues at the BL. Yes it’s one file, one format and one use case, so it’s not particularly representative. But it’ll have to do for now. Here we go: A new PDF file shows up at our all-singing all-dancing long-term-proofed digital repository. What do we want to do to ensure we can serve this PDF to our repository users now, and in the future?
Lets start with the preservation challenges, and I’ll rank them in order of risk scariness(TM). First = biggest risk:
- Is the PDF bit perfect in relation to how it was first created?
- Is the PDF password or DRM protected, preventing some or all modes of rendering/re-use?
- Does the PDF depend on one or more resources external to the file itself (eg. an obscure but critical font)?
- Is the PDF created by a particular application that results in bad PDFs that don’t render properly or are missing some critical information?
- Does the PDF render with a double click on the PCs in the repository reading room (or some definition of a “typical users computer”)?
- Are we confident of rendering the PDF in 100 years time?
I’d like to neatly (cowardly?) side-step the issue of whether good old fashioned file format obsolescence is really a major worry. And instead just assume that it isn’t. I’ll just lazily point to messrs Rusbridge and Rosenthal. What I’ve prioritised in the above has a bit more practical focus on imminent problems. The file might render ok, but maybe imperfectly. Or the file might not strictly be obsolete. But it might be “institutionally obsolete” if the reading room doesn’t have the appropriate software installed.
So what are the processes we’d ideally like to have automatically running over this PDF in order to mitigate the problems listed above?
- Calculate some checksums and see if we have an unbroken file. If we don’t have checksums, things get a whole lot more complicated and I’m not going there now.
- File format ID. Lets start by seeing if it’s likely to be a PDF. This will let us throw the right tools at the file in the next stages.
- Scan the file for DRM / password protection
- Perform external dependency test to see if we’re missing anything important that’s not yet in our emerging AIP.
- Attempt to identify the creating application. If this is possible, use it to look up likely issues with the application used and try and check for them. Otherwise search for common PDF build problems.
- Perform automated render test, or look up the format to see if it’s on our “supported in the reading room” list.
- See if we have some source code for a good quality PDF renderer, the source for a reference implementation for a PDF renderer and the spec for the PDF format.
- If we don’t like the answers provided by the above, we might want to go back to the source of the PDF file for another copy, or even contemplate repair/migration.
So which of these processes needs to be supported by information from some kind of “registry”? These bullets align with the ones directly above.
- None if we have checksums, otherwise we could try renderers that report errors if files are broken…?
- We need the relevant file format magic
- We need to know what kind of DRM/password protection that the PDF has the potential to annoy us with, and we need to know what tool+configuration will identify said DRM/password protection for us
- We need to know a tool or process that will let us identify dependencies
- We need to know where in the PDF header the creating application appears, and have a tool to extract this for us. We also need to know common PDF creation problems with particular creating applications
- Possibly none. We might benefit from some information about high quality PDF renderers
- Some source code and some format spec docs.
- This is complicated, but it will certainly require knowledge of PDF editors, migration tools and QA type processes.
So to sum up, we aren’t after much general information about PDFs, although we do need to know enough to be able to apply tools in a useful way. We are after a few very select bits of information about PDF to support some of those processes. We also need to know about tools for rendering, identifying DRM, identifying external dependencies and extracting some very specific metadata. And we need to know which are the best tools to do these operations, ideally based on the experiences of others (even better, based on cold hard data of running one tool against another on test data).
Most of this information that we need seems to be about tools. And it’s a bit more than straight up factual stuff like tools X,Y and Z identify DRM in format A. What we really want to see is some kind of evidence for why tool X is rubbish, tool Y is ok, and tool Z does a great job. In an ideal world we would have evidence that tool Z identifies all 876 instances of PDFs with DRM in our test set of 10000 PDFs. This is very much SCAPE/REPEL territory.
So to recap. This is of course one use case. The result will differ if the specifics are changed. And there are clearly more use cases out there. But I think we can still draw some broad conclusions.
By working through the practical challenges we face on the ground, I think that we can have a good go at working out what specific knowledge we need to be capturing in our format registry/wiki/knowledgebase/whatever. We should not be trying to capture as much information about each file format as possible. But with some clear use cases on the table, we can home in on the specific information that helps us answer the practical preservation questions we have. My guess is that most of these specifics are about what the tools do and how well they do it. This is broadly speaking the “methods” that Chris discussed. Do we need a bunch of other information about each format? Maybe not.
July 17, 2012 @ 9:05 pm CEST
We wrote up a paper last year that looked at what shape the format libraries might take in future iterations of the Rosetta product. The outcome of this work was a description of a desirable set of libraries, ‘catalogues’ and the various technical objects required to link the internal Rosetta world with the external ‘Library’ world, and further out to other global resources.
Some of the issues we sought to address where very specific to the Rosetta product, and others have a much more global feel. These key issues were distilled down to some key principles that I will briefly describe:
1) Modularity – Regardless of what element make up a specific “format library” these elements probably need to be modular – supported by suitable system interfaces/data models. This modularity allows common parts to be joint owned/developed by interested parties, and bespoke (and of limited interest to the broader community) elements to be managed locally.
An example of this might be the having a common “format library” (e.g. based on the PRONOM data) and a Rosetta centric “rules catalogue” (that captures any format related rules used by a Rosetta system).
2) Transparency (History and Operation) – This is essentially the audit of the informational records used to hold the format related data together. The user of a format library should want to know what the format library said about a specific format at any point in its history, as its very likely that either decisions have been made using it, or digital objects have been “steered” in a particular direction by the information contained within it.
3) Exportability – to a degree this represents a move towards some standardisation. It is desirable to share various informational objects between peers/partners. It’s also not ideal to build a complex knowledge based that is not portable/exportable. This adds further fuel to the argument of coming up with a series of common data models that allows DP communities to confidently add data into their local knowledge bases, knowing that they can share/see/interact with similar object, regardless of what system they have bought/are developing.
4) Governance – There should be some central entities that are responsible for governing these common data objects, and ensure that they develop in concert with related informational objects, and in a way that is agreed by the broad user base.
This seeks ensure that common informational objects are dependable, and any development deviation from the expected (agreed) norms results in changes that are manageable and low impact to any user.
I suspect this will be become more important as time progresses, the more we depend on external data sources, the greater the risk we have exposed our content too. An example of this is PRONOM. We (NDHA) have been using PRONOM as our primary source of format related information for over 5 years. We also know that changes made to the PRONOM data have resulted in our digital objects being identified as different object types at various times, which in turn complicates and muddies our format related view of content.
5) Open Structure – this primarily relates to (2) and (3) but is a common theme throughout all these key principles. This seeks to ensure that any data model / API used are open, and accessible to any implementer who wants to use these common informational objects.
These are points are largely idealistic, and talk to issues we have encountered or envisage encountering from our long term use of the Rosetta system.
I am happy to share this paper on the basis that readers (1) appreciate that it comes from a very Rosetta centric view of the world, and (2) it’s a work in progress, and therefore could not be considered complete at this time.
Please email me: [email protected] if it is of interest.