Don’t panic!: What we might need format registries for

Don’t panic!: What we might need format registries for

The file format registry related announcements in recent days have prompted some excellent discussion between my good colleagues, Andy Jackson, Chris Rusbridge and Andrew Wilson. This is excellent as I think we really need to debate our way through some pretty complex issues if we’re to work out how we solve this thing we call digital preservation. In my last blog post I questioned if we really knew what problems we’re trying to solve with format registries. It’s all right shouting about it, but I thought I’d better have a go at working out what those problems might actually be.

In his latest post, Chris used the principles of object oriented computing to help describe the fundamentals that we need in order to make sense of our data now and in the future. Comments from Andrew and Andy expanded this a bit further into the performance space and began to tackle the somewhat fuzzy relationship between the information stored up in a file and the meaning or intellectual content that emerges at the point of rendering or use. In other words, the combination of data and software that results in a performance for the user. Chris concludes that we’ve not focussed sufficiently on the “methods” that help us do this.

I’m going to take a slightly different approach, so here’s an (almost) real life use case I’d like to work through. It’s not entirely dissimilar to some unpublished work I did with former colleagues at the BL. Yes it’s one file, one format and one use case, so it’s not particularly representative. But it’ll have to do for now. Here we go: A new PDF file shows up at our all-singing all-dancing long-term-proofed digital repository. What do we want to do to ensure we can serve this PDF to our repository users now, and in the future?

Lets start with the preservation challenges, and I’ll rank them in order of risk scariness(TM). First = biggest risk:

  • Is the PDF bit perfect in relation to how it was first created?
  • Is the PDF password or DRM protected, preventing some or all modes of rendering/re-use?
  • Does the PDF depend on one or more resources external to the file itself (eg. an obscure but critical font)?
  • Is the PDF created by a particular application that results in bad PDFs that don’t render properly or are missing some critical information?
  • Does the PDF render with a double click on the PCs in the repository reading room (or some definition of a “typical users computer”)?
  • Are we confident of rendering the PDF in 100 years time?

I’d like to neatly (cowardly?) side-step the issue of whether good old fashioned file format obsolescence is really a major worry. And instead just assume that it isn’t. I’ll just lazily point to messrs Rusbridge and Rosenthal. What I’ve prioritised in the above has a bit more practical focus on imminent problems. The file might render ok, but maybe imperfectly. Or the file might not strictly be obsolete. But it might be “institutionally obsolete” if the reading room doesn’t have the appropriate software installed.

So what are the processes we’d ideally like to have automatically running over this PDF in order to mitigate the problems listed above?

  1. Calculate some checksums and see if we have an unbroken file. If we don’t have checksums, things get a whole lot more complicated and I’m not going there now.
  2. File format ID. Lets start by seeing if it’s likely to be a PDF. This will let us throw the right tools at the file in the next stages.
  3. Scan the file for DRM / password protection
  4. Perform external dependency test to see if we’re missing anything important that’s not yet in our emerging AIP.
  5. Attempt to identify the creating application. If this is possible, use it to look up likely issues with the application used and try and check for them. Otherwise search for common PDF build problems.
  6. Perform automated render test, or look up the format to see if it’s on our “supported in the reading room” list.
  7. See if we have some source code for a good quality PDF renderer, the source for a reference implementation for a PDF renderer and the spec for the PDF format.
  8. If we don’t like the answers provided by the above, we might want to go back to the source of the PDF file for another copy, or even contemplate repair/migration.

So which of these processes needs to be supported by information from some kind of “registry”? These bullets align with the ones directly above.

  1. None if we have checksums, otherwise we could try renderers that report errors if files are broken…?
  2. We need the relevant file format magic
  3. We need to know what kind of DRM/password protection that the PDF has the potential to annoy us with, and we need to know what tool+configuration will identify said DRM/password protection for us
  4. We need to know a tool or process that will let us identify dependencies
  5. We need to know where in the PDF header the creating application appears, and have a tool to extract this for us. We also need to know common PDF creation problems with particular creating applications
  6. Possibly none. We might benefit from some information about high quality PDF renderers
  7. Some source code and some format spec docs.
  8. This is complicated, but it will certainly require knowledge of PDF editors, migration tools and QA type processes.

So to sum up, we aren’t after much general information about PDFs, although we do need to know enough to be able to apply tools in a useful way. We are after a few very select bits of information about PDF to support some of those processes. We also need to know about tools for rendering, identifying DRM, identifying external dependencies and extracting some very specific metadata. And we need to know which are the best tools to do these operations, ideally based on the experiences of others (even better, based on cold hard data of running one tool against another on test data).

Most of this information that we need seems to be about tools. And it’s a bit more than straight up factual stuff like tools X,Y and Z identify DRM in format A. What we really want to see is some kind of evidence for why tool X is rubbish, tool Y is ok, and tool Z does a great job. In an ideal world we would have evidence that tool Z identifies all 876 instances of PDFs with DRM in our test set of 10000 PDFs. This is very much SCAPE/REPEL territory.

So to recap. This is of course one use case. The result will differ if the specifics are changed. And there are clearly more use cases out there. But I think we can still draw some broad conclusions.

By working through the practical challenges we face on the ground, I think that we can have a good go at working out what specific knowledge we need to be capturing in our format registry/wiki/knowledgebase/whatever. We should not be trying to capture as much information about each file format as possible. But with some clear use cases on the table, we can home in on the specific information that helps us answer the practical preservation questions we have. My guess is that most of these specifics are about what the tools do and how well they do it. This is broadly speaking the “methods” that Chris discussed. Do we need a bunch of other information about each format? Maybe not.

1
reads

6 Comments

  1. andy jackson
    July 17, 2012 @ 1:00 pm CEST

    I would argue that any person who is charged with making such an important decision is a ‘preservation person’, whether they like it or not! All of the information you identify is important, but I’m not convinced that these are really two separate audiences.

  2. paul
    July 17, 2012 @ 11:25 am CEST

    Barbara, thanks for the comment. This is a very useful observation, and I’d not really considered the use case you’re describing. A lot of the information required to meet that use case is quite contextual. Its about the usage and support for the format, rather than detail about the format itself. I think this again underlines the need for us as a community to carefully identify what our needs are and then work out how to meet them.

  3. Barbara Sierman
    July 17, 2012 @ 10:13 am CEST

    “Do we need a bunch of other information about each format?” If by “we” you mean digital preservation people, you might be right, but in my opinion there might be a reason for the complete picture of the format. Before the objects with that specific format will arrive in your repository, there might be a moment to decide whether or not you will accept that format. The people in charge of this decision might not be the preservation people. But their decision process will certainly be better, if they have a concise overview of the ins and outs of the format plus of course the risks you mentioned. Like for example how well used / popular is the format, since when does it exist, is there support for the format, how are the versions related to each other, are there other repositories preserving this format etc. So ideally the information about a file format should serve different audiences. Work to be done!

  4. paul
    July 6, 2012 @ 9:40 am CEST

    Some good points there, thanks Andy. This comes back to our observation from recent times that strict across the board compliance of a file to its file format specification is neither here nor there in most cases. But where we have identified specific problems we need some really thorough characterisation in these particular areas. As you describe this often gets closer to implementing aspects of rendering the actual file, rather than simply comparing data in the file with its specification. At the end of the day, we want to know if we can safely render a file on most current applicable renderers.

    I believe REPEL is Dave’s latest name for the REF work, but I couldn’t find a good link.

  5. andy jackson
    July 5, 2012 @ 9:20 pm CEST

    For me, one of the most interesting aspects of this bit of work was the realisation that the tools we had were rather missing the point. Two of the processes you mention, spotting DRM and determining missing fonts, might be called characterisation. However, because tools like JHOVE focus on enumerating the data that is in the file, they tend not to identify what is absent. For the fonts, the ‘significant property’ is actually the result of a rather awkward algorithm rather than being an explicit property (are the fonts or glyphs required to render the text of each font that is used in this document included in the relevant tables). The case of DRM is even more cumbersome. JHOVE can tell that the PDF is encrypted, but PDF uses a standard password for content that only has minor rights restrictions which all implementations know, and therefore the only way to tell if the encryption is problematic is to actually implement the PDF decryption algorithm. Again, we are talking about outcomes of processes that are applied to the data as they would be on access, rather than the explicit ‘properties’ of the data itself.

    BTW, what’s REPEL?

Leave a Reply

Join the conversation