Rethinking the file format registry

Challenges of Dumping/Imaging old IDE Disks

Content is King.  The key to a good file format registry is not software; it’s not user interface; it’s not governance. The key is content, content, content.  We will all win if we have a registry whose content is usable, accurate, and comprehensive.

I have a challenge for developers in the digital preservation community: can we build a file format registry without building any new software systems at all? 

Let’s take Pronom as an exemplar.  Pronom consists of about 700 small XML documents with some cross references.  There is a modest community of people who may look at them and point out errors (bugs) or omissions (new features).  Some of these may even email in corrections.   Of course, not every member of the community is trusted to make changes to the underlying data!  There is a special subset of community members that validate these changes and actually commit them.

How often do these changes occur?  I don’t think any of us know precisely, but I’ll suggest with confidence that it will be less than once a second.  Actually, I bet it will average less than once a week.

How much format data is there?  I don’t know precisely, but the entire current Pronom XML  data fits into a single zip file of 680KB.  I would be willing to bet that an active community will not grow this by more than a factor of 100 over the next few years – so I’ll estimate much less than 700MB.  Today’s size and rate of change would be comfortably handled via email! I receive many larger documents each day.  It certainly does not require a complex database.

So let’s consider this profile. There is a community that consists of a few committers, tens of active members, and perhaps hundreds of end users.  They have a process to manage patches and releases.  The community maintains a few thousand objects constituting a few megabytes.  

Does this sound like a process and structure that we are familiar with? To me, this sounds like any of hundreds of modest open-source community-driven software development projects.   Is this bad news?  Do they have to invest hundreds of thousands of euros to setup a special infrastructure to manage all of this complicated stuff and distribute their data?   Well, no!  The infrastructure is already there.  It is well established; there are many providers; and it is mostly free.  Most importantly, it lets communities get on with their real work.

My second challenge to the developers in the digital preservation community is this:  Suppose that you had two weeks to set up a functional file format registry, could you do it?  Could you manage some directories full of XML and also generate some HTML files (or just use a style sheet)? Would you really need to write a single line of code? And the really big question: could you take the second week off?

Someone else can do the hard work while you are on vacation!  Specifying good signatures for common file formats would be a great next step.

Let’s think radically.  Let’s make our problems so easy that it’s almost embarrassing to solve them!



Leave a Reply

Join the conversation