I’ve just started a small assignment for the OPF to investigate the options for a new file format registry, part of the toolbox needed for long-term preservation of digital material by archives, libraries and other memory institutions. This initiative was kicked off and sponsored by the National Archives of the Netherlands, and is now in progress under the auspices of the OPF.
Most readers of this blog will know very well what a file format registry is and why you need one. But for people new to the world of digital preservation, I’ll very briefly explain: in order to make sure you can still access all your files in 10, 50 or a 100 years you first need to know what kind of file formats you’ve got and the tools available to work with them. For institutions with a responsibility to look after our records of government and cultural history this is a high priority. So step 1 is to systematically track file formats and information about them.
Registries of this kind already exist, notably the UK National Archives PRONOM system, the GDFR system from Harvard University Library and the PLANETS Core Registry (which was itself closely based on PRONOM).
So why do we need another one? The field of digital preservation research is still only a decade or two old and many lessons are still being learned: the first generation of registries have done a great job in many respects but have also highlighted new requirements.
An important issue arises from the large amount of ongoing research effort required to keep on top of the wide range of file formats in use, with new types of digital material and new software appearing all the time. This is not a job that any one institution can afford to do by itself, so sharing of information is essential, between archives, libraries, universities, software vendors and individual experts. Also, the information you need is a mixture of facts and policy choices. The specification for PDF1.4 may not be open for argument, but choices on how to manage PDF files over the long term and what tools to use may vary from one organization to the other.
The problem is in many ways one of distributed web publishing, with the need for unambiguous shared identifiers, so everyone knows when they are talking about the same thing. The information to be stored about file formats is complex and a precisely defined shared vocabulary for format descriptions is essential for effective information sharing. So it’s a very natural fit for Linked Data. That’s one of my main professional interests and laying out how it could be usefully applied to this problem is one of my tasks.
But the first priority is to set out the issues we need to tackle. Over the next couple of months, I’ll be pulling together an outline of the concept of how such a distributed registry could work and aiming to narrow that down to an initial set of requirements.
For an idea of where that is going see the report “A New Registry for Digital Preservation: Outline Proposal for Discussion” (204kB PDF.) I’d certainly welcome input from the well-informed and opinionated (in a good way 🙂 ) readers of the OPF blogs. Please begin your rants in the comments to this post!
It will also be one of the topics of discussion at the forthcoming OPF workshop and hackathon in Amsterdam. I hope to talk to many of you about it there. Registrations close on Friday 5 November, so if you’re not registered yet, be quick!