Digital Preservation Summit: workshop on format registries

Digital Preservation Summit: workshop on format registries

At the Goportis Digital Preservation Summit in Hamburg last week, I had the pleasure on behalf of the Open Planets Foundation of chairing a very productive workshop on the ‘Format Registry Ecosystem’.

I had the chance to present the recent work of the OPF and National Archives of the Netherlands work in this area (see my several previous posts on this blog), but the bulk of our time was spent in a group discussion of the most pressing issues and how we should solve them.

There was a great group of people there including Adrian Brown, who in his time at TNA drove the development of PRONOM and DROID, Steve Knight from the National Library of New Zealand, key innovators in the use of representation information, Angela Dappert who (while at BL) was one of the main contributors to characterisation and data modelling work during the Planets project, Barbara Sierman from the Dutch KB who was very active in the Planets Core Registry work, and a strong contingent from University of Portsmouth who are currently active in a range of digital preservation activities and have created the ‘TOTEM’ registry of technical environment information.

There were many others! Sorry there isn’t space to list everyone, but thanks to all for their contribution to what was a very active discussion.

There were a handful of notable absences: it would have been great to hear directly from TNA and their current developments on a Linked Data version of PRONOM; none of the UDFR team was able to make the long trip from California; and Maurice van den Dobbelsteen was double-booked with an important meeting at the European Commission in Luxembourg.  Maurice has been instrumental in kicking off and steering the current OPF activities on format registries, contributing some of the main concepts (like the ‘ecosystem’, ‘dashboard’ and ‘collections’) and recognising the need to build on the work of Planets to tackle some of the key problems we still face in the area of format registries.

But enough on who was and wasn’t there. More important is what we decided.

To set the scene a little and define some terms the ‘Format Registry Ecosystem‘ is envisaged to consist of:

  • information on file formats and the software and hardware required to access objects in those formats,
  • compiled and created by diverse organisations and shared for others to use,
  • together with the means for users to select, gather and manage the data they need.

And a ‘format registry‘ (or ‘representation information registry’) is a collection of such information held in a computer system which provides access to it in convenient ways.

Outcomes of the workshop

During the workshop we agreed the following points:

1) Format registries are essential for the task of managing a digital repository and ensuring the material remains accessible in the long term.

2) The essential core of a registry is a set of identifiers (for file formats, software applications etc) together with sufficient information to make it clear what the identifier refers to. However, our objective should be to create a much broader information resource, describing in detail the entities of interest and the relationships between them: to make an interlinked and structured ‘encyclopedia’ of format information. It is also of interest to share information on the choices of individual organisations on how to manage their digital material – what we described as ‘policy procedures’.

3) We need to share the burden of gathering and organising the content for registries.

4) Our focus should be on the content and we should ensure the content can be separated from the software used to make it available. We propose the use of web standards (HTTP, RDF and associated standards) for representation and distribution of the data. As well as making it easily shareable, this should help to ensure that the content remains available in the long term.

5) With diverse sources and creators of format registry content, adequate provenance information will be a key element for making judgements on trustworthiness.

6) To make this process work, we need to agree on what we call the guidelines: a core data model, a common exchange format and a common approach to identifiers. That model should be extensible to meet the needs of specific cases.

7) We agreed that the OPF should establish a working group to develop this core data model. The working group will aim to make its recommendations within 3 months.  Membership of the group will be open, but all members will be expected to actively contribute. No face to face meetings of the group are planned: the group will communicate by email, wiki and Skype or phone. Several participants in the workshop volunteered to take part in the working group. Invitations to others already active in this field will be sent out in the next few days.

All of us recognised the importance of a good solution to this problem. We were aware of (and in many cases involved in) a range of initiatives that already tackle aspects of it, so it wasn’t the first time that any of us had mulled over the issues of format registries!  But it feels like the concrete collaboration via the new working group has the potential to enable many strands of work around the world to be intertwined. We have high hopes of making something of real benefit to the whole community.


Leave a Reply

Join the conversation