OK, not really the answers, just more questions! But a few weeks back I sent round a draft of a registry guidelines document, asking for feedback. I had included in that document a list of questions that we needed to consider and this post presents some of the answers I received, plus some answers I have tried to supply myself.
Please let me know which answers you agree or disagree with! (Hope you don’t mind yet another post from me on registries, but I’m finding this ‘crowdsourcing’ approach to research very effective 🙂 My previous posts on this blog have generated some very useful suggestions, discussion and feedback. Please keep it coming!)
1) Do you agree that RDF is the best way to represent the information?
Answer: yes it’s the best way (or at least not any worse than other reasonable alternatives). I had a couple of positive votes for this and no objections.
2) What is the best process for agreeing a data model and representation of this as an ontology?
Once we have worked out the data model, representing this as an ontology is relatively straightforward. There will no doubt be some design choices to make, but essentially it’s a case of encoding the data model in RDFS and OWL. Agreeing the data model is likely to be much harder.
Between PRONOM, PLANETS, GDFR and UDFR there has been a fair bit of work already on data models for representation information. Obviously starting by looking at what we’ve already got seems sensible.
As for a process, the data model and its representation is something that needs wide acceptance to make this process useful. What’s the best way to get agreement on this? It seems a working group of some sort would be the best way. The main remit of such a working group could be to agree a set of guidelines for publishing of representation information – agreeing the data model would probably be a significant part of the work. How would such a working group be organized? Who should take part? Does it need face to face meetings or can we do it all remotely? Does it need any central organization – if so would the OPF be an appropriate ‘host’?
3) How do we manage changes in the ontology over time?
I don’t think this is too difficult to manage – we can have new versions as required.
4) How do we enable the ontology to be extended to suit individual or specialist needs?
One of the strengths of the RDF approach is the ease with which people can extend a core ontology to suit their own specialist needs.
5) What should be the scope of the ontology?
It needs to support the kinds of information we need to exchange: about file formats; about software applications and technical environments, and how they relate to file formats; and institutional preferences/policy about file formats and technical environments.
6) Should we follow the Linked Data approach of using HTTP URIs as the primary identifier of the entities we want to describe?
The main feedback I received was: yes, let’s do this with Linked Data. Furthermore, two key initiatives have committed themselves to the Linked Data approach: TNA’s PRONOM and UDFR.
7) Should we require (Linked Data style) that our URIs are dereferenceable? This will tend to lead to multiple equivalent identifiers.
The main feedback was preferably yes, dereferenceable identifiers would be useful.
8) What minimum information is required to make it clear what an identifier refers to?
This is quite a complex question, because it’s not always clear what constitutes a file format. Not all file formats have clear specifications. Sometimes a file can be a valid instance of more than one file format. If a file doesn’t exactly follow a specification, but can still be read by viewers of a file format, should it still be identified as an example of it?
One approach might be to use a ‘test-driven’ definition – an object is an example of a format if it matches the signature associated with that format in our registry.
9) What guidelines should we follow to set a common pattern for new identifiers?
I’ll attempt to draft something for the next version of the guidelines. Issues include opaque versus human-readable, how to avoid ‘collisions’ in identifiers, how to match up two different identifiers for the same thing.
10) Distributing data: What do we recommend or require from registry publishers? Full Linked Data approach?
Based on feedback, the preference would be for a full Linked Data approach, but a simpler option of just providing a downloadable RDF file would also be useful.
11) What provenance information do we want to record?
I suggest keeping this simple to start with: for each item of information we want to know who published it and when, and where to go to find out more.
12) What versioning information do we want to record?
The main difficulty here is deciding the granularity of versioning information, so I’ll combine this question with the next one.
13) What is the optimum granularity of provenance and versioning information?
This depends a bit on the approach to the registry and its users. UDFR is planning a multi-user registry, where different information comes from different institutions and individuals. UDFR plans to provide provenance information at the level of the individual triple/statement. However it could still be useful to be able to group information with similar provenance (eg show me everything that Institution X has to say about format Y).
For TNA’s PRONOM, all information published must be approved by TNA, so at one level the provenance is simpler – but it could still be important to know how each item of information was arrived at, what validation was carried out etc (more on validation, later).
The approach to this should probably be driven by use cases: as a user of a registry, how do we want to use provenance information to filter, search or select the available information?
14) What is the best mechanism for representing provenance and versioning information?
Once we have decided on granularity, representing versioning info is fairly straightforward: we can use Dublin Core metadata terms for that.
Provenance can be more complicated and there are a few alternative provenance data models around. It depends a bit on how much we want to say about how the info was created and checked, and whether that needs to be structured and machine-readable, or can just be a pointer to a report, say.
A likely candidate is the Open Provenance Model Vocabulary (OPMV). There is also a W3C working group on provenance on the web: I haven’t looked at their work in detail yet but it’s on my to-do list.
15) What is the best way of notifying users of updates?
To start with, just ensure we have the last modified date in the graph metadata, then clients can check back regularly to see if there is anything new. Later, it would be useful to add some kind of subscription based update feed, probably using ATOM or RSS.
Other questions: governance and validation
Andy Jackson and Barbara Sierman have raised questions in the past about governance of publishing processes and validation of the contents of representation information registries. These are essentially questions about trust: how does a user of a registry decide whether or not to trust the information.
The main focus of this discussion is on how to agree some publishing guidelines – so some aspects of governance are out of scope, but the decision on what information to publish does have a bearing on trust. What information should be included on how a particular piece of information was arrived at? What process of testing was applied? What information might the user need to reproduce the tests him/herself?
As Andy points out, there is a big difference between creating trust and transmitting trust. Creating trust requires evidence that the published information has been validated. With a registry, mainly we can transmit rather than create trust, but we could design guidelines such as these to encourage thorough validation to be carried out and made available.
How best to actually do that validation is a complicated question – let’s leave it for another day.