As promised yesterday this is the follow up blog to the refactor of my original DROID SQLite Analysis work. The new version now allows you to produce reports from the format identification tool Siegfried.
In this blog I wanted to talk about a small number of other details that can be a bit harder to pull apart in a long meandering narrative. Let’s look at these below:
A tool to help develop a consistent dialect across tools in the digital preservation community
A few months ago at Archives New Zealand my colleagues and I sat down with the archivists to create definitions for each of the main statistics output by the tool. The majority of these definitions are under the ‘More Detail’ drop down. A number of alt-attributes are used where it is harder to add this information more organically.
The rationale behind these definitions is so that when we create digital preservation statistics about our collections we can talk consistently about them within our own organisations and within the community. We can give these definitions to colleagues that are new to the discipline and they immediately have knowledge at their fingertips about the various charts and figures they’re seeing plus information about core digital preservation terminology such as PUID and file format signature.
While the project layout lends itself to many different versions of definitions, I believe we may be able to make definitions consistent across sectors (GLAM, eDiscovery, etc.) If there are any definitions you’d like to see amended or corrected, please contribute submissions to the project’s issues page: https://github.com/exponential-decay/droid-sqlite-analysis/issues
Creating a multilingual tool
The strings used have all been isolated inside a class here: https://github.com/exponential-decay/droid-sqlite-analysis/blob/9c050829328a02c85e1beeda07c59adcdd03356c/libs/internationalstrings.py and so it is hoped that with the aid of international digital preservation colleagues these can be converted to language specific strings to make our dialect even more consistent.
This will be a future piece of work I will try and push on with but I am always open to integrating internationalized versions of these strings sooner and so if you forward me a pull request (or attachment to a GitHub issue) I can use that to create new command line flags and configuration options for the tool to be used in your language.
More statistics… Text, XML, Filename…
Siegfried integration allows us to look at one-to-many namespaces; this alone will be insightful if nothing else to the work we’re doing in digital preservation. Gaps may reveal themselves, and others may be plugged.
A namespace in Siegfried is any custom identifier made from the set of PRONOM or Freedesktop.org/Tika based signatures, for example, a namespace may just cover all image formats in PRONOM. It’s possible to customize a Siegfried signature file with the full gamut and everything in-between.
On top of this, Siegfried helps us to leverage MIMEInfo’s alternate identification engines – so the types of signature we can use are increased from Standard and Container signature in PRONOM, that is, regular file format signatures, and then file format signatures that can examine the contents of ZIP and OLE2 objects – to XML and Filename type signatures, plus being able to identify character encodings and ‘text based’ formats.
The SQLite analysis tool will hopefully aid in the exploration of these additional dimensions of format identification. I hope that will be revealed within the current configuration, but feedback is crucial and the tool completely extensible and open to users’ requests. Help me keep track of what’s needed through commenting, and what’s missing by logging in the project’s issues page: https://github.com/exponential-decay/droid-sqlite-analysis/issues
No more PRONOM only shops!
Refactoring my original DROID analysis to incorporate Siegfried has given me a number of new insights – not least into some of the challenges we give our digital preservation systems vendors trying to integrate the output from many different types of tool. Trying to understand how equal an identifier is compared to another identifier has been a curious one. Simply trying to take advantage of two tools instead of one, is another.
To clarify though, this was a challenge in the paradigm with which I started the original project when I was only thinking of PRONOM because that’s what our digital preservation system still uses as the gold standard. PRONOM is a good standard, but I hope some of the ways we might continue to breakdown the output of tools like Siegfried might demonstrate the importance of keeping our options open – cleverly in Siegfried through namespaces – and might lead to more extensible data models in all aspects of digital preservation, not just format identification.
What’s in a name?
The GitHub repository is still called (rather snappily!) ‘droid-sqlite-analysis’ – this is no longer accurate. We can now handle DROID, and Siegfried, and soon Fido will fit right in. That’s really all thanks to viewing format identification through a different (wide-angle) lens.
For future users of the tool, I’d really like something actually snappy that might be easier to talk to the community about. If you’ve any ideas for a name – please comment below, or on Twitter (@beet_keeper), or on the GitHub issues log. Thank you!