Jenny Mitcham, Digital Archivist at the University of York started a nice snowball rolling last week when she asked “Research data – what does it really look like?”
Paul Young at the National Archives, UK, was one of those to respond, to show that perhaps the snowball had been generating momentum for a number of years now. Thanks greatly to colleagues and friends there; and a lot of community collaboration.
Perhaps now is the time to take this momentum and take the snowball right the way to the bottom – given that, I wanted to contribute what I could to try and convince others to give signature development a go.
Euan Cochrane asked on Twitter, if he would one day see the format identification gaps in one of his collections at less than 5%:
My response: Of course! As long as you can contribute signatures:
One of the neat things about DROID when it was created was that it lent itself to being programmed, by us! And I find that pretty liberating. In fact, if anyone out there reading this goes away to develop signatures, you will see that it can be programmed to perform regular expression operations on any, arbitrary byte stream, see, my blog, hacking the droid signature file for characterization: http://exponentialdecay.co.uk/blog/hacking-the-droid-signature-file-for-characterization/
As Andy Jackson points out in his blog above, developing identification techniques for text formats is still an important frontier in identification. This is because DROID has its limitations. DROID’s current strength is in looking at neat strings of binary data that we can write in hexadecimal, and it’s only doing pattern matching. Heuristic analyses of files to discover encoding or other portions of information that might help us to identify textual formats (and some less textual formats) is more difficult for the tool.
But that doesn’t hold us back.
Part of what we’ll do by doing DROID signature development is generating enough evidence and community understanding about what formats do not suit DROID’s particular style of identification, and for every file that is textual, there is still plenty of binary to go around.
As such, my contribution I’d like to make to folks like Jenny, or Rachel, who might want to dip their toes in developing binary signatures is to offer five principles of file format signature development:
The five principles
- Tell the community about your identification gaps
- Share sample files
- Develop basic signatures
- Where feasible, engage openly with the community
- Seek supporting evidence
The five principles expanded:
1) Tell the community about your identification gaps
Andy talks about sharing profiles. Profiles overall are pretty interesting, but in Max’s and Jenny’s profiles, it is the 2%, and 65% values that are most interesting. These represent the formats that we immediately need to work on, and someone else might have the same gaps that need filling. Tell the community about your identification gaps and there’s a chance someone can pick up the format and work with it where you might not have the resource.
2) Share sample files
If you have gaps, but haven’t the time to work on signature development just yet, sharing those files now, might help someone else sooner, and certainly reduces the gap in time between The National Archives, UK, releasing a signature file, and the submission of those objects to your repository.
In fact – sharing as much as possible whatever the urgency can help put more eyes on a problem. There’s no guarantee working in the open community like this will give you results immediately – but the information is there to be found and referenced.
Ideally, we have a place we can put samples, or at the very least, share links; where subscribed users can receive updates about new submissions. I’ve set up a GitHub repository that might work. I’ve tried to crack licensing too, but please take a look, critique, and use, as required; collaborators can be added to provide many eyes: https://github.com/exponential-decay/community-ff-signature-development-repository
3) Develop basic signatures
The first part of signature development is finding the commonalities between formats, so when I showed Jenny, a TIF signature starts with just four consistent bytes that are always there it shows that, at least at the very beginning it is that simple.
Paul noted that some signatures are more complex, and this is true, but a lot of this will mean that we’re trying to iron out things such as false positives, see, Fetherston and Gollins (2012): http://www.ijdc.net/index.php/ijdc/article/view/201.
But we have tools to help there too: http://exponentialdecay.co.uk/blog/tag/skeleton-test-corpus/ and The National Archives, UK, plus other colleagues will help to ensure the veracity of a signature as part of their test procedures.
You only need a Hex Editor: https://mh-nexus.de/en/hxd/
4) Where feasible, engage openly with the community
Share, share, and share! Blog! Tweet! And especially use resources that we have like the droid-list where the communication is transparent and can be re-discovered by others. Engage by testing other’s work, and letting them know how you get on also. Sometimes even turning other’s onto other information is enough to spur research.
5) Seek supporting evidence
Seek files outside of your organisation, look for specifications, or reverse engineering efforts, and share this information on format pages on the Just Solve It wiki: http://fileformats.archiveteam.org/wiki/Main_Page
If the baseline is commonalities between examples of formats from heterogeneous sets of collections, then we can see perfection is a hard position to be reached. A strong set of bytes to match against and a good level of testing will often be enough, but if we do have additional information, then further down the line, we might be able to use this to strengthen a signature if we find faults, such as false positives when looking at other collections.
It might not always be possible to follow all five principles, the first four are the ones I find most crucial, and the fifth helps us greatly in future proofing and then extending our work into characterization. Fifth aside, in picking apart what we can with file format signatures, we’re still reverse engineering! – And that’s pretty cool!
On Twitter I linked to my own ‘toolkit’ for signature development; blogs available on this site: https://openpreservation.org/knowledge/blogs/topic/signature-development/
But I’ll link finally back to Paul at The National Archives whose blog contains interesting information about contributions to PRONOM plus additional important links about how to contribute, and information about the fundamentals of signature development also: http://blog.nationalarchives.gov.uk/blog/identifying-digital-file-formats-collaborative-effort/
Can you become a five star file format signature developer?!
For anyone that gives it a go – follow Jenny’s example, and blog – let the community know what you find!