Using siegfried tooling for signature development for #PRONOM2019

Exploring space savings by removing whitespace in METS files

So, it’s nearly PRONOM Research Week 2019 and you want to get involved by creating a new file format signature (sign up here). In this post, I’ll outline how the two siegfried tools – sf and roy – can help in your signature development workflow.

If you haven’t used siegfried before, please follow the Getting started guide to get set up. If you haven’t written a file format signature before, check out some of the guides and blogs listed here.

OK. So the imaginary file format that I’m going to write a signature for is the café format (for exchanging coffee orders over bluetooth… I can walk into any café with my phone and voila!). Here’s a sample file that I’ve opened in a hex editor:

Hmmm. It’s mostly ASCII text but the first two bytes look promising: 0xCA and 0xFE. That’s CAFE in hexadecimal, so it must be the magic header for this format! Let’s use the PRONOM: Signature Development Utility to create a PRONOM signature using those two bytes at offset 0:

If you save that as a file (I’m using the name “cafe.xml”) into a “custom” directory within your siegfried home directory (which you can identify by running sf -v), then you can build a signature file with it using the roy tool.

The “-extend” flag allows you to add one or more extension signatures to a signature file. So, for example, the command roy build -extend cafe.xml custom.sig creates a new signature file named “custom.sig” that contains PRONOM signatures plus my own custom one for the café format. To use that signature file when you run the sf tool, you need to use the “-sig” flag to tell sf to load that custom signature file: e.g. sf -sig custom.sig cafe.cf. Both commands are demonstrated below:

You can see in the result that my new signature correctly identified the cafe.cf file. The “basis” field tells you exactly what bytes matched. In the provenance block at the top of the results, my extension file is now listed as part of the “details” field for the identifier: this information is a gift to the future me so that I can work out exactly why I got the results I did.

Note: you can use the roy tool to build signature files with custom PRONOM container signatures also. This is a little more complicated as it requires using both the “-extendc” flag (with the name of your custom container signature file) and an “-extend” flag (with a regular custom extension file that contains the additional metadata – PUID, relationships to other formats, etc. – that a PRONOM container signature file doesn’t store).

So… there is something fishy with my café format that I’m sure some of you picked up on. The two bytes 0xCA and 0xFE that identify my format aren’t unique to it, but they are also part of the signature for compiled Java files (x-fmt/415). This might cause problems identifying Java programs in the future (meh, no great loss). You can use the roy inspect command to identify overlapping signatures like this. Siegfried actually uses this functionality to build inferred relationships so that, even if not declared formally, such overlaps won’t prevent the longer signature being matched.

The basic usage of roy inspect is a command like roy inspect fmt/1. This will print out a short description of that format. I can print out a description of my custom format (to which I gave the PUID dev/1) with the command roy inspect -extend cafe.xml dev/1. To identify overlapping signatures, use the command roy inspect missing-priorities. Again, this can be combined with my extension file by doing roy inspect -extend cafe.xml missing-priorities. The output looks like this:

The output of roy inspect missing-priorities is in graphviz’s DOT format. If you have graphviz installed, you can make a nice graph with it. There are also online tools to do this, such as sketchviz. Uploading that output to the sketchviz website gives me this graph (cropped to remove some of the overlaps):

In this picture, I can see clearly the overlap between my new signature and the existing one for x-fmt/415. Resolving this overlap may require further work on the signature, or it might just be a matter of adding in a relationship to that other format.

Thanks for reading this far, I hope it was fun, and please take part in PRONOM Research Week 2019!

84
reads

Leave a Reply

Join the conversation