So, it’s nearly PRONOM Research Week 2019 and you want to get involved by creating a new file format signature (sign up here). In this post, I’ll outline how the two siegfried tools – sf and roy – can help in your signature development workflow.
OK. So the imaginary file format that I’m going to write a signature for is the café format (for exchanging coffee orders over bluetooth… I can walk into any café with my phone and voila!). Here’s a sample file that I’ve opened in a hex editor:
Hmmm. It’s mostly ASCII text but the first two bytes look promising: 0xCA and 0xFE. That’s CAFE in hexadecimal, so it must be the magic header for this format! Let’s use the PRONOM: Signature Development Utility to create a PRONOM signature using those two bytes at offset 0:
If you save that as a file (I’m using the name “cafe.xml”) into a “custom” directory within your siegfried home directory (which you can identify by running
sf -v), then you can build a signature file with it using the
The “-extend” flag allows you to add one or more extension signatures to a signature file. So, for example, the command
roy build -extend cafe.xml custom.sig creates a new signature file named “custom.sig” that contains PRONOM signatures plus my own custom one for the café format. To use that signature file when you run the sf tool, you need to use the “-sig” flag to tell
sf to load that custom signature file: e.g.
sf -sig custom.sig cafe.cf. Both commands are demonstrated below:
You can see in the result that my new signature correctly identified the cafe.cf file. The “basis” field tells you exactly what bytes matched. In the provenance block at the top of the results, my extension file is now listed as part of the “details” field for the identifier: this information is a gift to the future me so that I can work out exactly why I got the results I did.
Note: you can use the roy tool to build signature files with custom PRONOM container signatures also. This is a little more complicated as it requires using both the “-extendc” flag (with the name of your custom container signature file) and an “-extend” flag (with a regular custom extension file that contains the additional metadata – PUID, relationships to other formats, etc. – that a PRONOM container signature file doesn’t store).
So… there is something fishy with my café format that I’m sure some of you picked up on. The two bytes 0xCA and 0xFE that identify my format aren’t unique to it, but they are also part of the signature for compiled Java files (x-fmt/415). This might cause problems identifying Java programs in the future (meh, no great loss). You can use the
roy inspect command to identify overlapping signatures like this. Siegfried actually uses this functionality to build inferred relationships so that, even if not declared formally, such overlaps won’t prevent the longer signature being matched.
The basic usage of
roy inspect is a command like
roy inspect fmt/1. This will print out a short description of that format. I can print out a description of my custom format (to which I gave the PUID dev/1) with the command
roy inspect -extend cafe.xml dev/1. To identify overlapping signatures, use the command
roy inspect missing-priorities. Again, this can be combined with my extension file by doing
roy inspect -extend cafe.xml missing-priorities. The output looks like this:
The output of
roy inspect missing-priorities is in graphviz’s DOT format. If you have graphviz installed, you can make a nice graph with it. There are also online tools to do this, such as sketchviz. Uploading that output to the sketchviz website gives me this graph (cropped to remove some of the overlaps):
In this picture, I can see clearly the overlap between my new signature and the existing one for x-fmt/415. Resolving this overlap may require further work on the signature, or it might just be a matter of adding in a relationship to that other format.
Thanks for reading this far, I hope it was fun, and please take part in PRONOM Research Week 2019!