DROID Container Signature Files: What they are and how to create them: A template and an example, or few…

PDF Eh? – Another Hackathon Tale

This is the second blog inspired by my visit to colleagues at National Library of Australia, last August. The first, discusses a federated approach to better incorporating custom signatures into the PRONOM signature base without modifying PRONOM. The essence of the blog, however, still centers around how the community can create signatures for itself, and make use of those, while The National Archives focuses on its own work and release larger PRONOM updates as frequently as it needs to.

The first blog in the series links to guidance to help users understand and create a basic signature for a file-format that DROID can recognize so we won’t go over that ground again here. What we will cover is the concept, and creation, of a container signature – a mechanism for identifying, with greater precision, the contents of container file formats based on ZIP or OLE2 containers – the most popular examples include early Microsoft Office formats, often found in OLE2 containers, and then later (post-2007) Microsoft Office formats, now often found in ZIP containers.

I will briefly introduce container signatures, before discussing how to develop them. If you’re comfortable with differentiating between what DROID is doing with its standard identification mechanism vs. container identification, you can skip forward here: Developing a Container Signature for DROID

Introduction to Container Signatures

Container signatures were introduced into DROID 6. They enable DROID to make additional decisions about how it will process a file. If it meets a trigger file (based on PUID), the identification engine will be routed one way or another:

Format Identification Flow Diagram

Container identification works largely the same way as standard format identification, but also enables the identification of paths within containers if a byte sequence (magic number) does not exist or cannot be divined from a sample of objects.

Container identification in DROID also places the tool radically ahead of its elder sibling in PRONOM as the public facing PRONOM data model has not yet (at least) been expanded to enable modelling of container signatures; the signatures are modelled externally as a separate, but complimentary data model extension. The collection of identification rules for containers can be downloaded in individual XML files here: http://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm

Container signatures
14 January 2011
4 February 2011
11 June 2012
28 August 2012
18 December 2012
26 February 2013
1 May 2013
12 November 2013
27 February 2014
17 July 2014
23 September 2014
18 February 2015
27 March 2015
17 December 2015

For DROID’s sake, the filename is encoded with the date it was created as a unique ID and to enable sorting within its preferences dialog and to allow users to select from different files if using an alternative to the default.

Differences between standard signature files and container signature files

While PRONOM does model the earliest form of DROID signature, it is converted into a DROID compatible signature file before it is used in the tool – I often call this DROID byte code. The output is a single file that indexes the PRONOM corpus and the converted byte codes.

Container signatures are modelled in a different but complimentary style to the basic DROID signature file. We can see the difference in format by comparing the two XML side by side. I have used block diagrams below to document the main structures used:

DROID Signature File Structures Side-by-side

For a container file to be identified a reference to it must exist in the standard signature file, this will normally be in the form of a title/description, file format extension, and a PRONOM unique identifier (PUID).

Providing those exist, similar data will be repeated in the container signature file, partially for documentation purposes, but also partially linking the two models together. The majority of work will be done in linking the two mechanisms through the trigger PUIDs. As the earlier flow chart describes, once a PUID triggers container identification, DROID passes the byte stream as input to the container identification mechanism – if a match is discovered the canonical reference information in PRONOM will be returned to the user through the DROID user interface.

Trying to find the balance in this blog between too much detail on the mechanics, and practical guidance that will help users, I will use the remainder to describe what you need to do to create your own container signatures in the hope that it will help to boost community development of PRONOM data.

Developing a Container Signature for DROID

To develop a container signature for DROID you need to update three sections between the two signature files:

  • Standard Signature File: Add File Format in File Format Collection
  • Container Signature File: Add File Format Mapping in File Format Mappings
  • Container Signature File: Add Container Signature in Container Signatures

I have created two templates to make that easier on GitHub here: https://github.com/exponential-decay/droid-signature-files/tree/master/signature-file-templates. Simply edit the commented out parts of the signature files to begin creating your own container signature.

To illustrate how we then add data to these sections to create a working container identification mechanism, I will create an example based on a theoretical ZIP based file format that is documented, and can be downloaded here: https://github.com/exponential-decay/droid-signature-files/tree/master/openpreserve-container-blog/dev1-sample-container-file

Aside from the container type, the other important pieces of information I have specified are the structure (folders and files), and two byte sequences that make up the content of the files. We will ask DROID to match the sample file based on being a ZIP based container, with a *.OPF extension, containing all the folders specified and all the byte sequences within the files specified.

The structure is copied from my GitHub and shown below:

   ZIP: dev1-sample-container-format.opf
       |
       |
       +--+ DIR: path-to-find-one\
       |
       |
       +--+ DIR: path-to-find-two\
       |      +
       |      |
       |      |
       |      +--+ FILE: file-to-read-two
       |
       |
       +--+ FILE: file-to-read-one

The byte sequences we will match have been described in the repository also:

  • file-to-read-one: expressed as PRONOM regex are: DE C0 DE F1 1E
  • file-to-read-two: expressed as PRONOM regex are: DE C0 DE F1 1E {9} DA 7A

Within a container signature you can match both file paths and internal byte sequences of files. We can define a byte sequence using standard PRONOM signature syntax. We will use both of these techniques.

The result, within the specific sections of XML we need to modify looks like:

Standard Signature File: Add File Format in File Format Collection

<FileFormat ID="1" Name="opf-sample-format" PUID="dev/1" Version="1.0" MIMEType="x-application/opf-sample-file">
   <Extension>opf</Extension>
</FileFormat>

Container Signature File: Add File Format Mapping in File Format Mappings

 <!-- Sample OPF Container File (ZIP) -->
 <FileFormatMapping signatureId="1000" Puid="dev/1"/>

Container Signature File: Add Container Signature in Container Signatures

<ContainerSignature Id="1000" ContainerType="ZIP">
   <Description>opf-sample-container-format</Description>
   <Files>
       <File>
           <Path>path-to-find-one/</Path>
       </File>
       <File>
           <Path>path-to-find-two/file-to-read-two</Path>
           <BinarySignatures>
               <InternalSignatureCollection>
                  <InternalSignature ID="300">
                      <ByteSequence Reference="BOFoffset">
                          <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0">
                              <Sequence>DE C0 DE F1 1E 20 20 20 20 20 20 20 20 20 DA 7A</Sequence>
                          </SubSequence>
                      </ByteSequence>
                  </InternalSignature>
              </InternalSignatureCollection>
           </BinarySignatures>
       </File>
       <File>
           <Path>file-to-read-one</Path>
           <BinarySignatures>
               <InternalSignatureCollection>
                  <InternalSignature ID="301">
                      <ByteSequence Reference="BOFoffset">
                          <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0">
                              <Sequence>DE C0 DE F1 1E</Sequence>
                          </SubSequence>
                      </ByteSequence>
                  </InternalSignature>
              </InternalSignatureCollection>
           </BinarySignatures>
       </File>
   </Files>
</ContainerSignature>

The complete signature files can be seen here: https://github.com/exponential-decay/droid-signature-files/tree/master/openpreserve-container-blog/dev1-sample-container-signature

When used with DROID, and the sample file, we see the following result:

Sample File Identified in DROID

Success!

A point of note, signature files are stored, and will need to be placed in the .droid6 configuration folders on your system, on windows, this will be under your user profile:

C:\Users\spencero.droid6

And on Linux this should be:

$HOME/spencero/.droid6

Within the .droid6 folder, before running DROID with the new signature files in place, I will also delete the folders ‘profile_templates’ and ‘profiles’. This will ensure that when DROID loads it will re-build the signature definitions correctly – extensive testing and re-testing of signatures can often confuse DROID at this point.

Development Comments

Signature development will take a certain amount of trial and error and DROID is not yet built to support community development to its fullest. While writing this blog I made note of the following things I had to be aware of:

  • Curly bracket syntax is not supported for identifying {x} bytes. This is supported for standard signatures only
  • To describe a folder in the Path field, the string must be terminated with a forward slash, ‘/’ Note: A backward slash wasn’t tested
  • Other syntax is supported in the container model that you will not see in the standard signature file at present, namely, you can mix bytes and strings in the same structure, e.g. 00 00 00 ‘Microsoft Works’ 00
  • The container filename does require date for DROID to even see it, and this date must be the last part of the string before the .xml extension, e.g. opf-dev1-signaturefile-20160107.xml
  • Over time I have noticed other issues, and will document them, or ask for guidance on the droid-list Google Group: https://groups.google.com/forum/#!forum/droid-list
  • Start with simple byte sequences in container signatures and then increase complexity. The more detailed the expression the more issues and potential bugs you may uncover. Pushing the system will be useful, but I do recommend making your signatures work first and then optimizing.

To understand container signatures more, it is a good idea to look at the most recent index provided by The National Archives, UK, based on date, above: 17 December 2015. This will give you a good idea about the techniques and standards The National Archives, UK, adopts.

You can also see my basic container signatures here: https://github.com/exponential-decay/droid-signature-files.  There is a mix of OLE2 signatures here and ZIP based signatures too, they will likely grow overtime. These may provide a good start for you to get into container development.

How do you begin custom development?

I recommend trying to download the samples from my GitHub, reviewing the documentation here as a tutorial, reviewing the contents of the XML, and trying to observe the structure of the sample file. I’d then recommend trying to get them to work in conjunction with your version of DROID so that you can see how DROID puts everything together and how further development might practically work for you.

The example described can be manipulated easily for ZIP based objects but OLE2 objects work, and can be viewed in much the same way. Once you’ve a candidate set of files you want to create a signature from a tool such as 7-ZIP can be used to extract the structure from the object, the tool working with OLE2 much like a zip object. The directory listing and byte sequence specification for DROID will look exactly the same.

OLE2 Structure in 7-ZIP

It can sometimes help to make a working copy of a file and rename it with a .ZIP extension to be able to open it easier.

Also note that some OLE2 files are more ‘stubborn’ than others – That is, it is likely that they are an earlier version than 7-Zip can handle and so will not be straightforward to open or 7-Zip will complain. In that instance, a purer implementation of an OLE2 library might be needed to extract the object. I have a Jython wrapper that can be used in conjunction with Apache POI to extract the objects from most OLE2 files here: https://github.com/exponential-decay/ole2-re-combiner

To identify objects that might be candidates for container signatures it is best to focus on finding ZIP and OLE2 magic numbers rather than OOXML described in the first flow-chart. Rather than the entire signature, there are enough clues in the sequence at the beginning of file (BOF), the beginning of these objects will always look like this in a hex editor:

Magic Numbers Shown in Hex Editor

The top sequence belongs to OLE2, the second, to ZIP.

Without a hex editor, if your file is identified as fmt/111 or x-fmt/263 in DROID, but you were expecting confirmation of a different format then this will be a cue for further analysis within a hex editor and with tools other than DROID.

Once you’ve started to look at the files you can start to spot the patterns that DROID will need to understand to identify your files moving forward. The Google Group droid-list is my first stop for talking about new signatures, and [email protected] is my next stop for submitting draft/nearly final versions of signature to The National Archives, UK.

Concurrent Different Signature Files

In the complimentary blog to this, I discuss a mechanism offered by Siegfried’s helper tool Roy that enables the combination of signatures from multiple sources. Right now, with DROID, we have two options, test signatures in isolation, as we are doing above, or manipulate both official releases of signature file by hand to incorporate our additions. I’d recommend the former for one-off testing – I’d recommend Siegfried for more complex testing and given enough confidence in development, potential production use within your own system.

ROY tips can be found here: https://github.com/richardlehane/siegfried/wiki/Building-a-signature-file-with-ROY

The specific example needed to incorporate signature extensions to a standard DROID signature file is as follows:

roy build -extend custom-fmt1.xml,custom-fmt2.xml (add custom signatures in DROID format e.g. using this utility. The custom signature should be placed in a custom directory within your home directory)

Wrapping Up

I haven’t been able to find time to rework and extend my signature development utility to incorporate container signatures – but then – it is not clear yet that the community requires it. As a first step, until then, or The National Archives offer their own tools, I hope I’ve managed to describe the creation of container signature files adequately enough to see a little more community development. In combination with my first blog, we may see the appearance of a useful (temporary?) federated approach using Siegfried and hopefully DROID that doesn’t necessarily require PRONOM – mechanisms that enable us to make use of signatures and community signature file extensions sooner and taking some of the work demands away from The National Archives, UK, to validate community submissions. If not, the increased number of submissions to The National Archives, and increased numbers of developers will undoubtedly strengthen the code data at the heart of PRONOM. If my descriptions above are not clear enough or useful enough then please leave your feedback in the comments and I’ll seek to clarify as much as possible and add more detail, e.g. to GitHub and my tools, as much as I can. And please, anyone with greater knowledge of some of these mechanisms, leave your comments too and we can add to the information base above, correcting anything as appropriate. Let’s push container signature development further.

Leave a Reply

Join the conversation