Identifying file formats – taking a closer look at Pronom and Droid

PDF Eh? – Another Hackathon Tale

Pronom and Droid, developed primarily at the National Archives (TNA) of the United Kingdom, have been a key contribution to the digital preservation community. Pronom is a registry of information about file formats. The TNA provides access to the Pronom registry on-line at http://www.nationalarchives.gov.uk/PRONOM and maintains the information. Droid is a software application that uses some of the file format information to identify the type of specific digital objects. Droid is available on-line through SourceForge at http://droid.sourceforge.net/ and is managed as an open source project.

In October, I spent some time looking closely at both Pronom and Droid to get a better understanding of them and evaluate ways that they could be improved. In this series of blog posts, I’ll be reporting on the results of this investigation – which brought several surprises along with it.

The TNA have built a lovely web interface to interact with Pronom, whose contents are stored in a rather complex database. This is a great way to look at the information they have on a specific file format. You can search for the format you want and read through all of the information. As of 22-Oct-2010, Pronom has information about 731 file formats.

The most important benefit that Pronom provides is to give each of these formats a persistent unique identifier. This is a truly important contribution. There is no other registry in the world that provides persistent unique identifiers to digital object formats without regard to origin at an appropriate level of detail for digital preservation. We can contrast Pronom and Droid other initiatives. IANA manages the mimetype names. Mimetype is heavily used by browsers and email applications to decide how to display files that are downloaded, for example. The mimetype is, however, very coarse-grained level. For example, there is only one mimetype for all versions of PDF – even though the different PDF versions have substantially different features as PDF has developed over the decades. Mimetype also covers a relatively small number of format types, and provides no method to recognise the formats. The Unix ‘file’ command provides a fast and robust method to identify many types of digital objects, but it does not provide a persistent unique identifier for each type that it recognises. The person who implements the recognition routine for each type is free to print out whatever seems useful and there is no guarantee that the output format will be persistent.

For me and some other users of the information, however, Pronom has two major shortcomings. First, I don’t have much need for a handsome web interface to access the information about a single format at a time! I need to use the information in an automated way – and for as many formats as possible. Second, the coverage is much more limited than it looks at first. Most of the registered formats have only outline descriptions. This means that they provide a name and identifier, and some useful textual information about the format, but no method for recognising an instance of a format.

The Droid 5.0 application provides a nice user interface that enables a user to point to a set of files or a directory, identify the likely file types, and explore them. The format is placed into a database and the user can run reports, filter, sort, and so on. One can also export the results as a csv file that can be imported into a spreadsheet application for further analysis. In addition, there is a command-line version of the tools to support automated processes.

In order to identify file formats, Droid uses a Signature File. This is an XML file that contains a substantial subset of the information in Pronom. It contains an element for each of the file formats known to Pronom.

I am particularly interested in how Droid recognises specific file formats. At the British Library, we need this to be fairly efficient, accurate, and comprehensive.

The Droid Signature File was the starting point for my exploration. Both it and the underlying Pronom XML (more on this later) are very clearly documented in http://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf. This important paper lays out the signature language, the Droid algorithms, and more with considerable precision.

The Signature file includes several key pieces of information in addition to the format name and identifier. First, it includes the typical file extensions for the format. For example, PDF files typically end with a ‘pdf’ extension. Pronom calls these ‘external signatures’. Second, it includes patterns that can be used to recognise a file format. For example, a PDF file starts with %%PDF and ends with %%EOF. Pronom calls these ‘internal signatures’. Third, it includes some relationships between formats. For example, PDF is a supertype of PDF 1.1, 1.2, and so on. This means that any object that is an instance of the PDF 1.1 format is also an instance of PDF.

When I looked more closely at the Signature file, I had two surprises. First, the Signature file included patterns to recognise only 208 of the formats – less than a third. This means that the effective coverage of DROID is much smaller than I had first expected. Second, I couldn’t make any sense of the patterns! I was expecting to see something like

 %PDF-1.0

Instead, I encountered:

        <InternalSignature ID="123" Specificity="Specific">
            <ByteSequence Reference="BOFoffset">
                <SubSequence MinFragLength="0" Position="1"
                    SubSeqMaxOffset="0" SubSeqMinOffset="0">
                    <Sequence>255044462D312E30</Sequence>
                    <DefaultShift>9</DefaultShift>
                    <Shift Byte="30">1</Shift>
                    <Shift Byte="2E">2</Shift>
                    <Shift Byte="31">3</Shift>
                    <Shift Byte="2D">4</Shift>
                    <Shift Byte="46">5</Shift>
                    <Shift Byte="44">6</Shift>
                    <Shift Byte="50">7</Shift>
                    <Shift Byte="25">8</Shift>
                </SubSequence>
            </ByteSequence>
            <ByteSequence>
                <SubSequence MinFragLength="0" Position="1"
                         SubSeqMinOffset="0">
                    <Sequence>2525454F46</Sequence>
                    <DefaultShift>6</DefaultShift>
                    <Shift Byte="46">1</Shift>
                    <Shift Byte="4F">2</Shift>
                    <Shift Byte="45">3</Shift>
                    <Shift Byte="25">4</Shift>
                </SubSequence>
            </ByteSequence>
        </InternalSignature>

This was substantially more complicated than I had anticipated and it sent me back to the definition of the Droid Signature language! It turns out that the internal signatures in this XML document are not the patterns as held in Pronom. Instead, they are the result of compiling those patterns into a form that can be used for efficient pattern matching. You have to go back to Pronom to find the original pattern.

In this very simple case, the pattern is:

    255044462D312E30

Again, this is not quite what I was expecting. I needed to go back to the documentation again to learn that this is a sequence of bytes coded as pairs of hex digits.

    25 50 44 46 2D 31 2E 30

We can use a table of character encodings to recognise this as:

    %  P  D  F  -  1  .  0

This is great, but the InternalSignature specification seems like a very complicated way of saying “look for ‘%PDF-1.0’ at the start of the file”.

After reviewing the Droid signature file, I was convinced that I needed to go back to the source in Pronom. The problem was how to extract all of the signatures in a form that I could work with. While it is not obvious how to accomplish this, the engineers who developed Pronom have made it possible. In the next post, I’ll show how to get every bit of information out of Pronom in XML format and we’ll take a closer look at some of the signatures.

329
reads

Leave a Reply

Join the conversation