While thinking about the Dev8D challenge (which I cannot compete in 🙁 I got to thinking about the way we do file characterisation.
I am not old enough to know the history of this field, but it seems that the grand old tool is the file(8) tool from unix. When “file” was developed, all files should contain/contained a few magic bytes in the header, to help identification tools. We still see this pattern.
- Pdf files start with %PDF-1.
- Shell scripts start with #!/bin/sh
- Xml files start with <?xml
So, for quick identification, this is sufficient. Combined with filename suffix, you can get a good idea about the kind of file your dealing with. But his design came from a time where binary files were the norm. This is no longer the case. Increasingly, we use xml files, and semi binary formats like pdf. I use xml files for configfiles for everything, and as data files for most things. Having my characteriser tell me that the file is xml is anti-informative. I am being given no information about what kind of file it is. It is comparable to being told that your data file is binary.
Alternatively, we use zip files. It is easy to identify a zip file. But these are, from memory the kind of files that will be identified as zip files
- docx
- zip
- jar
- war
Now, for characterisation purposes, we do not want just to be told that the file is a zip. We want to be told more.
If we move towards more advanced signatures, we can get more information, mainly because zip files have an uncompressed file list. More heavily compressed formats, or gzip/tar will prove a lot harder to characterise.
So, in this day and age, binary file signatures are not sufficient. Instead, we need to understand the serialization formats more. For that is what files are, serialization formats. The content of the file is serialized into the bytes of the file. For many advanced uses, there are multiple ways to serialize the very same file. But if we understand the serialization, we can read the file contents, and perform our characterisation on the contents.
To prove this concept, I wrote a little tool, currently called DocumentIdentifier. It will be released soonly, but really, it’s so simple. Given a file, it loads it as xml, if able. Then it outputs the namespace of the root element. It runs through the file, and outputs the namespaces used in the file.
So, working on a Mets xml file, I would get something like this
Root element namespace: "http://www.loc.gov/METS/" Other namespaces used "http://www.loc.gov/mix/" "http://www.loc.gov/ndnp" "http://www.loc.gov/standards/premis" "http://www.loc.gov/mods/v3" "http://www.w3.org/2001/XMLSchema-instance" "http://www.w3.org/1999/xlink" "http://www.w3.org/2000/09/xmldsig#"
The next step for me will be, thanks to David Tarrant, to have it validate xml blobs, if there are from a namespace the tool recognize.
Maurice van den Dobbelsteen
February 18, 2011 @ 1:32 pm CET
I’m happy with this discussion and an outlook on tools dealing with this problem. Increasingly we have use cases for better identification of container formats.
An example would be a scanned document showing up as PDF 1.4, whereas we want to know that it uses JPEG2000 (or TIFF) for the scanning and includes OCR info, and/or that it is a particular kind of PDF/A or PDF/E. Think also of email, Word docs with a movie and an image embedded, AV formats like MPEG-4 and webpages etc.