Inspired by Jenny Micham’s blog post about developing her first file format signature, I thought it would be fun to take a crack at creating one myself. I previously dipped my toe into the world of contributing to PRONOM by looking at a few mis-identifications and multi-identifications, but I had yet to create a file format signature.
Recently, my colleague at the Archives New Zealand, Ross Spencer conducted some performance analysis using the govdocs corpus, which is an open collection of about 1 million government documents that represent a broad range of formats and time periods. In the process of the analysis, Ross found that only 70% of the govdocs files can be identified using the latest PRONOM signature release! This is a great opportunity for people like me who want to start researching and creating file format signatures.
How I started
30% of a million is a lot of files to start identifying! I narrowed my scope by just looking at PDF files. I previously spent hours of enjoyment troubleshooting dodgy PDFs, so it seemed like a good place to start. I separated the PDFs from the main corpus, and ran our Rouges tool to list the unidentified PDFs. Then I isolated the directories that included unidentified PDFs to begin my analysis.
Is this Broken or What?
It’s easy when you’re looking at a widely distributed format like PDF to assume that when DROID can’t identify a file, then something must be wrong with the file. That was my first assumption when looking at a group of PDFs where the PDF header wasn’t at the absolute beginning of the file.
My normal method of examining PDFs is to open up the file in a hex editor (I use HxD) and compare the file to the documentation Adobe makes available. For these purposes, I also compared the PDF to the signatures in PRONOM. A PDF always begins with a header that identifies the file as whatever version of PDF the file conforms to. When opened up in a Hex Editor, a PDF starts like this:
The first part of the signature for all PDFs in PRONOM always starts with the %PDF header bytes at the absolute beginning of the file. This text string translates to the hex values 0x25504446. This is followed by a hyphen and the version number, which in this case is 1.2, or 0x2D312E32. If a file begins with any other bytes besides 0x25504446, it’s not a valid PDF, even if it opens up in a PDF reader.
I scratched my head over a particular folder of PDFs, where the PDF header wasn’t at the absolute beginning of any of the files. In all cases, the header followed hundreds of bytes of data. I initially thought that these were saved in some strange way by a bug in a program they were created by. But further digging in the bytes showed they were created by different programs at different times and by different agencies. So what gives? Was this a strange anomaly only affecting these ten files? I mentioned to Ross that it looks like the PDFs were just malformed, and flicked them over to him. He responded quickly with a link to a RFC
Confounded with this wizardry, I asked him how he found a document leading me on the right track to figuring out what the deal was with these PDFs.
Pattern Matching and Google
When Ross looked at the PDFs, he noticed they all started with the same bytes:
What, to me, looked like a lot of dots was actually some data. Ross did a web search for the first bytes, scaling back byte by byte until he searched for the first four bytes, 00, 05, 16, 00 and was rewarded with RFC 1740.
So, what IS it?
The RFC describes two formats (AppleSingle and AppleDouble) Apple created to represent and preserve the attributes of files across file systems that do not share the same attributes of the file’s home system. For instance, if you created a PDF on an Apple sometime in the late 90s, and then wanted to use that same PDF on an IBM, the original file data from the PDF would be ‘wrapped’ in one of these formats. Reading further into the RFC, it became clear I was looking at AppleSingle files.
Doing some further searching, I found that AppleSingle is not yet represented in PRONOM. I was excited that I would be able to create my first file format signature!
Creating the Signature
I started off by reading a couple blog posts: David Clipsham’s A week of file format research and Ross’s Five Star File Format Signature Development, as well as The National Archives’ How to Research and develop signatures for file format identification . These are great guides to developing signatures, from what a file format signature actually is, to doing research, and sharing with the community. I felt confident enough to do more in-depth research on AppleSingle with the goal of creating a signature in mind.
Just Solve the File Format Problem is a solid jumping off point for researching formats. It’s also good to find technical documentation aimed at developers. Because the audience of such documentation is people who are making applications that are either creating the format, reading the format or otherwise interacting with the format, technical documentation should tell you exactly how the format is structured and how to identify the format. The documentation for AppleSingle said the file header, or the beginning of the file, starts with a 4 byte magic number (0x00051600) followed by a 4 byte version number (0x00020000 in the case of the version 2 files I had). Would 8 bytes be enough to correctly identify AppleSingle and avoid collisions with other formats? Well!
It’s been a long time since I took math, but 1.8 x 10^19 seems like a rather big number!
I knew what bytes I wanted to use, now it was time to actually write the signature and test it. TNA has a great Signature Development Utility tool that wraps the signature into PRONOM’S preferred XML format. It’s really easy to use and available over the web, so there’s nothing to download except your test signatures.
Testing
Once I saved the signature output from the utility, I installed my signature file in DROID for testing (Go to Tools > Install Signature File, and DROID will let you browse your file system to select your signature file)
Then, I went into preferences (Tools > Preferences) to ensure my signature file was selected from the Binary Signature File drop down menu. I had to quit DROID and open it again to make the changes take effect.
I was ready to test my signature against my directory of AppleSingle files, the entire govdocs corpus and finally, Ross’ latest Skeleton Suite, a corpus of automatically generated digital objects based on the signature in the PRONOM database for collision testing. After I finished my own testing , I was ready to share my findings with the community and ask for further testing and guidance.
Sharing!
In order to get more of the file format community involved in signature creation, it’s important to create open processes. PRONOM and DROID enable open communication with their Google group and github. I posted my proposed signature and methodology along with some sample files on the Google group with an ask for
- Community testing of the signature
- More sample files of AppleSingle (both version 1 and version 2) to test against
- Guidance from the PRONOM team on how we should create entries in PRONOM, and if this should be a new kind of container format or not
The last ask was important to me, because I am new to signature development and am in need of guidance! The second part, about whether AppleSingle should be a container seemed pertinent as well. In the samples I’ve seen, the AppleSingle format is a header that’s added to the original file data. The file keeps its original filename and functionality, so it appears to act like a container to me, and it seems important for identification and preservation purposes to know what kind of file the AppleSingle file actually contains.
Feedback!
The file format community is a really friendly one! Richard Lehane tested the signature in his identification tool, Siegfried. Andy Jackson was kind enough to search the UK Web Archive for the first four bytes of the signature. David Clipsham wrote a thoughtful reply to on the Droid list that included guidance on submitting new signatures:
I added more information about AppleSingle to make the PRONOM entry more robust, and with that, you can look forward to identifying AppleSingle files in September’s signature release.
What’s the takeaway? There are tools and a community that makes developing file format signatures accessible to anyone with the inclination. Take a look at the unidentified files in your corpora and/or collections, give it a go and contribute!
Andrea Byrne
September 15, 2016 @ 8:57 am CEST
That would be really great to be able to work with your corpus and collaborate with corpus development! I’ll DM you on twitter with better contact information so we can talk about vm access. Thanks!
tallison
September 15, 2016 @ 12:55 am CEST
That’s one option…Individual files are publicly hosted and available, and we could give you access to our vm if you wanted to do processing on the corpus and collaborate on corpus development.
I think it is probably time to re-sample from Common Crawl, with oversampling on binary files and more interesting langs/charsets (see e.g. https://issues.apache.org/jira/browse/TIKA-2038).
Andrea Byrne
September 12, 2016 @ 10:47 pm CEST
Ah, that’s so interesting! Thanks for your compliments and also for sending along your reports, because now it looks like I’ll have to download a few other corpora to take a closer look at those octet-streams.
tallison
September 12, 2016 @ 5:30 pm CEST
Congratulations on making the switch!!! Primarily for the sake of Apache Tika, I ran a comparison of DROID vs Tika vs ‘file’ in April.
http://162.242.228.174/mimes/mime_comparisons.html
If you want to see what other file formats DROID identified as octet-stream, take a look at this report: http://162.242.228.174/mimes/octet_streams_pairwise_rollups.zip
Welcome to contributor status!
Cheers, Tim