When is a PDF not a PDF? Format identification in focus.

In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.

Assumptions

I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately  as possible then pass the file to format specific tools for deeper analysis.

I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that's a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

  • the fine free file utility (also known simply as file),
  • DROID,
  • FIDO, and
  • Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.

What's happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:

<match value="%PDF-" type="string" offset="0"/>

What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts: %PDF-1.2 meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID="125" Specificity="Specific">
    <ByteSequence Reference="BOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="0" SubSeqMinOffset="0">
            <Sequence>255044462D312E32</Sequence>
            <DefaultShift>9</DefaultShift>
            <Shift Byte="25">8</Shift>
            <Shift Byte="2D">4</Shift>
            <Shift Byte="2E">2</Shift>
            <Shift Byte="31">3</Shift>
            <Shift Byte="32">1</Shift>
            <Shift Byte="44">6</Shift>
            <Shift Byte="46">5</Shift>
            <Shift Byte="50">7</Shift>
        </SubSequence>
    </ByteSequence>
    <ByteSequence Reference="EOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="1024" SubSeqMinOffset="0">
            <Sequence>2525454F46</Sequence>
            <DefaultShift>-6</DefaultShift>
            <Shift Byte="25">-1</Shift>
            <Shift Byte="45">-3</Shift>
            <Shift Byte="46">-5</Shift>
            <Shift Byte="4F">-4</Shift>
        </SubSequence>
    </ByteSequence>
</InternalSignature>

Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a beginning of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here.

The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.

So who's right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker, “`%%EOF“`. (See implementation note 15 in Appendix H.)

The referenced implementation note reads:

3.4.4, “File Trailer”

15. Acrobat viewers require only that the “`%%EOF“` marker appear somewhere within the last 1024 bytes of the file.

So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.

What's next?

I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.

carl
By carl, posted in carl's Blog on 21st Aug 2014 10:40 AM

Comments


Leave a comment