can we talk about fmt/42, fmt/43 and fmt/44?

can we talk about fmt/42, fmt/43 and fmt/44?

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues….

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change – it would impact one of the most common format types we have…

31 Comments

  1. andy jackson
    February 7, 2012 @ 1:35 pm CET

    Thanks for this, I’ll pass it on to the SCAPE folks doing tool evaluation and see if they have time to try it.

  2. andy jackson
    February 7, 2012 @ 1:34 pm CET

    Not sure what you mean about the OS tools. None of the ones I’m dealing with perform ‘internal conversions’. That said, it is true that the definition of format in the OS tools is often more loose than for PRONOM (although even PRONOM’s definition is still somewhat slippery). My currently preferred approach is to re-use OS identification algorithms/source code but to extend or replace the supplied signature file with one that matches up the PRONOM IDs etc. These signatures may end up in core Tika, but if they don’t, we can still have a version of the tool with a new signature file that takes our stricter and more fine-grained format definitions into account.

    Finally, of course, EOF markers may be useful for validation (if we capture all live variants), but I don’t want to use them for identification because that is just a prelude to deeper validation. This means I would rather identification produced false positives that my production workflow can sift through than false negatives I have to override manually.

  3. Jay Gattuso
    January 30, 2012 @ 2:10 am CET

    I took out all the EOF patterns. I’ve not tested it, other than to validate the XML and run in up in D6 and fired a few known files at it. Seemed to work OK. 

    http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55%20-%20no%20EOF.xml

  4. Jay Gattuso
    January 30, 2012 @ 1:18 am CET

    I can echo these comments… I’ve been bitten by this more than once. My current method is to completely close and restart DROID every time I reuse a filename – ‘just to make sure’. I’ll look at versioning via the XML internally and see if that fixes things for me.

    Perhaps a D7 requirement is a ‘flush’ function that forces a re-parse of the signature source XML… but that may just be me being lazy and not using proper versioning….

  5. Jay Gattuso
    January 30, 2012 @ 1:12 am CET

    Ok, further digging – what you are missing is that with the vanilla v55 sig file, these files get the dual ID….. These files come from a big pile of files that I pulled from our system based on their PUID, ignoring that we actually do some filtering POST DROID ID, so this issue would have been filtered at, and these objects transparently assigned the fmt/44 PUID.

    The dual ID is expected for these files, and the x-fmt/80 ID occurs as a false positive hit for the x-fmt/80 pattern. I have 20 files that have this match (0x 11 01 @ offset 522).  All from the same producer, so it looks like the MD written by PS7 in this case is triggering this FP.

    Better summary of my tests – adding the [hasProriityOver] element beneficially refines the fmt/44 signature regardless of any EOF changes. 

    The removal of the EOF has no impact on the ID my test jpg files (with previously asserted PUIDS) but does allow the accurate ID of jpgs with data after the EOI marker.

Leave a Reply

Join the conversation