can we talk about fmt/42, fmt/43 and fmt/44?

Exploring space savings by removing whitespace in METS files

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues….

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change – it would impact one of the most common format types we have…

31 Comments

  1. andy jackson
    February 7, 2012 @ 8:45 pm CET

    I think I’ll have to put some more effort into the DROID 7 requirements wiki. I’ll try to spend some time doing that now. FWIW, the think your point about RegEx performance ‘falling of a cliff’ will not turn out to be the case. The Java RegEx engine appears to emply Boyer-Moore as appropriate, and my informal testing indicates that the speed difference between the two is probably negligible. We’re trying to do some more formalised large-scale testing, and the results of that will be published as soon as we can.

Leave a Reply

Join the conversation