can we talk about fmt/42, fmt/43 and fmt/44?

can we talk about fmt/42, fmt/43 and fmt/44?

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues….

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change – it would impact one of the most common format types we have…

31 Comments

  1. andy jackson
    February 12, 2012 @ 2:05 pm CET

    Sorry about that – I didn’t mean to imply that you work at TNA. Should have stuck to a more general ‘YMMV’.

  2. mattpalmer1086
    February 8, 2012 @ 10:18 pm CET

    Well, BMH helps precisely by *not* having to examine all the bytes – it’s a sub-linear search algorithm which skips over bytes that can’t match, without examining them at all. Of course, they still have to be read into a byte buffer, but you don’t have to actually process all of them any further!

    But I’m largely in agreement with you – I would also sacrifice performance if it meant making signatures easier to work with and more sustainable. I don’t know what TNA thinks, as I don’t work there anymore!

  3. andy jackson
    February 8, 2012 @ 8:48 pm CET

    Okay, fair enough. I don’t quite understand how B-M-H is helping here, given that you are having to scan the bytes anyway, which suggests I/O is the limiting factor, but I’m happy to be wrong about that.

    I didn’t really mean to get so distracted by the performance issues. My only point is really that I’m willing to accept a reasonably signficant speed and even expressivity loss if we can permit the use of a more widely adopted signature language. Of course, you and the TNA are free to disagree with me!

  4. andy jackson
    February 8, 2012 @ 8:43 pm CET

    Sorry, I didn’t mean to imply that RegEx would be suitable for attacking the text identification problem. It is possible to make some headway like that, but regular languages cannot be used to parse HTML and other higher-order formal languages properly. I’ve noted some ideas on this in the DROID7 wiki.

  5. mattpalmer1086
    February 8, 2012 @ 2:11 pm CET

    I did a lot of signature analysis and profiling during DROID 5 and 6 development. Any DROID signature with more than one sub-sequence involves the equivalent of a .* expression for those sub-sequences. This is a very high proportion of them. Now, some of them find the next subsequence fairly quickly – but many of them actually do a lot of scanning.

    Put it this way, when profiling DROID, the majority of its identification time is spent in the Boyer Moore Horpsool algorithm scanning along byte streams. So for me, the performance issue is already a known factor – there is a lot of scanning to find the next matching subsequence of a signature.

    I’m looking forward to any other experimental data that might appear.

Leave a Reply

Join the conversation