can we talk about fmt/42, fmt/43 and fmt/44?

can we talk about fmt/42, fmt/43 and fmt/44?

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues….

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change – it would impact one of the most common format types we have…

31 Comments

  1. mattpalmer1086
    February 8, 2012 @ 1:47 pm CET

    Patching the regex implementation in java is a nice idea, but it would be a very big “patch”!

    The use of BM would have to be enabled for regular expressions that were not just string literals (so you could specify the character classes in the first place!). This in turn would require the identification of candidate regular expressions that could be searched for in this way, involving classifying sub-components of the expression as “BM-able”. This means changing core behaviour of quite a lot of the underlying engine.

    In point of fact, this is exactly what is on the roadmap for byteseek. Even given that byteseek is being designed with this in mind, it’s not quite as straightforward as the explanation above indicates. Maybe once I’ve worked out the kinks in marrying up automata-based regular expressions (deterministic and non-deterministic automata are also in byteseek, but not used by DROID!) with sub-linear sequence (and multi-sequence!) searching, I’ll turn my attention to working it back into the core Java libraries! There would be additional complications making all of that work with unicode text, rather than just byte sequences.

    Btw: if you enable case insensitivity for Java regexes, you won’t get BM either – it’s explicitly only for case sensitive searching. I guess *that* could be patched up – but that’s a lot of work and effort for very little payoff.

    There are very few binary formats it’s not possible to match right now (other than ones based in containers, which have their own signatures in any case). However, there are a lot of signatures which amount to horrible hacks, to work around the limitations of earlier DROIDs. There are quite a lot of signatures which do more work than they need to, and are much less clear than they could be, as the DROID syntax can’t curently handle quite standard features of normal regular expressions (for example, the absence of optionality). Bringing them closer to standard Java regular expressions would be one way to do this (if they are not replaced by them!).

    You are right to say that the big gaps in DROID identification are indeed around text formats, not binary signatures. There’s a whole proposal (including some detailed text heuristics I developed) on the DROID 7 wiki about this. In my (not so!) humble opinion, reg ex is simply not the way to go for text format identification – but that’s a whole different discussion I’d be happy to have elsewhere, as I think we’ve drifted significantly from the point of this thread!

  2. andy jackson
    February 8, 2012 @ 12:55 pm CET

    Well, that’s interesting. Perhaps we could knock up a patch for Open JDK and then everyone who uses Java 8 could benefit from this advanced implementation?

    As for the additional functionality, that feels a bit like putting the cart before the horse to me. The case-sensitivity seems like a stretch, as RegEx can be declared case-sensitive or insensitive and so any DROID expression that needs mixed case sensitivity could at worse be matched using two or more RegEx.

    I think we should aim to use the minimum functionality we need, and the big gaps in DROID matching ability seem to be centred around text formats. Are there any binary formats we can’t match right now?

    I’m not entirely convinced about the performance issue either, as I suspect most signatures are literal matches. I’ll wait for the experimental data to lead the way there. Guessing performance makes me nervous.

  3. mattpalmer1086
    February 8, 2012 @ 12:29 pm CET

    Intrigued by the hints of Boyer Moore in the java regular expression classes, I did a little more digging to observe it in action. I was amazed that something so useful wasn’t more widely known. To cut a long story short, it turns out that BM is only enabled for java “regular expressions” if you compile the expression as a literal case sensitive string match, using the Pattern.LITERAL compile flag. I’m afraid this means it’s not a regular expression anymore, just a simple string search.

    The implementation of Boyer Moore Horspool in the byteseek library (therefore, in DROID) is already considerably more advanced than this, in that it can handle character classes (sets of bytes) in positions of the string to be matched, and can handle case insensitivity (although no signatures currently use this I believe – but some could definitely benefit from this).

  4. mattpalmer1086
    February 7, 2012 @ 11:00 pm CET

    Hmmm… a further analysis shows it may be possible to read from as many buffers as necessary by mplementing the CharSequence interface (ultimately backed by byte array buffers read from streams or files as necessary). This could work as flexibly as DROID currently does. Maybe my objections were premature. I’ll look forward to any further work done on this.

  5. mattpalmer1086
    February 7, 2012 @ 10:09 pm CET

    Fascinating. I had no idea that Java regular expressions used Boyer Moore for searching internally. I note they use Boyer Moore which (while theoretically faster than the Horspool variant used in DROID) is usually slower due to its added complexity – but I’m nitpicking here.

    The more serious objection to using native regular expressions is that they are forced to work on char[] buffers (or Strings, or other char sequences). Setting aside the conversion of byte[] to char[] in order to process byte-oriented streams as char arrays, the bigger issue is that these regexes cannot process expressions which would span more than one array. In practice, this means you have to pick a candidate buffer size (e.g. 64Kb), and then you can only identify signatures which fit into this buffer.

    By contrast, DROID has already been engineered to process its (near) regular expressions across buffers if necessary, allowing signatures to match as long as they need (or as small as you would like them to be), in each case only loading enough to make it worth loading a bit more.

Leave a Reply

Join the conversation