can we talk about fmt/42, fmt/43 and fmt/44?

can we talk about fmt/42, fmt/43 and fmt/44?

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues….

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change – it would impact one of the most common format types we have…

31 Comments

  1. andy jackson
    February 7, 2012 @ 9:00 pm CET

    We find most of our workflows are trying to assure that items are renderable, i.e. ‘will this item display okay?’. We therefore want to pass our data to the right tool that we use to estimate whether rendering will work, which may be a fairly simple format validation or may be something more complex. For example, we want to pass our JP2 files to jpylyzer to look for damaged/truncated ones.

    Therefore, if DROID gave us a false negative (as it would if it expects the JP2 EOI marker to be present), we would end up with a big set of files marked ‘unknown format’, mixed up with everything else marked ‘unknown format’, despite the fact that the first set of files are very nearly JP2s. This has to be picked apart manually at present.

    However, if we use a more forgiving, inclusive identification step (which we do), we only risk false positives, and those would be revealed by the format-specific validator. Indeed, this is what happens, and the damaged/truncated files are marked as ‘jp2 but invalid’. Lovely.

    i.e. the reason FN are worse than FP is that we have format-based validation workflows, and therefore FP will be weeded out downstream. Furthermore, in my experience, most FP are usually either renderable-but-non-conforming or are malformed instances of the identified format.

  2. Jay Gattuso
    February 7, 2012 @ 8:16 pm CET

    I’m refering to the O/S ID mechs as a course format ID process, and the ‘Endpoint’ process being some consuming application (e.g. MS Word) that has its own opaque conversion/ID process.

    ‘ This means I would rather identification produced false positives that my production workflow can sift through than false negatives I have to override manually.’

    Interesting point – I’m interested to hear how the two different classes of errors (FP and FN) are weeded out – it sounds like you are suggesting the FN have to be manually worked, but the FP can be systematically addressed?

     

    I also think that we often (and dangerously) conflate 3 different processes – format classification (lumping things that are the same into a pile), format ID (giving a pile of things a label) and format validation (asserting that the things in the pile are a valid & formal set of things with the previously assigned label). Perhaps this is one of the things that gets muddied in this space – especially as sometimes a PUID can give a high confidence classification, ID and validation, and other PUIDs will only give a low confidence classification…

  3. mattpalmer1086
    February 7, 2012 @ 6:03 pm CET

    I completely agree that sharing signatures across platforms would be great. There is a lot of interest on the DROID 7 development wiki for better signature management / development tools. One of the proposals was to switch to Java regular expressions. I think I shot that down a bit (for mainly technical reasons) – but pointed out that DROID already uses very regular-expression like signatures – but the way they are delivered is very opaque in the DROID XML.

    A tool to allow signatures to be specified in their original reg-ex like form would be very welcome here. It would also faciliate signature sharing between other platforms, which can only be a good thing for everyone. You can see the more advanced syntax already supported by DROID if you look at the container signatures in DROID 6.

    Very interesting that DROID is hard to deploy on Hadoop. Maybe that should be a proposal for the DROID 7 develpment wiki? Or “make DROID more stream friendly”? Which brings us nicely back to not having to process EOF signatures!

    I’ve actually been working away for the last year or so on some new byte pattern matching capabilities in the byteseek library, which is *much* more stream friendly. In particular, it doesn’t need to know the length of the stream to match or search (unless you actually want to scan backwards from the end). I hope to get the 1.3 release out in the next couple of months (but have been saying that for the last couple of months!).

  4. andy jackson
    February 7, 2012 @ 5:09 pm CET

    Yes, indeed, all operating systems tend to describe format in very coarse ways (usually file extensions associated with applications), as do many open source applications, but I still don’t quite understand what this has to do with what I was saying originally. The file identification tools I am talking about (file, Tika) work at the ‘common name’ and MIME type levels respectively, but both have expressed some interest in more fine-grained identification.

    My main drive here is to understand what features we really need in identification, and whether it is possible to have signature files that can be easily shared across the different tools. If BOF RegEx are sufficient, then we can generate signature files for all these tools from a single data source, and whoever needs to use the signatures can do so easily without having to switch tools or platforms. As you say, different tools suite different contexts, so I’d like to be able to get the same results across different contexts. If this approach works, PRONOM will have more users, and the more users we have, the more help we will have in growing the signature data to cover more formats.

    In other words, this isn’t primarily about which contexts the DROID/PRONOM tools do and don’t fit into, but about sharing signature information and getting that valuable data embedded into tools that are more widely used and supported. However, it is true that if DROID was easy to deploy on Hadoop, then this would not be so pressing. Attempting to do so revealed a number of issues, e.g. usual DROID usage requiring a File while Hadoop only provides an InputStream, but they all boil down to DROID being optimised for desktop usage and Tika being optimised for batch execution (and indeed Map-Reduce tasks).

  5. mattpalmer1086
    February 7, 2012 @ 3:07 pm CET

    I don’t think it’s the OS tools being referred to here. It’s the end user applications which open, for example, any kind of Word file when the OS only identifies it as a Word file. In other words, OS level file identification is normally too course grained.

    Interesting you are re-using OS identification algorithms and extending their signatures to give PUID-equivalent matching. Is there a particular reason or set of reasons driving you to doing this work? I completely get that DROID/PRONOM aren’t suitable for all contexts or workflows, but I am interested in understanding the contexts where they don’t fit and why.

Leave a Reply

Join the conversation