In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).
In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.
I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?
It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.
If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.
Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.
I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).
This proposal has a few issues….
(1) What do we do with this extra data? Should we be scraping it somehow?
(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)
(3) Is there a need to make this change – it would impact one of the most common format types we have…
mattpalmer1086
January 27, 2012 @ 6:04 pm CET
If you do edit a signature file to produce another version, and manually upload it into DROID, remember to also change the signature version in the file header, and in the file name to a different number. Probably best to pick a low number (e.g. version 2, so it won’t conflict with any higher versions that may appear, and won’t become the default highest version available to DROID. DROID may become confused about which file to run profiles with if this isn’t done. And it becomes harder to remember which signature file was run over which set of files!
andy jackson
January 27, 2012 @ 5:57 pm CET
I know some of the folks in the SCAPE project are currently running DROID over the govdocs1 corpus, so if we can construct a version of DROID that does not use EOF signatures then they’ll probably be able to run a suitable test. I’ll follow that up.
If that looks good, then adding a requirement to the DROID 7 wiki is an excellent idea. Thanks!
mattpalmer1086
January 27, 2012 @ 3:40 pm CET
I don’t much like the EOF components of signatures either. I suspect they don’t add much accuracy, but do involve scanning at the end of a file or stream. In the case where DROID has to work from compressed files, it only has a stream to work with, so it must read the entire stream in order to get to the end, just to read these EOF markers! Well, there are also some annoying signatures which only involve a variable length scan (no BOF or EOF offset) – which potentially can force a scan of the entire stream too (but there used to be only about 2 of those, for fairly uncommon formats, so it would be nice to be able to disable those if required).
I suspect that very frequently, they do not add any identification accuracy – and as we have seen, sometimes decrease it!
It would be very interesting to strip out the <ByteSequence Reference=”EOFoffset”> elements of the signatures from a signature file, and compare results with the original signature file, running over a fairly large corpus, of course.
If it turns out that the EOF byte sequences don’t really affect identification accuracy, it would be an incredibly simple change for DROID to turn on or off running the EOF byte sequences (it already sorts the sequences to run all the BOF parts first, followed by the EOF parts for each signature).
This could be suggested for incorporation into DROID 7, by adding a new requirement to the wiki: http://droid7.wikispaces.com/
andy jackson
January 27, 2012 @ 2:58 pm CET
We’ve seen some similar problems, both with PDFs and JPEG2000s. In the former case, there is some variation in how the PDFs are closed, but this variation has no effect on the interpretation of the item (as in your JPG case here) – I think the DROID signatures were modified to take the variation into account. In the latter case, we had JP2 files that had been accidentally damaged in such a way that most of the damaged files would fail to be identified as JP2 if the end-of-file marker was required.
In the PDF case, the EOF signature causes us to waste time dealing with exceptions that are completely harmless. In the JP2 case, we only want to know whether any given file is ‘probably intended to be a JP2’, so that we can run deeper analysis and validation upon it, and so the EOF signature gets in the way of this workflow.
Furthermore, as far as I can tell by manually inspecting the Droid signatures, I am aware of no cases where the EOF signature tells us anything more than the BOF signature – i.e. when ignoring the EOF signatures when a BOF signature is present does not alter the result. Neither am I aware of any cases where a format can only be identified using EOF signatures (certainly, there are no such signatures in v55 of the DROID signature file). Finally, it is interesting to note that two of the most widely-used identification tools, file and Apache Tika, only allow BOF signatures (and in fact, use just an 8K chunk from the start of the file). As they don’t seem to be required, and in fact cause a range of problems, I currently consider EOF signatures to be actively harmful, and would rather we simple stopped using them.
Unless, of course, there are cases where EOF signatures are really needed?
mattpalmer1086
January 27, 2012 @ 9:40 am CET
Since the format is agnostic on any data following the end of format marker, it would seem to be a good idea to make it a wildcard * search from the end. this would have almost no performance impact, as it would only be triggered if the rest of the signature had already matched, but it would improve accuracy. I wouldn’t say its a different format to jpg, since jpg allows this data to appear. whether the rest of the data should be scraped or not is another matter. It might be interesting to treat the rest of the data as a new stream to be run through format identification. However, I expect (but don’t know) that this data would be proprietary info recorded by some software. I doubt Droid would ever identify it… It might be interesting to flag that there was additional data, but that opens a can if worms to do it genetically, not just for jpg…