In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).
In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.
I had a look into the specs for jpg, trying to unravel this story – were the extra bytes useful to someone? were we missing something by ignoring these bytes?
It transpired these bytes were added by a production workflow, and didn’t really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity…..) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn’t care… the ‘jpg’ stops at the EOI marker.
If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.
Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers – none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.
I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).
This proposal has a few issues….
(1) What do we do with this extra data? Should we be scraping it somehow?
(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)
(3) Is there a need to make this change – it would impact one of the most common format types we have…
Jay Gattuso
January 30, 2012 @ 12:31 am CET
Yupe – that was the issue – thanks – I had the wrong ID associated.
Here is the ammended XML: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF_v5.xml
mattpalmer1086
January 30, 2012 @ 12:25 am CET
Are you saying that without EOF markers, you get x-fmt/80 matches, but with them you don’t? Because that doesn’t make any sense to me, given how the DROID algorithm works (or is supposed to work).
DROID checks each file against all the signatures, recording if any of them matched. Matches are only ever removed if a higher priority file format is also detected. Given there were not originally priority relationships with xfmt/80, then if xfmt/80 could possibly match, it should have already appeared, in fact regardless of whether the fmt/42, 43 and 44 signatures also matched, or indeed were present at all.
What am I missing here?
mattpalmer1086
January 30, 2012 @ 12:16 am CET
You are using the wrong id in <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID>!
You should be using the file format id, not a signature id. “467” is a signature id of x-fmt/80, but its file format id is actually “122”. If you change this, the erroneous matches should disappear.
Just to clarify my understanding of your results, are you saying that (discounting the erroneous x-fmt/80 matches), all jpgs were identified correctly without EOF markers?
Jay Gattuso
January 29, 2012 @ 11:45 pm CET
All good points / questions Andy..
I would be very interested in creating a version of the whole sigfile sans EOF markers and seeing what the difference is between the as is sigfile, and this amended one.
I will get round to it (unless someone beats me to it..) but I’m currently swamped in another set of tests that is looking at the longer term changes to sigs over time – all this data will be useful, and I am starting to wonder how we can best share (1) source data for testing and (2) results from these kinds of tests.
The minimalistic approach adopted by O/S based file ID methods is compelling, but I suspect somewhat skewed by the complex overhead of end point applications dealing with internal conversions transparently to the user / OS which still potentially leaves us somewhat in the dark about the exact nature if the files we are looking at.
Jay Gattuso
January 29, 2012 @ 11:39 pm CET
I stripped out the EOF for fmt 42,43 and 44. I then tested the 3000(ish) files that we have previously ID’ed as fmt 41,42,43 and 44 (500 fmt/41, 110 x fmt/42, 500 x fmt/43 and 500 x fmt/44 – then all these files again with no file extension – making 1610 unique signature comparison, and 3220 file comparisons). I left the fmt/41 files in as kind-of-ground truth.
The changes to the signature led me down a bit of a rabbit hole – there is an implication for x-fmt/80. This is a problematic to resolve as the x-fmt/80 (http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=122&strPageToDisplay=signatures) sig is pretty weak – it as a very weak (short and not hugely specific) string match at an absolute offset from the BOF. I have included a has priority statement for the three IDs, to allow the preferential match the jpg PUIDS (in this case the x-fmt/80 match is clearly a false positive. As I don’t have any examples of x-fmt/80 I can’t test how this works across the complete set of related files – fmts 42, 43, and 44 and x-fmt/80.) The inclusion of the has prority statement results in the offering of both fmt/44 and x-fmt/80 as hits, so I’m not sure how to weed out the x-fmt/80 hits without making the x-fmt/80 signature more specific.
Sig file with no <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID> clause: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF.xml
Sig file with <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID> clause: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF_v3.xml
I also ran the new signatures over the ~40 fmt/43 and fmt/44 files I have collected that have data after the EOF – and these ID’ed as the expected PUIDs.
In summary of my first basic tests – the removal of the EOF pattern for fmt 42,43 and 44 has some implications of inaccurate matches for a ~20 of my test files resulting in an erroneous x-fmt/80 match. This could potentially be resolved by making the x-fmt/80 tighter – assuming this is possible. Otherwise, the remaining 1600 signature based IDs resulted in the as expected results.
I’m not saying we should dump EOF out of hand, but in this case my limited testing has not highlighted any issue with removing the EOF aspect of the fmt/42,43 and 44 signatures.
I would be very interested in anyone else’s experiences. Feel free to have a play and let us know how you get on.
Alternatively I will have a look at making a wildcarded EOF pattern for the same PUIDs. There is a whole bunch more testing needed before this was committed, but I’d like to hear from others before I jump in and push this any further…