Investigating PRONOM EOF patterns and DROID ‘Fast’ Scanning

9 February 2012

Following on from the interesting discussion in my last post about the jpeg signatures, I undertook some quick testing on the impact of using / not using the EOF sections of a DROID signature file.

I previously posted this signature file here: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55%20-%20no%20EOF.xml

I have compiled my results and written a brief paper that is available here: http://dl.dropbox.com/u/59534857/Comparing%20DROID%20signature%20files.pdf

I have included the summary section here:

“This short paper presents the results found when the End Of File (EOF) section is removed from the current Droid Signature (v55).

A second test is completed, in which the ‘Maximum Scan Bytes’ function is tested with the same signature files.

A recommendation is made to repeat these tests, and extend the scope of the source files.

A recommendation is made to consider the case for removing some EOF markers if there is an efficiency gain in doing so.

A recommendation is made to assess the impact of using the ‘Fast’ mode of Droid v6.”

Thought, comments and questions welcomed as ever.

Characterisation Identification

7 Comments

Jay Gattuso
February 12, 2012 @ 8:31 pm CET

Hey Nir,

Droid in ‘fast’ mode is significantly faster. The the point here is that if it has different results – its not the same tool…

Using a workflow (with automated decision making) to steer these results back into something consistent is a possibility, but it shouldn’t be nessicary – the expectation here being that DROID performs consistantly. If DROID offers different PUIDs in different modes, it raises some very big questions about suitable performance baseline, accuracy and consistency.

There is undoubtedly a performance gain for using the fast mode, and in the extremely limited tests I completed there are clearly some formats that can be safely ID’ed via the fast mode.

Any logic used to decide what files require a 2nd ‘deeper’ parsing would be much better suited inside the DROID signature not handled by an external tool/process. How would community users know that they are making the same workflow decisions?

This logic wold also require a complete look-up to exist for all the formats we encounter, and that the DROID ‘fast’ performance would offer a PUID assertion that would be suitable for use as a trigger to a 2nd parsing of the file. I’m not sure that ‘fast’ performance is suitably baselined at this time to make that look-up, or to confidently create external workflow rules triggered by DROID ‘fast’ behaviour.
nir.sherwinter
February 12, 2012 @ 2:16 pm CET

If you’ll use less data for identification and as a result you’ll get more false positives we must think of a mechanism which can be used in order to determine if the mode should be used or not. For example, if the results from more tests like Jay’s will show that specific formats are more faster identified by this mode, we can use the extension as a decision point (all PDFs will be routed to the fast mode first…). If we can’t find any mechanism or decision pathway – this mode is useless.

You must be logged in to post a comment.

Investigating PRONOM EOF patterns and DROID ‘Fast’ Scanning

7 Comments

Leave a Reply

You might also like…

Apache Tika File Mime Type Identification and the Importance of Metadata

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

Une déclaration d’amour aux formats

Join the conversation

Member-only content

or

or

or

or

Download

or