Investigating PRONOM EOF patterns and DROID ‘Fast’ Scanning

Investigating PRONOM EOF patterns and DROID ‘Fast’ Scanning

Following on from the interesting discussion in my last post about the jpeg signatures, I undertook some quick testing on the impact of using / not using the EOF sections of a DROID signature file.

I previously posted this signature file here: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55%20-%20no%20EOF.xml 

 

I have compiled my results and written a brief paper that is available here: http://dl.dropbox.com/u/59534857/Comparing%20DROID%20signature%20files.pdf

 

I have included the summary section here:

“This short paper presents the results found when the End Of File (EOF) section is removed from the current Droid Signature (v55).

A second test is completed, in which the ‘Maximum Scan Bytes’ function is tested with the same signature files.

A recommendation is made to repeat these tests, and extend the scope of the source files.

A recommendation is made to consider the case for removing some EOF markers if there is an efficiency gain in doing so.

A recommendation is made to assess the impact of using the ‘Fast’ mode of Droid v6.”

 

Thought, comments and questions welcomed as ever.

7 Comments

  1. andy jackson
    February 12, 2012 @ 2:00 pm CET

    Hm, not sure that will work. ‘Fast mode’ is really ‘guess-using-less-data mode’, and because there are multiple matching expressions for each format it is possible that using less data will cause more false positives, not just false negatives (as you suggest). This may not be terribly likely, but it is difficult to discount a priori.

  2. nir.sherwinter
    February 12, 2012 @ 12:51 pm CET

    1. I assume that using the fast mode will identify the files much more faster
    2. For the best-tuned ingest workflow (in your case using Rosetta) I may think of a flow of firstly using the fast mode and then use the regular mode on files that the fast mode failed to identify (something like two-phase identification mode)

  3. andy jackson
    February 9, 2012 @ 11:43 am CET

    With fmt/101, I can kind of believe that there is some other format that is XML-based but is effectively discounted because its EOF signature does not match a file. In that case, removing the EOF sig may change the result (although this could probably be fixed with a HasPriorityOver relationship). In such cases, knowing what PUID you got instead of fmt/101 would probably resolve the issue.

    However, I can think of no such argument for the case of fmt/144. Perhaps the fall-back to file extension is matching all *.pdf PUIDs for some PDFs you have that do not match the EOF markers DROID expects? If so, this difference in results would be an argument in favour of dropping those EOF signatures, as using more information to match a file would have been shown to be giving a less precise result!

    I know some folks in SCAPE are doing very similar work, using the GovDocs1 corpus and its associated ‘ground truth’ (which I put in quotes here, as they are in fact the results from FITools and have not been verified manually, as far as I know). I believe the results will be published over the next few months, and hopefully in a way that lets the raw data be shared too.

  4. Jay Gattuso
    February 9, 2012 @ 10:51 am CET

    The results are relatively mystifying in some occasions… I’ll try and answer your questions as fully as I can from here, and I’ll dig into some of the detail in the office tomorrow.

    1)fast mode is my abbreviated way of describing the ‘max size byte scanning’ mode that was built into v6. As I recall the default setting is to only scan the first and last 64Kb of a file. You can switch this off by setting the variable to a negative number – forcing DROID to scan the whole file (as per previous versions). The option is in one of the preferences tabs. Its worth noting that running these test in fast mode took perhaps a couple of hours, and in full or slow mode it took atleast 12 hours to churn through the same source set of files.

     

    2) One of the reasons for the confusing results will undoubtedly be extension matching – I can re-run the data and show the various hits with and without extensions if that would be useful – I decided not for the first showing of this data because its an added layer of complexity thats not immediatly needed to see what the main issue is.

     

    3) Example of fmt/101 – as its of interest, I’ll drill into this example, and show what I can see from the four tests – you should bear in mind that there is not always a corrolation between the trigger file for a PUID hit that occurs in all the result sets – I need to shuffle the data a bit to get down to that level of analysis. The important thing for now is that a series of unknown files behave differently depending on the scanning setting, and the use of the EOF pattern in the sig. If there is any other specific PUIDs you would like some deeper anaylsis on, let me know.(I’ll look at fmt/144)

     

    As to whats going on? very good question. I am in the middle of doing some deeper analysis with the same source files, using droid 3, droid 5 and droid 6, and a set of 5 sig files (off the top of my head v13, v37, v45, v49 and v50 as they relate to sig files we use(d) in our Rosetta system). I suspect these results will help us shed some more light on whats happening, but from what I can see from my first pass of the data different versions of DROID can result in inconsitant PUID assertions, different signature files obviously can, and this fast mode also changes the results. 

     

    I’d really like someone else to repeat my tests with another big lump of files – its not that difficult to pull the results into a db – I can share my full method & tables etc if there is an interest. I have already shared the modified sig file, and the fast mode is built in function. We also need to be careful to address to the two separate issues I’m flagging here – the fast mode issue, and the experiment with the EOF patterns.

     

  5. andy jackson
    February 9, 2012 @ 10:20 am CET

    Thanks for the analysis, but your results leave me utterly mystified! Firstly, what on earth is ‘fast mode’!? Can’t find it in the UI, CLI or Help PDF file. Why on earth is it giving different results!? Extension matching only?

    More to the main point, when I look at the v55 versus v55-no-EOF results, I cannot make any sense of the output. For example, looking at fmt/101 (XML), which apparently had no matches under the no-EOF results, it is difficult to see how the EOF could matter. It is identified using a straightforward BOF signature, so perhaps some other signature is taking precedence, somehow, although I have difficulty seeing how removing signatures could make that happen.

    The results for fmt/144 are even more mysterious. This is a PDF varient that has no internal signature of any kind, in either signature file! How on earth can that ever match?

    What am I missing? Any ideas?!

Leave a Reply

Join the conversation