Digital Preservation Stage Boss One: The Performance of File Format Identification Tools vs. Checksum Generation Tools

At Archives New Zealand we were finding ‘WAVE’ files becoming a bottleneck of one of our ingest processes. The result initially looked odd to me where I had thought I had understood in the past that file format identification would not take longer to divine than a checksum. My rationale being that to identify a file, DROID, or equivalent tool, would rarely need to read the whole file, rather, it should be able to read signatures from offsets relative to the beginning, or end of file.

On top of that, I assumed that even if the identification tool was ‘slow’ then it would only take as long as checksum generation, never longer. My rationale for this being that reading from disk is the slowest part of the process – once the file’s contents were in memory, the tool working with the file would continue its computations in-memory.

It seems that the word ‘rarely’ may have been the error in my original assumptions; and finding those challenged I was eager to create an empirical picture of the performance of format identification vs. checksum generation and created a handful of experiments to do this.

The experiments were created to look at performance over the Govdocs Select corpus; a simulant of that corpus; and the WAVE files we were having the trouble with.

The results demonstrated that over a corpus that perhaps, ‘looks like‘ the Govdocs Select one, DROID will outperform the checksum generation tools. Siegfried will be quicker on average than DROID.

As we start to modify the maximum length of file each tool is configured to scan, for example, asking DROID not to read more than 65535 bytes (its default setting), we see an increase in speed, but we don’t see a major drop-off in the number of files supposedly identified by either tool, but we do start to see differences in what those results are. This indicates less precise identification results and potential false-positives.

For the collection of WAVE files we had at hand, DROID ran 16x slower than the checksum generation tools, and when we reduced DROID’s scan length it was still measurably slower. There is potential for the number of wildcards in file format signatures for WAVE based formats to be causing the problem where it creates the potential for DROID to scan through the entire file for each matching signature. Other things could account for this, for example, code that could still be optimized. Solutions are presented which may allow us to improve on the results we see today.

By way of control, the ‘simulant’ corpus (26,124 files populated with random data, totalling 31.4GB) was used to demonstrate that neither DROID or Siegfried needed much time to reach a conclusion of ‘fmt/UNKNOWN‘. DROID was noticeably quicker than Siegfried, but neither tool took over three minutes to get through the amount of data it was given.

The results are presented in the experiment’s full report, here: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/final-report/digital-preservation-stage-boss-one.pdf [PDF]

All of the work involved in this experiment can be found in my GitHub repository Digital Preservation Stage Boss One.

Please take your time and enjoy the results and take a look through the repository. All thoughts and comments are appreciated.

[Edit] 2016-08-22: Post-publication Richard Lehane noticed a discrepancy in my thinking re: DROID prioritization of WAVE identification results. My correction is reflected in the report text and committed to the repository by way of source control/versioning. Please see GitHub for the previous reading.

[Edit] 2016-09-21: Surfacing the wildcard signature listing in PRONOM from the report itself: https://github.com/exponential-decay/digital-preservation-stage-boss-one/tree/master/wildcard-signature-information

2 Comments

ross-spencer
August 29, 2016 @ 3:23 am CEST

Thanks Andy.

It makes sense to me.

I’m not sure how to report that to the DROID team at the moment, but perhaps I can just point them at this thread on GitHub?

With a steer from Richard, I was able to use its companion tool Roy to change the signature file SF uses to simulate DROID’s behavior – that is – ignore the priorities that helps SF to return quicker (Roy uses a flag (-multi) with a setting of 3 (comprehensive scan) to do this).

Here are the results of SF’s analysis of the WAVS:
MEAN SECS SDEV SECS MEAN MINS SDEV MINS droid container NOLIMIT 123.2327 0.6672 2.0539 0.0111 sf container NOLIMIT STANDARD 2.0390 0.0186 0.0340 0.0003 sf container NOLIMIT NOPRIORITIES 52.7807 0.2720 0.8797 0.0045

If nothing else, it might suggest that there are performance gains to be had if they look at that part of their algorithm, maybe passing buffered input streams instead…
Andy Jackson
August 23, 2016 @ 9:27 pm CEST

Reading a file byte-by-byte can be very slow if the input stream is not buffered carefully. I think this is because file system block-sized reads are much more efficient than going back to the disk all the time. As far as I can tell, DROIDs ByteReader classes are passed raw FileInputStreams rather than e.g. ones wrapped in BufferedInputStreams. This may be why DROID is relatively slow.

You must be logged in to post a comment.

Digital Preservation Stage Boss One: The Performance of File Format Identification Tools vs. Checksum Generation Tools

2 Comments

Leave a Reply

You might also like…

Breaking WAVEs (and some FLACs)

Policy-based assessment with VeraPDF – a first impression

Droid file format identification using Hadoop

Join the conversation

Member-only content

or

or

or

or

Download

or