Fido – a high performance format identifier for digital objects

Fido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux.  Most importantly, Fido is very fast.

In a subsequent post, I’ll describe the implementation in more detail.  For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October.  The core is a couple of hundred lines of code in three files.  It is shorter than these blog posts!

I was stunned by Fido’s performance.  Its memory usage is very small.  Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.

 I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP.  In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.

As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files.  File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents).  I invoked file as follows:

       time find . –type f | file –k –i –f – > file.out

This reports 1m24s or 84 seconds.  I compared this against:

       time python –m fido.run –q –r . > fido.csv

This reports 1m18s or 78 seconds.

In my benchmark environment, Fido 0.7.1 is about the same speed as file.  This is an absolute shock.  Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool.  Memory usage is rock solid and tiny for both Fido and file.

Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU’s used in virtual setup);  RAM: 1 GB).  He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.

Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.

For me, performance in this range is a surprise, a relief, and an important step forward.  It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections.  A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.

Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive.  Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.

Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux.  It is still beta code – we welcome your comments, feedback, ideas,  bug reports – and contributions!

By adam farquhar, posted in adam farquhar's Blog

3rd Nov 2010  7:57 AM  31642 Reads  No comments

Comments

There are no comments on this post.


Leave a comment