File Characterisation Tools – A Report on a Testing Project Conducted at the National Library of Australia

File Characterisation Tools – A Report on a Testing Project Conducted at the National Library of Australia

The National Library of Australia has just completed a small project to investigate and test a number of software tools of interest to digital preservation activities. The result of this project was an internal report describing the tests and the results, and giving some recommendations about the potential for using these tools in a planned replacement of the Library’s infrastructure for managing digital content.

Although written as an internal report, it contains material that will be of interest to the general digital preservation community, and so I am posting it here for your reference. The report is attached.

The project tested file characterisation tools, and in particular, file format identification tools and metadata extraction tools. There have been a few other projects reported on recently that have covered similar ground (links are included in the report).  I tried to make this project complementary to those, by selecting some different tools (but some in common), and by analysing the results in terms of the comparative usefulness of the outputs, rather than computational performance. This makes the results somewhat subjective, but at the same time illuminates many of the issues that make using these tools challenging.

The tools tested for file format identification were:

  • File Investigator Engine
  • Outside-In File ID
  • FIDO
  • Unix file / libmagic

The tools tested for metadata extraction were:

  • File Investigator Engine
  • Exiftool
  • MediaInfo
  • pdfinfo from the Xpdf toolkit
  • Apache Tika

Like all projects of this sort, this one was time and resource constrained, and I didn’t get to cover the metadata extraction tools in as much depth as I would have liked, but the results do give at least an introduction to the capabilities of these tools.

The tests used a data set collated from both publically available and internal Library sources, including the Govdocs1 corpus.

I hope you find this report useful. The National Library of Australia welcomes comments and discussion about it
(I personally am moving on — my work here is done!)


Matthew Hutchins

Leave a Reply

Join the conversation