Evaluation of identification tools: first results from SCAPE

Evaluation of identification tools: first results from SCAPE

As I already briefly mentioned in a previous blog post, one of the objectives of the SCAPE project is to develop an architecture that will enable large scale characterisation of digital file objects. As a first step, we are evaluating existing characterisation tools. The overall aim of this work is twofold. First, we want to establish which tools are suitable candidates for inclusion in the SCAPE architecture. As the enhancement of existing tools is another goal of SCAPE, the evaluation is also aimed at getting a better idea of the specific strengths and weaknesses of each individual tool. The outcome of this will be helpful for deciding what modifications and improvements are needed. Also, many of these tools are widely used outside of the SCAPE project, which means that the results will most likely be relevant to a wider audience (including the original tool developers).

Evaluation of identification tools

Over the last months, work on this has focused on format identification tools. This has resulted in a report which is attached with this blog post. We have evaluated the following tools:

All tools were evaluated against a set of 22 criteria. Extensive testing using real data has been a key part of the work. One area which, I think, we haven’t been able to tackle sufficiently so far is the accuracy of the tools. This is problematic, since it would require a test corpus where the format of each file object is known a priori. In most large data sets this information will be derived from the very same tools that we are trying to test, so we need to see if we can say anything meaningful about this in a follow-up.

Involvement of tool developers

Over the previous months we’ve been sending out earlier drafts of this document to the developers of DROID, FIDO, FITS and JHOVE2, and we have received a lot of feedback to this. In the case of FIDO, a new version is underway, and this should correct most (if not all) of the problems that are mentioned in the report. For the other tools we have also received confirmation that some of the found issues will be fixed in upcoming releases.

Status of the report and future work

The attached report should be seen as a living document. There will probably be one or more updates at some later point, and we may decide to include more tests using additional data. Meanwhile, as always, we appreciate any of your feedback on this!

Link to report

Evaluation of characterisation tools – Part 1: Identification

 

Johan van der Knijff

KB  / National Library of the Netherlands

[post_views post_types=”blogs,page” icon_or_phrase=””] reads

0 likes

Leave a Reply

You might also like…

Post icon

The SCAPE Project – a brief introduction

SCAPE is an integrated research project co-funded by the European Union under the FP7 ICT program. It is running since February 2011 and has a total duration of 42 months. The SCAPE Consortium brings together a broad spectrum of expertise from memory institutions, data centres, research labs, universities, and industrial firms. Sixteen European institutions are cooperating to develop
solutions for the long-term digital preservation of large-scale and heterogeneous collections of digital-objects. Their aim is to develop scalable services for efficient and automated preservation planning and the execution of preservation actions of large (multi-Terabyte) and complex data sets.

Post icon

Looking back at the first SCAPE project year

After the first project year SCAPE members can look back at an outstanding project start and an intense phase of integrated project work. First results…

Post icon

Identification tools, an evaluation

We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

We have tested Tika 1.0, Fido 0.9.6 and Droid 6.0 with the V45 signature file.

Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files.

Tika is the fastests of the tools, and Fido is the slowest.

Join the conversation