Apache Tika File Mime Type Identification and the Importance of Metadata

Tika File Mime Type Identification and the Importance of Metadata

An evaluation was recently carried out to determine how well Apache Tika was able to identify the mime types of a corpus of test files, described in the ‘Data Set’ section. The purpose of the evaluation was to determine:

1. if the performance* of Tika has changed between versions 1.0 and the current version, 1.3 and,

2. how the provision of metadata, in the form of the file name, affects the performance of Tika.

In order to address the first point, the evaluation was carried out four times, once for each of the four available versions of Tika 1.0, 1.1, 1.2 and 1.3.

The second point was address by running the evaluation twice for each version of Tika; the first test passed only a file input stream to the Tika ‘detect’ method, the second test passed both the file input stream and the file name.

In total eight tests were carried out, the results are shown in the Results section below.

^* For the purposes of this evaluation the performance of Tika is measured by the number of file mime types identified correctly when compared against a ground truth, described in the ‘Data Set section.

Data Set

The set of test files consists of a Govdocs corpus of almost 1 million files, freely available from http://digitalcorpora.org/corpora/files. The ground truth for these files has been provided by Forensic Innovations, Inc. available from http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip.

Platform

The evaluation was run as a Cloudera Hadoop map/reduce process on a HP ProLiant DL 385p Gen8 host with 32 CPUs, 224 Gb of RAM and a clock rate of 2.295 Ghz, using ESXi to run 32 virtual machines. The Hadoop configuration is a Cloudera (cdh4.2.0) Hadoop 30 node cluster consisting of a manager node, a master node and 28 slave nodes located at the British Library in the UK. Each node runs on its own virtual machine with 1 core, 500Gb of storage and 6Gb of RAM.

Results

In total the evaluator process was run eight times on the Govdocs corpus.

The table below shows the number of files processed by Apache Tika, the number correctly identified, the number that were incorrectly identified and the percentage identified correctly. The ‘Filename Used?’ column indicates whether the Tika detect method was pass only the file input stream (‘N’), or passed both the file input stream and file name (‘Y’).

Test	Tika Version	Files Processed	Files Identified Correctly	Files Identified Incorrectly	Files Correctly Identified (%)	Filename Used ?
1	1.0	973693	757326	216367	77.779	N
2	1.1	973693	757240	216453	77.770	N
3	1.2	973693	758549	215144	77.904	N
4	1.3	973693	758557	215136	77.905	N
5	1.0	973693	945555	28138	97.110	Y
6	1.1	973693	945516	28177	97.106	Y
7	1.2	973693	938138	35555	96.348	Y
8	1.3	973693	938148	35545	96.349	Y

Table 1 – Files mime types identified correctly/incorrectly by Apache Tika

Observations

The results in Table 1 show that, when used with a file input stream only, the performance of Tika improves slightly between versions 1.0 and 1.3. However, when Tika is used with both a file name and a file input stream the performance degrades between versions 1.0 and 1.3.

Further investigation shows that the files that were identified correctly in Tika version 1.0 but identified incorrectly in version 1.3 were of the following types :-

Tika v1.3 Mime Type	Number of Files
application/msword	1
application/octet-stream	61
application/rss+xml	4
application/x-tika-msworks-spreadsheet	2
application/zip	2
message/x-emlx	10
text/plain	6
text/x-log	8107
text/x-matlab	2
text/x-perl	2
text/x-python	4
text/x-sql	295

Table 2 – Number of files identified correctly in version 1.0 but incorrectly in version 1.3

Further investigation, carried out into files identified by Tika 1.3 as ‘text/x-log’, shows that these are text files with a file extension of ‘.log’. These files were identified by Tika versions 1.0 and 1.1 as having a mime type of ‘text/plain’, which matches the ground truth mime type. Similarly, Tika versions 1.2 and 1.3, when used with just an input stream, also identified these files as ‘text/plain’, again matching the groundtruth.

However, when Tika versions 1.2 and 1.3 were provided with the filename, they identified .log files as having a mime type of ‘text/x-log’. As the ‘plain/text’ group of files encompasses a large and diverse set of file types, including logs, source code, properties/config files, data files etc, this could be considered an improvement as it provides greater differentiation between the different file types.

Possible Future Work

The results of the tests show that Apache Tika relies heavily on the filename when carrying out file identification. In the future this work could be extended to investigate how easily Tika can be fooled into identifying a file wrongly after being provided with incorrect/misleading file extension as part of the filename.

5 Comments

andy jackson
May 23, 2013 @ 10:45 am CEST

Wonderful. Thank you.
marwoodls
May 23, 2013 @ 10:23 am CEST

The raw results can be found at :-

https://github.com/openplanets/cc-benchmark-tests

These are in the form of CSV files which show, for each of the test files in the Govdocs1 corpus, the mime type as identified by Tika, and the mime type(s) according to the ground truth.
andy jackson
May 23, 2013 @ 10:04 am CEST

It would be really great to have the detailed results, so that the anomolies can be examined and the tools can be improved. Is there any chance of publishing the raw data, if only for the 'disagreements', for each Tika version.

Note also that, from my understanding of the source code, Tika uses the file extensions but the extension-based result is always overriden by a 'magic number' or container match. i.e. a wrong extension should only confuse it when the extension is all it has to go on. However, as you say, having a corpus-based test for this would be a great idea.
andy jackson
May 23, 2013 @ 9:57 am CEST

Yes, agree very much with you Peter. The text/x-log case is an improvement, in my opinion, and I feel a hierarchy is natural. For example, a text/x-log can also be interpreted as a text/plain, which can also be interpreted as application/octet-stream.

If we can agree on this approach, I think we should look at these anomolies and consider revising the 'ground truth' accordingly.
pmay
May 23, 2013 @ 9:37 am CEST

My feeling is that this says much more about the groundtruths and the methodology we use to measure accuracy.

Assuming "text/x-log" is a specialism of "text/plain", and as such could be considered "correct", then Tika 1.2 using filenames achieves a correctness of 97.181% ((938138+8107)/973693), a very slight improvement on Tika 1.0's 97.110%. Other specialisms would show further improvement in the performance as measured by "correctness" accuracy.

Perhaps what is needed is more reliance on a hierarchy of mime-types to enable hierarchical groundtruth matching – fuzzy matching, if you will?

You must be logged in to post a comment.

Apache Tika File Mime Type Identification and the Importance of Metadata

5 Comments

Leave a Reply

You might also like…

SPRUCE Mashup: Batch File Identification using Apache Tika

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

Droid file format identification using Hadoop

Join the conversation

Member-only content

or

or

or

or

Download

or