Tika File Mime Type Identification and the Importance of Metadata
An evaluation was recently carried out to determine how well Apache Tika was able to identify the mime types of a corpus of test files, described in the ‘Data Set’ section. The purpose of the evaluation was to determine:
1. if the performance* of Tika has changed between versions 1.0 and the current version, 1.3 and,
2. how the provision of metadata, in the form of the file name, affects the performance of Tika.
In order to address the first point, the evaluation was carried out four times, once for each of the four available versions of Tika 1.0, 1.1, 1.2 and 1.3.
The second point was address by running the evaluation twice for each version of Tika; the first test passed only a file input stream to the Tika ‘detect’ method, the second test passed both the file input stream and the file name.
In total eight tests were carried out, the results are shown in the Results section below.
* For the purposes of this evaluation the performance of Tika is measured by the number of file mime types identified correctly when compared against a ground truth, described in the ‘Data Set section.
Data Set
The set of test files consists of a Govdocs corpus of almost 1 million files, freely available from http://digitalcorpora.org/corpora/files. The ground truth for these files has been provided by Forensic Innovations, Inc. available from http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip.
Platform
The evaluation was run as a Cloudera Hadoop map/reduce process on a HP ProLiant DL 385p Gen8 host with 32 CPUs, 224 Gb of RAM and a clock rate of 2.295 Ghz, using ESXi to run 32 virtual machines. The Hadoop configuration is a Cloudera (cdh4.2.0) Hadoop 30 node cluster consisting of a manager node, a master node and 28 slave nodes located at the British Library in the UK. Each node runs on its own virtual machine with 1 core, 500Gb of storage and 6Gb of RAM.
Results
In total the evaluator process was run eight times on the Govdocs corpus.
The table below shows the number of files processed by Apache Tika, the number correctly identified, the number that were incorrectly identified and the percentage identified correctly. The ‘Filename Used?’ column indicates whether the Tika detect method was pass only the file input stream (‘N’), or passed both the file input stream and file name (‘Y’).
Test |
Tika Version |
Files Processed |
Files Identified Correctly |
Files Identified Incorrectly |
Files Correctly Identified (%) |
Filename Used ? |
1 |
1.0 |
973693 |
757326 |
216367 |
77.779 |
N |
2 |
1.1 |
973693 |
757240 |
216453 |
77.770 |
N |
3 |
1.2 |
973693 |
758549 |
215144 |
77.904 |
N |
4 |
1.3 |
973693 |
758557 |
215136 |
77.905 |
N |
5 |
1.0 |
973693 |
945555 |
28138 |
97.110 |
Y |
6 |
1.1 |
973693 |
945516 |
28177 |
97.106 |
Y |
7 |
1.2 |
973693 |
938138 |
35555 |
96.348 |
Y |
8 |
1.3 |
973693 |
938148 |
35545 |
96.349 |
Y |
Table 1 – Files mime types identified correctly/incorrectly by Apache Tika
Observations
The results in Table 1 show that, when used with a file input stream only, the performance of Tika improves slightly between versions 1.0 and 1.3. However, when Tika is used with both a file name and a file input stream the performance degrades between versions 1.0 and 1.3.
Further investigation shows that the files that were identified correctly in Tika version 1.0 but identified incorrectly in version 1.3 were of the following types :-
Tika v1.3 Mime Type |
Number of Files |
application/msword |
1 |
application/octet-stream |
61 |
application/rss+xml |
4 |
application/x-tika-msworks-spreadsheet |
2 |
application/zip |
2 |
message/x-emlx |
10 |
text/plain |
6 |
text/x-log |
8107 |
text/x-matlab |
2 |
text/x-perl |
2 |
text/x-python |
4 |
text/x-sql |
295 |
Table 2 – Number of files identified correctly in version 1.0 but incorrectly in version 1.3
Further investigation, carried out into files identified by Tika 1.3 as ‘text/x-log’, shows that these are text files with a file extension of ‘.log’. These files were identified by Tika versions 1.0 and 1.1 as having a mime type of ‘text/plain’, which matches the ground truth mime type. Similarly, Tika versions 1.2 and 1.3, when used with just an input stream, also identified these files as ‘text/plain’, again matching the groundtruth.
However, when Tika versions 1.2 and 1.3 were provided with the filename, they identified .log files as having a mime type of ‘text/x-log’. As the ‘plain/text’ group of files encompasses a large and diverse set of file types, including logs, source code, properties/config files, data files etc, this could be considered an improvement as it provides greater differentiation between the different file types.
Possible Future Work
The results of the tests show that Apache Tika relies heavily on the filename when carrying out file identification. In the future this work could be extended to investigate how easily Tika can be fooled into identifying a file wrongly after being provided with incorrect/misleading file extension as part of the filename.
andy jackson
May 23, 2013 @ 10:45 am CEST
Wonderful. Thank you.
marwoodls
May 23, 2013 @ 10:23 am CEST
The raw results can be found at :-
https://github.com/openplanets/cc-benchmark-tests
These are in the form of CSV files which show, for each of the test files in the Govdocs1 corpus, the mime type as identified by Tika, and the mime type(s) according to the ground truth.
andy jackson
May 23, 2013 @ 10:04 am CEST
It would be really great to have the detailed results, so that the anomolies can be examined and the tools can be improved. Is there any chance of publishing the raw data, if only for the 'disagreements', for each Tika version.
Note also that, from my understanding of the source code, Tika uses the file extensions but the extension-based result is always overriden by a 'magic number' or container match. i.e. a wrong extension should only confuse it when the extension is all it has to go on. However, as you say, having a corpus-based test for this would be a great idea.
andy jackson
May 23, 2013 @ 9:57 am CEST
Yes, agree very much with you Peter. The text/x-log case is an improvement, in my opinion, and I feel a hierarchy is natural. For example, a text/x-log can also be interpreted as a text/plain, which can also be interpreted as application/octet-stream.
If we can agree on this approach, I think we should look at these anomolies and consider revising the 'ground truth' accordingly.
pmay
May 23, 2013 @ 9:37 am CEST
My feeling is that this says much more about the groundtruths and the methodology we use to measure accuracy.
Assuming "text/x-log" is a specialism of "text/plain", and as such could be considered "correct", then Tika 1.2 using filenames achieves a correctness of 97.181% ((938138+8107)/973693), a very slight improvement on Tika 1.0's 97.110%. Other specialisms would show further improvement in the performance as measured by "correctness" accuracy.
Perhaps what is needed is more reliance on a hierarchy of mime-types to enable hierarchical groundtruth matching – fuzzy matching, if you will?