Apache Tika’s Regression Corpus (TIKA-1302)

Apache Tika’s Regression Corpus (TIKA-1302)

BACKGROUND

Nearly two and a half years ago, I started an effort for Apache Tika™ to help improve its robustness via TIKA-1302.  Apache Tika™ is an umbrella/wrapper project that “detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).”

I documented some of the early work in an ApacheCon presentation in 2015, but I thought it would be useful to blog about work since then and invite OPF to collaborate in a few areas.

Before I start, I should say that some of this work was motivated by Peter May’s Spruce Mashup,William Palmer’s Tika to Ride and other work carried out by OPF (see the ApacheCon slides).  The straw that broke the camel’s back was an issue opened by Luis Filipe Nassif — an “upgrade” in Apache PDFBox now caused an exception in roughly 20% of the PDF files in his test corpus.  This, together with the work of OPF, brought home how much we needed a shared, public, large scale regression corpus to run our code against before we release new versions.

Thanks to the generosity of Rackspace for hosting the vm and thanks to contributions from Julien Nioche, Chris Mattmann and Dominik Stadler, we now have ~3 million files comprising ~1TB in our regression corpus.  The intro to the site is available here. Thanks, too, of course go to govdocs1 and CommonCrawl, as forming the basis of our regression corpus.  Through this effort:

  • We have identified serious issues in Tika and its dependencies before releases.  Apache POI (Dominik Stadler) now regularly runs POI against their own regression corpus, and I’ve been supporting Apache PDFBox with this corpus, e.g. PDFBOX-3058.  Through collaboration with Tilman Hausherr and others on Apache PDFBox, I’ve made some critical improvements to our (still nascent) evaluation/comparison metrics.  For an example of the current reports, see this.
  • We added tika-batch, a method available now in tika-app to run Tika robustly, in many threads from file share to file share.
  • We integrated Jukka Zitting’s and Nick Burch’s code to maintain the metadata of embedded documents via the RecursiveParserWrapper, the -J option in tika-app or the /rmeta endpoint in tika-server.

I am very proud of what we, as a community, have been able to accomplish by this effort.  However, there’s still more we can do.

HELP WANTED

While OPF helped motivate some of this work, and while some members (esp. Andy Jackson and William Palmer) collaborate with Tika via our JIRA, there are many ways in which we can enrich our collaboration.

  • Evaluation code — I have some drafty code.  Any feedback on presentation/metrics for profiling the output of a text/metadata extractor or for comparing the output of two text/metadata extractors would be welcomed.  Again,  this is the current state of the reports.  I plan to integrate this code into a new module for Tika (tika-eval) at some point.
  • File identification — in the Spring, I published a comparison of DROID, ‘file’ and Apache Tika against our regression corpus.  This allowed us to see where there were conflicts between the tools or to identify missing file types in Tika — if Tika said, ‘octet-stream’, what did DROID say?  We added a few handfuls of mime types to our detection definitions via this method.  Can we make the comparison data more useful to support PRONOM in adding/updating their definitions?
  • Corpus 2.0 development — The current version of the corpus has some areas for improvement.  If we’re limited to a single server, I’ve found that 3 million files/1TB is a manageable size.  To go beyond that, we’d want to head to Hadoop/Spark.  If anyone would like to chip in in the design or selection procedures, that’d be great!  Ideally:
    • We’d like to be able to gather files from Common Crawl with heavy oversampling of binary formats and non-English language docs (if I see one more UTF-8 web site in English…:) ).
    • We’d like to have an ingest/workflow that is able to identify truncated documents (Common Crawl truncates at 1MB) and repull the original document (if available).  I want to keep truncated docs, but I want to add the full docs back in.
    • I want to store the http-header info as well as the mimes identified by Tika, Droid and file in a db.
    • How can we make this corpus useful to OPF?
    • Many other areas for improvement.  Please help…
  • Other ideas for collaboration?

Thank you, all, again.

 

Cheers,

 

Tim

 

 

Leave a Reply

Join the conversation