A Tika to ride; characterising web content with Nanite

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or FormatProfiler.properties in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser  that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files


Total WARC size

59.4GB (63,759,574,081 bytes)


Total files in WARCs (# input records)


Runtime (hh:mm:ss)






Total Tika parser output size (compressed)

765MB (801,740,734 bytes)


Tika parser failures/crashes


Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside – if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Leave a Reply

Join the conversation