Running Apache Tika over ARC files using Apache Hadoop

PDF Eh? – Another Hackathon Tale

In the context of the SCAPE project, we have recently been doing a series of experiments associated with content file identification of ARC.GZ web archive containers. Why? Because you will presumably be interested in which different file formats you have in your archive containers and how many of them per type.

The second reason is, that if you have to work with web archive containers, it is not unlikely  that you have to analyze a huge amount of these containers. And this is where a server cluster based on Apache Hadoop comes in.

So the main question for us was: How much ARC.GZ web archive data will we be able to identify per minute on an experimental cluster using Apache Tika as the file identification tool?

 

Purpose of the experiments
The experiments have been run on a development cluster in pseudo-distributed mode and on a “real” (distributed mode) clusters to find out how:

  • long it takes to process ARC.GZ files with a native map/reduce program using the Apache Tika detector API on Apache Hadoop
  • performance is influenced by the amount of data

ARC / Apache Tika map/reduce application
The map/reduce application written for the tests consists of a custom ARC Record Reader to natively read the ARC web archive containers with Hadoop – record by record. A record is one object stored in the ARC file – e.g. an image, text or other file. The second part of the application uses the Apache Tika 1.0 API to detect the mime type of each record. Output of the application is a table containing all detected MIME types including the count on how often they have been detected. This gives you an overview on what is inside your archives. The test application have been implemented as a native map/reduce application.

Running experiments on the “real” cluster
The “real” Cluster is a experimental cluster on 6 dedicated servers:

NETWORK infrastructure:
The CONTROLLER and the NODEs are connected to a GBit high performance network switch (guarantees the full GBit performance for each port) to the internal network infrastructure

1 xCONTROLLER

  • CPU: 2 x 2.40GHz Quadcore CPU (16 HT cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; RAID5 (redundancy) => 2TB effective disk space

5 x NODE

  • CPU: 1 x 2.53GHz Quadcore CPU (8 HT cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; JBOD (performance) => 2TB effective disk space

 

Results from the experiments on the described cluster:

amount of data throughput – GB/min 
 10GB  11,76
 100GB  16,17
1TB 18,01

Our conclusion:
The results gave us a good impression on what performance can be expected while identifying web content in a map/reduce environment. Very roughly speaking: the more data, the less Hadoop task starting overhead is generated. During the experimentation phase, we measured a  throughput of up to 18GB of web archive data per minute. This is much better than what we have expected to see.

Related topic:
If you are interested in how an identification result could look like, please continue reading: http://www.openpreservation.org/blogs/2012-08-17-analysing-formats-uk-web-archive

60
reads

1 Comment

  1. ravikchikkabd
    May 30, 2017 @ 2:47 pm CEST

    Hello sir, Is it possible to share the code for this experiment, please?
    Thanks,
    Ravi

Leave a Reply

Join the conversation