Running Apache Tika over ARC files using Apache Hadoop

In the context of the SCAPE project, we have recently been doing a series of experiments associated with content file identification of ARC.GZ web archive containers. Why? Because you will presumably be interested in which different file formats you have in your archive containers and how many of them per type.

The second reason is, that if you have to work with web archive containers, it is not unlikely that you have to analyze a huge amount of these containers. And this is where a server cluster based on Apache Hadoop comes in.

So the main question for us was: How much ARC.GZ web archive data will we be able to identify per minute on an experimental cluster using Apache Tika as the file identification tool?

Purpose of the experiments
The experiments have been run on a development cluster in pseudo-distributed mode and on a “real” (distributed mode) clusters to find out how:

long it takes to process ARC.GZ files with a native map/reduce program using the Apache Tika detector API on Apache Hadoop
performance is influenced by the amount of data

ARC / Apache Tika map/reduce application
The map/reduce application written for the tests consists of a custom ARC Record Reader to natively read the ARC web archive containers with Hadoop – record by record. A record is one object stored in the ARC file – e.g. an image, text or other file. The second part of the application uses the Apache Tika 1.0 API to detect the mime type of each record. Output of the application is a table containing all detected MIME types including the count on how often they have been detected. This gives you an overview on what is inside your archives. The test application have been implemented as a native map/reduce application.

Running experiments on the “real” cluster
The “real” Cluster is a experimental cluster on 6 dedicated servers:

NETWORK infrastructure:
The CONTROLLER and the NODEs are connected to a GBit high performance network switch (guarantees the full GBit performance for each port) to the internal network infrastructure

1 xCONTROLLER

CPU: 2 x 2.40GHz Quadcore CPU (16 HT cores)
RAM: 24GB
NIC: 2 x GBit Ethernet (1 used)
DISK: 3 x 1TB DISKs; RAID5 (redundancy) => 2TB effective disk space

5 x NODE

CPU: 1 x 2.53GHz Quadcore CPU (8 HT cores)
RAM: 16GB
NIC: 2 x GBit Ethernet (1 used)
DISK: 2 x 1TB DISKs; JBOD (performance) => 2TB effective disk space

Results from the experiments on the described cluster:

amount of data	throughput – GB/min
10GB	11,76
100GB	16,17
1TB	18,01

Our conclusion:
The results gave us a good impression on what performance can be expected while identifying web content in a map/reduce environment. Very roughly speaking: the more data, the less Hadoop task starting overhead is generated. During the experimentation phase, we measured a throughput of up to 18GB of web archive data per minute. This is much better than what we have expected to see.

Related topic:
If you are interested in how an identification result could look like, please continue reading: http://www.openpreservation.org/blogs/2012-08-17-analysing-formats-uk-web-archive

Running Apache Tika over ARC files using Apache Hadoop

1 Comment

Leave a Reply

You might also like…

A Year of FITS

Some reflections on scalable ARC to WARC migration

A Weekend With Nanite

Join the conversation

Member-only content

or

or

or

or

Download

or