Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

PDF Eh? – Another Hackathon Tale

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied by the background noise of mouse clicks and keyboard typing.

There were Hadoop newbies, people from the SCAPE Project with some knowledge about Apache Hadoop related technologies, and, finally, Jimmy Lin who works currently as an associate professor at the University of Maryland and who was employed as research scientist at Twitter before. There is no doubt that his profound knowledge of using Hadoop in an ‘industrial’ big data context was that certain something of this event.

The topic of this Hackathon was large-scale digital preservation in the web archiving and digital books quality assurance domains.  People from the Austrian National Library presented application scenarios and challenges and introduced the sample data which was provided for both areas on a virtual machine together with a pseudo-distributed Hadoop-installation and some other useful tools from the Apache Hadoop ecosystem.

I am sure that Jimmy’s talk about Hadoop was the reason why so many participants became curious about Apache Pig, a powerful tool which was humorously characterised by Jimmy as the tool for lazy pigs aiming for hassle-free MapReduce. Jimmy gave a live demo running some pig scripts on the cluster at his university explaining how pig can be used to find out which links point to each web page in a web archive data sample from the Library of Congress. Asking Jimmy about his opinion on Pig and Hive as two alternatives for data science to choose from, I found it interesting that he did not seem to have a strong preference for Pig. If an organisation has a lot of experienced SQL experts, he said, Hive is a very good choice. On the other hand, from the perspective of the data scientist, Pig offers a more flexible, procedural approach for manipulating data and to do data analysis.

Towards the end of the first day we started to split into several groups. People gathered ideas in a brainstorming session which at the end led to several groups:

·  Cropping error detection

·  Full-text search on top of warcbase

·  Hadoop-based Identification and Characterisation

·  OCR Quality

·  PIG User Defined Functions to operate on extracted web content

·  PIG User Defined Functions to operate on METS

Many participants made their first steps in Pig scripting during the event, so it is clear that one cannot expect code that is ready to be used in a production environment, but we can see many points to start from when we do the planning of projects with similar requirements.

On the second day, there was another talk by Jimmy about HBase and his project WarcBase which looks like a very promising approach of providing a scalable HBase storage backend with a very responsive user interface that offers basic functionality of what the WayBack machine does for rendering ARC and WARC web archive container files. In my opinion, the upside of his talk was to see HBase as tremendously powerful database on top of Hadoop’s distributed file system (HDFS), Jimmy brimming over with ideas about possible use cases for scalable content delivery using HBase.  The downside was to hear his experiences about how complex the administration of a large HBase cluster can become. First, additionally to the Hadoop administration tasks, it is necessary to keep additional daemons (ZooKeeper, RegionServer) up and running, and he explained how the need for compacting data stored in HFiles, once you believe that the HBase cluster is well balanced, can lead to what the community calls a “compaction storm” that blows up your cluster – luckily this only manifests itself with endless java stack-traces.

One group provided a full text search for WarcBase and they picked up the core ideas from the developer groups and presentations to build a cutting-edge environment where the web archive content was indexed by the Terrier search engine and the index was enriched with metadata from the Apache Tika mime-type and language detection. There were two ways to add metadata to the index. The first option was to run a pre-processing step that uses Pig user defined function to output the metadata of each document. The second option was to use Apache Tika during indexing to detect both the MimeType and language. In my view, this group has won the price of the fanciest set-up, sharing resources and daemons running on their laptops.         

I was impressed how in the largest working group the outcomes were dynamically shared between developers: One implemented a Pig user defined function (UDF) making use of Apache Tika’s language detection API (see section MIME type detection) which the next developer used in a Pig script for mime type and language detection. Also Alan Akbik, SCAPE project member, computer linguist and Hadoop researcher from the University of Berlin, was reusing building blocks from this group to develop Pig Scripts for old German language analysis using dictionaries as a means to determine the quality of noisy OCRed text. As an experienced Pig scripter he produced impressive results and deservedly won the Hackathon’s competition for the best presentation of outcomes.

The last group was experimenting with functionality of classical digital preservation tools for file format identification, like Apache Tika, Droid, and Unix file, and looking into ways to improve the performance on the Hadoop platform. It’s worth highlighting that digital preservation guru Carl Wilson found a way to replace the command line invocation of unix file in FITS by a Java API invocation which proved to be ways more efficient.

Finally, Roman Graf, researcher and software developer from the Austrian Institute of Technology, took images from the Austrian Books Online project in order to develop python scripts which can be used to detect page cropping errors and which were especially designed to run on a Hadoop platform.

On the last day, we had a panel session with people talking about experiences regarding the day-to-day work with Hadoop clusters and the plans that they have for the future of their cluster infrastructure.

I really enjoyed these three days and I was impressed by the knowledge and ideas that people brought to this event.

Leave a Reply

You might also like…

Post icon

In defence of migration

There is a trend in digital preservation circles to question the need for migration.  The argument varies a little from proponent to proponent but in…

Post icon

A Nailgun for the Digital Preservation Toolkit

Mentioned in various forums before, but not necessarily expanded upon within this community, the Nailgun client/server application removes the overhead of starting the Java Virtual Machine when running a Java application consecutive times. Given a large majority of the programs in the digital preservation toolkit are written in this programming language we should consider all of the optimizations that we can find. Nailgun enables us to reach a significant improvement in performance, and should be considered in future digital preservation workflows, if it is not being used already. This blog outlines the current performance issues with Java and provides an overview of how to get Nailgun up and running; giving baseline statistics as it goes to illuminate the descriptions provided.

Post icon

A Weekend With Nanite

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today.

Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.

Join the conversation