In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this event with a lot of enthusiasm and fingers itching to continue the work I started there.
As Clements and René writes in their blog post on this event, collaboration had, without really being stated explicit, taken centre place and that in it self was a good experience.
For the hackathon Jimmy Lin from the University of Maryland had been invited to present Hadoop and some of the adjoining technologies to the participants. We all came with the hope of seeing cool and practical usages of Hadoop for use in digital preservation. He started his first session by surprising us all with a talk titled: Never Write Another MapReduce Job. It later became clear Jimmy enjoys this kind of gentle provocation like in his 2012 article titled If all you have is a hammer, throw away everything that is not a nail. Jimmy, of course, did not want us to throw away Hadoop. Instead he gave a talk on how to get rid of all the tediousness and boiler plating necessary when writing MapReduce jobs in Java. He showed us how to use Pig Latin, a language which can be described as an imperative-like, SQL-like DSL language for manipulating data structured as lists. It is very concise and expressive and soon became a new shiny tool for us developers.
During the past year or so Jimmy had been developing a Hadoop based tool for harvesting web sites into HBase. This tool also had its own piggy bank which is what you call a library of user defined functions (UDFs) for Pig Latin. So to cut the corner those of us who wanted to hack in Pig Latin cloned that tool from Github: warcbase. As a bonus this tool also had an UDF for reading ARC files which was nice as we had a lot of test data in that format, some provided by ONB and some brought from home.
As an interesting side-note, the warcbase tool actually leverages another recently developed digital preservation tool, namely JWAT, developed at the Danish Royal Library.
As Clemens and René writes in their blog post they created two UDFs using Apache Tika. One UDF for detecting which language a given ARC text-based record was written in and another for identifying which MIME type a given record had. Meanwhile another participant Alan Akbik from Technischen Universität Berlin showed Lin how to easily add Pig Latin unit tests to a project. This resulted in an actual commit to warcbase during the hackathon adding unit tests to the previously implemented UDFs.
Given those unit tests I could then implement such tests for the two Tika UDFs that Clemens and René had written. These days unit test are almost ubiquitous when collaborating on writing software. Apart from their primary role of ensuring the continued correctness of refactored code, they do have another advantage. For years I've preferred an exploratory development style using REPL-like environments. This is hard to do using Java, but the combination of unit tests and a good IDE gives you a little of that dynamic feeling.
With all the above in place I decided to write a new UDF. This UDF should use the UNIX file tool to identify records in an ARC file. This task would aggregate the ARC reader UDF by Jimmy, the Pig unit tests by Alan and lastly a Java/JNA library written by Carl Wilson who adapted it from another digital preservation tool called JHOVE2. This library is available as libmagic-jna-wrapper. I, off course, would also rely heavily on the two Tika UDFs by Clemens and René and the unit tests I wrote for those.
Old Magic
The "file" tool and its accompanying library "libmagic" is used in every Linux and BSD distribution on the planet, it was born in 1987, and is still the most used file format identification tool. It would be sensible to employ such a robust and widespread tool in any file identification environment especially as it is still under development. As of this writing, the latest commit to "file" was five days ago!
The "file" tool is available on Github as glenc/file.
"file" and the "ligmagic" library are developed in C. To employ this we therefore need to have a JNA interface and this is exactly what Carl finished during the hackathon.
Maven makes it easy to use that library:
<dependency>
<groupId>org.opf-labs</groupId>
<artifactId>lib-magic-wrapper</artifactId>
<version>0.0.1-SNAPSHOT</version>
</dependency>
which gives access to "libmagic" from a Java program:
import org.opf_labs.LibmagicJnaWrapper;
...
LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
magicFile = "/usr/share/file/magic.mgc";
jnaWrapper.load(magicFile);
mimeType = jnaWrapper.getMimeType(is);
...
There is one caveat in using a C library like this on Java. It often requires platform specific configuration as in this case the full path to the "magic.mgc" file. This file contains the signatures (byte sequences) used when identifying the formats of the unknown files. In this implementation the UDF will take this path as a parameter to the constructor of the UDF class.
Magic UDF
With the above in place is it very easy to implement the UDF which in its completeness is as simple as
package org.warcbase.pig.piggybank;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.opf_labs.LibmagicJnaWrapper;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
public class DetectMimeTypeMagic extends EvalFunc<String> {
private static String MAGIC_FILE_PATH;
public DetectMimeTypeMagic(String magicFilePath) {
MAGIC_FILE_PATH = magicFilePath;
}
@Override
public String exec(Tuple input) throws IOException {
String mimeType;
if (input == null || input.size() == 0 || input.get(0) == null) {
return "N/A";
}
//String magicFile = (String) input.get(0);
String content = (String) input.get(0);
InputStream is = new ByteArrayInputStream(content.getBytes());
if (content.isEmpty()) return "EMPTY";
LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper();
jnaWrapper.load(MAGIC_FILE_PATH);
mimeType = jnaWrapper.getMimeType(is);
return mimeType;
}
}
Github: DetectMimeTypeMagic.java
Magic Pig Latin
A Pig Latin script utilising the new magic UDF on an example ARC file. The script measures the distribution of MIME types in the input files.
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
-- The '50' argument is explained in the last section
define ArcLoader50k org.warcbase.pig.ArcLoader('50');
-- Detect the mime type of the content using magic lib
-- On MacOS X using Homebrew the magic file is located at
-- /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc
define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic('/usr/local/Cellar/libmagic/5.16/share/misc/magic.mgc');
-- Load arc file properties: url, date, mime, and 50kB of the content
raw = load 'example.arc.gz' using ArcLoader50k() as (url: chararray, date:chararray, mime:chararray, content:chararray);
a = foreach raw generate url,mime, DetectMimeTypeMagic(content) as magicMime;
-- magic lib includes "; <char set>" which we are not interested in
b = foreach a {
magicMimeSplit = STRSPLIT(magicMime, ';');
GENERATE url, mime, magicMimeSplit.$0 as magicMime;
}
-- bin the results
magicMimes = foreach b generate magicMime;
magicMimeGroups = group magicMimes by magicMime;
magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes);
store magicMimesBinned into 'magicMimeBinned';
This script can be modified a bit for usage with this unit test
@Test
public void testDetectMimeTypeMagic() throws Exception {
String arcTestDataFile;
arcTestDataFile = Resources.getResource("arc/example.arc.gz").getPath();
String pigFile = Resources.getResource("scripts/TestDetectMimeTypeMagic.pig").getPath();
String location = tempDir.getPath().replaceAll("\\\\", "/"); // make it work on windows ?
PigTest test = new PigTest(pigFile, new String[] { "testArcFolder=" + arcTestDataFile, "experimentfolder=" + location});
Iterator <Tuple> ts = test.getAlias("magicMimeBinned");
while (ts.hasNext()) {
Tuple t = ts.next(); // t = (mime type, count)
String mime = (String) t.get(0);
System.out.println(mime + ": " + t.get(1));
if (mime != null) {
switch (mime) {
case "EMPTY": assertEquals( 7L, (long) t.get(1)); break;
case "text/html": assertEquals(139L, (long) t.get(1)); break;
case "text/plain": assertEquals( 80L, (long) t.get(1)); break;
case "image/gif": assertEquals( 29L, (long) t.get(1)); break;
case "application/xml": assertEquals( 11L, (long) t.get(1)); break;
case "application/rss+xml": assertEquals( 2L, (long) t.get(1)); break;
case "application/xhtml+xml": assertEquals( 1L, (long) t.get(1)); break;
case "application/octet-stream": assertEquals( 26L, (long) t.get(1)); break;
case "application/x-shockwave-flash": assertEquals( 8L, (long) t.get(1)); break;
}
}
}
}
Github: TestArcLoaderPig.java
The modified Pig Latin script is at TestDetectMimeTypeMagic.pig
¡Hasta la Vista!
During this event we had a lot of synergy through collaboration; shouting over the tables, showing code to each other, running each other's code on non-public test data, presenting results on projectors, and so on. Even late night discussions added significant energy to this synergy. All this is not possible without people actually meeting each other face to face for a couple of days, showing up with great intentions for sharing, learning and teaching.
So, I do hope to see you all soon somewhere in Europe for some great hacking.
Epilogue: Out of heap space
A couple of weeks ago I was more or less done with all of the above, including this blog post. Then something happened that required us to upgrade our version of Cloudera to 4.5. This again resulted in us changing the basic cluster architecture and then the UDFs stopped working due to heap space out of memory errors. I traced those out of memory errors to the ArcLoader class, which is why I implemented the "READ_SIZE" class field. This field is set when instantiating the class to some reasonable number of kB. In forces the ArcLoader to only read a certain amount of payload data, just enough for Tika and libmagic to complete their format identifications while ensuring we don't get hundreds-of-megabyte sized strings being passed around.
This doesn't address the problem of why it worked before and why it doesn't now. It also doesn't address the loss of generality. The ArcLoader can no longer provide an ARC container format abstraction in every case. It only works when the job can make do with only a part of the payload of the ARC records. I.e. the given solution would not work for a Pig script that needs to extract the audio parts of movie files provided as ARC records.
As this work has primarily been a learning experience I will stop here — for now. Still, I'm certain that I'll revisit these issues somewhere down the road as they are both interesting and the solutions will be relevant for our work.