SPRUCE Hackathon Leeds: extending C3PO to support Apache Tika

PDF Eh? – Another Hackathon Tale

The SPRUCE Unified Characterisation Hackathon in Leeds brought together a group of developers to discuss the digital preservation community's approach to characterisation and to consolidate and improve existing toolsets.

Developed by Petar Petrov as part of the SCAPE project, C3PO is a tool for profiling digital collections based on FITS characterisation metadata. I had recently experimented with using FITS and C3PO to carry out a digital collections audit that formed part of a SPRUCE-funded project at Bishopsgate Institute.

At the hackathon, Petar, Per Møldrup-Dalum and I worked to extend C3PO to support the creation of collection profiles using metadata extracted with Apache Tika. Tika extracts a range of metadata, along with text content, from different media types. Like FITS, it therefore offers more precise characterisation and profiling of digital collections. A key advantage of Tika is its performance: this is particularly important to practitioners dealing with large datasets, such as the 300TB web archive that Per has been working with at the State and University Library in Denmark.

To add support for Tika to C3PO, Per wrote a parser for Tika's metadata output; Petar then implemented an adapter to enable C3PO to understand this output. We generated a test dataset of Tika output files and were able to use C3PO to ingest and analyze this metadata. In addition to adding support for Tika, we were also able to get C3PO running in Apache Tomcat. Petar hopes to release a Tika-enabled version of C3PO shortly.

Some challenges still remain. Parsing Tika's text metadata output proved awkward, and it would be more convenient to parse XML output if this can be provided without also extracting a document's text content (which is more expensive). Petar aims to modify C3PO so that it is able to ingest metadata from both FITS and Tika for the same dataset. This poses the problem of how to reconcile the two sets of metadata in the absence of unique identifiers for the files they describe. More problematic still is how to map the wide variety of metadata properties extracted by Tika, which are not well documented, onto those provided by FITS.

Despite these challenges, the work at the hackathon has extended the capabilities of C3PO and opened up exciting possibilities for future work. C3PO with Tika support is another useful option for collection owners looking to build up a detailed profile of their collections to assist with preservation planning.

44
reads

1 Comment

  1. peshkira
    March 23, 2013 @ 11:12 am CET

    Hey You!

    If you are interested in c3po and the info in this blog post, I have just released a new version of c3po (v0.3.0) that

    adds this new functionality and fixes some bugs. Note, that the TIKA support is experimental and only a few properties are supported for now, but this will change with the next releases.

    If you want to try c3po out, you can find some more info here: https://github.com/peshkira/c3po and download it here: https://github.com/peshkira/c3po/wiki/Downloads

Leave a Reply

Join the conversation