Hackathon: Practical Tools for Digital Preservation

Hackathon: Practical Tools for Digital Preservation

I participated in the Open Planets Foundation / Digital Preservation Coalition hackathon event in York two weeks ago (27/09 – 29/09).  This is the third event of this type that I’ve attended this year, and so far each has been better than the previous.

The event was well attended, with 30+ people over the three days, all in the blaze of a rare English indian summer, indeed it was a little too warm indoors.  The hackathon format was built upon the successful AQuA hackathons which brings together content owners with collection samples and preservation problems, and developers working in the digital preservation field.  Following a set of lightning 2 minute presentations by both COs and devs, developers and content owners are paired and spend the next 2 and a half days developing a solution, while documenting both the problem and the solution details on the Wiki.  This is done against a backdrop of presentations aimed at both devs and COs, creating a busy, productive environment.

Speaking as a dev the pressure of time is ever present, there’s only 2 full days for coding, and it’s surprising what can be achieved in such a short time.  Getting away from the regular desk and focussing on a single problem without distractions speed up prototype development, while having the user on hand means iterative improvements can be quickly tested and incorporated or abandoned.

People brought a diverse set of content issues and samples, the full list is on the event wiki along with the solution details.  I was paired with Jenny Mitcham from the York Archaelogical Data Service, who actively migrate the reports they receive to the PDF/A format, though the process is not without problems.  Initially I was looking for a way of flagging potentially problematic PDFs using a PDF characterisation toolbased upon the Apache PDFBox libraries, that I’ve been developing for a while.  A solution proved too complex to develop in the allotted time, PDF/A validators aren’t easy to write, so I ended up investigating one or two commercial tools.  The PDFTron PDF/A validation and conversion tools proved to be good, though not perfect, and fairly cost effective.

Other problems/solutions that caught the eye were a tool that extracted and characterised objects embedded within MS Word docx files, a prototype solution for web based email harvesting, and a neat use of the Ohlo code profiling tools to identify source code text files.  There was also some good progress made with the British Library’s truncated JPEG2000, and shifted crop corruption issues, that I’ll be using and developing further back in the office.

The event had a really positive, friendly atmosphere, one real strength of the format is that it demands real interaction between the participants, this helps to build an active digital preservation community.  I’m a real convert to this style of event, and would encourage any curious or skeptical readers to come along to a future event of this type, I’m sure that this won’t be the last.


Leave a Reply

Join the conversation