We’ve been working away on defining and solving preservation problems at the 3 day AQuA Mashup event in Leeds. Today we wrapped up some new technical solutions and documented what we learnt. I think its not unrealistic to claim that most of the preservation issues and challenges have actually been centred around pretty straightforward problems.
Increasingly we seem to be coming to the conclusion that the most pressing preservation needs are for relatively simple, and ideally easy to use, toolsets. Furthermore, the core of the functionality we need is often already out there in the form of open source tools that were typically not developed with digital preservation in mind. That’s really the central theme of the AQuA project: to exploit available tools and solve some small but pressing preservation challenges.
Of course, investigating these questions with specific content quickly took our techies into some pretty complex technical challenges, and pushed our understanding around file formats and the capabilities of the tools and techniques we were trying out. Despite this, excellent progress was made. In fact, the learning and sharing of ideas and expertise on tools and formats was a recurring theme in the positive event feedback we received from attendees.
Some fascinating stuff was developed by our techies, guided in part by our content experts who had worked to describe and document their preservation issues. A lack of transparency around PDF content was a common theme identified by our content experts and so investigational work into embedded images, fonts and external links was particularly enlightening. Carl Wilson (BL) reported back to the group on the lengths that a diligent PDF/A validation was required to go to with respect to fonts. Confirmation is needed that all glyphs (characters) found in the PDF file were represented within the font file as well as ensuring validation of the font file itself. I’m intrigued to know if the various PDF/A validators out there do a decent job of this?
Fingerprinting techniques cropped up in a couple our solution developments, making use of tools for both audio and image matching. This is (not surprisingly) a challenging area and a lot more work is required. But as Roger (BL) and Maurice (National Archeif/OPF) demonstrated, there appears to be significant potential in exploiting this approach to meet digital preservation challenges.
Sven Schlarb of the Austrian National Library demonstrated the value of reusable tools and a flexible workflow design capability by taking IMPACT Project outputs (originally focused on Optical Character Recognition to enable search and retrieval of digitised texts) and applying them to digital preservation QA problems. Excellent stuff, and applicable for lots of content! It looks like Taverna will pay dividends on the new SCAPE Project (watch this web space).
Frank Feng (University of York) worked on consistency checking packaged METS content. His JAVA code cross checks file references between the various metadata, OCR, master data and service data from a mass digitisation project. This double check, or second opinion if you like, is particularly useful for picking up content or metadata generation (or subsequent processing) bugs. With some added intelligence and a bit more adaptability this could be exploited for a more generic cross checking validation of similar but not identically structured content.
Finally, Andy Jackson (BL) worked on a number of preservation issues, including QA of damaged FLV video files and some rooting around in binary office files that yielded some interesting results to take forward in future characterisation and format identification work.
I’m sure these guys will be blogging in more detail about some of this work over the next few days.
I’ve not given much mention to our non-techie attendees, but they did a fabulous job of capturing our preservation issues on the project wiki. The “Collections – Issues – Solutions” page has all the results, including preservation challenges that we identified but didn’t have time to explore further. Some tidying up work will be happening over the next few weeks but most of our results are up there ready to discover. Comments, suggestions and registrations for the London mashup are of course very welcome!
The next challenge is to scale up our approach and schedule for the Leeds event to work with at least double the numbers for our London mashup scheduled for June 13-15th.