Project to Identify files with linked dependencies

 Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product. 

Currently we have only anecdotal evidence as to the prevalence of linked files in the digital universe. It would be really useful to be able to understand the scale of the issue and identify the prevalence of linked files in the material that we ingest. Archives New Zealand and Victoria University have recently intiated a project that we hope will go some way towards achieving this. 

A student  from the School of Engineering and Computer Science at Victoria University recently started work on a new summer project at Archives New Zealand. The student, Niklas Rehfeld, is funded through a summer scholarship jointly provided by Archives New Zealand and Victoria university. 

Over the next 10 weeks Niklas will be working on a project to investigate linked files and build a tool to identify them. Specifically, the aim of this project is to develop a prototype tool to identify when computer files formatted in the Microsoft Office 1997-2003 formats link to other computer files and which files they link to (in order to identify the component files that make up the complex digital object).

The technical work will involve the following:

  1. Analysis of the Microsoft specifications to determine how document linking and other metadata that maybe of use for preservation purposes is implemented for Word, Excel and Powerpoint documents for the period 1997-2003.
  2. Review of existing frameworks and related tools such as the open source “format identification, validation, and characterization” tool JHOVE.
  3. Writing a specification for a modular tool for identifying linked documents given a root Microsoft Office document. As part of the specificion will be an evaluation of the feasibility of extending an existing tool versus creating a standalone implementation from scratch.
  4. Implementation of a prototype tool for at least one document format. Time permitting, the tool will be extended either to handle a wider range of document formats or a wider range of preservation metadata.
  5. Testing of the tool against a selection of files supplied by National Archives.
This project is a research project first and foremost. There is no guarantee that a working tool will be produced from it. However if a useful tool is produced the intention is to release it as an open source product that anyone can incorporate into their preservation workflows. 
Niklas’s first steps will include looking at the various tools that are out there that may be able to be extended to perform the function outlined above. These include JHOVE, JHOVE 2, DROID and the National Library of New Zealand metadata extraction tool. In addition he will be investigating the available java libraries that he may be able to use for this purpose. 
If anyone in the OPF community has any advice on best places to start with this project or any other advice they would like to offer we would greatly appreciate it. We have a number of good leads already but would appreciate any help the community could offer. 

By Euan Cochrane, posted in Euan Cochrane's Blog

21st Nov 2011  4:23 AM  20876 Reads  1 Comment


There are no comments on this post.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.