Project to Identify files with linked dependencies

Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product.

Currently we have only anecdotal evidence as to the prevalence of linked files in the digital universe. It would be really useful to be able to understand the scale of the issue and identify the prevalence of linked files in the material that we ingest. Archives New Zealand and Victoria University have recently intiated a project that we hope will go some way towards achieving this.

A student from the School of Engineering and Computer Science at Victoria University recently started work on a new summer project at Archives New Zealand. The student, Niklas Rehfeld, is funded through a summer scholarship jointly provided by Archives New Zealand and Victoria university.

Over the next 10 weeks Niklas will be working on a project to investigate linked files and build a tool to identify them. Specifically, the aim of this project is to develop a prototype tool to identify when computer files formatted in the Microsoft Office 1997-2003 formats link to other computer files and which files they link to (in order to identify the component files that make up the complex digital object).

The technical work will involve the following:

Analysis of the Microsoft specifications to determine how document linking and other metadata that maybe of use for preservation purposes is implemented for Word, Excel and Powerpoint documents for the period 1997-2003.
Review of existing frameworks and related tools such as the open source “format identification, validation, and characterization” tool JHOVE.
Writing a specification for a modular tool for identifying linked documents given a root Microsoft Office document. As part of the specificion will be an evaluation of the feasibility of extending an existing tool versus creating a standalone implementation from scratch.
Implementation of a prototype tool for at least one document format. Time permitting, the tool will be extended either to handle a wider range of document formats or a wider range of preservation metadata.
Testing of the tool against a selection of files supplied by National Archives.

This project is a research project first and foremost. There is no guarantee that a working tool will be produced from it. However if a useful tool is produced the intention is to release it as an open source product that anyone can incorporate into their preservation workflows.

Niklas’s first steps will include looking at the various tools that are out there that may be able to be extended to perform the function outlined above. These include JHOVE, JHOVE 2, DROID and the National Library of New Zealand metadata extraction tool. In addition he will be investigating the available java libraries that he may be able to use for this purpose.

If anyone in the OPF community has any advice on best places to start with this project or any other advice they would like to offer we would greatly appreciate it. We have a number of good leads already but would appreciate any help the community could offer.

1 Comment

nik
January 17, 2012 @ 2:36 am CET
Hi everyone again,

Another quick update on the status of the project.

As could be expected, immediately after posting the last update, It turned out that there are a whole lot of other ways of linking that had not been considered… So now implementations have been written for the following link types:
- Formulas, Pivot Tables, Chart Source Ranges in XLS
- normal linked documents in PPT, XLS and DOC files
- Text and Shape links in PPT files.
There are a couple more that I have found out about, that hopefully will be implemented in the next couple of weeks.

I have also been working on splitting the code into two different parts, a user program and an API, so that extending and integrating it will be a bit easier.

I will try and get it up on the web some time soon, once I have checked the documentation and fixed a couple of bugs that I know about, so that if anyone is interested in trying/testing it out they can.

Nik

You must be logged in to post a comment.

Project to Identify files with linked dependencies

1 Comment

Leave a Reply

You might also like…

Apache Tika File Mime Type Identification and the Importance of Metadata

On Building a Debian Package of a Ruby Program

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

Join the conversation

Member-only content

or

or

or

or

Download

or