Archiving Complete Environments for Complex Objects

Archiving Complete Environments for Complex Objects

Digital objects are often more complex than their common perception as individual files or small sets of files. Standard digital preservation methods can lose important parts, or the context of digital objects. Interestingly enough thousands of miles apart Maurice van den Dobbelsteen (The Netherlands) and Euan Cochrane (New Zealand) simultaneously proposed a new approach to cope with the special requirements of their National Archives in dealing with the different types of complex objects. They envisage preserving the whole original environment on which these objects were made. Instead of bothering with fine grained file format and properties detection they propose to preserve the prototypical office desktop machine of an era in use in public and governmental offices.

The system imaging intervals are defined by what could be called “technology generations”, combinations of computer architecture/operating system/standard applications. By the early 1990ies this could have been an 386 or 486 PC running Windows 3.X or NT with some office package like Microsoft Excel and Winword or AMI Pro and Lotus 1-2-3 installed. Or the company or institution was using some Apple Macintosh based office network with Mac OS 6 or 7 running on Quadra, Centris, or Performa machines based on the line of 68000 Motorola CPU’s. Usually the digital workplaces were upgraded in intervalls of 3 to 5 years depending on the organization or company. To capture all relevant changes you might want to image yearly to be safe side. Translated into todays context this means starting to preserve a standard desktop environment: 32bit AMD or Intel machine, Windows XP, Office 2003, Internet Explorer 7 or Firefox and most current plugins and readers. After that a double strategy working into two directions should be taken. Keep on making yearly images of current environments and start working backwards to older environments when funding allows. This approach would very well complement many web archiving strategies as it captures the prototypical web browsing environment of a certain era containing all the relevant codecs and media players.

The majority of today’s digital objects consist of individual files but most of those files are not self containing. The objects are often more complex especially when dealing with special purpose software on office and scientific desktops, with electronic publications with attached primary research data or dynamic and interactive objects like computer games or educational systems for online learning. In every case a certain digital ecosystem is required to access or run them. The easiest way to preserve the entirety of the information is to replicate and preserve the (complete) original rendering environment. Beside practical considerations there are more reasons to preserve complete original environments:

  • To provide researchers the ability to experience individual users’ old information environments such as politicians’, artists’, scientists’ and other famous persons.
  • To provide researchers the ability to re-run representative users’ working environments, the aforementioned prototypical office desktop from a particular time period.
  • In order to preserve complex digital objects in an inexpensive and efficient way by enabling the automation of their preservation.
  • To produce permanent ”viewers” for digital objects that can easily be maintained over the long term and are known to be compatible with the objects.

All of these cases benefit from a shift of our understanding of digital objects up from the single digital files or small groups of files as they are currently conceived of, to full computer systems.

The concept is not a completely new one, but was demonstrated successfully by the team at the University of Emory’s Manuscript, Archives, and Rare Book Library (MARBL) where they have preserved an image of the hard disk from Salman Rushdie’s early 1990s Macintosh desktop and use an emulator to access it. A group of researchers from Universities of Maryland, Albany, Texas at Austin and Emory recommend disk imaging of comprehensive systems for future scholarship and research. To demonstrate the versatility of the emulation approach for my PhD thesis I preserved 2007 a complete MySQL database and CMS X86 Linux machine by dumping it and re-running it in a virtual machine. It wouldn’t be possible to extract the database information and the website rendering application to preserve it by migration in a meaningful way.

Unfortunately, imaging and maintaining access to old computer desktops is still a niche endeavor for a number of reasons including the perceived complexity and difficulty for average, non-technically trained preservation practitioners. To prove the opposite we run a number of system imaging experiments on different Microsoft operating systems running on a wide range of X86 machines at the National Archives of New Zealand. The work flow rendered from these experiments are presented at this years iPRES. It includes numerous steps which could be highly automated to make the process easily manageable by an average archivist, librarian or other digital preservation practitioner.

The focus is shifted then from characterizing of objects in attempts to make them reproducible in completely different digital ecosystems, to the preservation of the whole original environment in which the objects were created, managed or viewed. Using the system imaging approach no specific knowledge of the object and creating application is required. As emulation and virtualization of the x86 architecture is well established, the described method might be used to ease certain preservation work flows. Using this method the current diverse methods for the handling of different types of digital object might be simplified into a standard procedure for preserving a whole computer.

Nevertheless, additional information and metadata is to be gathered to make the strategy work: Licenses to run the installed software packages needed to be handed down or obtained separately by the memory institutions. Plus, a software archive of all relevant software components is required to have the relevant hardware drivers needed or make imaged systems bootable again. The information on computer architectures, imaged systems and software eras as well as their dependencies should be maintained in a proper tool registry. This registry could help to properly “tag” (timestamp) the system images and look up the registry to trigger the image to be loaded for a specific digital artefact. The approach can actually be applied to many preservation scenario’s and policies, even offering potential for issues around webarchiving.

Leave a Reply

Join the conversation