System imaging – dumping the permanent storage of a computer system to re-run it in an emulator – is a viable option to preserve complete digital environments. These include complex digital artefacts, famous persons machines, electronic lab books of natural sciences research or software and hardware development environments of software companies, to mention just a few. These artefacts or complex environments – further on “preservation targets” – are typically customized for special purposes and often include highly customised configuration or (unique) user programming. Thus, they often cannot be easily migrated into a preservable format without risking the loss of significant information and/or the context provided by the look and feel and user interaction with the system. When migration is needed, e.g. because of system obsolescence or organizational changes, significant resources and associated costs are often required to confirm that all information and context is preserved post-migration.
With the ongoing development in emulation as a preservation and access strategy, the number of success stories of complete systems preserved is rising. After a couple of simple feasibility studies more advanced systems were tackled and more knowledge on the required pre-/post imaging processes was acquired. There are some curational challenges involved in the process, but together with a tailored set of tools, system preservation can be executed by normal archivists and practitioners in the field.
The series of experiments started in Freiburg during the PLANETS project with the preservation of a Linux database system which was a back-end for a CMS of an universities web-site. Another, by now famous example is the preserved Apple Macintosh of Salman Rushdie done by MARBL. During a research stay at Archives New Zealand, the strategy was tested and extended to several Windows systems stemming from between the early 1990ies and mid 2000. Further experiments widened the scope and included systems like a SUN Sparc Station Classic, a networked IBM DB2 system running on OS/2 2.1 or a late 1990ies Apple iMac. Some of the experiments were straight forward, resulting in directly useable results, while others needed more tweaking.
The Imaging Process
The procedure to image the permanent storage (usually harddisk) of a machine to re-run it in a virtual machine was implemented for several commercial virtual machines like VMware, with the VMware converter. It is limited to some popular systems. However, it demonstrates the steps to prepare the to be imaged system for the transfer into a virtualized hardware environment. More generally, this approach was outlined during the Archives NZ experiments, published at last years iPres. Depending on the actual machine, several methods are available to dump the harddisk: In-system, non-intrusive method, which means to boot a networked Mini Linux operating system on the preservation target. Or an intrusive method, which involves removing the harddisk for dumping and connecting it to a suitable imaging system. Such a system would provide the appropriate connectors like SCSI or IDE if the disk cannot be connected to some USB or other external connector for dumping. The involved steps were formalized in a first approach during the Archives NZ Windows system preservation experiments and were refined during a bachelor project jointly run at Victoria University and the Archives. At the same time the bwFLA project is developing workflows to automate system imaging workflows for different hardware architectures.
Success and Partial Success Stories
The first experiment run during the PLANETS project in 2007 looked at a MySQL database system running on a SuSE Linux 8.0 on 32bit X86 hardware using a RAID over three SCSI disks. The system was brought into a special state to avoid disturbing filesystem IO during dumping by re-mounting the filesystem read-only. The dump run straight-forward from the machine over the network onto some receiving machine. Thus, the actual RAID configuration was used to generate a single blockdevice image from it. The resulting image was converted to the VMware image format using the qemu-img tool. The VMDK image did only partially boot loaded to the virtual machine as the virtual SCSI subsystem was using a different controller. To fix this, the install CD of the original operating system was booted and the Initial RamFS was reproduced using a different controller and omitting any RAID setups. Finally, the system was successfully running virtualized.
Another famous case was the reproduction of the real Apple Macintosh, a Performa 5400 all-in-one machine, equipped with a PowerPC 603e processor used by famous author Salman Rushdie. The MARBL at Emory University acquired the archives of Rushdie and wanted to make the author’s desktop as completely as possible available to researchers.
The Archive NZ experiments on Windows preservation targets were comparably easy. Some mid 1990ies to mid 2000nd computers were set up with Windows 3.11 to Windows 98. The disks were imaged both using a separate dumping machine, using an USB adaptor if appropriate, or booting a Mini Linux to dump the disk content over the local network. All resulting images were perfectly bootable, except for the Windows 98 image. The latter required to write a new MBR and re-sys (sys c:) the image for it to become bootable again. All images required to exchange the relevant driver setups for graphics, networking and audio in order to get higher screen resolutions than 640 by 480 pixels, audio output and the machines re-connected to the network. Windows 95 and 98 required to re-initialize a couple of standard drivers for mainboard and IDE too. The required driver sets depend on the emulator or virtual machines used.
The X86 emulator QEMU e.g. offers Cirrus Logic and other types of VGA, different types of audio cards, including the famous Soundblaster series, and a couple of different network cards emulating AMD PCnet, RTL8029, RTL 8139 or NE2000. The Windows 3.11 system was too old for VMware and VirtualBox, thus a generic SVGA driver (the original Microsoft SVGA patched to support the VESA specification) was to be used to obtain a 1024×768 resolution with 256 color palette. No higher resolutions or more color depth is possible in these setups. The driver produces a partial crash during Windows boot with full Microsoft networking in VMware, but not in VirtualBox. Another problem occurs after installing the SoundBlaster 16 audio driver, replacing the original ESS hardware. All works well in Windows but prevents it from proper shutdown.
To widen the list of architectures later on it was tried to preserve a Sparc Station Classic, featuring a SCSI disk, running Solaris 5.5.1. The dumping of the disk connected to an Adaptec controller attached to some Linux host run smoothly and the system successful started to produce a couple of lines output in the Sparc variant of QEMU on an Ubuntu 10.04. But after it was possible to read the actual version message of the Solaris operating system the emulator crashed because of some illegal CPU instruction. The emulator itself was successfully running with a Linux variant compiled for Sparc Stations. Unfortunately, Sparc is not a very popular platform anymore and thus the support for the architecture is poorer compared to the more popular ones. But fixing the emulator should fix the boot process of the imaged system.
A larger project, because it concerns another less popular system running in a client-server configuration, was the reproduction of a networked OS/2 database server connected with a couple of clients via TCP/IP over TokenRing. The server featured a SCSI disk, the client machines were equipped with IDE disks. All machines were installed in 1994. An intrusive method was used to dump the disks as no Ethernet adaptors and no easyly bootable device like a CD-ROM were available. The SCSI disk got easily dumped using the procedure deployed for the Sparc disk.
Another experiment at the Archive NZ was the virtualization of a Windows 2000 machine running a MS-SQL database. It was part of a research project at the Archives to investigate the best approaches for appraising, transferring (where relevant) and preserving databases. The machine held an MS-SQL database with a custom HTML front-end. The database consisted of a digital index to a couple of paper records transferred earlier from an agency. The migrated and virtualised (or emulated) database may now be used in our reading room(s) to aid users in discovering and accessing the transferred paper records.
IDE Disk Challenges for Older Drives
In a couple of experiments the machines needed to be disassembled to remove the drives as they were not bootable with a suitable Mini Linux or there was no suitable network connection available. Different to expectations as most people would assume IDE to be more wide-spread and compatible, the dumping of parallel port IDE disks was more challenging compared to the SCSI counterparts. Usually, an USB IDE adaptor, either from an external disk case or as a special “adaptor cable” which usually comes with an external power adaptor, is a convenient method. In this case, the disk can directly be attached to the imaging machine.
If the disk is really old, which means before the mid-1990s, the USB adaptor does not properly talk to the disks to gather the capacity information. In the NZ Archive experiments a couple of older laptop disks failed as well as the 240MByte disks of the OS/2 clients. Even directly attached IDE channels on system motherboards did not properly recognize these disks. A typical sign of a reading failure is the report of 2TByte capacity on USB or undetected disks in BIOS on mainboard channels. Finally, an older Intel 865 chipset was able to properly detect it. There seem to be firmware issues, as all newer disks (usually of a Gigabyte upwards) are properly detected both on USB and internal IDE channels. Unfortunately, there was no IDE adaptor card, comparable to the SCSI adaptor, available to circumvent the problem to preserve a complete machine for early IDE dumping.
OS/2 Imaging Oddness
For some not yet understood reason the system imaging of OS/2 2.1 IDE disks did not completely work as expected. We got several rather similar clients to test with, but interestingly the problem was exactly the same with the other client disks dumped too. The issue of the dumped disk image seems to be a structural one as different partition layouts resulted in the same kind of problems (the partition table did not read completely, especially extended partition information was garbled). A couple of experiments were run like copying the disk 1:1 to a much larger IDE disk (of 13GByte of size). The results were as expected: The partition table listed exactly the same layout and every partition was mountable in the Linux system used, but reporting some filesystem errors for the first partition, which shouldn’t be there. To investigate into the imaging problem the procedure was re-run with dd_rescue (without giving a blocksize). It resulted in exactly the same image as the dd with a blocksize of 512Byte (as could be expected). A test-run with a blocksize of just 1Byte resulted in an image containing a readable partition table and behaving in exactly the same way as the image taken the normal (512Byte blocksize) way from the 13GByte disk. The image differed from the other ones (using the diff utility). The results are a bit disturbing: Usually you would expect a proper image from the usual dd(_rescue) process. Nevertheless, there was no similar problem when imaging the SCSI disk of the same operating system.
Emulator Hardware Issues
The 16bit version of the early 1990ies Windows version running on top of DOS is just 25 years old, but already gets neglected by the major virtual machines like VirtualBox and VMware. Both feature a custom made VGA and thus there is no direct driver support for the OS available. The only option to get beyond 640 by 480 resolution in 16 colors, which might not enough for a couple of artefacts, is to binary patch the Microsoft SVGA driver for that system. It adds VESA 2.0 support for a range of graphic adaptors and allows resolutions up to 1024 by 768 resolution allowing 256 colors. Interestingly, the SVGA driver interferes with Microsoft networking (SMB) if enabled in VMware and prevents the machine to boot into Windows normally. This problem is not visible in VirtualBox, though. The situation is much better in QEMU or DOSbox as both machines replicate real hardware like a Cirrus or an S7 video adaptor.
As both types of machines still offer Soundblaster 16 and an AMD PCnet network adaptor, direct driver support is still available. For some reason the installation of the SB16 sound drivers into the imaged Windows 3.11 installation from then on prevents the system to properly shut down. It got ESS drivers installed when running on real hardware. There was no issue when adding SB16 to a fresh install.
A similar VGA problem popped up with an imaged OS/2 2.1 installation. The originally installed SVGA driver prevented the machine to start properly. It was to be reset to standard VGA to display the graphical user interface and boot up. This operating system has networking issues in QEMU, as the interrupt handling of the network card does not cooperate properly with the OS driver. But, it works in VirtualBox. VMware does not support OS/2 at all.
Hardware Replication
The original hardware replication pretty much depends on the preservation target’s architecture. Platforms like the various homecomputers, Apple Macintoshs or SUN Sparcs were rather limited regarding the hardware variance. Thus, it was comparably easy for well defined hardware with limited variability to provide an emulator matching the original system. If the emulation is complete like for Apple PPC, the system directly ran in the emulator without modification. An open issue is the incomplete emulation for instance for the Sparc architecture. QEMU is able to run a Linux compiled for Sparc but an imaged version of an original Solaris installation crashes in early stages. The varying popularity of platforms often results in varying quality of emulation.
A direct hardware replication for X86 will be impossible in most cases, as the variety of hardware is much wider than the supported one. X86 is one of the longest living architecture evolving over more than 30 years by now. Step by step new components got introduced and components improved over the time. Thus, an X86 of the early 1990ies is kind-of compatible to a today’s machine, but a Windows 3.11 or Windows 9X will most probably not execute properly any more on it. The drivers are only part of the problem.
In general there might be specifically challenging setups which required lots of CPU power and RAM when originally running. But, for many large systems the future access requirements usually lower than of actually running systems. Usually it would be acceptable to wait longer for a query to complete on a preserved database than one running in production. Nevertheless, CPU and memory resources for original environments are less of a problem and could be often generously applied.
Preliminary Conclusions
The range of experiments created lots of knowledge on the different systems regarding setup and configuration particularities. Much of it should be re-usable on similar tasks, simplifying the imaging of systems of the same type. The preservation hardware environment is usually fixed and limited in most cases, as often only one emulator is available, like for the SUN Sparc or the iMac. Especially for certain hardware architectures the bandwidth of sold hardware components was very limited making it much easier and more predictable for emulator programmers. Nevertheless, usually the target emulator is well defined. The preservation target can be prepared for the transfer in advance to make the transition process more smooth, by e.g. adding drivers for the emulated hardware (and removing conflicting drivers from the original setup), resetting screen resolution, … At Victoria University a student project was run as well as in the bwFLA project in Freiburg to define standard workflows for preparation of preservation targets. This includes the gathering of metadata on the original system.
Beside to automatically collectible metadata information like software licenses and account’s passwords are to be acquired if the access restrictions are not removed (e.g. passwords cracked, reset).
From the wide range of experiments run can be concluded that there are better and less well preservable systems. Operating systems prepared and supported in virtual machines should be preferred where possible. Generally, the procedure can be expected to get easier with newer systems as they are all network capable. Nevertheless, heavily secured and proprietary platforms like the Apple iPad may pose new problems here.
Open Research Questions
Even if the above issues such as incomplete emulation are solved, there are still several open research questions. What kind of modifications on the preservation target are allowed for the emulated result to be still authentic? How would it be possible to measure success of such a preservation workflow? For migration scenarios the research has produced a couple of answers like the concept of significant properties or object description languages to compare the result against the original.
As the object considered for system imaging is much more complex, the measurement of success and completeness of the workflows is not completely clear yet. The process can be seen as a kind of migration affecting certain parts and aspects of the object. Usually, lower layer components of the hardware-software-stack like the CPU or drivers are affected. These should not but definitely can affect the object of interest, which is usually a certain application like a piece of digital art, computer game, scientific workflow or business process. As most probably the concepts of traditional migration can only be partly applied, new metrics are to be found to describe complete systems in a way which allows proper (automatic) verification. A subset of questions would include acceptable transformations to the original filesystems regarding the cleanup of privacy related content or changes made to configuration files.