We have recently started some research at Archives New Zealand to investigate the best approaches for appraising, transferring (where relevant) and preserving databases.
As part of this research we will be undertaking case studies of a number of databases. The case studies will involve a number of aspects including where possible testing one or more preservation approaches on each database. We hope to publish the results of this research at the end of the project.
As part of one of the case studies we recently migrated an entire Windows 2000 Server machine to virtualized hardware and onto emulated hardware. The motherboard and other components failed during transport so the process proved particularly challenging but was ultimately successful.
The machine held an MSSQL database with a custom HTML front-end. The database consisted of a digital index to some paper records. The paper records had been transferred to us earlier from the agency (LINZ) that gave us the database to do research on. The migrated and virtualised (or emulated) database may now be used in our reading room(s) to aid users in discovering and accessing the transferred paper records.
Attached to this post is some relatively brief documentation of the process. This version has had the passwords that were recovered from the machine removed from it.
This example will be included in the case study for this database with a discussion of the value of the process versus the resources required etc. Those aspects are possibly the most interesting for this community but hopefully there is some value in this process documentation also.
April 26, 2012 @ 9:16 pm CEST
Having tried out the Emulation Framework recently its worth pointing out that the QEMU version of this database server could be managed and run via it. The disk images could also be potentially held on a network share and emulated remotely via QEMU and the Emulation Framework.
April 26, 2012 @ 8:43 am CEST
In this particular case the object may not officially be of long-term value and has been migrated to virtual/emulated hardware for research purposes. The database does actually have some short-medium term value as an index to the some of the paper records we hold.
The database that is of value is relatively easy to describe and evaluate in relation to disk resizing options. However your point is quite clear and important, where we do not know what may or may not be of value we probably shouldn’t start removing things just to make our life easier. So, in cases where this approach may be used it’s value will have to be assessed against the cost of preserving such large objects (the disk images). I suspect that in many cases the cost to understand, document, and migrate the database, along with the costs in providing meaningful access to the migrated databases without the custom GUI, may make this emulation/virtualisation approach quite attractive in comparison
This is also not the only possible application of emulation/virtualisation to databases however, and others may not have this consideration. For example Microsoft Access databases with customisations in them could be rendered using standard MS-Windows environments with standard MS-Access installations on them. Those standard rendering environments could be preserved once and used to render many millions of access databases reducing the per-object storage cost to a minimal value. In other words, if, as you say, we can isolate the object to be rendered from the rendering environment then we can use standard rendering environments for many different objects and reduce the per-unit cost of storing any “rendered object” to a minimal value. That is a big if for complex databases but should be straight forward for the likes of “stand-alone” MS-Access style databases.
On the other hand, if we can’t preserve all the important information through migration then the cost of preserving the disk images will have to be weighed against the value of the important information.
Dirk von Suchodoletz
April 26, 2012 @ 8:24 pm CEST
Regarding the image size reduction: Here are a couple of questions are to be discussed: What exactly is the object which is to be preserved authentically? Are the numerous operations to reduce the original disk image size identity transformations? Could this be proven in an automated way? How much change of the original image on the block level is acceptable? Most probably it depends on the expectations of the donor, the memory institutions and the complexity of the object itself. At least the harmlessness of some standard transformations like defragmentation and image shrinking should be provable by e.g. counting and fingerprinting the contained files before and after the transformation including the standard file metadata like time of last access. Nevertheless, it might be desirable to remove any additional software components like an installed office software package as it is of no interest in a case like this. But, this definitely would change the object container significantly.
April 25, 2012 @ 11:53 pm CEST
I have just taken a look at the size of the data on the 40GB drive. I was going to investigate creating a smaller disk image and copying all the content from that drive to the new drive image.
Unfortunately I found that there were at least 18 GB in use on that drive. So although I could go through with this process it may only save up to 10 GB (the 40 GB drive image is around 30GB as a vmdk file). So the disk image would still be up to 20GB.
It does solve another mystery however. The search queries on the database were quite slow, this is now probably explicable based on the volume of data that the database is searching across.
I’m not sure that this is a unique problem for emulation however. The database involved in this example is large. Whether it is preserved using emulation or a different strategy it will still be very large. Also, were the database smaller there are a number of options for reducing the size of the image that were not investigated in this process. These include automatable options (there are ways to clear blank space on the image file to ensure that the compacting can be done comprehensively).
In regards to your comments about automation: There is a lot of room for automation here. The password extraction and the changes that are made by the mergeIDE program could presumably be made using a single Linux Distribution. It may even be possible to add drivers using scripts that change the system files on the hard drive image. The only possible non-automatable part was finding the right location for the front-end html application after the migration process. However the issues there may have been to my poor understanding of the MSSQL database software and MS Windows 2000 Server rather than a genuine need to change the default configuration. Such systems are fairly standard and were this to be a solution that was regularly implemented across the digital preservation community then I’m sure standard documentation and processes would be produced to remove this problem.
Thank you very much for the comments.
Dirk von Suchodoletz
April 26, 2012 @ 7:30 am CEST
Great job! It nicely complements the study on the “simpler” Microsoft operating systems done earlier. It demonstrates the challenges with the newer systems when exchanging the (virtual) hardware components. Those challenges seem to be manageable and the process could become a standard technique to preserve dynamic, interactive environments. This could apply to the FoxPro database experiment reported earlier at this site too and adds to the discussion on strategy options for database preservation.
For the future it would desirable to streamline the process by automating it. A major issue might be still the gross size of the preserved object. The database itself might by just a couple of hundred megabytes compared to the preserved system image of 40 gigabytes. Another issue is the knowledge required to make the object conveniently accessible. Here, it would be desirable to cooperate with the donors of the object to acquire additional knowledge to finalize the object for preservation and access.
The procedure showed that a certain “cooperation” of the original operating system is required to be properly preservable. Another issue might be the handover of the license for the OS and the database engine. Both should be taken into consideration for objects which should be preservable in the future and could be made a requirement when implementing a project or putting out a tender.