Update: Digital Archaeology and Forensics

Beginning of this year we reported on first results of a joint Archives New Zealand and University of Freiburg data recovery project of a set of 5.25 inch floppy disks from the early 1990s. After recovering the raw bitstreams from the floppy disks with a special hardware device the resulting image files were sent over to Freiburg for further analysis. After being able to establish the file lists contained on each floppy it is possible now to extract single files.

Filesystem Interpreter

Of course it was possible to sneak at the probable file contents before by opening the floppy image file in a hex editor. But this makes it very complicated especially for non-text files to distinguish between file boundaries. Depending on the filesystem used a  file is not necessarily contained in consecutive blocks on the storage medium.

For the purpose of the Archive and the public institution donating the data it is not required to re-implement the filesystem driver of the old platform for some recent one as most probably nobody wants to write files on floppy disks for this architecture again. But nevertheless a thorough understanding of the past filesystem is required to write some tool which can at least perform some basic filesystem functionality like listing the content of a directory and reading a specific file. For fast prototyping and because processing speed and efficiency is not an issue here the Python scripting language was chosen by the student endeavoring this task in his thesis. After the first implementation step to read the directory content, the second step to read actual files was achieved.

Fortunately the project was started early enough so that all relevant information which was coming from one specific site (www.ctosfaq.com) on the net was recovered in time. This site went down and did not leave relevant traces either in the Internet Archive nor in the publicly accessible cache of the search engines. This is a nice example of the challenges digital archeologists face. It gives recommendations for the future to store all relevant information on a past computer architecture within the memory institutions and not to rely on the net too much.

Preliminary Results

The recovery experiment was run on 62 disk images created by the team in New Zealand. In three of those 62 images the File Header Block was unreadable. Two of the failing images had just half the size as the rest of them 320KByte instead of 640KByte. This issue lead to unavailable file information like file address on the image and file length. For the third failing case it is still a bit unclear why the File Header Block is unreadable. This comes to a total of 59 readable images with a total of 1332 identifyable files in them. The text content of the failing disk images was transfered to a single text file per image. At the moment the issues are investigated together with the manufacturer of the reading device. It might be possible to tweak the reading process and extract more information that way to add the missing pieces for the failing images. This might led to some deeper insight into the procedure and some best practice recommendations.

9 Comments

  1. ecochrane
    April 8, 2012 @ 7:59 am CEST

    Stanford University’s VLSI Research Group have just released a new database of information on processors. It is available to view and download (as multiple csv tables) here.

  2. Dirk von Suchodoletz
    April 3, 2012 @ 9:26 am CEST

    There is quite a number of meta data which is required to describe a system completely. Lots of this information is already out there and available from different sources and in different formats and representations. This could be the aforementioned web sites, linked data of any kind, the OPF Registry suggested by Maurice van den Dobbelsteen and Bill Roberts, existing meta data standards like PREMIS or TOTEM. The information needs to be machine readable and somehow standardized so that it either could be used directly or (semi) automatically imported into e.g. TOTEM. Most probably new requirements arising from ongoing projects like bwFLA. They might require to add additional information fields. Nevertheless, it would be desirable to follow a (moderated) wiki approach so that numerous sources could contribute and it would make sense to re-use existing initiatives over trying to start just the next service of this kind. OPF would be a good place for hosting and moderating.

  3. ecochrane
    March 28, 2012 @ 8:39 pm CEST

     The IT History Society have just launched a new IT hardware database that is available here:

     http://www.ithistory.org/hardware/hardware-name.php

    It is intended to be “the most comprehensive and exhaustive database of IT Hardware

  4. bram van der werf
    March 18, 2012 @ 4:28 pm CET

    I would very much agree that having such an independent body would be benificial. Looks like we have a reasonable understanding of the challenge and a clear view on a potential solution. Documentation for a process is like metadata for an object, even though you loose the actual outcome of a process you will still be able to recreate or regenerate something of the original and in software actually something very close to the original. Your independent body would provide a similar service as commercial escrow services only a bit more complex and with a wider scope.

    Preserving, managing and maintaining this type of  information/documentation is one step towards a solution. Creating “bodies” that actually do manage and maintain standards, registries will only work if stakeholders are willing to invest money. Else it will become exactly what you pointed out (just another website or wiki with some information).

    Already for many years the digital preservation community is having a chicken or the egg type of conversation on services, registries and other reliable information sources that should be managed and maintained by a “body”. 

    Over the past decades it became clear that the commercial sector perceives limited to zero value in financing these types of services and registries. If anyone besides memory institutes would have been prepared to pay for these services, these services would already exist. This leaves memory institutes as main stakeholders and potential users of these services. I think you need to mobilize at least 20-30 other stakeholders with long term commitment to establish the business case for such an archiving service, else there will be no critical mass to sustain that software documentation body. Once you have this critial mass (budget), you should try to find an administrative, operational and legal home for it.

  5. ecochrane
    March 15, 2012 @ 1:47 am CET

    We at Archives New Zealand are really happy with this result.

    Because of this work we have been able to ascertain that the data on the disks is not particularly valuable. It turned out that in this case the data on the disks was a duplicate of print-outs that the agency still has. This is a great result because it enables the agency to make a justified decision to destroy the disks without any risk of loss. It will also help us to process similar disks in the future. 

    The most troublesome part of this process was that the only way the student was able to understand the file system was to use documentation that has subsequently gone from where we found it on the internet.  As Dirk identifies above, that highlights the need for some “body” to independently (not just on some site on the internet) preserve and make available all the system and software documentation for old digital technologies. Without that documentation this kind of work would be much more difficult if not impossible, and this documentation is rapidly disappearing.

     

     

     

Leave a Reply

Join the conversation