We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochrane – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.
We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.
Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.
For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.
We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.
Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:
Image Type: MFM Sector Image Start Track: At least 0 End Track: At most 83 Side Mode: Side 0 Sector Size: 256 Bytes Sector Count: Any Track Distance: 40 Tracks Target RPM: By Image type Flippy Mode: Off
We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.
Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.
Exploding the Disk Images
Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.
The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:
************************* * MIC-501 V1.6 * * 62K CP/M VERS 2.2 * ************************* COPYRIGHT 1983, MULTITECH BIOS VERS 1.6
CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…
With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.
Commands for the different utilities look like the following:
That resulted in a command line to generate a file listing like this:
Creating a directory listing:
C:> cpmls –f bw12 disk-images\disk-one.img
This will list the user number (a CP/M concept), and the files objects belonging to that user.
0: File1.txt File2.txt
Extracting objects based on user number:
C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir
This will extract all objects collected logically under user 0: and put them into an output directory.
Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.
One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:
# Bondwell 12 and 14 disk images in IMD raw binary format diskdef bw12 seclen 256 tracks 40 sectrk 18 blocksize 2048 maxdir 64 skew 1 boottrk 2 os 2.2 end
The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.
WordStar and other bits ‘n’ pieces
Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.
Example text, read verbatim might look like:
thå southerî coasô = the southern coast
At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.
Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.
The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.
Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.
Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.
We await further news from DoC about what they’re planning on doing with the extracts next.
The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.
On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.
The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.
Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.
Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line.
Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.
Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.
We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.
What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next.