Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation

Challenges of Dumping/Imaging old IDE Disks

We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochrane – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.

We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.

Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.

For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.

We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.

Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:

Image Type:     MFM Sector Image
Start Track:    At least 0
End Track:      At most 83
Side Mode:      Side 0
Sector Size:    256 Bytes
Sector Count:   Any
Track Distance: 40 Tracks
Target RPM:     By Image type
Flippy Mode:    Off

We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.

Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.

Exploding the Disk Images

With the disk images in hand we found ourselves facing our biggest challenge. The images, although clearly well-formed, i.e. not corrupt, would not mount with Virtual Floppy Disk or mount in Linux.

Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.

The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:

 

*************************


*     MIC-501  V1.6     *


*   62K CP/M  VERS 2.2  *


*************************


COPYRIGHT  1983, MULTITECH BIOS VERS 1.6

 

CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…

With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.

Commands for the different utilities look like the following:

That resulted in a command line to generate a file listing like this:

Creating a directory listing:

C:> cpmls –f bw12 disk-images\disk-one.img

This will list the user number (a CP/M concept), and the files objects belonging to that user.

E.g.:

0:
   File1.txt
   File2.txt

Extracting objects based on user number:

C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir

This will extract all objects collected logically under user 0: and put them into an output directory.

Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.

One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:

# Bondwell 12 and 14 disk images in IMD raw binary format

diskdef bw12
  seclen 256
  tracks 40
  sectrk 18
  blocksize 2048
  maxdir 64
  skew 1
  boottrk 2
  os 2.2
end

The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.

WordStar and other bits ‘n’ pieces

Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.

Example text, read verbatim might look like:

thå  southerî coasô = the southern coast

At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.

Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.

The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.

Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.

Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.

We await further news from DoC about what they’re planning on doing with the extracts next.

Conclusions

The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.

On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.

The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.

Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.

Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line. 

Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.

Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.

We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.

What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next. 

4 Comments

  1. ross-spencer
    September 29, 2014 @ 8:17 am CEST

    Thanks for the links Dirk! Before my time here, I hadn't read them before. 

    An interesting point about the CTOS FAQ website disappearing and you guys being lucky enough to capture all the relevant information before that happened. I think in some cases luck is almost certainly going to be a factor in a successful migration from disks like that.

    I guess we just have to keep identifying the gaps and filling them as and when we can, such as to  enable us to keep doing this work over time. 

    Your CTOS extraction tool is a good example of that. 

    With regard to DROID and File, we also have the ability to influence, and affect changes in those tools. If we can find identifying byte sequences in the bit streams to these disks then we can submit signatures to the team at PRONOM, or the maintainers of File. 

    If there is a question over whether they fit within scope of PRONOM or not, then I've long been a proponent of signature files for different purposes. E.g. a signature file for file formats vs. a signature file for identifying disk images. It's entirely possible, we just need a mechanism to store and output that information.

    There's a lot to do, and finding time to do it is a challenge. When I get an opportunity I should have a look over the CP/M formatted disks we have and think about what a signature for those looks like. We've already seen big clues about that! – It's just a matter of writing it and testing. Perhaps it'll put us in a better position to have a discussion about improving the mechanisms for identifying disk images. 

    One final point I wanted to address is that we really do seem to be a main actor in this kind of discovery. Or perhaps, between us, some of the only ones documenting it. As ever, we need all the war stories we can gather. I guess that in publishing our stories we hope to see other's publish theirs too. I'm sure there are plenty more. 

    Thanks once again.

    Ross

  2. Dirk von Suchodoletz
    September 25, 2014 @ 12:21 pm CEST

    Archive New Zealand seems to be the main actor to come up with weird old floppies! I'm not surprised that DROID (or other tools) fail to detect certain ancient file types. Euan found in 2011 in some experiments on older sets of files only a rather moderate detection rate. The same was true for the BTOS/CTOS floppy set we analyzed 2012 (another rather moderatly priced 5,25" floppy controller is the FC5025). We got some really strange results with Linux file utility including some formats which did not exist in those days. Another problem for some files and executables was the unavailability of original systems or emulated representations of it.

  3. ross-spencer
    September 24, 2014 @ 3:18 am CEST

    Hi Euan,

    There are a couple of things we'd like to do to improve this process, and validation at that level would be one of them. We did boot up an old copy of MS-DOS WordStar to verify one or two files. From my perspective, that was just validation that what we saw was in fact WordStar, and the characters denoting spaces rendered as expected.

    In terms of validation for 'meaningful' changes, then we didn't feel it was a risk, and I am not sure eyeballing the number of documents we had, this way would have helped us to identify anything. We'd also have to define what we were looking for more specifically. 

    The migration to plain-text was not for preservation, so what we expect to see happen now:

    1. De-facto validation against paper files that DoC may be in possession of. We might spot some odd things in the temporary migration there.

    2. If we haven't paper originals, but if the content looks to be significant, then we'll will be more rigorous in our migration, and validation. We'll talk with DoC about their requirements, and do a lot more research into understanding the expected appearance, and intellectual content of the object. 

    There was an aspect of validation I would like to have done against the disk images rather than the format objects. This is related to the data remanence comment in-text above. I am reasonably confident that there is more data in the images that we're not extracting. This is likely because it has been overwritten, and with it, the offsets to those parts of the disk image. 

    Something I might like to provide in future is a sort of ASCII dump of the disk image, maybe in concert with a bitmask removing the high-order bit mark in any characters with it set. This can be compared to the plain-text extracts of the file-objects, and we can see if we have anything of significance we might think about saving. It might simply help us find more content that was intended to be overwritten, but it might also help us to identify files that haven't been extracted, and that leads us into a different line of investigation, and with it, potentially a different set of tools. 

    How far we go, I feel, depends very much on what the requirements of the content provider are. I think we've gone about as far as we should right now without more in-depth discussions happening about what to do next. Technologically we're a couple of steps further along. In terms of the primary goal of DoC we're one step further along – we know roughly what is on the disks now, and can consider our options accordingly. 

    It's a good point, and thank you for the comment about the war story. Another one for the collection! 

    Ross

  4. ecochrane
    September 23, 2014 @ 1:39 pm CEST

    Fascinating post and a great war story. 

    Did you consider validating/checking the migration of the content against its rendering in a contemporary version of Wordstar? I know you have a copy there though it is not a cp/m version. 

    It would hopefully give you more assurance that the migration was successful and didn't alter the content in a meaningful way. 

     

     

Leave a Reply

Join the conversation