”#Migration: No one does it for the future; they do it (need to do it) for the now.” – https://twitter.com/beet_keeper/status/327968228276060160
Recently I was asked by a colleague to look at some files he’d been sent by Hutt City Council in New Zealand; an unknown format from a 1995 vintage IBM operating system – a format as yet unidentified by popular format identification tools.
As with most of these attempts to identify a format we ran the files through DROID, ExifTool and the Unix File Command. With neither identifying the files the search really begins with a Google search of the file’s magic bytes:
2B 41 2B 56 2B 43 2B +A+V+C+
A single result at the time provided little to go on; it confirmed someone had once asked the same question on a computer graphics forum. A few clues in the bitstream e.g. a potential font size and title, ‘Roman Bold 26’, and a few more Google searches meant that we could say these files were potentially proprietary to an IBM system as opposed to a file with a more open specification. Confirmation with the content provider gave us the original environment as OS/2.
But that was it. We were staring at an obsolete format; definition: “A format, which, within our limited resourced world view at the time, we could no longer use.”
Our final point of call was to see if we could put the format back into its original environment to observe it in its natural state.
From here, the process became much simpler. As it turned out, an OS/2 installation running on VirtualBox knew what to do with these files. It was able to render them natively in an application for handling IBM AVC (Advanced Video Connection) content. Even better than that, the context menu for these images gave us the option ‘Convert To’ with the following options available:
- BMP (OS/2 Bitmap)
- DIB (RIFF DIB Image)
- GIF (GIF Image Compressed)
- JPG (Baseline JPG)
- PCX (PCX Image Compressed)
- TGA (Truevision TARGA)
- TIF (Tag Image File Format)
- VID (IBM MMotion Still Video Image)
Variants existed under BMP, TGA and TIFF, for example OS/2 1.3 and 2.0 BMP and Motorola or Intel, Compressed or Uncompressed TIFF.
The context menu option also allowed for the bulk conversion of these images, so a single click gave us uncompressed TIFF images suitable for export.
Simple is of course a relative term, and although we had the images we wanted, there was a problem retrieving them from the emulated environment. Unable to successfully set up a shared drive to enable our Host OS to interact with VirtualBox, and unable to attach any form of writeable media, we were stuck.
The virtual machine was connected to the Internet but Netscape unable to interact with modern websites particularly well. Also we were unable to use FTP successfully, at least given the self-imposed timeframe we were working to.
Our final option was email. SMTP saved the day. Taking the images, Zipping them using the still available Info-Zip tool and emailing them from a Gmail account back to itself using the OS-provided Netscape Messenger email client enabled the images to be retrieved which immediately made them useable in a modern environment.
And that was it, job done!
But there is still more to this story.
Time Travel
It’s 1996. I boot up my OS/2 Warp 4.2 box. It’s being packed away today, ready for the new Pentium machines running Windows being rolled out by our IT department. Windows… *sigh* but my IT department wax lyrical about the improvements in performance and security. It’s just work, I’ve got a fishing trip at the weekend so I’ve other things to keep my mind off the IBM vs. Microsoft debate. Wait! I’d better make sure I’ve got all my files. Ah, those IM files I was looking at last year. Neat images; could come in handy again. Windows doesn’t support the format though. Hmm, right-click, convert. 300 files; IM to TIF – that’s going to take a few floppy disks! – Should be able to access them in a few applications though. Good!
What we did by grabbing hold of an OS/2 installation and VirtualBox was not create a solution we want to take into the future. It was us stepping back into 1996 for one time only. To create a version of a file we could take into 1997, and beyond, on a different platform. It is 1996 again and we’ve now got 300 TIFF files. As things move forward in 2013 we might start thinking about converting them to a new standard, PNG maybe to capitalize on space savings provided by lossless compression and also to make use of them on the web. Being an open standard (like TIF) might help to avoid a similar situation to our IM files in future as well. Whatever mechanism is best. It should be lossless and should give us the greatest potential for use moving forward.
Outside of the time travel context, with our images converted and the original provider of the materials happy with the work, we’re left with a success story, but an incomplete solution… an unsatisfying one.
An unsatisfying solution
At the end of this process we’re still left with a file format we don’t fully understand. I can’t migrate this format in a modern environment using modern tools. I can’t render it; I can’t really identify it with complete certainty. I can’t help matters and create a signature for it without really knowing more about where it came from and what its specification looks like. I do have enough examples from a single system to take apart some of the header and look for consistencies but is this precise enough for what we’re attempting to achieve in Digital Preservation? Maybe, for an experimental DROID signature file.
As for the completed migration, with no validation tools available I can’t look at the internals of this format and guarantee I know what was lost between the conversions from IM to TIFF – I do know I’ve lost something though – what were those references to font? They’re no longer in the TIF output, and what other plain text did I spot in the bitstream that might mean something? A part of the bitstream annotated as ‘TEXT’, another ‘HEAD’- fields pertaining to the DB/2 conversion described by the provider?
In short:
- We can’t validate the success of the conversion beyond the rendered image
- We haven’t isolated a specification for this format
- We haven’t an ability to express a signature in current production identification systems
- We cannot render IM files in a modern environment
- The mechanism of transfer from the emulated environment to our Host OS was certainly not a preferred route
As many a school report might say – could do better. The end result of this process is that we have some images that can now be reused by the original content provider. We can also say with a little confidence that we know what format these images were originally: IBM AVC Still Video Image. I’ll leave it up to the comments section of this blog to suggest ways forward from here. The main message for me, however, is that for this to be considered a satisfactory result for digital preservation, one, or more of these issues would have been solved as part of the process – a file format signature would be something, some idea about what the header says would be good, and a deeper analysis about the format structure even better. What would be really nice is an understanding of whether it might be possible to create a migration tool for this format in future, with some idea about what the original specification for the format suggests about the feasibility of being able to do that.
Other Solutions
Before I conclude, we did consider two other options which with further investigation might help us in the short term.
- eComStation is a modern operating system, based on OS/2. In an emulated environment this might give us better methods of extracting the files, for example USB support, better access to file upload websites, and even the opportunity to set up a shared drive between it and the Host OS. We did try to convert the images using eComStation and found that it worked, and even provided PNG as an export format – what had been lost in translation, however, was the bulk processing capability – this left us wondering whether we’d need to create a MS-DOS based Batch Script to do this routine, or even use REXX – IBM’s own interpreted programming language native to the environment.
- Exporting the OS/2 executable for the native image viewer or even converter into Windows may have worked providing they had originally been written to be compatible with Windows and not just OS/2. Highly unlikely but we did have success running an Aldus PhotoStyler executable found in the user directories sitting alongside the original image files.
Migration for the Now
This was an interesting use case. It was nice to have the time to look at a problem outside of the context of the government records we’re expected to look after at Archives New Zealand. There was no expectation of this result, just some files to play around with and see what we could do.
There were a number of lessons alluded to above – goals that we should strive for in digital preservation.
For me, despite this solution relying wholly on emulation, what I really learned was the value of migration. Stepping back into 1996 allowed me to migrate my files to a format I could still use in 2013. I believe the same of file formats now. Any file formats that I have a doubt about, be them proprietary, be that an objective, or otherwise, measurement of over complexity, or simply because it’s not a widely adopted format – I should be thinking about migrating them. It might be the difference between a future Digital Preservation Analyst having to emulate my XP environment and finding an obscure way to transfer files from it, and the alternative, of simply being able to render them natively within their own modern OS.
Isgandar Valizada
June 3, 2013 @ 5:43 pm CEST
Hi. You might want to take a look at the OPF blog post introducing the "bwFLA Project", which deals with the development of an emulation framework you might find interesting regarding your tasks: http://www.openpreservation.org/blogs/2013-03-18-bwfla-demo-emulation-service-eaas-and-digital-art-curation
ross-spencer
May 16, 2013 @ 5:18 am CEST
Dear Euan,
Thank you for your comments.
With regard to some of the detail, I think you may be at risk of drawing an unnecessarily one-dimensional and potentially flawed conclusion from a solution that was multi-faceted.
Yes, emulation worked well and it complimented our migration approach but drawing a distinction between the two isn't necessary. Similar solutions will undoubtedly be used again in the future to either migrate or allow us to view files in an original environment.
The experience helped me to articulate a need which I believe we have, to think about using the right formats now, or at a point when we are still able to. The narrative in the middle of the blog serving to demonstrate the idea of a user understanding the value of his information and the need to maintain a continuity to allow it to remain useful.
However, this idea shouldn't (probably shouldn't) exist independently of preservation considerations – in this case a primary concern for me (and if I was a future researcher) would be not losing information
The emulation didn't help me to understand the format any better. It didn't help to reveal the structure of the file, and for all I know, what it rendered might not even be complete or accurate. There is an image, but is it one layer of a multi-layered document? What if I want to interact with the other layers or the behind the scenes data I alluded to with the font information and the additional annotated elements? – How do I find this out?
There are many potential approaches.
In recognising the value of this case study, you must also recognise certain elements of luck which enabled us to satisfy any part of the content provider's requirements… The ability for OS/2 to recognise this format natively – how likely is this to occur with other formats? And the ability for the virtualized environment to give us a connection to the internet which we could use.
The concluding comments of the blog asks for tools which might allow us to better validate what we're seeing in either the emulated environment or in our current IT environments.
I want to know what this format is; why it contains font information; what other data is in the bitstream that I can't identify. I want to know what its creating application was, and what other applications support it which might allow me to interact with it better.
At the very least, I want to characterize it. This will help me (and others) to better understand how to use it in future.
Ross
ecochrane
May 13, 2013 @ 3:25 am CEST
This case study provides an excellent example of value of emulation as an "ambulance at the bottom of the cliff" solution. The emulation worked, it enabled interaction with the original files in a representative environment and also enabled users to then make selected parts of the content available for re-use in a modern environment (in this case those parts were whatever was migrated into the TIFF files from the originals). It also worked with little input from the creators/donors. They may well have been able to identify the original software or O/S required to interact with the files, and aside from that no other input was required from them. Furthermore they didn't have to spend the time and money over the years to migrate the records. Instead the cost was born when the content was required for re-use. It was a just-in-time solution rather than a just-in-case solution- another reason why it should generally be the preferred solution.
It would be great if users/creators etc. regularly conducted validated migration, and had the tools on hand to do so. However as digital preservation practitioners that is not what we have to deal with for the most part. We have to deal with the things (like these) that have not been properly migrated. And so we need solutions that will not rely on the users having performed these migrations as it is highly unlikely that they will (and why should they since they aren't likely to see any benefits from it?).
Furthermore unless the regular migration was done thoroughly (and with validation) on a regular basis (which seems highly unlikely) then there will be significant issues with using it as your preferred solution for the old things that do make it to your organisation for preservation:
1. How will you validate the migration? What will you validate it against? (if you don't have the original software to use to interact with the files to see what content they present)
2. How long will you maintain the migration path? e.g. will you still be migrating wordstar files in 20/30 years if they come to your organisation then? If so, is your answer to (1.) still sustainable/available
And so if you decide that you need to maintain the original software to use in validating future migrations then why not just use it as your preferred solution in the first place?
Thank you for posting this case study Ross. It’s another great example for the community to refer to.