Bringing together the Emulation and Format ID hackathons

PDF Eh? – Another Hackathon Tale

Coming up in the next month are two excellent OPF hackathons: the “Emulation, learn from the experts” hackathon and the “CURATEcamp 24 hour worldwide file id hackathon“. One follows the other with the emulation hackathon running from the 13th of November till the 15th and the File ID hackathon on Friday the 16th. 

This seems like a great opportunity in which one hackathon can contribute to the next. And this was actually suggested on the format ID page here but little was said about how they might relate to each other aside from the off-hand suggestion (/question) that “emulation needs file characterization too?” This lead me wondering how best to answer the question:

How does format identification relate to emulation?

Although I have suggested that format identification is not strictly necessary for successful emulation workflows I do believe that there are many ways in which the two could contribute to each other. Some of these I’ve outlined below:

1.     Emulated software environments can provide an alternative format identification method.

As I’ve discussed here, opening a file in a number of old pieces of software can help to identify its format. Often error messages are encountered when opening a file e.g.:

 

Share photos on twitter with Twitpic

 

Or the file can be obviously not rendered properly, indicating that it was not in a format that is compatible with the software:

 

Share photos on twitter with Twitpic

This can indicate that the file cannot be formatted using any of the formats that that software supports.

Performed multiple times using different environments incorporating software that was contemporary to the file(s) of interest this method can help to identify the format of the file by reducing the possibility set dramatically.

 2.     File’s formats can be validated using emulation.

A useful test to do to validate a file’s format is to test whether the file renders properly in the software environment that is primarily associated with the file’s purported format. When that software is old, that software will often need to be run using emulation.  

 3.     Software format compatibility information –captured from emulated environments – can help in identifying formats.

Different applications differentiate in different ways between versions of file formats in their open and save-as (and import/export) parameters. The logic behind the differentiation may be able to be analysed to discover when format variants are significant or not. For example Microsoft Word Version 6.0c (running on Windows 3.11) has the following open parameters for word for ms-dos files:

 

Word for MS-DOS 3.x – 5.x

Word for MS-DOS 6.0

 

In contrast to this WordPerfect 5.2 for Windows (running on Windows 3.11) has these open parameters:

 

MS Word 4.0; 5.0 or 5.5

MS Word for Windows 1.0; 1.1 or 1.1a

MS Word for Windows 2.0; 2.0a; 2.0b

 

Of which the first may be referring to ms-dos versions.

 

Lotus Word Pro 96 Edition for Windows (running on Windows 3.11) has the following open parameter for word for ms-dos files:

 

MS Word for DOS 3;4;5;6 (*.doc)

 

And Corel WordPerfect Version 6.1 for Windows (running on Windows 3.11) has these open parameters:

 

MS Word for Windows 1.0; 1.1 or 1.1a

MS Word for Windows 2.0; 2.0a; 2.0b; 2.0c

MS Word for Windows 6.0

 

None of which refer to any ms-dos variants.

 

This pattern continues through more recent variants of each office suite.

 

The interesting finding from this is that the Microsoft suites differentiate between versions 3,4,5 (as a group) and version 6 but not within/between versions 3, 4 and 5 and the other suites (when they have a relevant parameter) do not differentiate between any of 3, 4, 5, or 6. If every office suite differentiated between the variants in the same way then this would indicate that there were significant differences between them. However as they don’t then it is inconclusive in this case.  As Microsoft wrote the standards in this example then their suites ought to have the most reliable information and therefore it may be sensible to conclude that version 6 is significantly different to versions 3, 4 or 5.  

This pattern also holds for save-as parameters. The Microsoft suites differentiate between version 6 and the group of versions 3, 4 and 5 whereas the other suites don’t differentiate this way. Where there is general agreement in both open and save-as parameters across multiple applications then this will give digital preservation practitioners very good reason to believe that there are significant differences between the formats in question.

This information can then be used to contribute to algorithms for identifying file formats. 

 

4. Identifying a file’s format can help in identifying its rendering/interaction environment.

If we know the format of a file and know which environments could interact with files of that format and which couldn’t, and/or which environments had that format as a default, then we can reduce the set of possible rendering/interaction environments dramatically. This, along with the age of the file can be particularly useful in the web archiving space for identifying the right rendering/interaction environment for old web pages. 

If emulation is to be truly successful then it needs to become as automated as possible, without automatic environment identification (or having the environments identified and documented by the creators) then a significant part of that automation is going to be impossible.

 

 

So there are clearly overlaps in the two initiatives but this leads to the next question:

How can the two hackathons contribute to each other?

1.The emulation hackathon could work to produce a file ID workbench and/or an automated rendering/format validation tool that: Runs files through multiple environments using multiple open-as parameters to identify successful rendering environments by:

    1. Capturing error messages, text content and screenshots after opening the file
    2. Parsing error messages to identify non-compatible applications
    3. Scanning text from opened files to identify odd symbols indicating inappropriate rendering e.g:

Figure 27: WordPerfect file rendered in Microsoft Office 2007

 

or

  

or

Figure 18: Microsoft Office 2007 rendering of a Microsoft works 4.0 file with a sentence added and different word count

  4. Comparing screen captures to captures from known-successful renderings to identify rendering failures.

Or that can simply be used to manually open files to visually test their rendering and identify their format.

Such a tool would help a lot for both format identification for use in implementing a migration strategy and rendering environment identification for use in implementing an emulation strategy. A tool like this would enable both the validation of file’s formats and the identification of file’s formats. It is important to note that it will be essential to be clear about which of these is being conducted and which data is being used for each so as to avoid circular validation based on identification using the same method.

2. The emulation hackathon could produce environments in which to install applications that need their compatibility parameters documented. The file ID hackathon could then document these and use them to contribute to format identification algorithms.

The emulation framework already provides an option for format ID practitioners to start testing some of the approaches outlined above and hopefully the emulation hackathon will enable or produce some more.  So if you are planning on attending either of these events please bear in mind the options for collaboration and working together to achieve the goals of both valuable ventures.

 

Leave a Reply

Join the conversation