Research Activities and Open Questions at Archives New Zealand
At Archives New Zealand we are currently working on a number of digital preservation research activities including:
- The collation of a sample set of files for use in testing tools and approaches and other digital preservation experiments.
- Documenting software applications and environments.
- Developing an evidence base of migration/normalization and emulation tests.
The purpose of this post is to raise awareness of our work with the wider community and to get some feedback on the activities we are undertaking. There are a number of specific questions throughout that we would also appreciate specific feedback on also.
Collating a sample set of files
As has been discussed a lot in the community recently most digital preservation research requires some sample files for use in testing. We have also recognised this at Archives NZ and have taken it upon ourselves to produce and make available a set of files that may be of interest to the wider community. We plan on making available via the internet and hope to have the first set available by the end of June. Currently the sample is made up of files from a number of sources:
- Files from actual archives that we have had transferred to us.
- Files provided by a Crown Research Institute (http://www.sciencenewzealand.org/).
- Personal files from members of our team.
The hope/aim with the sample is to be able to provide items that have varying but real value to individuals, agencies and the wider public. This value component is mostly lacking in other sample sets that are available and for good reason, files that are of real value are often difficult to make available publicly because of confidentiality or intellectual property reasons. This has also been a problem for us and is one of the reasons it is taking us quite a long time to make the set available. The set has to be checked to ensure that is appropriate for public release.
The set looks like it will be around 550 MB in size and so we have been debating internally about the best option is for making the set available. I would like to ask the community for suggestions on this. There are three main questions we have:
- What would be the best way to provide this information, in a compressed container file (e.g. zip, tar, rar etc?) or in an ISO file, as individual files, or in some other form?
- Where should the information be posted? We could potentially make it available via the Archives New Zealand website but are there better places for it to live?
- What information should be included with it to describe the files? This will be limited by what we have available but, for example, we have gotten much of the information from floppy disks and other portable media, sometimes these disks were labelled, would this information be of use to the community?
Documenting Software Applications and Environments
We are currently doing some development of our archival description tool/database and in that we are looking to include the ability to document the creating application/environment and/or intended rendering application/environment for all digital items we control. Unfortunately neither of these fields are going to be very easy to populate as the population will have to be conducted automatically in most cases (due to volume), and there are either no tools available to infer that information or the tools that are available are not really up to the challenge.
In order to fill this gap and so we can document other experimentation we are doing (more on that below) we have been experimenting with an application/environment documentation database which is intended to document every application we hold, its dependencies, and the various parameters that the applications have which may be useful to know for digital preservation purposes. These parameters are things such as save-as parameters, open parameters, and import and export parameters.
Something like this would be useful for many purposes but in doing this experimentation we have already learnt quite a lot about the complexity/volume of different sets of code that are used to open or save files captured with different “formatting standards”. For example out of 33 applications documented so far there are 505 save-as parameters and 77 of those have a .doc extension associated with them. This implies that there are many different sets of code being used to write “.doc” files and so explains some of the changes that are found when files are subsequently opened in different software environments (i.e. the internal structure of files is different to that expected by the rendering application). A similar example can be given for the open-parameters and the volume of different sets of code used to open files structured with the same “format”.
There is a bit of a chicken-and-egg (http://en.wikipedia.org/wiki/Chicken_or_the_egg) problem with application/environment documentation in the digital preservation community at the moment as there is no equivalent of a PUID for applications/environments. This means that it is difficult to create tools to identify which app/environment is needed to render a file (or set of files), or was used to create them, as there is no standard way of identifying/documenting the app/environment (that I am aware of)[edit– Pronom has information about, and PUIDs for applications e.g. http://www.nationalarchives.gov.uk/PRONOM/Software/proSoftwareSearch.aspx?status=detailReport&id=14. ]– we will have to syncronise these. ]
The database we have been creating is fairly rudimentary and simple at the moment but is helping us to understand what requirements we have in this area. It would be great to see this Open Planets Foundation provide something to fill the gap here and there is potential for us to share our requirements if such a project were to be undertaken.
Developing a Migration/Normalization and Emulation Evidence Base
The third area of research that we have been investigating at Archives New Zealand has been in the development of a migration/normalization and emulation evidence base for use in making decisions about preservation strategies. This work is currently getting started and involves testing the rendering of digital objects across a number of different rendering environments. For any one object we may test its rendering on the following:
- (What we believe or know to be) the object’s original creating or rendering software running on representative hardware from the era that it was created.
- The object’s original creating or rendering software running in an emulated or virtualised set of hardware (QEmu, VMware or VirtualBox)
- Various current software applications such as Open Office, Libre Office, Microsoft Office 2007, Corel WordPerfect X5 etc in order to represent the object as migrated using those applications.
We are using a lime-survey survey to document each test rendering. This survey has questions in it about the different variables that may change across different renderings. It also has conditional questions that change depending on what kind of object you are testing the rendering of (it doesn’t ask about slide-transitions for spreadsheets for example). The survey currently has around 140 (mainly yes/no/comment) questions in it in total, though for any one test only a portion of these will appear to the tester (as few as ~6-10). We are also endeavouring to take screenshots of the process when needed.
As you might imagine this experimentation is quite time consuming and would benefit from being replicated and/or extended by others as we simply won’t have the time/resources to run these tests across as many objects as we would like. At the conclusion of the testing we intend on publishing the results and a description of the methodology so that others can do similar tests.