During and around this year's iPRES a couple of discussions sprung up around the topic of proper software archiving and it was part of the DP challenges workshop discussions. With services emerging around emulation as e.g. developed in the bwFLA project (see e.g. the blog post on EaaS demo or Digital Art curation) proper measures need to be taken to make them sustainable from the software side. There are hardware museums around; similar might be desirable too.
Research data, business processes, digital art and generic digital artefacts can often not be viewed or handled simply by themselves, instead they require a specific software and hardware environment to be accessed or executed properly. Software is a necessary mediator for humans to deal with and understand digital objects of any kind. In particular, artefacts based on any one of the many complex and domain specific formats are often best handled by the matching them with the application they were created with. Software can be seen as the ground truth for any file format. It is the software that creates files that truly defines how those files are formatted.
To make old software environments available on an automatable and scalable basis (for example, via Emulation-as-a-Service) proper enterprise-scale software archiving is required. At first look the task appears to be huge because of the large amount of software that has been produced in the past. Nevertheless, much of the software that has been created is standard software, and more or less used all over the world; and there are a lot of low hanging fruit to pick off that would be highly beneficial to preserve and make avaialble. If components of software can be uniquely described, deduplication should also reduce the overall workload significantly. For at least a significant proportion of the software to be covered, licensing might complicate the whole issue a fair amount as different software licensing variants were deployed in different domains and different parts of the world, and current copyright and patent law differs in different jurisdictions in how it applies to older software.
Types of Software
Institutions and users have to decide which software needs to preserved, how and by whom. The answers to these questions will depend on the intended use cases. In simpler cases all that may be needed to render preserved artefacts in emulated original environments could be a few standard office or business environments with standard software. Complex use cases may require very special non-standard, custom-made software components from non-standard sources, like use cases involving development systems or use cases involving the preservation of complex business processes.
Software components required to reproduce original environments for certain (complex) digital objects can be classified in several ways. Firstly, there are the standard software packages like operating systems and off-the-shelf applications sold in (significant) numbers to customers. And secondly there can be different releases and various localized versions (the user interaction part of a software application is often translated to different languages such as in Microsoft Windows or Adobe products) but otherwise the copies are often exactly the same. In general it does not really matter if it is a French, English, or German Word Perfect version being used to interact with a document. But for the user dealing with it or an automated process like the process used for migration-through-emulation the different labeling of menu entries and error messages matters.
The concept of versions is somewhat different for Open Source or Shareware-like software. Often there are many more "releases" available than with commercial software as the software usually gets updated regularly and does not necessarily have a distinct release cycle. Also, different to commercial software, the open source packages feature full localization, as they did not need to distinguish different markets.
In many domains custom made software and user programming plays a significant role. This can be scripts or applications written by scientists to run their analysis on gathered data, run specific computations, or extend existing standard software packages. Or it could be software tools written for governmental offices or companies to produce certain forms or implement and configure certain business processes. Such software needs to be taken care of and stored alongside the preserved base-files of an object in order to ensure they can be accessed and interacted with in the future. The same applies for complex setups of standard components with lots of very specific configurations.
If such standard software is required, it would make sense to be able to assign each instance a unique identifier. This would help to de-duplicate efforts to store copies. Even if a memory institution or commercial service maintains its own copy, it does not necessarily need to replicate the actual bits if other copies are already available somewhere. It may simply be able to manage it’s own licenses and use the bits/software copies provided by a central service. Additionally, it would simplify efforts to reproduce environments in an efficient way.
What Should be Identified?
Some ideas about how to identify and describe software have already been discussed for the upcoming PREMIS 3.0 standard, in particular for the section regarding environments. Suitable persistent identifiers would definitely be helpful for tagging software. Something like ISBNs or the ISSNs that describe books and other media (or DOIs that are becoming ubiquitous for digital artefacts). These tags would be useful for tool registries like TOTEM as well or coudl match to PREMIS PUIDs. There could be three layers of IDing that could become relevant:
- On the most abstract layer a software instance is described as a complete package, e.g. Windows 3.11 US Edition, Adobe Page Maker Version X or Command & Conquer II containing all the relevant installation media, license keys etc. The ID of such a package could be the official product code or derived from it. However when using such an approach it might be difficult to distinguish between hidden updates, for example, during the software archiving experiment at Archives New Zealand we acquired and identified two different package sets of Word Perfect 6.0. So a more nuanced approach may be required.
- At the layer of the different media (relevant only if it is not just one downloaded installation package) each floppy disk or each optical medium (or USB media) could be distinguished. E.g. Windows 3.11 as well as applications like Word Perfect came with specific disks for just the printer drivers, and the CD (1 or 2) in the Command & Conquer game differentiated which adversary in the game you were assigned to.
- At the individual file layer executables, libraries, helper files like font-sets etc. could be distinguished. The number of items in this set is the largest. An approach centered on running a collection of digital signatures of known, traceable software applications is followed e.g. by the NSRL (National Software Reference Library) and may be the most appropriate option for these types of applications.
Usually it is not trivial to map the installed files in an environment to files on the installation medium, as the files typically get packed (compressed in ‘archive’ files) on the medium and a some files get created from scratch during the installation procedure.
Depending on the actual goal, the focus of the IDs will be different. To actually derive what kind of application or operating system is installed on a machine, file level identifiers will be needed. To just reproduce a particular original environment (for e.g. emulation) package level identifiers are more relevant. In some cases it may be useful to address a single carrier, e.g. to automate installation processes of standard environments consisting of an operating system and a couple of applications.
For the description of software and environments it might be useful to investigate what can be learned from commercial software installation handling and lifecycle management. Large institutions and companies have well-defined workflows to create software environments for certain purposes and their approaches may be directly applicable to the long term preservation use case(s).
Software Museum or Archive
What should be archived, who are the stakeholders and users and how can the archive be supported?
A model for nearly full-archiving of a domain is the Computer Games Museum in Berlin which receives every piece of computer game which requires an USK, which is the German abbreviation for the Entertainment Software Self-Regulation Body, an organisation which has been voluntarily established by the computer games industry to classify computer games, classification. The collection is supplemented by donations of a wide range of software (operating systems, popular non-gaming applications) and hardware items (computers, gaming consoles, controllers). Thus, the museum has acquired a nearly complete collection of the domain. An upcoming problem is the rising number of browser and online games which never get a representation on a physical medium. Another unresolved issue is the maintenance of the collection. At the moment the museum does not even have enough funds for bitstream preservation and proper cataloguing the collection.
Archiving (of standard software) already takes place, for example, at the Computer History Museum, the Australian National Library, the National Archives of New Zealand or the Internet Archive to mention a few. Unfortunately, the activities are not coordinated. Both the mostly "dark archives" of memory institutions and the online sites for deprecated software of questionable origin are not sufficient for a sustainable strategy. Nevertheless, landmark institutions like national libraries and archives could be a good place to archive software in a general way. Nevertheless, the archived software is only of any use if it is properly described with standard metadata. Ideally, the software repositories would provide APIs to communicate with a central software archive and attach services to it. The service levels could differ from just offering metadata information to offering access to complete software packages. As an addition to the basic services museums could offer interactive access to selected original environments, as there is a significant difference between having a software package just bit-stream preserved and have it available to explore and test it for a particular purpose interactively. Often, specific, implicit knowledge is required to get some software item up and running. So keeping instances running permanently would have a great benefit. Archiving institutions like museums could try to build online communities around platforms and software packages. Live ‘’exhibition'' of software helps community exchange and can attract users with knowledge who would be otherwise difficult to find.
Software museums can help to reduce duplicated effort to archive and describe standard software. It can at least help that not every archive needs to store multiple copies of standard software but simply can refer to other repositories. Software museums or archives could become brokers for (obsolete) software licenses. They could serve as a place to donate software (from public, private entities), firmware and platform documentation. Such institutions could simplify the proceedings for a software company to take care of their digital legacy. A one-stop institution might be much more attractive to software vendors and archival institutions than the possible alternative of having multiple parties negotiating license terms of legacy packages with multiple stakeholders (Software companies might have a positive attitude towards such a platform or lawmakers could be persuaded to push it a bit). Software escrow services (discussed e.g. within the TIMBUS EU project) can complement these activities. A museum can operate in different modes like in a non-for-profit branch for public presentation, community building, education etc. and commercial branch to lend/lease out software to actually reproduce environments in emulators for commercial customers.
The situation could be totally different for research institutions and users of custom made software. Such packages do not necessarily make sense in a (public) repository. In such cases the question of, how the licensing will be handled arises. If obsolete, they could be handed over to the archive managing the research primary data.
Another issue is the handling of software versions. Products are updated until announced end-of-live. Would it be necessary to keep every intermediate version or concentrate on general milestones. An operating system like ''Windows XP'' (32bit) was officially available in several flavors (like ''Home'' or ''Professional'') from 2001 till 2014. In many cases a ''fuzzy matching'' would be acceptable as a certain software package runs properly in all versions. Other software might require a very specific version to function properly. This needs to be addressable (and could be matched to the appropriate PRONOM environment identifiers). Plus, there are a couple of preservation challenges in the software lifecylce.
There are a number of questions which arise when creating or running a software archive or museum:
- On which level should a software archive be run: Institutional (e.g. for larger (national) research institutions, state or federal or global level or should a federated approach be favoured)?
- Does it make sense (at all) to run a centralized software archive in a relevant size, assuming that for modern, complex scientific environments, the software components are much too individual? What kind of software would be useful in such an archive? Which versions should be kept?
- Would it be possible to establish a PRONOM-like identifier system (agreed upon and shared among the relevant memory institutions)? Or use the DOI system to provide access to the base objects?
- How, through which APIs should software and/or metadata be offered (or ingested)?
- How should the software archive adapt to the ever changing form of installation media from tapes, floppies to optical media of different types to solely network based installations?
- Would it be possible to run the software archive as a backend, where locally ingested software is stored in the end?
- Is the advantage gain of centralizing knowledge and storage of standard software components big enough to outweigh the efforts required to run such an archive?
- Do proper software license and handling models exist for such an archive, like donation of licenses, taking over abandoned packages, escrow services? Would it be possible to bridge the diverse interests of diverse users of a diverse range of software and software producers?
- Would there be advantages in running such an archive as/in a non-profit organisation?/What business model would make most sense for such an organisation?