This article is the fifth in the series on monitoring ageing file formats. The underlying question is: Can we predict which file formats are likely to become obsolete? This project is part of the NDE Preservation Watch and Preferred File Formats program.
Original author: Rein van ‘t Veer
Several dozen different file formats have been discussed in the previous articles about data from participating organisations in the Dutch Digital Heritage Network (DDHN), including the Netherlands Institute for Sound and Vision (NISV) of Data Archiving and Networked Services (DANS). In this article we look at the available applications. Can the files still be opened in sustainably available applications – are they still well supported?
We start with the question of what “sustainably available” means for applications. The risk assessment of disappearing formats involves more than just the numbers in a particular digital collection. To what extent is the format easy to open and with what software? For example: a handful of HTML files in an archive poses a low risk. There is a good chance that in twenty years’ time the HTML file will be opened successfully without any additional software required. The Web has a tradition of “backwards compatibility”, a famous example of which is the Space Jam website from 1996, still online and viewable in modern browsers. We are looking for a form of sustainability that is, in a sense, an extension of the role of the archive. Optimal availability, for the widest possible range of users, for as long as possible. The good news is that such initiatives exist in the field of software and are well applicable.
This article is divided into two parts. First we look at what other relevant projects exist in the preservation of files in relation to applications to open them. We discuss national and international initiatives, the main question being: can we automatically extract from the available data sources with which a specific file format can be opened? How many applications are known that support a certain format and from/until when were these applications in circulation? In the second part, we highlight a few file formats to see to what extent they can be opened “sustainably”, in ways that we explain there.
In the field of usable applications for file formats, much has already been explored. We look at the most relevant initiatives that have existed in this area: the Guide Preferred Formats, the United States National Archives and Records Administration (NARA) Digital Preservation Framework, WikiData Digital Preservation, and finally the United Kingdom National Archives Public Record Office and Nôm 喃 [sic] (PRONOM). The striking “Nôm 喃” in the name is derived from a historical Vietnamese script, based on Chinese characters. What we are looking for is a method to relate a specific file format – as specifically as possible – to available software to open or convert it. Let’s take the Microsoft Access 95 file format as an example: it is very outdated, the current version of MS Access can no longer open it.
Guide Preferred formats
The most important initiative in the field of safeguarding digital files in the Netherlands can be found in the Guide Preferred Formats, an initiative of the DDHN . Participating organisations include the National Library of the Netherlands, the Dutch National Archives, the Netherlands Institute for Sound and Vision and a large number of other national and regional archives. It offers, among other things, a step-by-step plan and a registry of file formats, searchable by field of use, for example. While the guide is instrumental in standardising formats for better shelf life, it puts less emphasis on the question of what applications are used for source formats. The file formats in the registry of the Guideare derived from PRONOM (see below), which can be used to draw up a format policy by an archive organisation with a digital file collection. For example, the established policy may state that Access 95 is a non-preferred format
The United States National Archives and Records Administration has launched a file format risk assessment project. It is primarily a tool for assigning scores to file formats based on, for example, whether the format has an open specification or not. An important outcome of this project was an overview table with risk scores for shelf life. While many of the listed formats are named after the software used to create them, a large number of “generic” file formats are missing. In many file formats named after a software package, it also offers no insight into which other applications can be used for the formats.
Wikidata has a wealth of information on applications and formats: it has inventoried about 200,000 applications and about 14,000 file formats. The list of applications includes not only user applications, but also applications such as operating systems and software libraries. Wikidata is therefore by far the largest inventory of data on file formats and applications. What has not yet been provided is a good link between the two. The page for the family of MS Access applications lists the supported formats, but not which version of MS Access supports the MS Access 95 format. The page for the MS Access 95 format does show a lot of other useful information, including a link to PRONOM. In short: at the time of writing this article it is difficult to know which software version can be linked to which file format version.
The UK National Archives has embarked on a major project to provide insight into file formats and uses. A comprehensive set of tools allows not only file format identification, but also an index of formats and applications. For example, it lists the MS Access 95 format with a PRONOM universal identifier (PUID) which can be found in the WikiData entry for this format: “x-fmt/238”. The key question here is how such a file format can be related to software that can open it: a recent installation of MS Access cannot do that. Unfortunately, PRONOM isn’t quite there yet: it only mentions MS Access 2000, with a link to the supported MS Access 2000 format. All in all: PRONOM has come a long way in relating file formats to supported software, but we are not there yet. Another small hurdle is that the PRONOM website is very suitable for human “consumption”, but you need to know how to get machine-readable access to the data. However, the XML representations of formats provide very detailed and useful metadata (see examples in the “Further digging” section below).
Related work: an interim conclusion
Many useful and valuable tools have been developed by WikiData, DDHN, and the National Archives of the United States and the United Kingdom, among others, but none of these tools currently allow us to conduct quantitative research on the arbitrary file format support. Either there is not yet enough data available (PRONOM), or the links between specific formats and software have not yet been established (WikiData). In addition, there is still a lack of tooling: there are still few search interfaces that make it easy for machine-readable and human-readable identifications. In the most ideal form, a web application is available in which a file can be dragged and dropped, which first identifies the file type (a web version of DROID, as it were), and then offers a list of associated applications.
Applications for sustainable access
In this second part of the article, we discuss a number of file formats that we encountered in the previous analyses by the NISV and DANS. We explain what the ideal conditions are for opening old file formats and which current applications can be used to open them, given these circumstances.
Maximum version availability
Maximum availability of previous versions of software for support of old formats is paramount. It happens that recent versions of software no longer support old versions of the file formats designed for this software. The discontinued support for version 97 of Microsoft databases in recent Access versions is an example of this. A system is needed that functions as an archive: in which all previous “releases” of software can be retrieved and used.
Minimum licence administration
A lot of software needs to be registered with a licence key, often via a specially designed licence service. When old versions of software are deprecated, licences for these obsolete versions may no longer be issued, or the licence server may no longer be available. Licences should therefore stand in the way of installing old versions of software as little as possible. A so-called “vendor lock-in” can also mean that software manufacturers can charge exorbitant amounts when they are the only ones who provide access to a file format – it is clear that this should be avoided as much as possible.
The wish is for a minimum of dependence on old operating systems on which software is supported. Suppose a WordPerfect version can no longer be installed on the current version of Windows – then a virtual machine must be started with an old operating system, which can cost a considerable investment of time. Now there is a huge amount available in the field of emulation, a variety of old hardware and software platforms including Commodore 64 machines can be “emulated” on modern systems. Many older systems can even be used in a web browser, such as the first generation Apple MacIntosh. Archive.org offers a number of such online emulators, and the Software Preservation Network led by Yale University is working on an initiative to offer emulation software as an online service. As beautiful as these possibilities are (especially for a retro-computing enthusiast), it is more accessible for an archive not to depend on them, due to a few very knowledge-intensive prerequisites. :
- It requires knowledge of the emulated operating systems and the software to open the files. How do you actually use Mac OS 6? Which menus should you use to find a program?
- It requires knowledge of what combination of operating system and software is needed.
- Many emulator systems don’t necessarily make it easier to export data. Once the file is open in the emulator, how do you transfer the data to a modern system?
Open source software
The purport of these recommendations is that in an archival context it is advisable to use formats that can be opened with open source software as much as possible, and to use open source software as much as possible for reading and conversion. This does not necessarily mean that all these formats need to be open specifications or standards themselves – “proprietary” formats such as the DOC format are fine to open and convert with open source software such as LibreOffice. In many cases, open source software meets almost all of the above requirements. There is generally an absolute minimum of licensing hurdles. Open source software is usually stored in an open code repository such as GitHub, SourceForge, or GitLab so that older versions of the software remain available. Much open source software is available as releases for Linux, Mac, and Windows, something we’ll take into account in the discussion of file formats below.
In software availability for file formats, we only review multi-platform open source applications that have a lot of traction and are actively maintained. This is not an uncontroversial choice: in some cases there may actually be better paid applications available for creating files of a particular file format. However, opening and converting these file formats should be hindered as little as possible by vendor lock-ins, expensive software, or complex licensing issues.
Highlighted a few formats
To investigate the availability of open source applications, we have selected a number of well-known and lesser known file formats from the previous archive analyses. It concerns five functional areas: images, text, audiovisual, geographical (geo) and table data (tabular).
The vast majority of files in the analysed archives are image files, with only two major file format types: TIFF and JPEG.
|File type||Supporting software|
|TIFF||a.o. GNU Image Manipulation Program (GIMP), ImageMagick, QGIS (for GeoTIFF).|
|JPEG||a.o. GNU Image Manipulation Program (GIMP), ImageMagick|
Due to the proliferation of images, there is a lot of open source software available for image formats, on all possible operating systems: the need for image software is almost universal. Applications such as GIMP can open and edit the files, with a program such as ImageMagick they can be converted via the command line in batch processing. Both of these mentioned applications are multi-platform. That doesn’t mean that certain file formats don’t have availability challenges.
JPEG is not much of a problem, but a prominent file format here is the TIFF format. The TIFF file type is especially problematic because it is a flexible container format, with many possible “subtypes” for different purposes with specific supporting software requirements. I have specifically highlighted the GeoTIFF format here, as it is well known in the geo world. Although GeoTIFF is an open standard, some geo-related knowledge is required to know that a GeoTIFF file is a GeoTIFF file at all, often the file name says little about the fact that the file contains extra metadata that is crucial for placing the image anywhere on the globe.
With background knowledge in geo, GeoTIFF is well known to the author of this article, but there are at least nine other “sub-types” and versions of TIFF for other domains, and of which it is not entirely clear what specialised software they require to handle all (meta) extract data. In my opinion, it is therefore better to use a replacement equivalent format that does justice to the full reproduction of the contents of the original TIFF. Notable here is that TIFF has a preferred format with a high sustainability score, also in the NARA risk assessment, while identifying the correct subtype of TIFF can be tricky – tools such as DROID are required to identify it correctly. For the specific GeoTIFF subformat, we tried this – it is indeed correctly identified.
A significant portion of the file formats treated fall under the heading of “unstructured text”. By this we mean file formats that contain running text, possibly combined with embedded images or other media. Here we discuss a number of common formats.
|File type||Supporting software|
|For reading: e.g. Firefox, Chromium, Evince. Editing and converting: e.g. LibreOffice Draw, Inkscape.|
|DOC||a.o. LibreOffice Writer, OpenOffice|
|DOCX||a.o. LibreOffice Writer, OpenOffice|
As we have seen in the DANS analysis, the DOC format is clearly in decline. The versions of MS Office that produce this format are hardly in circulation anymore and almost everyone uses the successor format DOCX. DOC support is still strong, but it may be advisable to consider converting it to an open format such as ODT or PDF/A. A representative sample is needed to verify that the entire format of the DOC file converts properly.
PDF as a text format is a special case. We mentioned earlier in the article about the approach to this project that there are quite a few different sub-specifications of PDF, not to mention the different versions of the PDF format. For each archived PDF file, it should therefore at least be known in which sub-format and which version the file is stored. This should all be PDF/A, since this specification makes it impossible, for example, to make the file unreadable with a password. DROID is also important here to properly identify the subtype.
The audiovisual formats are mainly used by the NISV.
|File type||Supporting software|
|MP3||a.o. VLC, Audacity, Kdenlive|
|MPG||a.o. VLC, FFmpeg, Kdenlive|
|MXF||a.o. VLC, FFmpeg, Kdenlive|
|WAV||a.o. VLC, FFmpeg, Kdenlive|
There is little to worry about for the audio formats: the MP3 and WAV formats are widely supported and are clearly specified. As far as video formats are concerned, things are a bit less straightforward: the MXF format in particular is mainly used in professional circles, and the format, like TIFF, is a container format that supports a variety of sub-formats that are difficult to guarantee that these will all still be readable in a few decades. As long as it is clear that the included MXF files only contain open video encoding standards for which open source code is available, the risk is probably small.
Mainly DANS has to deal with geo-specific file formats, as much data in archaeological field drawings is stored in such formats.
|File type||Supporting software|
|SHP||a.o. QGIS, GDAL, all multi-platform|
|TAB||a.o. QGIS, GDAL|
|GeoJSON||a.o. QGIS, GDAL|
The expectation is that Shapefile (SHP) will be around for a long time, but that MapInfo TAB files will slowly but surely disappear. Good replacement formats are emerging, such as the open GeoPackage standard, which can be used as a source format and as a preferred format. Surprisingly, this format is still missing from the overview of preferred formats for geo.
Almost all archives deal with data files in some tabular form. Due to its ease of use, the most commonly used form is generally in spreadsheets, compared to more complex but more structured data such as databases.
|File type||Supporting software|
|XLS||a.o. LibreOffice Calc (multi platform)|
|XLSX||a.o. LibreOffice Calc|
|MDB||a.o. DBeaver, LibreOffice Base (all multi-platform): depending on version|
|ACCDB||a.o. DBeaver, LibreOffice Base|
We have already discussed the MDB format in detail above. It is, according to the consensus among Preservation Watch members, a prime candidate to undergo a migration trajectory. To keep this article concise – and having covered the MDB format extensively – we are not doing a comprehensive analysis of the various XLS, XLSX and ACCDB formats, subformats and versions here.
In this article we have discussed the application side as an important sustainability aspect of digital files. We’ve done this through two approaches: first, by asking what information is available on the Web to find out what software versions can open a specific file format. The conclusion of this is that many good initiatives have been developed that will bring this to an end, but that there is still some work to be done before any file can be linked to one or more software versions.
Secondly, we looked at a number of common formats that we encountered in the previous archive analyses, almost all files can still be opened well with open source applications, although there are enough formats with so many different specification versions and subformats that it is difficult to say whether all variants can be opened properly. A limiting factor in this is that in this series of analyses we have not looked into files for details of the file specifications themselves, but we have analysed the file metadata now available to us, especially for filename extensions and MIME/IANA types. It is therefore advisable to do a content analysis of the specific file subtypes for all digital archives and to include the PUID and/or the WikiData file type identifier in the metadata databases of these archives. Open source tools are available for this and with more specific file metadata, more detailed risk analysis can be carried out, with more targeted migration processes.
For those who want to learn more about file formats and application versions:
- A Wikidata query for file formats, format titles, counts of related applications
- A Wikidata query for applications, application titles, PUIDs, and file format titles
- An XML export of the MS Access 95 format on PRONOM. This is the same URL as the HTML representation, with “.xml” appended. This means that an XML representation of every format and every application can be requested.
- An XML export of the MS Access 2000 software on PRONOM: the same URL, also with “.xml” after it.
- A Wikidata query listing applications for opening TIFF files
- A Wikidata query listing applications for opening JPEG files
Do you want to learn how to make predictions about the life cycle of file formats in an e-depot?
We will host an in-person workshop in Dutch where you get to work hands-on with your own data. Join us on 5 September, 2023, 13:00-16:00 at KB, National Library of the Netherlands in The Hague. More information.
Previous blogs in the translated series by Rein van ‘t Veer:
© 2022 CC-BY-SA-4.0 Rein van ‘t Veer/Dutch Digital Heritage Network.