This article is the third in the series on monitoring ageing file formats. Can you predict which file formats are likely to become obsolete? This project is part of the Dutch Digital Heritage Network Preservation Watch and Preferred Formats programme.
Original author: Rein van ‘t Veer
In mid-2022, the Preservation Watch working group launched an open call for research on monitoring file formats and their life cycle. Its aim is to investigate the predictability of ageing file formats. In October 2022, Rein van ‘t Veer started working on this project as an author and “data scientist”.
In this third article we analyse the file formats used at the Netherlands Institute for Sound and Vision (NISV). For this we especially thank Mari Wigham, Willem Melder and Kiki Lennaerts. We will focus on the following questions:
- Are there clear trends of decreasing use of certain file formats at the NISV?
- Is the Bass model of value as a predictive model in case of decreasing use? Is the Bass model more accurate than a straight trendline counted from the time a file format’s usage begins to decline?
For the analyses in this project, we gained access to metadata records from NISV, in the form of a CSV file of 4.7 million records, in more than 800 MB of data. We can only offer the aggregated data, since this data cannot be shared as open data. The CSV source file contains, among other things, the file name, the file type and the date on which the digital file was created, as described in the delivery of the metadata by the NISV. You can find the CSV source file in the CLARIAH media suite. The aggregated data works with counts per month, which have been further reduced to quarters in the analysis script and analysis below.
The data was filtered for readability of both file type and creation date, thereby dropping almost 600 thousand of the 4.7 million records, a loss of 12.6%. This is due to the fact that not every file has a creation date, or it is not yet listed. The choice therefore fell on dates that were made available in a consistent manner, so that the data is unambiguous. 12.6% is a substantial loss, but this loss rate is common for datasets in general. After this, the source data was aggregated by file format and by period. The grouping by file format is fairly straightforward: the file format was extracted from the data, based on a combination of the file name extension and the file type as registered in the NISV archives.
Aggregating the data in periods is a matter of choosing wisely. The class width of a quarter turned out to be a good choice because on the one hand it irons out some irregularities from the data and thus produces a cleaner curve. On the other hand it leaves enough data points for the main file formats to be able to plot good graphs. Of the 16 identifiable file formats in the source data, 7 remained suitable for analysis. The dropped file formats were in all cases duplicates of the appropriate formats, but described in a different way: such as “TIFF”, or the MIME designation “image/tiff” instead of TIF. The numbers that dropped out are negligibly small.
We did not take the quarter at the time of writing this article (fourth quarter 2022) into account, because the numbers of files for this may not yet be complete. The counts per period, per file format have been exported and the data made available online as a JSON file. The source code for generating this aggregated data and performing the analysis in this article is also available as open source software.
The following formats remained from the pre-processing:
|TIFF||1876108||The tagged image file format, a container format for images.|
|WAV||1056765||The well-known WAV audio format.|
|MXF||977133||Material Exchange Format, a container format for digital video and audio.|
|MP3||114405||The well-known audio compression file format.|
|TAR||58429||A bundling format for digital files. It is unclear which formats are included in these TAR files.|
|36853||The Portable Document Format, a document file format.|
|MPG||16446||A collection of video compression standards and file formats.|
It is striking that the most commonly used format here is TIFF, a digital image file format. The list is also considerably shorter than the list of formats analysed from the Common Crawl dataset discussed in the previous blog.
The list is so comprehensive that we can discuss analysis of each of these seven file formats. The graphs show both the absolute numbers of files found per quarter (in blue), the historical trends that the Bass models can extract (in red) and the projection that the Bass models can make over the last four quarters. These projections have varying accuracy. Not all Bass projections are equally close to the actual measurements over the last four measured quarters. We discuss the formats and their counts over time in more detail below.
Non disappearing formats
NISV amount of archived MP3 files
For MP3 files, the Bass model cannot really find a good “fit” for this data. Apart from the first measurement in 2018, deliveries are fairly stable and show no significant decrease. There seems to be an interesting wave movement, where one part of the year – a “summer break”? – clearly fewer files are archived than in the other part. Given the relative stability of these archives, we can safely disregard this file format. The Bass model is clearly less accurate than the linear model here. Based on low starting numbers in the first measurements, the Bass model expects a decline that does not exist. Since this is a stable format, there is no reason to judge the Bass model on this. Our main concern is that it can make reliable predictions for disappearing formats.
NISV amount of archived WAV files
The number of archived WAV files has a clear popular period between 2009 and 2016, but has not shown a clear decrease in the past few years – rather a modest increase. If we were to draw a straight trendline through the period 2017-2022, we would see a slight increase in numbers, so that we do not have to rank MP3 among the disappearing file formats. The Bass model significantly underestimates the number of archived files of this type over the past five years, but since the WAV format has been fairly stable in use over the past few years, we shouldn’t hold the model to account for this. We have therefore omitted a comparison with a linear model here: it does not add any extra information.
Disappearing non audiovisual formats
NISV amount of archived TIFF files
The numbers of tiff files historically show very irregular numbers of registered files. Even day-to-quarter aggregations can’t iron out the irregularities here: apparently large numbers of tiff files have been created and archived at irregular times. However, based on the data over the past six years, it is easy to say whether numbers are declining. Simply no numbers have been reported since 2016 – The Bass model agrees with us here: it accurately predicts the missing numbers over the last 4 quarters, where a linear trend line is no longer effective in these last quarters. It is evident from these data that the file type has fallen into disuse at the NISV. NDE has made the preferred file format guide for determining a replacement file format.
NISV amount of archived TAR files
The graph clearly shows that relatively little use is made of the tar format at the NISV. TAR is a file format for bundling other files, which needs to be unpacked before it is clear what file format it contains – this not only makes it more difficult to use, but also more difficult to judge which “preferred format” it should be converted to as it depends on the file type packaged in it. Perhaps this also explains the reduced consumption: it is better to archive the data unpackaged than in a packaged form. The Bass model clearly shows a better forecast for the last four quarters than the linear trendline.
NISV amount of archived PDF files
Since PDF is not an audio or video file type, this is a bit of an odd man out for the NISV. Its archiving at Sound & Vision is also striking: two major peaks in 2014 and 2015, after which virtually no activity anymore. Since PDF files are used in large numbers by other digital archives, we do include them in the analysis, but it should also be clear that the Bass model manages to find a good “fit” to the data, it leaves some with regard to accuracy, the linear trendline is far behind it.
NISV amount of archived MXF files
The MXF file format was clearly more popular in the early period of the measurements: in the period from 2008 to about 2018, Sound & Vision still had tens of thousands of archives of this format per quarter, in the past few years the numbers have fallen to several thousand per month. Despite the fact that the Bass model makes a somewhat more pessimistic estimate of the numbers than the actual measurements, and that MXF is also a preferred format for the NISV, this trend may be reason to look at a different “preferred format” than MXF.
One of the problems with MXF is that it is a container format, like tiff is for images, that can accommodate a wide variety of codecs, all of which are difficult to support and maintain. Open video software like VLC can play MXF files but depends on installations of the codecs used in the MXF file.
The Bass model (in orange) clearly struggles to find a good “fit” for the shape of the graph. Nevertheless, the model clearly scores better on the test data of the past four quarters: the linear trend line (in red) is much more underestimated over the last four quarters (in purple) than the Bass model (in green).
NISV amount of archived MPG files
MPG or MPEG as a file type is an umbrella term for a number of video and audio formats widely supported by video and audio software, unfortunately we do not have a better definition of these files and the specific codecs used in them. Despite the general popularity of the file type in the world, the number of archives at the NIBG has fallen sharply in recent years: there have been no more records of it in the past three years. Is this a reason to switch to a different file type? And if so, which ones? In any case, the Bass model is clearly a better model here than the linear model.
In this article we looked at the file types used by the archives of the Netherlands Institute for Sound and Vision. It is striking that only two video file formats emerged in the usable counts of the NIBG: the MXF format and MPEG. Perhaps even more striking is that the numbers of MPEG files have not received any mentions in the past few years, and the numbers of MXF files also saw a significant decrease after the period 2009-2012. In general, fewer files are archived or digitised at Sound & Vision, or do we not have a complete picture?
It is also striking that in the initial phase of the measurements considerable numbers of “packaged” files in TAR format were registered, although files in this format are hardly used anymore from 2014 onwards. That in itself is positive: the (meta)data is easier to interpret and share when it is not packaged in a generic bundle format such as TAR, which can well explain why this decrease can be seen.
Looking at the usefulness of the Bass model in predicting the declining file formats, we see that in all cases the model significantly outperforms the simpler linear trendlines. We can therefore regard this experiment as a success: the model is clearly of greater value than a simple linear model.
Do you want to learn how to make predictions about the life cycle of file formats in an e-depot?
We will host an in-person workshop in Dutch where you get to work hands-on with your own data. Join us on 5 September, 2023, 13:00-16:00 in The Hague, The Netherlands. More information.
Previous blogs in the translated series by Rein van ‘t Veer:
© 2022 CC-BY-SA-4.0 Rein van ‘t Veer/Network Digital Heritage