Assessing file format risks: searching for Bigfoot?

Assessing file format risks: searching for Bigfoot?

Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:

  1. Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
  2. Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.

This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".

The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.

Criteria are largely theoretical

FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.

Appropriateness of measures

Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.

Risk model and weighting of scores

Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.

Accuracy of underlying data

A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:

  • PNG, JPG and GIF are uncompressed formats (they're not!);
  • PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
  • JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
  • JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).

To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).

What risks?

It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?

Encyclopedia replacing expertise?

A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent rant blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.

KB example

About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.

Final thoughts

Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?

I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.


  1. johan
    October 9, 2013 @ 8:38 am CEST

    I'm addressing these comments in a separate blog post, see link below:

  2. Roman Graf
    October 5, 2013 @ 9:05 pm CEST

    Thank you for your comments. I agree that a format study seems to be more important and useful for community at the moment than an expert system extension.

  3. andy jackson
    October 5, 2013 @ 8:26 pm CEST

    I really like the idea of that systematic format study – it would generate so much useful information, and there are plenty of assumptions behind things like the Format Sustainability Factors that should really be put to some kind of test.

    I almost wonder if it is too ambitious, given how much one can say about even a single format!

  4. andy jackson
    October 5, 2013 @ 8:10 pm CEST

    I think stories always come first.  There is no limit to the number of numbers we might imagine measuring, and when we choose things like 'Software Count' or 'Migration Tool Count' there are rafts of assumptions implied by the choice to measure those over spending our time measuring something else. Having said that, I do agree that ideally we should measure broadly and so hope to avoid prematurely biasing our findings – just that the metrics and the stories are complicit.

    However, it's worth noting that all of the registry work I've seen has been not been 'numbers first' or 'stories first', and certainly not 'users first', its all been 'data model first', where a small number of people (usually just one!) disappear for a while and then produce something rather sophisticated, but consequently inflexible, riddled with debatable assumptions and stymied by little or no user testing. I suspect this may be because we don't actually know who will fill these data models as part of their usual day-to-day work.

    This is why I like to think about it in terms of stories first. Lets bring the stories together and help the people that need the stories. But once that's in place, we can find the people that do the work that fills those stories, understand what they need, and build the tools that help them do more while also helping preserve the information they uncover.

    The more I talk about it, the more it sounds like a scientific journal with a very specific scope!

    Put it this way… Imagine if you told a physicist that they could publish their paper, but only if they expressed it in terms of a mountainous ontology created by the editors of the journal, rather than, you know, using text and equations. That's how I feel about ontology/data-model driven design for representation information registries.

Leave a Reply

Join the conversation