In the last two days, two very different approaches to the digital preservation "file format problem" have been announced and discussed. Jason Scott of the Archive Team announced "SOLVE THE FILE FORMAT PROBLEM MONTH" in his own inimitable and robust manner, and the folks at CDL announced the unveiling of the Unified Digital Format Registry (albeit with slightly more formal language). So which of these rather different initiatives will answer our digital preservation needs?
If we're not careful, neither of them. To be frank, either solution will require *you* to get up and get involved if there's to be any prospect of success. And regardless of the rest of what I have to say in this blog post, getting involved is something you should do! So that's my contribution to the call to arms shout. But hang on, just what are our preservation needs? Surely someone has worked this out, before coming up with the approaches…?
I looked first at the new UDFR system and the first blog post I've seen about it from Gary McGath. Gary seems to have instantly hit upon the crux of the problem I have with UDFR. With a somewhat impenetrable interface how are we going to get a critical mass of contributions? The kind of critical mass that would be needed (for example) to make the JPEG page he describes more complete/trusted than the wealth of information already available on Wikipedia describing JPEG? Even then, would this actually help anyone? And how do we know when we have sufficient information describing JPEG? Write a new JPEG renderer to test if we know enough? Why bother when there are lots of JPEG tools out there (not to mention source code for them)?
I read the final report on UDFR, and I've looked back over the various documents and web pages collated on the project site. I can see functional requirements and use cases at the level of things like "Edit an entry", "Export data" etc. But I see nothing describing what concrete digital preservation problems this system will help with solving. I hope I've just missed it. But I'm really worried that it doesn't exist.
Chris Rusbridge just got a bit closer to articulating some problems/aims in a blog post related to the Archive Team's appeal to crowd source the "formats problem". It's interesting to see that Chris's list is mainly about tools that do handy things to certain formats. This seems helpful and a bit more practical, although I would say that although Chris title's the list "…what is the file format problem that we need to solve?", most of the entries still sound more like solutions than problems! We as a community really are bad at articulating our challenges and requirements, and just can't wait to dive into the solution. My worry of course is that we then create an amazing technical solution to a problem we don't have.
*If* our challenges are actually related to knowing about tools rather than formats, then a formal registry with a complex and rigid structure containing "facts" is probably not what we need. We want to know what experiences people have had in applying specific tools to actual data. What works, what doesn't, and so on.
This leads me on to Archive Team's "SOLVE THE FILE FORMAT PROBLEM MONTH". Jason wasn't exactly as clear about what the problem is as I would have liked. But if I understood him correctly, it goes something like this: He's got a bunch of stuff he can't understand/render (as do other people) and wants to make it usable now and in the future. This feels instantly more concrete. Collect actual data, identify and describe specific challenges with that data, then crowd source the information needed to solve the challenges in an easy to edit wiki. There is still a definite focus on the format, but it seems a bit more practical. Jason specifically mentions a couple of things I liked: "links to programs that deal with the formats" although as I said, I think its the experiences people have with those tools that are important to capture. We already have lots of lists of tools. And: "known variations or problems with the format". These format "gotchas" (often subtle but critical) are what we really need to record and share information about (as my colleagues at the KB have been saying for some time). I suspect this would be far more valuable than endless format specification type detail.
Clearly there are other considerations to be made, and I imagine that many will be unhappy with the somewhat adhoc approach of Archive Team. I think however that successful engagement of the contributors we need, and making sure we're targetting a clearly identified problem (and a problem that we actually have) is going to make or break these initiatives. I'll be following both of them closely and contributing where I can. But I have to say, if the bet is about impact, my money is on the Archive Team.
paul
July 5, 2012 @ 12:50 pm CEST
Chris Rusbridge channels Douglas Adams and object oriented computing, in a follow up to my blog rambling:
http://unsustainableideas.wordpress.com/2012/07/04/the-solution-is-42-what-was-the-problem/