On Monday I was asked to speak at an experts workshop aimed at steering developments in preservation services on the Reponet+ Project (part of JISC Innovation Zone). For my presentation (“Pain Points for preservation workflows/services in repositories”) I had a look at the growing collection of practitioner sourced preservation challenges we’ve been accumulating on the OPF wiki. I’ve attempted to pull together these preservation issues into key themes that might be useful in steering the development of future digital preservation tools and services.
The 120+ preservation issues are sourced from around 80 digital presrvation practitioners, most of whom are based in museums, libraries and archives and who have some responsibility for the stewardship of digital collections. We captured most of the issues during our SPRUCE Mashup events, but a good number also came from the EU funded SCAPE Project, which is also taking a practitioner focused approach to its development work. As a consequence they provide quite a good impression of the real issues being faced on the front line of digital preservation. A word of caution is however required. This is not a scientific study into user needs and the results are shaped somewhat by the mashup events in which they are captured (not to mention the structure of SCAPE). In particular, the first two events had a stronger focus on QA (being part of the AQuA Project). Despite this, the free hand given to practitioners to bring along datasets of their choice and raise preservation issues that they were concerned about provides at least some confidence in broad conclusions drawn from this information.
I identified 5 key themes from the wiki:
Theme 1: Quality Assurance (of broken or potentially broken data)
A large number of the Issues were related to quality assurance of data or metadata. In some cases there were examples of broken data to investigate (some of them feature in the Atlas of Digital Damages). In other cases practitioners were concerned that some of their data might be broken in some way, but they had no comprehensive and automated way of verifying this. And a final group were interested in processing their data in some way, but were worried that without QA it would be a risky step. All of these Issues required some form of cross checking, characterisation or validation of the data to confirm it’s quality. QA also appeared as a side issue in many Issues that focused on other challenges.
See Issue pages categorised as : Quality assurance, Bit rot, and Integrity
Theme 2: Appraisal and Assessment
Many practitioners brought along challenges associated with that initial appraisal or assessment of unfmailiar and often newly acquired digital collections that now had responsibility for. Challenges typically involved characterisation of technical and also informational characteristics in order to inform preservation planning and next steps in preserving the content. A related, but slightly different set of challenges involved assessing or validating content against some kind of policy driven profile.
See Issue pages categorised as: Aprraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.
Theme 3: Identify/Locate Preservation Worthy Data (typically on shared server space)
Although not identified specifically in the body of preservation issues here, this theme appeared anecdotally to be very common at the larger institutions (particularly Universities). Many practitioners who had Theme 2 type Issues also appeared to have Theme 3 Issues. The Issue centres around preservation worthy data that has been temporarily placed on shared server space of some kind, before then being abandoned or simply left with no available route to a more suitable location such as a digital respository. As a consequence the data may be backed up, but is not checksummed, may have no responsible owner and is effectively left in an unmanaged state. The solution might involve automated processes to sniff out data that is preservation worthy (which is likely to be mixed in with all sorts of other data), make it safe with checksumming and a more organised preservation plan, and when time and resources allow, get it into shape ready for ingest to a repository. Data residing on the somewhat ubiquitous “box of disks under the desk”, more recently categorised as the “memory stick in the drawer”, appears (anecdotally) to also remain an issue at some institutions. I suspect this theme will be of growing concern for those taking up responsibility for managing research data at HEIs.
(not specifically identified in the Issues wiki, but allied closely to practitioners Issues categorised under Appraisal and Assessment and related categories)
Theme 4: Identify Preservation Risks
Related in many respects to the broad category of Appraisal and assessment, many of the Issues had a slightly more direct focus on identifying preservation risks faced by the digital data. In some cases this was a clear examination of the risks and the implications for the data. Others jumped straight in to the subsequent solution, considering particular migration or emulation actions (although these tended to be in the minority).
See Issue pages categorised as: Obsolescence, preservation risk and business constraint.
Theme 5: Long tail of many other issues
As in many preservation situations there is a long tail of somewhat mixed results, taking in everything from Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.
It’s interesting to note that the overiding focus of the majority of these issues is the need to characterise digital data and better understand what it is, what it’s condition is and only then perhaps begin to think what to do about it and how. This is an interesting observation as much of the technology development seen in this field appears to be focused on some of the slightly more sexy preservation topics (I did say slightly!) such as migration, emulation and preservation planning. Whilst I have little doubt that these efforts are worthwhile, what seems to be clear is the need to understand our digital data better before we start monkeying around with it!
Updated to correct numbers of issues and practitioners 25/10/2012