A standardised mechanism for writing and retaining technical provenance metadata
Technical provenance metadata is metadata that details technical changes that have been to a file during its life. Note that this does not detail changes made across versions of a file, but to the file itself.
Within the National Library of New Zealand systems environment, we have no single natural home for technical provenance information about our digital objects. Currently this kind of data lives with our collection as a record of technical interactions and interventions in our corporate document management system.
Looking across our organisation we run the Rosetta preservation repository, an instance of ALMA, an instance of EMu, and our own in-house metadata broker service called DNZ. As well as these enterprise sized systems we have a number of digital systems, tool, workspaces and tools that conceptually fall into our digital preservation environment in varying degrees of intersection.
Digital collections are processed by both our archive (Alexander Turnbull Library) and legal deposit teams. In processing collections, both teams have a shared need to accurately record our processing treatments to ensure the integrious nature of our records.
When capturing records of our treatments we can use the catalogue platforms to record the contextual descriptions of digital items. We record some technical provenance that wraps around the collections (e.g. originating hardware/media etc) as well as the collection descriptions. We record PREMIS structured records via the preservation system, but these records start from the time the preservation system first encountered a file as an ingest process.
A key part of our processing stage is pre-conditioning treatments that are required to ensure a smooth ingest process. It is the metadata created at this stage that requires a formal place in our processing workflow.
Its place is not to supersede any digital preservation metadata, but to augment it. It covers a processing step that is often missing (in our experience) from the formalised preservation metadata we typically see.
One of the reasons we might find that data missing is, it might be contended, that the traditional view of standards like PREMIS is predicated one being applicable when the object is ingested into a digital preservation system. This is perhaps changing given recent updates to controlled vocabularies, but documentation for the most part still refers to metadata for “digital preservation systems” .
We process before that point, and we hope that the sentiment of PREMIS editorial community is one that agrees that we would all benefit from a standardised data object to record our interventions, particularly when it happens before ingest into a system that normally prohibits modification of collection items.
When a file is processed prior to ingest in to a formal “repository” or “system” we often find ourselves needing to take some stabilising actions to help the given file have an easy path through our system and its inherent constraints. These actions might include changing of file extensions to ease access to a file when we find a “mislabelled” file extension. It might be subtle shifts to some structural components within a file to ensure conformance to standard or successful validation, where the date time separator in the metadata is not permitted in the standard. It might be the removal of malicious, erroneous or superfluous data included in the bitstream of what we think of as our file object. It might be sanitising of a filename to ensure safe passage through cascading applications that may struggle with trivial text encodings.
We find ourselves making these changes based on some well-considered policy, but struggling to manage the record of these actions as a part of the audit-able history of file object.
We would also like to revert those changes, allowing the future user to decide to have the perfect original, or the version we have needed to work on as our present day systems and processes dictate.
We don’t consider these types of interventions to be meriting of full preservation actions (that is, making a new version of the file). These are small but precision adjustments that seek to ease the file object towards its resting state in the repository and into future use. Hence, we modify the file.
These changes are designed to be reversible, and so to efficiently reverse them we require a data structure that allows our actions to be written and read by machines and human alike.
We know from experience that these types of efforts work best when they are well standardised. We could implement our own version of the proposed data model, but really we want everyone to use the same thing, and for the data model to be managed with all the other vital preservation metadata we depend on.
The next stage in our journey in finding consensus is the sharing of the proposal and seeking comments on the applicability, usability and (ideally) desirability of the proposed data model. Part of that is posting this blog, and following it up in due course with an OPF webinar to discuss our proposed data model in more detail with interested persons.
Once we have gathered feedback and made adjustments/improvements, we would then hope to make a new proposal to the PREMIS Editorial Committee to consider the inclusion of a community supported technical provenance data object.
By Jay Gattuso, posted in Jay Gattuso's Blog