A standardised mechanism for writing and retaining technical provenance metadata

A standardised mechanism for writing and retaining technical provenance metadata

Technical provenance metadata is metadata that details technical changes that have been to a file during its life. Note that this does not detail changes made across versions of a file, but to the file itself.

Within the National Library of New Zealand systems environment, we have no single natural home for technical provenance information about our digital objects. Currently this kind of data lives with our collection as a record of technical interactions and interventions in our corporate document management system.

Looking across our organisation we run the Rosetta preservation repository, an instance of ALMA, an instance of EMu, and our own in-house metadata broker service called DNZ. As well as these enterprise sized systems we have a number of digital systems, tool, workspaces and tools that conceptually fall into our digital preservation environment in varying degrees of intersection.

Digital collections are processed by both our archive (Alexander Turnbull Library) and legal deposit teams. In processing collections, both teams have a shared need to accurately record our processing treatments to ensure the integrious nature of our records.

When capturing records of our treatments we can use the catalogue platforms to record the contextual descriptions of digital items. We record some technical provenance that wraps around the collections (e.g. originating hardware/media etc) as well as the collection descriptions. We record PREMIS structured records via the preservation system, but these records start from the time the preservation system first encountered a file as an ingest process.

A key part of our processing stage is pre-conditioning treatments that are required to ensure a smooth ingest process. It is the metadata created at this stage that requires a formal place in our processing workflow.

Its place is not to supersede any digital preservation metadata, but to augment it. It covers a processing step that is often missing (in our experience) from the formalised preservation metadata we typically see.

One of the reasons we might find that data missing is, it might be contended, that the traditional view of standards like PREMIS is predicated one being applicable when the object is ingested into a digital preservation system. This is perhaps changing given recent updates to controlled vocabularies, but documentation for the most part still refers to metadata for “digital preservation systems” .

We process before that point, and we hope that the sentiment of PREMIS editorial community is one that agrees that we would all benefit from a standardised data object to record our interventions, particularly when it happens before ingest into a system that normally prohibits modification of collection items.

When a file is processed prior to ingest in to a formal “repository” or “system” we often find ourselves needing to take some stabilising actions to help the given file have an easy path through our system and its inherent constraints. These actions might include changing of file extensions to ease access to a file when we find a “mislabelled” file extension. It might be subtle shifts to some structural components within a file to ensure conformance to standard or successful validation, where the date time separator in the metadata is not permitted in the standard. It might be the removal of malicious, erroneous or superfluous data included in the bitstream of what we think of as our file object. It might be sanitising of a filename to ensure safe passage through cascading applications that may struggle with trivial text encodings.

We find ourselves making these changes based on some well-considered policy, but struggling to manage the record of these actions as a part of the audit-able history of file object.
We would also like to revert those changes, allowing the future user to decide to have the perfect original, or the version we have needed to work on as our present day systems and processes dictate.

We don’t consider these types of interventions to be meriting of full preservation actions (that is, making a new version of the file). These are small but precision adjustments that seek to ease the file object towards its resting state in the repository and into future use. Hence, we modify the file.
These changes are designed to be reversible, and so to efficiently reverse them we require a data structure that allows our actions to be written and read by machines and human alike.
We know from experience that these types of efforts work best when they are well standardised. We could implement our own version of the proposed data model, but really we want everyone to use the same thing, and for the data model to be managed with all the other vital preservation metadata we depend on.

The next stage in our journey in finding consensus is the sharing of the proposal and seeking comments on the applicability, usability and (ideally) desirability of the proposed data model. Part of that is posting this blog, and following it up in due course with an OPF webinar to discuss our proposed data model in more detail with interested persons.

Once we have gathered feedback and made adjustments/improvements, we would then hope to make a new proposal to the PREMIS Editorial Committee to consider the inclusion of a community supported technical provenance data object.

30
reads

4 Comments

  1. lindlar
    September 11, 2018 @ 8:47 am CEST

    sorry for the horrible formatting … that didn’t quite turn out as intended

  2. lindlar
    September 11, 2018 @ 8:42 am CEST

    Great post, thanks Jay! We have a few examples for workflows where would love to capture more technical provenance metadata.
    In addition to the examples given the post, such as changing the filename / label, I’d like to give a few use cases we currently have and am keen to hear if you would think those to fall within the scope of what you have in mind:

    Removing junk data before header / after trailer in PDF files
    Some PDF writers may add extra (junk) data before the header / after trailer of PDF files. The characters typically carry no function and, more importantly, violate PDF specification. In some of those cases we are evaluating the possibility to remove the junk data pre-ingest and document the change.
    A comparable example from a different file format group would be replacing non-standard conforming date/time-stamps in TIFF files pre-ingest.
    USB / Optical disc imaging processes
    During imaging processes we sometimes encounter nasty data carriers, where different imaging software / methods provide different results and what we know to be the “correct content” can only be assembled via a combination of imaging runs using different drives / different software. It would be helpful to store a description of how an image / a file vacuum was performed could be stored alongside the digital object(s).
    AV scanning – technical analysis of (analog) masters
    I’m not sure if this issue would fit here – but as we are desperately looking for a home for this information it’s worth a try 😉
    We are currently undergoing a AV digitization project. As part of this, we semi-automatically capture information about the analogue film state, e.g. regarding shrinkage, warping, vinegar syndrome, etc. Shrinkage ratio, for example, is captured in a machine readable way during the scanning process by the machine. Unfortunately there is currently neither a standard schema, nor a standard process for this kind of provenance metadata. As the information is helpful in understanding the genesis (and, also, state) of the digital object we would like to store this within the archive alongside the digitization results. I am currently leaning towards sidecar metadata – however, more out of necessity, as I couldn’t find another place to capture it. Would much prefer a technical metadata container.

    Looking forward to hearing your thoughts regarding coverage of these issues in the model you guys have in mind. Happy to contribute to the discussion in any way I can!

  3. alexgreen
    September 7, 2018 @ 3:41 pm CEST

    Very interested to hear about this as we’ve been thinking the same. Keen to get involved in the proposed data model.

  4. Peter May
    September 7, 2018 @ 3:21 pm CEST

    Interesting post Jay, looking forward to hearing more about this proposed data model…

Leave a Reply

Join the conversation