For anyone dealing with a relatively small number of records, compared to say an internet or data archive, a reasonable process for ingest of material into your digital preservation system might be:
- 1. Process files with a file format identification tool
- 2. Per 1. process files with a file format validation tool
- 3. Per 1. Process files with a property extraction tool
- 4. Ingest the file into permanent managed storage with the results of processes 1. 2. and 3.
- The Assumption: It will just work
The devil is in the exception workflow for each of these processes:
- What happens if we don’t receive a format ID?
- What happens if we:
- a) find the files are not valid?
- b) find the validation/extraction tools fail?
In working to understand exceptions and resolve each of them, we can find ourselves on a hiding to nothing depending on the number of affected files.
So what is the point?
The first rule of digital preservation is identify everything, or is it?
I still believe in this and a great deal of my work remains in understanding what is coming into the archive. I will research unknown formats and create new file format signatures and submit research to the DROID mailing list. Part of my work is improving existing signature definitions, often to improve granularity of identification. As well as the mailing list there is a contact form for more direct contact with the PRONOM team:
- Mailing list: https://groups.google.com/forum/#!forum/droid-list
- Contact form: http://www.nationalarchives.gov.uk/contact/contactform.asp?id=13
The concept of format identification may be more closely related to records management though – if I don’t know what a file is, did the donor/depositor? And if not, is the file/collection really of archival value?
If it’s identifiable when it is received do I need to do as much work prior to ingest?
And then, if we have to take the accession in the first place, how many options do we have in terms of turning it back? Perhaps there are more use cases out there where we just have to ingest what we get.
What do we do then? – I’ll try and answer each of the questions I pose in the conclusions below.
File formats need to be well-formed and valid?
I’m not sure where the community currently stands on this, but I’m sure it will remain the foundation for robust discussion – ‘What is validity?’!
My chosen definition is that validity is about checking for conformance against a given specification. A validation tool should be an actionable implementation of a file format specification. We are measuring anything in the implementation of a file format that the developer has created, intentionally or unintentionally, to be out-of-band.
The surmise is that readers of a file format (the software) will have been created alongside a specification, and that correcting out-of-band features will lead to a greater number of readers per format being able to handle it.
Validity is nicely defined in the XML standard and it’s closely related to the concept of being well-formed. We might question how easily the same concept can be applied to more complex binary formats. In practice the range of issues we can encounter is infinite. Some formats could be invalid if certain markers are missing, for example, ‘start of image’ or ‘end of image’ markers (SOI, EOF) in JPEG. Some formats can be invalid for making use of features that aren’t permitted in the official documentation, for example, if someone used JavaScript in a file purporting to be PDF/A. When we talk about some formats it is entirely because of their ability to validate more rigorously against a certain standard that they’re considered to be their own format, as in the case of PDF/A.
In most cases, validation is closely related to property extraction.
We extract properties so that file formats can be well understood to promote preservation?
It sounds like a reasonable expectation to me. Digital preservation data is not big data but it is multi-dimensional data, and it can be a monster to wrangle.
A Microsoft Word file is not just a Microsoft Word file – it can be encrypted, it can be denoted specifically as a ‘template’ file, it can contain dynamic fields, it can contain track changes, comments, and many different versions – and more!
Similarly, a TIF file (Tagged Image File Format) is not just a TIF file. I don’t even need to get deep into the feature set for the hairs on the back of my neck to stand up – the sixth version of the TIFF specification, from 1992, allows a file to be compressed using one of four different compression methods – If I have to write a reader for the TIF file format one day I have to think about how I implement the decompression engine for each of those compression methods. Fortunately that is just one worst case scenario when working with legacy file formats.
At Archives New Zealand, if an appropriate extraction tool has been embedded then it is completed on ingest by the Rosetta digital preservation system, however, it is possible for extraction to happen anywhere in the digital preservation workflow. The data is always embedded in the file to be extracted now, or later – a user downloading the file from our online catalogue could use their own tools to extract more information as they wish.
The reason it belongs as part of this conversation is that an extractor is not expected to validate a file format – but what if it encounters an error and halts during the extraction process?
An example might be an incorrect data offset or data length specification – a buffer overflow is a good example of an exception that occurs when a program expects a fixed length of data but finds something else: https://en.wikipedia.org/wiki/Buffer_overflow.
The root of the error could be the way the format was output; it could be data corruption; it could be caused by a format using an out-of-band feature. It might just be a bug in the extractor itself…
If it isn’t the extractor and it’s not a corruption of the data stream, isn’t it then that the file is in some way invalid?
The answer is complicated. The current trend is both yes and no. Yes because one tool processing the file cannot read it and extract the data we know is there from the specification. No because we can say within certain tolerances that the file is valid according to a specification adopted by the developer who wrote the creating application that output the file in the first place.
So what do we do?
In any of the three instances above my posit is that it depends on your motivation for using certain tools in the first place.
For me, the benefit of format identification, validation, and property extraction is two-fold:
- 1. We can provide better metadata to our users (including ourselves!) to search on and work with
- 2. We can route our outputs more precisely whatever our approach to preservation is
But we have a problem matching the number of potential issues that can arise with the resources that we have in terms of time and people that we have available to manage exceptions and resolve them.
If an object fails anywhere in the simplified process outlined above I have to think seriously about whether I can fix it now, or how I will consider fixing it later.
If you are part of a small team then it’s going to be nearly impossible to resolve issues immediately. I’m in a team of three dedicated to the repository and digital preservation issues, with other colleagues providing support outside that too. And so I am fortunate to be in a team larger than other colleagues out there in the field but I still have trouble managing these resource restrictions.
To manage it as best as possible we adopt something approaching the following method that I hope will prove valuable for others to see.
- If an exception occurs, weigh up the pros and cons of fixing the issue now.
- Rationale: Unless there is an option where the file won’t be ingested, the file is going into the repository regardless, and if it’s not fixed the problem is not going anywhere. It will be kept in stasis alongside the file to be visited later.
- If the problem is with the file, can it be resolved with minimally invasive effort?
- Rationale: We must maintain the integrity and authenticity of the record by being openly transparent about any work completed on a file. The work should be documented before ingest, visible to users, and should be reversible. Otherwise, you’d be better off ingesting the file as your preservation master and working on a second preservation master at a later date.
- If an exception occurs then log it in two places. 1) As a provenance note attached to the record in your digital preservation system. 2) As a separate log of technical exceptions (this might simply be a spreadsheet) – recording also the unique ID of the intellectual entity for ease of finding in future.
- Rationale: Continuity of knowledge is paramount. Years could pass between continual processing of collections and being able to re-visit issues found on ingest.
- Log as much about the technical exception as you know and is feasible to document within any restrictions.
- Rationale: If we can’t pinpoint whether the issue is with the file or the validating or extracting tool then this is invaluable information to working it out at a later date. It may be useful to understand differences between moving versions of software in your preservation system.
- Determine if the issue is genuinely with the file, or with the software.
- Rationale: Our approach from this point will be radically different
- If the issue is with the file have a look at how many other files are affected.
- Rationale: You might be able to pinpoint the issue down to a handful of bytes or the issue may be more systemic. As the preservation master is always available in your system, a fix may range from forensic byte-level fixes to re-saving the file in the same format, or migration. The appetite for any particular fix must be discussed widely with your organization until you can develop enough institutional understanding. Eventually policy can be wrapped around this institutional knowledge. Note, I’ve tried to focus on forensic, byte-level, easily documented fixes in practice thus far.
- Is there an appetite to fix issues at the file level?
- Rationale: The potential to emulate an operating environment could be a big enough reason to reduce this appetite entirely. If you know a piece of software that you can run in an emulator has output file (A) and you are aware that it does this consistently – ‘fixing’ the file won’t necessarily make a difference in this environment. Without appropriate testing it is reasonable to assume that ‘fixing’ could ‘break’ the file in the originating environment.
- If the issue is with the software, determine the level of skill you have to be able to debug the software.
- Rationale: Contributing to a project by debugging an issue could see it better understood by the developers; it could also be fixed quicker. Being able to debug software shouldn’t be a barrier to contribution though. As a subject matter expert or researcher looking at a format you can contribute the reason as to why the tool’s output is incorrect. If your knowledge doesn’t extend that far you can still say to the developers – “look, something is wrong with this output, can you help me to understand it, and help me understand if it is a problem with the tool or the file”. A good developer will always be receptive to new input and new test-cases.
- Is the software open source?
- Rationale: It might be more obvious how to contribute as many open source projects will have publicly visible bug trackers. If the software is closed source, either free or commercial, you still have avenues that you can go through. Contact your vendor or look for the support pages on the developer’s website. If it is still unclear, find whatever email address you can and describe the issue in its entirety and it might find its way to the right people. Unfortunately sometimes this is all you can do. Record it in your technical log regardless.
- Tell the community!
- Rationale: Whatever you’re facing tell the digital preservation community. On Twitter or mailing lists, at conferences, and so forth. The more you contribute and the more you share, the more we can tackle together.
- What about file format signatures?
- Follow the same approach as you would for the other tools. If you can do the research, even better, if not, then sharing information about non-identified files or misidentified files, and providing samples if possible, on mailing lists, can help others to solve your problem. It will often benefit those users also. From experience, it can sometimes be easier to contribute to PRONOM after an ingest has completed as files are often no-longer restricted – this goes for improving any of the tools that we’re using. The process may however need to be set in motion sooner rather than later as it can take a bit of time to get everything together.
Conclusion
So what about motivation? – I guess that is the main point I wanted to address. If we have tools in our digital preservation system doing the work of format identification, validation, and extraction, they’re there for a reason.
There might be a period of grace where we decide whether or not the function of a tool is fit for purpose, for example, the notion of XML validation is great in a digital repository IF you can guarantee that your donor/depositor will provide with their XML a valid schema or DTD. If not, then the ‘validation’ process will fail without exception.
Once we’re up and running I believe that it’s not enough to just take the output of a tool as an FYI. We have to entertain the tool’s potential to enable us to maintain our digital legacy, and I hope that some of the proposed approaches above can help us to continue do that.
Please comment below, let me know your thoughts, and let me know what else we can be doing to continue to improve the state of our discipline.
Credit for the idea to write this short personal commentary goes to colleagues in the field, Andrew Berger, Euan Cochrane, and David Underdown for their continued willingness to openly discuss and debate these issues on forums like Twitter.