During my time at The National Archives UK, colleague, Adam Retter, developed a methodology for the reversible pre-conditioning of complex binary objects. The technique was required to avoid the doubling of storage for malformed JPEG2000 objects numbering in the hundreds of thousands. The difference between a malformed JPEG2000 file and a corrected, well-formed JPEG2000 file, in this instance was a handful of bytes, yet the objects themselves were many megabytes in size. The cost of storage means that doubling it in such a scenario is not desirable in today’s fiscal environment – especially if it can be avoided.
As we approach ingest of our first born-digital transfers at Archives New Zealand, we also have to think about such issues. We’re also concerned about the documentation of any comparable changes to binary objects, as well as any more complicated changes to objects in any future transfers.
The reason for making changes to a file pre-ingest, in our process terminology – pre-conditioning, is to ensure well-formed, valid objects are ingested into the long term digital preservation system. Using processes to ensure changes are:
- Reversible
- Documented
- Approved
We can counter any issues identified as digital preservation risks in the system’s custom rules up-front ensuring we don’t have to perform any preservation actions in the short to medium term. Such issues may be raised through format identification, validation, or characterisation tools. Such issues can be trivial or complex and the objects that contain exceptions may also be trivial or complex themselves.
At present, if pre-conditioning is approved, it will result in a change being made to the digital object and written documentation of the change, associated with the file, in its metadata and in the organisation’s content management system outside of the digital preservation system.
As example documentation for a change we can look at a provenance note I might write to describe a change in a plain text file. The reason for the change is the digital preservation system looking for the object to be encoded as UTF-8. A conversion can give us stronger confidence about what this file is in future. Such a change, converting the object from ASCII to UTF-8, can be completed as either a pre-conditioning action pre-ingest, or preservation migration post-ingest.
Provenance Note
“Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.”
Such a small change is deceptively complex to document. Without the presence of a character sitting outside of the ASCII range we might have simply been able to write, “UTF-8 byte-order-mark added to beginning of file.” – but with its presence we have to provide a description complete enough to ensure that the change can be observed, and reversed by anyone accessing the file in future.
Pre-conditioning vs. Preservation Migration
As pre-conditioning is a form of preservation action that happens outside of the digital preservation system we haven’t adequate tools to complete the action and document it for us – especially for complex objects. We’re relying on good hand-written documentation being provided on ingest. The temptation, therefore, is to let the digital preservation system handle this using its inbuilt capability to record and document all additions to a digital object’s record, including the generation of additional representations; but the biggest reason to not rely on this is the cost of storage and how this increases with the likelihood of so many objects requiring this sort of treatment over time.
Proposed Solution
It is important to note that the proposed solution can be implemented either pre- or post-ingest therefore removing the emphasis from where in the digital preservation process this occurs, however, incorporating it post-ingest requires changes to the digital preservation system. Doing it pre-ingest enables it to be done manually with immediate challenges addressed. Consistent usage and proven advantages over time might see it included in a digital preservation mechanism at a higher level.
The proposed solution is to use a patch file, specifically a binary diff (patch file) which stores instructions about how to convert one bitstream to another. We can create a patch file by using a tool that compares an original bitstream to a corrected (pre-conditioned) version of it and stores the result of the comparison. Patch files can add and remove information as required and so we can apply the instructions created to a corrected version of any file to re-produce the un-corrected original.
The tool we adopted at The National Archives, UK was called BSDIFF. This tool is distributed with the popular operating system, FreeBSD, but is also available under Linux, and Windows.
The tool was created by Colin Percival and there are two utilities required; one to create a binary diff – BSDIFF itself, and the other to apply it BSPATCH. The manual instructions are straightforward, but the important part of the solution in a digital preservation context is to flip the terms <oldfile> and <newfile>, so for example, in the manual:
-
$ bsdiff <oldfile> <newfile> <patchfile>
Can become:
-
$ bsdiff <newfile> <oldfile> <patchfile>
Further, in the below descriptions, I will replace <newfile> and <oldfile> for <pre-conditioned-file> and <malformed-file> respectively, e.g.
-
$ bsdiff <pre-conditioned-file> <malformed-file> <patchfile>
BSDIFF
BSDIFF generates a patch <patchfile> between two binary files. It compares <pre-conditioned-file> to <malformed-file> and writes a <patchfile> suitable for use by BSPATCH.
BSPATH
BSPATCH applies a patch built with BSDIFF, it generates <malformed-file> using <pre-conditioned-file>, and <patchfile> from BSDIFF.
Examples
For my examples I have been using the Windows port of BSDIFF referenced from Colin Percival’s site.
To begin with, a non-archival example simply re-producing a binary object:
If I have the plain text file, hen.txt:
-
The quick brown fox jumped over the lazy hen.
I might want to correct the text to its more well-known pangram form – dog.txt:
-
The quick brown fox jumped over the lazy dog.
I create dog.txt and using the following command I create hen-reverse.diff:
-
$ bsdiff dog.txt hen.txt hen-reverse.diff
We have two objects we need to look after, dog.txt and hen-reverse.diff.
If we ever need to look at the original again we can use the BSPATCH utility:
-
$ bspatch dog.txt hen-original.txt hen-reverse.diff
We end up with a file that matched the original byte for byte and can be confirmed by comparing the two checksums.
$ md5sum hen.txt 84588fd6795a7e593d0c7454320cf516 *hen.txt
$ md5sum hen-original.txt 84588fd6795a7e593d0c7454320cf516 *hen-original.txt
Used as an illustration, we can re-create the original binary object, but we’re not saving any storage space at this point as the patch file is bigger than the <malformed-file> and <pre-conditioned-file> together:
- hen.txt – 46 bytes
- dog.txt – 46 bytes
- hen-reverse.diff – 159 bytes
The savings we can begin to make, however, using binary diff objects to store pre-conditioning instructions can be seen when we begin to ramp up the complexity and size of the objects we’re working with. Still working with text, we can convert the following plain-text object to UTF-8 complementing the pre-conditioning action we might perform on archival material as described in the introduction to this blog entry:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quam lacus, tincidunt sit amet lobortis eget, auctor non nibh. Sed fermentum tempor luctus. Phasellus cursus, risus nec eleifend sagittis, odio tellus pretium dui, ut tincidunt ligula lorem et odio. Ut tincidunt, nunc ut volutpat aliquam, quam diam varius elit, non luctus nulla velit eu mauris. Curabitur consequat mauris sit amet lacus dignissim bibendum eget dignissim mauris. Nunc eget ullamcorper felis, non scelerisque metus. Fusce dapibus eros malesuada, porta arcu ut, pretium tellus. Pellentesque diam mauris, mollis quis semper sit amet, congue at dolor. Curabitur condimentum, ligula egestas mollis euismod, dolor velit tempus nisl, ut vulputate velit ligula sed neque. Donec posuere dolor id tempus sodales. Donec lobortis elit et mi varius rutrum. Vestibulum egestas vehicula massa id facilisis.
Converting the passage to UTF-8 doesn’t require the conversion of any characters within the text itself, rather just the addition of the UTF-8 byte-order-mark at the beginning of the file. Using Programmers Notepad we can open lorem-ascii.txt and re-save it with a different encoding as lorem-utf-8.txt. As with dog.txt and hen.txt we can then create the patch, and then apply to see the original again using the following commands:
-
$ bsdiff lorem-utf-8.txt lorem-ascii.txt lorem-reverse.diff
-
$ bspatch lorem-utf-8.txt lorem-ascii-original.txt lorem-reverse.diff
Again, confirmation that bspatch outputs a file matching the original can be seen by looking at their respective MD5 values:
$ md5sum lorem-old.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii.txt
$ md5sum lorem-ascii.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii-original.txt
The file sizes here are much more illuminating:
- lorem-ascii.txt – 874 bytes
- lorem-utf-8.txt – 877 bytes
- lorem-reverse.diff – 141 bytes
Just one more thing… Complexity!
We can also demonstrate the complexity of the modifications we can make to digital objects that BSDIFF affords us. Attached to this blog is a zip file containing supporting files, lorem-ole2-doc.doc and lorem-xml-docx.docx.
The files are used to demonstrate a migration exercise from an older Microsoft OLE2 format to the up-to-date OOXML format.
I’ve also included the patch file lorem-word-reverse.diff.
Using the commands as documented above:
-
$ bsdiff lorem-xml-docx.docx lorem-ole2-doc.doc lorem-word-reverse.diff
-
$ bspatch lorem-xml-docx.docx lorem-word-original.doc lorem-word-reverse.diff
We can observe that application of the diff file to the ‘pre-conditioned’ object, results in a file identical to the original OLE2 object:
$ md5sum lorem-ole2-doc.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-ole2-doc.doc
$ md5sum lorem-word-original.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-word-original.doc
The file-sizes involved in this example are as follows:
- Lorem-ole2-doc.doc – 65,536 bytes
- Lorem-xml-docx.docx – 42.690 bytes
- Lorem-word-reverse.diff – 16,384 bytes
The neat part of this as a solution, if it wasn’t enough that the most complex of modifications are reversible, is that the provenance note remains the same for all transformations between all digital objects. The tools and techniques are documented instead, and the rest is consistently consistent, and perhaps even more accessible to all users who can understand this documentation over more complex narrative, byte-by-byte breakdowns that might otherwise be necessary.
Conclusions
Given the right problem and the insight of an individual from outside of the digital preservation sphere (at the time) in Adam, we have been shown an innovative solution that helps us to demonstrate provenance in a technologically and scientifically sound manner, more accurately, and more efficiently than we might otherwise be able to do so using current approaches. The solution:
- Enables more complex pre-conditioning actions on more complex objects
- Prevents us from doubling storage space
- Encapsulates pre-conditioning instructions more efficiently and more accurately – there are fewer chances to make errors
While it is unclear whether Archives New Zealand will be able to incorporate this technique into its workflows at present, the solution will be presented alongside our other options so that it can be discussed and taken into consideration by the organisation as appropriate.
Work does need to be done to incorporate it properly, e.g. respecting original file naming conventions; and some consideration should be given as to where and when in the transfer / digital preservation process the method should be applied, however, it should prove to be an attractive, and useful option for many archives performing pre-conditioning or preservation actions on all future, trivial and complex digital objects.
—
Footnotes
Documentation for the BSDIFF file format is attached to this blog and was created by Colin Percival and is BSD licensed.
adamretter
July 14, 2014 @ 5:16 pm CEST
Nice article Ross 🙂
If you have not seen it, then maybe of interest for you is the UTF-8 Validator https://github.com/digital-preservation/utf8-validator.
Also I am intrigued to know your need for BOM in UTF-8 files, as I understood it was optional until you step up to UTF-16?