Hero or Villain? A Tool to Create a Digital Preservation Rogues Gallery

Hero or Villain? A Tool to Create a Digital Preservation Rogues Gallery
Presented here is a tool that will create a 'rogues gallery' out of any digital collection for which you have a DROID report for (alternatively, soon, a Siegfried report for). The tool was presented at a recent OPF Webinar, Preservation in Practice: Archives New Zealand; slides here. And was created by myself and Andrea K. Byrne while trying to find new ways of developing the technical skills of the non-digital archivists within the organization. 

The definition of a rogues gallery in this context is the isolation of files posing a sentencing or digital preservation risk, for example, unknown file types contained in a collection of archival value. The inverse, also developed, is a 'heroes gallery', simulating the isolation of files that are unlikely to pose an immediate risk, that is, at the most fundamental level they are known by the DROID format identification tool.

At a point in the near future it is hoped the output of other tools can be incorporated into the analysis engine to create rogues galleries, for example, based on the concept of validation or incomplete metadata. 

The tool was originally developed for training purposes but we also recognized a number of other potential uses:

  • It enables users to work on copies of content that requires immediate attention
  • Users can clone the directory structures (context) containing rogue content e.g. to aid in the analysis of duplicate files
  • Archivists may be able to consider the immediate ingest and delivery of a 'clean' collection independent of a rogues collection to promote immediate access while file format issues are worked on in an isolated treatment environment.
  • The footprint on disk of a collection is reduced by enabling users to create working copies of only those files of immediate interest
  • Complexities of collections and issues are reduced, and patterns in collection can become apparent as different rogues galleries are isolated

The tool, an extension to the DROID Analysis Engine, works in concert with the rsync command found in Unix-based environments. The argument –-rogues will output an rsync compatible filename path listing:

$ python droidsqliteanalysis.py –csva opf-test-corpus-droid-analysis.csv –-rogues

Using the default values at the time of writing, this will contain:

  • Files with multiple identifications
  • Files with no identification
  • Files identified only by extensions
  • Files with an extension mismatch
  • Zero-byte objects
  • Files with duplicate MD5 values

A sample from a rogues listing created from the OPF format corpus:

/home/git/opf-format-corpus/format-corpus/knowledge-management/nova-mind/Curation outline 3.opml.md

/home/git/opf-format-corpus/format-corpus/knowledge-management/nova-mind/Curation outline 3.opml

/home/git/opf-format-corpus/format-corpus/ebooks/iBooks Author 1.1 (190)/lorem-ipsum-plus-image.iba

Output to a text file listing it can be used as an input to rsync as follows:

rsync -rlptDv –files-from [SOURCE FILE LISTING] [SOURCE] [DESTINATION]
Windows:
rsync -rlptDv --files-from=opf-rogues-list.txt “/cygdrive/c/git/opf-format-corpus/” “cygdrive/c/treatment/opf-rogues-gallery/”

Linux:

sudo rsync -rlptDv --files-from=opf-rogues-list.txt “/home/git/opf-format-corpus/” “/home/treatment/opf-rogues-gallery/”

The flags used here are:

  • -r recurse into directories
  • -l copy symlinks as symlinks
  • -p preserve permissions
  • -t preserve modification times
  • -D preserve device/special files
  • -v increase verbosity

Rsync does have an –archive command. This also attempts to preserve group and ownership. We discovered that without the correct permissions the command attempts to chmod the object in order to achieve this. The result is that the modification dates are no longer preserved. Given the right environment the –archives flag may prove to be more appropriate for some users.

We can place the rogues gallery anywhere we choose and copy it as many times as we like, simply by modifying the destination parameter.

If we supply the Python script with an argument of –heroes instead of –rogues we can output the inverse of rogues gallery.

Andrea is currently utilizing the output of the tool to analyze legacy collections at Archives New Zealand. It is also likely it will be used when we continue work on other born-digital material including analysis pre-ingest and for training.

Future work will improve how configurable the tool is to give users more control over the definition of a ‘rogue’. Unit tests are on the way with the scaffolding for the test harness already developed. Unit tests will cover the full scope of the DROID Analysis Engine but initial tests will be focused around the file counts output by functions that impact rogues and heroes output to ensure that they remain the direct inverse of one another.

As mentioned above, other work that will also benefit the analysis engine is to see other digital preservation tool' outputs incorporated. I'd like to work with FITS because of the range of mechanisms that it integrates, but outstanding issues with that include upgrading DROID to 6.1.5 to enable FITS to work in a Java 8 environment, and upgrading the XSLT, and perhaps other parts of the codebase to see FITS support JHOVE 1.11. Update, 26 August: Since the blog I have received word that the next release of FITS will include the latest version of DROID.

Are you a hero or a villain?

If you begin to use this mechanism or begin to make use of a similar technique we are keen to hear more about your approach. If you find bugs, or have any feature requests that we can help with then feel free to record those in my GitHub issues log: https://github.com/exponential-decay/droid-sqlite-analysis/issues. If you've any views on the archival merit of isolating a 'clean' collection for ingest before issues with rogues have been addressed, and then making that isolated set available sooner than perhaps might be possible through more traditional workflows, then your views are appreciated in the comments section below.

The tool can be downloaded from here: https://github.com/exponential-decay/droid-sqlite-analysis 

Update, 28 August: Since publishing this blog I have had the opportunity chat to Richard Lehane and to test Siegfried further. Two commands in the tool help promote rogues gallery-esque usage:

sf –known <file-directory>

sf –unknown <file-directory>

These commands will output lists of either known or unknown files to sdout which can then be used directly with Rsync as per the instructions above. 

Further, the full set of rogues can be output using Siegfried by utilizing the DROID CSV output option:

sf -droid -hash md5 <file-directory>

This will output a CSV file, and this is fully compatible with the DROID Analysis engine described above.

Leave a Reply

Join the conversation