Conducting some research into the chaining of digital preservation tools using a Linux shell script, I once again found it difficult to source a set of files that I could use as a stake in the ground and allow my work to be in some way replicated by others wishing to confirm results and find future optimisations. Read: Scientific method.
The Open Planets Foundation (OPF) format corpus represents a great set of files. Arguably it isn’t as complete as it could be, but it represents files being used for testing on digital preservation tools, that I can attach a count to, that others can easily access, and gives me some level of complexity for testing against. That is, there is some structure to the collection in terms of numbers of folders and depth, format coverage, and a decent enough number of files so as to not simply be processed at light-speed by the tools in question, allowing us to collect useful, comparative timing values.
But the corpus in its current form on GitHub is difficult to use without some additional effort. Like a garden needs watering, it sometimes needs a little weeding too, and I think that’s the case with the format corpus. Ideally if we can extract the weeds from it, it can become completely standalone, and useful to anyone who comes along who needs to use it for a number of ever-so-slightly different purposes.
Purposes of having a format corpus with a broad range of files with known characteristics and attributes, of many different combinations have been expounded before. Other reasons include:
- Monitoring the consistency of the behaviour of tools and their output.
- Monitor improvements in capability and performance.
- Understand the behaviour of tools against files with more obscure characteristics and attributes.
- Enable other tools to be developed and measured alongside existing tools using the same baseline for testing.
- And, in this instance, for the purpose measuring (and demonstrating) the performance of the tools within the digital preservation toolkit and to enable those same tests to be run by other users and organisations.
It is possible to do this with the corpus in its current form, but it contains two types of file. Those meant specifically for testing a tool, functional files, e.g. exemplars of JP2, PDF, and ODF.
Compare this to non-functional files, those that contain metadata, or even the results of other experiments on the files with other tools. To explain further:
A process one might follow at present to experiment on the corpus is to download the repository from GitHub. This will give you a folder that looks like this:
.gitattributes .gitignore .opf.yml .project .pydevproject metadata-template.ext.md README.md <DIR> desktop-publishing <DIR> ebooks <DIR> file-archive <DIR> filesys-trials <DIR> govdocs1-error-pdfs <DIR> jp2k-formats <DIR> jp2k-test <DIR> knowledge-management <DIR> office <DIR> office-examples <DIR> pcraster <DIR> pdfCabinetOfHorrors <DIR> statistica <DIR> tiff-examples <DIR> tools <DIR> variations <DIR> video
The files that we are interested in testing against exist inside the lower level directories, e.g. ‘office-examples/’.
A directory at this top-level level, and sitting amongst the others, that does not constitute part of the corpus is ‘tools’. This folder does what it says on the tin and it contains tools for working with the format corpus, not corpus files.
Files like README.md, metadata-template.ext.md, and .gitignore; while these can be 'handled' by any of the tools we might test against, usefully, or not, they’re not part of the corpus of files that we want to be measuring against.
This presents us with an issue counting the number of functionally useful files in the corpus for presenting back useful, repeatable results. As we move deeper into the structure of the repository we find other *.md files, and other metadata objects in inconsistent formats, e.g. comma-separated-values files, inconsistently distributed within different folders where contributors may or may-not have created them.
The two sets of objects need to be presented back in such a way that the testing objects can simply be picked-up and used by others. The descriptive and utility objects should not clutter this other collection and should sit, modularly, somewhere else.
Following these principles I remixed the corpus as is to create the following structure at the top-level:
diagram.png README.md <DIR> format-corpus <DIR> tools
README.md now contains all descriptive information previously found about the files in the ‘format-corpus/’ including metadata about the objects and licensing information where it has been provided by the content creator. Diagram.png is part of README.md.
Further issues inside the corpus include the inclusion of tool output, such as Jpylyzer output, and Jhove output.
As examples of XML objects these files would prove useful. Other than that, they clutter the corpus repository somewhat and serve little purpose. While fulfilling some suggested metadata requirements outlined by Andy Jackson of the British Library, i.e. including some or all of the following information:
formatName: formatVersion: extensions: mimeType: mimeTypeAliases: pronomId: xmlNameSpace: creatorTool: creatorToolUrl: formatSpecUrl:
The tool output does not constitute part of the collection that we’re interested in testing against.
On the one hand, this output is variable over time, but the corpus files are constants. The specific version of the tools generating the output from these files is constant too (although the platform might not be). Given correct source control with any of these utilities, we don’t need to store any output. We simply need to run older or newer versions of them over time, depending on perspective, to recreate these ‘metadata’ objects.
Should they be viewed as the result of ‘testing’, and those results important to keep with the format-corpus then perhaps we can create a ‘testing-results’ folder in the top level of the repository.
Tool output has been removed entirely.
The corpus now sits in a standalone space. When it is downloaded via Git, the user will receive a folder containing test files and the directory structure wrapping those files; none of the extraneous non-functional, and descriptive data. The ‘format-corpus/’ directory can simply be passed to any tool being tested free from the additional pollution.
I am not sure this is a perfect model yet. README.md contains the file index, simply because GitHub renders it on the front page of a repository folder, by default. We could still extract all of this information into individual files to be managed that way. Work could also be done to generate files matching the suggested metadata schema by Andy Jackson – a task at a hackathon maybe? – Not as cool as coding, but maintaining such a useful resource, equally important.
Regardless, I hope that in my remix I’ve demonstrated some principles that other contributors will be happy to follow, or perhaps some ideas that can open up a discussion about alternative ways to do this.
To conclude, I’ve branched the current corpus here: https://github.com/ross-spencer/opf-format-corpus/tree/opf-format-corpus-20-feb-2014
I need this to be able to make a write-up in reference to a set of files fixed to a point in time, and the, forked, full corpus is here: https://github.com/ross-spencer/opf-format-corpus/
Overall, I’d be extremely happy if I’ve managed to keep this work in a state that enables it to be forked and placed somewhere more useful to the community, and am happy to see it move back into the Open Planets Foundation GitHub where the community can continue to work on it.
Happy gardening!
—
Notes:
Structure: Because it is difficult to rework structure en masse using Git, we do need to think quite carefully up front about our work. Unfortunately the remix required a full extract, generation of a new repo, and re-upload. Ideally I would have worked on a Fork of the current collection to the same aim.