Corpora

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed.

This page lists corpora that we’ve used in software testing and hack events, and we are also looking to improve it. We’re also happy to receive suggestions for additions to the list, here’s a few pointers to consider.

Size

There’s no rules regarding size or numbers of files in a corpus but very large test collections do bring some problems as they’re:

  • difficult to use on virtual build services, e.g. Travis, as they take too long to download;
  • awkward for unit testing, these take too long to run over a large corpus; and
  • time consuming to copy and distribute.

We won’t discount a suggestion due to size alone, the Govdocs corpus is certainly large. It’s also very useful for testing format identification tools.

Scope

Corpora that focus on representing a single problem or a small set of related problems are preferred as they’re easier to use. Restricting scope helps keep the overall size of the corpora manageable side-stepping the issues with large collections described above. Smaller corpora can be combined to build larger test sets.

Corpora listing

The Govdocs corpus is a large collection of approximately 1 million documents which are freely available for research, provided by the Digital Corpora site. Each file is presented as a numbered file with a tentative file extension (e.g. 0000001.jpg). The corpus is particularly useful for testing cross format tools such as format identification software. The […]

While the GovDocs corpora is useful, it’s also very large. This means long transfer and test execution times, depending upon your bandwidth and compute power. David Tarrant used some informed analysis to remove “repeat” files from the corpus and reduce it to a more manageable 21,000 files from the original 1,000,000. You can read a […]

An openly-licensed corpus of small example files, covering a wide range of formats and creation tools. All test files are CC0 licenced unless otherwise stated. A recent summary of the contents of the repository can be found here. The corpus can be downloaded as a git repository from this GitHub project on the OPFF’s GitHub […]