Corpora

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed.

This page lists corpora that we’ve used in software testing and hack events, and we are also looking to improve it. We’re also happy to receive suggestions for additions to the list, here’s a few pointers to consider.

Size

There’s no rules regarding size or numbers of files in a corpus but very large test collections do bring some problems as they’re:

  • difficult to use on virtual build services, e.g. Travis, as they take too long to download;
  • awkward for unit testing, these take too long to run over a large corpus; and
  • time consuming to copy and distribute.

We won’t discount a suggestion due to size alone, the Govdocs corpus is certainly large. It’s also very useful for testing format identification tools.

Scope

Corpora that focus on representing a single problem or a small set of related problems are preferred as they’re easier to use. Restricting scope helps keep the overall size of the corpora manageable side-stepping the issues with large collections described above. Smaller corpora can be combined to build larger test sets.

Corpora listing

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]