While the GovDocs corpora is useful, it’s also very large. This means long transfer and test execution times, depending upon your bandwidth and compute power. David Tarrant used some informed analysis to remove “repeat” files from the corpus and reduce it to a more manageable 21,000 files from the original 1,000,000. You can read a blog post detailing how the corpus was reduced.
- Download the Govdocs Selected dataset as a gzipped tar file.