Govdocs
The Govdocs corpus is a large collection of approximately 1 million documents which are freely available for research, provided by the Digital Corpora site. Each file is presented as a numbered file with a tentative file extension (e.g. 0000001.jpg). The corpus is particularly useful for testing cross format tools such as format identification software.
The corpus is available as a set of 1,000 zip files each containing about 1,000 test files.
Resources
- The corpus as a set of zip files can be downloaded here.
- MD5 Hash values for the zip files.
- SHA1 Hash values for the zip files