Govdocs

The Govdocs corpus is a large collection of approximately 1 million documents which are freely available for research, provided by the Digital Corpora site. Each file is presented as a numbered file with a tentative file extension (e.g. 0000001.jpg). The corpus is particularly useful for testing cross format tools such as format identification software.

The corpus is available as a set of 1,000 zip files each containing about 1,000 test files.

Resources