Plain text
- Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)
- Early English Books Online (EEBO) (some texts TEI-encoded)
- Early Caribbean Digital Archive (ECDA)
- Oxford Text Archive (large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
- Project Gutenberg
TEI-Encoded
- Women Writers Online
- Eighteenth Century Collections Online (ECCO-TCP)
- Documenting the American South
- Resources from Laura Nelson’s “Analyzing Complex Digitized Data”
- Demonstration Corpora, by Alan Liu
-
- U.S. Presidents’ Inaugural Speeches
- Abraham Lincoln Speeches and Letters
- Documenting the American South
- The Church in the Black Community
- First-Person Narratives of the American South (African Americans, women, enlisted men, Native Americans, ex-slaves, etc.)
- North American Slave Narratives
- Sunday School Books in 19th Century America
- The Grange Visitor (Michigan newspaper)
- Historic American Cookbooks
- Adult British Fiction – 1880s (by gender)
- Children’s Fiction – 1880s (by gender) (I have formatted some of these data, ask me)
- William Wordsworth writings
- Book summaries and film summaries from Wikipedia
- U.S. patents related to the humanities
Text in a Variety of Formats
- HATHITrust (16 million volumes, mostly in English)
- Chronicling America (12.8 million pages of American newspapers)
- DocSouth Data (narratives & literature from the American South)
- Perseus Digital Library (large collection of classical texts, much of it encoded in TEI/XML)
- EEBO-TCP (ca. 50,000 early English books, many encoded in TEI/XML)
- Old Bailey Online (197,745 London criminal trials, 1674-1913)
- Canadian Hansard (debates & journals of the Canadian Senate & House of Commons)
- Australian Hansard (Parliamentary debates, 1901-1980)
- UK Hansard (UK Parliamentary debates)
- Open Islamicate Texts Initiative (see also repositories; 10,000 premodern Islamicate texts)
- Transkribus Corpus and READ (efforts to use computer vision to recognize handwriting)
- ToposText (557 classical texts linked with a gazetteer of the ancient world)
- BYU Corpora (widely used corpora of American English)
- Wright American Fiction (American adult fiction, 1774–1900)
- UCLA Broadcast NewsScape (170K hours of captioned news programs; see Red Hen Lab for information on access)
- Media History Digital Library (nearly 2 million pages of media-related books and articles, 1875-1995)
- Christian Classics Ethereal Library (classic Christian texts)
- NYT Annotated Corpus (1.8 million NYT articles + NYT-supplied metadata)
- Europeana Collections (many datasets from European libraries & archives, from papyri to photographs to newspapers)
- Foreign Records of the US (nearly complete run of Foreign Relations of the United States; see these tools to obtain full text)
- Internet Archive (huge collection of websites, texts, audio, and other media, available for bulk download via wget)
- Twitter Datasets (a catalog of Twitter datasets that are publicly available on the web)
- BitCurator (effort to develop tools to analyze features of digital texts)
- Movie Quotes Corpus (“220,579 conversational exchanges between 10,292 pairs of movie characters”)
- Europe PMC (repository of life sciences books, articles, and preprints)
- Trove Australia (565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
- BNC-Baby (4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)