Text (Datasets) – Digital Scholarship @ LMU

Plain text

Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)

Early English Books Online (EEBO) (some texts TEI-encoded)
Early Caribbean Digital Archive (ECDA)
Oxford Text Archive (large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
Project Gutenberg

TEI-Encoded

- U.S. Presidents’ Inaugural Speeches
- Abraham Lincoln Speeches and Letters
- Documenting the American South
  - The Church in the Black Community
  - First-Person Narratives of the American South (African Americans, women, enlisted men, Native Americans, ex-slaves, etc.)
  - North American Slave Narratives
- Sunday School Books in 19th Century America
- The Grange Visitor (Michigan newspaper)
- Historic American Cookbooks
- Adult British Fiction – 1880s (by gender)
- Children’s Fiction – 1880s (by gender) (I have formatted some of these data, ask me)
- William Wordsworth writings
- Book summaries and film summaries from Wikipedia
- U.S. patents related to the humanities

Text in a Variety of Formats

HATHITrust (16 million volumes, mostly in English)
Chronicling America (12.8 million pages of American newspapers)
DocSouth Data (narratives & literature from the American South)
Perseus Digital Library (large collection of classical texts, much of it encoded in TEI/XML)
EEBO-TCP (ca. 50,000 early English books, many encoded in TEI/XML)
Old Bailey Online (197,745 London criminal trials, 1674-1913)
Canadian Hansard (debates & journals of the Canadian Senate & House of Commons)
Australian Hansard (Parliamentary debates, 1901-1980)
UK Hansard (UK Parliamentary debates)
Open Islamicate Texts Initiative (see also repositories; 10,000 premodern Islamicate texts)
Transkribus Corpus and READ (efforts to use computer vision to recognize handwriting)
ToposText (557 classical texts linked with a gazetteer of the ancient world)
BYU Corpora (widely used corpora of American English)
Wright American Fiction (American adult fiction, 1774–1900)
UCLA Broadcast NewsScape (170K hours of captioned news programs; see Red Hen Lab for information on access)
Media History Digital Library (nearly 2 million pages of media-related books and articles, 1875-1995)
Christian Classics Ethereal Library (classic Christian texts)
NYT Annotated Corpus (1.8 million NYT articles + NYT-supplied metadata)
Europeana Collections (many datasets from European libraries & archives, from papyri to photographs to newspapers)
Foreign Records of the US (nearly complete run of Foreign Relations of the United States; see these tools to obtain full text)
Internet Archive (huge collection of websites, texts, audio, and other media, available for bulk download via wget)
Twitter Datasets (a catalog of Twitter datasets that are publicly available on the web)
BitCurator (effort to develop tools to analyze features of digital texts)
Movie Quotes Corpus (“220,579 conversational exchanges between 10,292 pairs of movie characters”)
Europe PMC (repository of life sciences books, articles, and preprints)
Trove Australia (565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
BNC-Baby (4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)

Leave a Reply Cancel reply