LibGuides: Data & Statistics Research Guide: Textual Data

Textual Data Archives

HathiTrust
HathiTrust Digital Library is a digital preservation repository and highly functional access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Bibliographic and full text search are available for all volumes in HathiTrust. Public domain volumes are freely accessible to the public and can be downloaded in their entirety with authentication by persons affiliated with partner institutions. NYU Libraries is a HathiTrust partner institution.

Documenting the American South
Documenting the American South (DocSouth) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.

English-Corpora.org
The English-Corpora.org online version is comprised of several corpora including: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA ,Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, as well as Corpus del Español and Corpus do Português. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. To access the corpora as a downloadable set for offline use see the resource "English-Corpora Text-as-Data."

Users must create an account with English-Corpora.org using their NYU emails. Users must also connect using this link at least once every 365 days to retain their account's access.
Chronicling America: Historic American Newspapers
Search America's historic newspaper pages from 1789-1925 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.
Cultural Analytics Dataverse (McGill University)
This dataverse is a small collection of individual corpora produced or explored by the txt lab at McGill University.
Project Gutenberg
Project Gutenberg is a repository of ebooks that can be downloaded as text.You will find the world's great literature here, especially older works for which copyright has expired.
Google nGram texts
These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets). Each of the numbered links below will directly download a fragment of the corpus. In Version 2 the ngrams are grouped alphabetically (languages with non-Latin scripts were transliterated); in Version 1 the ngrams are partitioned into files of equal size. In addition, for each corpus we provide a file named total_counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful for computing the relative frequencies of ngrams.