Word frequency data


  100,000 list (see samples) 5,000-60,000 lists (see samples)

Lemma / word form

Words forms. This allows you to see the frequency of different parts of speech within noun, verb, and adjective. e.g. paper (nn1) papers (nn2)  ||  listen (vv0) listens (vvz) listening (vvg)  listened (vvd) listened (vvn)  ||  fast (jj) faster (jjr) fastest (jjt). This is probably the better option for linguists, for those who do natural language processing, or for those who are developing an application where they need a list of each individual word form. Note that a word can be listed multiple times for different parts of speech (e.g. run as a noun and as a verb), and this yields 86,413 distinct words in the list (e.g. all forms of run counted just once).

Lemmas. While the main parts of speech are separated (e.g. separate entries for point as a noun and as a verb), all uses within noun, verb, etc are grouped together, i.e. the frequencies for listen, listens, listening, and listened are all grouped together under listen (verb). This may be the better option for some teachers and learners.

Corpora

Provides data from the 520 million word Corpus of Contemporary American English (COCA; 2012 update of 450 million words), the 400 million word Corpus of Historical American English, the 100 million word British National Corpus, and the 100 million word Corpus of American Soap Operas.

Data from just the Corpus of Contemporary American English (COCA) (version from 2010 -- 400 million words)

Genre frequency

Provides the frequency for COCA spoken, fiction, popular magazine, newspaper, and academic, as well as BNC spoken, fiction, popular magazine, newspapers, non-academic journals, academic journals, and miscellaneous. For COHA, the lists also shows the frequency in the three main time periods of COHA (1950s-1980s, 1900s-1940s, and 1810s-1890s).

No frequency information for genres in the standard word lists. But researchers can purchase the lists with genre frequency in COCA (only) (see sample) -- the five main genres, as well as 40+ sub-genres (e.g. Newspaper-Sports or Academic-Medicine).

Format / distribution

As Excel files (see a sample: xlsx, xls), with hyperlinks for each of the 100,000 words, for one-click searches to the online corpora. Also available as a text file, for those who want to import the data into another program (such as a relational database). Both formats are included in the purchase price.

Available only as text files (basic word lists and collocates lists) or as an Excel file (genre frequencies list), but with no hyperlinks for online queries.

# texts

Sometimes it's useful to know not just how many times a word occurred in a corpus (or a particular genre of a corpus), but also in how many texts it occurred (e.g. 37,273 of 182,893 texts overall, or 20.4% of all texts), which lets you see how well it's "spread across" the texts in a corpus. The 100,000 word lists provide this data for each corpus (COCA, COHA, BNC, SOAP) as well as each genre in COCA and the BNC, and the three main time periods of COHA (1950s-1980s, 1900s-1940s, and 1810s-1890s).

No information on the number of texts.