Word frequency data

Corpus of Contemporary American English


 Purchase data 

Overview
Using the data
Compare 100k/60k

100,000 word list
  Samples
  Compare
  FAQ / questions

5,000-60,000 lemma lists
   Samples / formats
   Compare
   Free list (5,000)  

Spanish data
Portuguese data

Related sites
  Full-text data 
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


The following are samples of the data that you can get from the new 100,000 word list of English. If you get the full 100,000 word list as an Excel file (see samples: xlxs, xls), you can sort the full 100,000 word list however you want -- to sort and limit by part of speech, overall frequency in COCA, COHA, the BNC, or SOAP, or the frequency in any of the genres of COCA or the BNC (e.g. academic or newspapers). You might also take a look at 5,000 randomly-selected words from the list (every twentieth word, 1 to 100,000) to check the accuracy of the list.
 

  Sample list Description (click to download) # entries
1 Rank-ordered Every 100 words, 1-100,000 1008
2 Alphabetical All words starting with [V-] 1212
3 Part of speech Simple past tense forms (every fifth entry) Top 1000
4 Genres (COCA: Academic) Verbs that are 50% more frequent in COCA Academic (per million words) than overall 1188
5 Dialects (COCA / BNC) Nouns that are at least 10 times as frequent in COCA (overall) than in the BNC (random entries) Top 1000
6 New words (COCA / COHA) Nouns that are at least 10 times as frequent in COCA (1990-2012) than in COHA, 1950-1989 (random entries) Top 1000
7 Informal words Adjectives that are at least twice as frequent in SOAP (Soap Operas) than in COCA (overall) 255

Note: no entries have been removed from these sample lists. In other words, the full list (words 1-100,000) has not been "cleaned up" for these sample lists, and as a result, the accuracy of the sample lists is indicative of the accuracy of the full list.