Word frequency data

Corpus of Contemporary American English


 Purchase data 

Overview
Using the data
Compare 100k/60k

100,000 word list
  Samples
  Compare
  FAQ / questions

5,000-60,000 lemma lists
   Samples / formats
   Compare
   Free list (5,000)  

Spanish data
Portuguese data

Related sites
  Full-text data 
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


See general page on comparing our lists to others (lists 5,000-60,000)

No other word frequency list of English besides ours contains an accurate listing of 100,000 or more words. There are at least four main problems with other lists, and all other lists suffer from at least one of these:

1. Their corpus is too small. We use the 450 million word Corpus of Contemporary American English (COCA), and the words that occur at the end of the list (e.g. words #90,000-100,000) occur about 16-17 times each. If you had a 100 million word corpus like the British National Corpus (1/4 to 1/5 the size of COCA), each of these words would occur about 4 times. At that point, many of these words are just "noise" -- there's no way to know if they appear in one or two texts by chance, or if other words should be in the list instead. And very small corpus like the American National Corpus or some of the "corpora" for the lists at Wiktionary are completely inadequate for a 100,000 (or even a 20,000) word list.

2. Their "corpus" is not balanced. If you have a corpus based on just newspapers (which is easy to create), you'll just get "newspaper language", and the same holds for fiction or any other genre. For example, see verbs in COCA that do appear in a corpus with fiction but rarely in a newspaper-only corpus  (undress, sigh, glance, stare, suppose, kiss, smile, etc). Most of the corpora from the Wiktionary lists are based on one narrow genre. And corpora that are based just on web-accessible texts are perhaps the worst -- there's no way to know what's in there, and what is left out (e.g. fiction). COCA is almost perfectly balanced between spoken, fiction, popular magazines, newspapers, and academic -- a wide range of English.

3. Their "corpus" is not tagged for part of speech. There are more than 140 words that occur both as a noun and as a verb at least 10,000 times in COCA, including words like state, work, place, head, point, end, study, face, care, use, control, and experience. But what is their frequency as a noun and as a verb? Or what about words like [school, room, father, party, oil, programs], or [serve, save, spoke, pulls, fail, joins]. Which are mostly nouns and which are mostly verbs? It's crucial to know the part of speech of a word in order to know how it is used in the language, but many other word lists don't show this.

4. Their word list is not carefully corrected. Take a look at the word lists from the British National Corpus (we downloaded the unlemmatized / all.num.gz file). As is shown in #1 above on this page, their words #90,000-100,000 would occur about four times each. Take a look at these files of nouns, verbs, adjectives, and adverbs that occur four times in their list. How accurate is this list? While we can't claim absolute 100% accuracy for our list (but see a sample of 5,000 completely randomly-selected words), we would argue that it is at least 99% correct -- for even these low-frequency words (e.g. #80,000-100,000), and very close to 100% for more common words (e.g. #1-40,000) -- and we believe that this is much higher than any other list.