See page on comparing 100,000 list to other similar lists
Our data is based on
the 520 million word
Contemporary American English (COCA) -- the only large corpus of
English that is both up-to-date (the latest texts are from late 2015) and which is based on a wide range of genres (e.g.
spoken, fiction, newspapers, magazines, and academic writing).
Why worry about what corpus is
used? After all, there are many English word lists and
frequency lists out on the Web (see in particular the British National
Corpus and the
American National Corpus). Some are good, and others are very
poor in quality. Not all
frequency lists are
One should be very,
very suspicious of word lists that are taken
from samples of
web data, outdated texts, or corpora that are
too small to effectively model what is happening in the real world.
Or worse, word lists that don't give you
idea what they
are based on. As the saying goes: "garbage in
garbage out (frequency
Here's some questions
you might ask yourself as you consider downloading or purchasing
a word list:
Depth and accuracy. Why do so many wordlists on the web
contain just the top 1000-3000 words of English? Why not the
top 20,000 or 60,000? It's because even a bad corpus (the
collection of texts that the word lists are based on) can produce a
moderately accurate list for the very most frequent words. But
because the corpus is neither deep nor balanced enough, you start
getting messy data for medium and lower frequency words. Ask to see
the top 20,000 or 60,000 words (e.g. every 7th or 10th word). If
they don't have it, then you should be very, very suspicious of that word list.
the corpus contain texts from a wide variety of genres -- spoken,
fiction, popular magazines, newspapers, and academic journals?
Frequency lists that are based on just one of these may only contain
40-50% of the words from a more balanced corpus. As mentioned, our frequency list
is based on the
Contemporary American English (COCA), which is almost perfectly
balanced across genres.
contains more than 450 million words, and each of the top 20,000
words occurs at least 300 times. In a small 10-20 million word
corpus, some of these words would occur just 7-8 times. At that
point, the lower frequency words might make it into the list "by
chance", whereas others are left out. No such problem with COCA.
How recent is it?
Language change happens. If the word list is based on
texts (or much worse, 100 year old public domain novels), then it
will be missing many of the words from the modern language. COCA is
based on texts from 1990-2015 (20 million words each year)-- or in
other words, virtually right up to the current time.
Is it just a bare wordlist? Word lists are nice, but to be
really useful (especially for language learning) there ought to be
some indication of what these words mean and how they are used. Some
of our frequency lists contain the top 20-30
words) for each word in the list, which creates a great "sketch" of
Are they just word forms? Do you really want to see the
individual frequency of shoe and shoes, or realize,
realizes, realized, and realizing? Do you want to
have the combined frequency of watch as a verb (they watch
TV) and watch as a noun (his watch broke)? If the
lists are simply taken from
pages that are "scraped" from the web, they will just provide
long lists of words, without grouping them meaningfully (e.g. shoe/shoes),
or separating them when necessary (e.g. watch as a noun and
as a verb).
Summary. There are many word frequency lists out on the web.
Some are just OK, and some are truly bad. The frequency lists that
we have created are the only ones available anywhere that are based on a large, recent,
and balanced corpus of English.