See general page on comparing our lists to others (lists 5,000-60,000)
No other word frequency list of English besides
ours contains an accurate listing of 100,000 or more words. There are at least
four main problems with other lists, and all other lists suffer from at least
one of these:
1. Their corpus is too small. We use the 450
million word Corpus of Contemporary
American English (COCA), and the words that occur at the end of the list
(e.g. words #90,000-100,000) occur about 16-17 times each. If you had a 100
million word corpus like the British
National Corpus (1/4 to 1/5 the size of COCA), each of these words would
occur about 4 times. At that point, many of these words are just "noise" --
there's no way to know if they appear in one or two texts by chance, or if other
words should be in the list instead. And very small corpus like the
American
National Corpus or some of the "corpora" for the lists at
Wiktionary are completely inadequate
for a 100,000 (or even a 20,000) word list.
2. Their "corpus" is not balanced. If you
have a corpus based on just newspapers (which is easy to create), you'll just
get "newspaper language", and the same holds for fiction or any other genre. For
example, see verbs in
COCA that do appear in a corpus with fiction but rarely in a newspaper-only
corpus (undress, sigh, glance, stare, suppose, kiss, smile, etc).
Most of the corpora from the
Wiktionary lists are based on one narrow genre. And corpora that are based
just on
web-accessible texts are perhaps the worst -- there's no way to know what's
in there, and what is left out (e.g. fiction). COCA is almost perfectly balanced
between spoken, fiction, popular magazines, newspapers, and academic -- a wide
range of English.
3. Their "corpus" is not tagged for part of
speech. There are more than 140 words that occur both as a noun and as a
verb at least 10,000 times in COCA, including words like state, work, place, head, point, end,
study, face, care, use, control, and experience. But what is their
frequency as a noun and as a verb? Or what about words like [school, room, father,
party, oil, programs], or [serve, save, spoke, pulls, fail, joins]. Which are mostly nouns and which are mostly verbs? It's
crucial to know the part of speech of a word in order to know how it is used in
the language, but many other word lists don't
show this.
4. Their word list is not
carefully corrected. Take a look at the
word lists from the
British National Corpus (we downloaded the unlemmatized /
all.num.gz file).
As is shown in #1 above on this page, their words #90,000-100,000 would occur
about four times each. Take a look at these files of
nouns,
verbs, adjectives, and
adverbs that occur four times in their list.
How accurate is this list? While we can't claim absolute 100% accuracy for our
list (but see a
sample of 5,000 completely randomly-selected words), we would argue that it
is at least 99% correct -- for even these low-frequency words (e.g.
#80,000-100,000), and very close to 100% for more common words (e.g. #1-40,000)
-- and we believe that this is much higher than any other list.
|