The following questions and issues
relate to the 100,000 word list.
1. What is the list based on? It is
based on the the 520 million word Corpus
of Contemporary American English (COCA; July 2012 update of 450 million words), the 400 million
word Corpus of Historical American English,
the 100 million word British National
Corpus, and the 100 million word
Corpus of American Soap Operas.
2. What about part of speech tagging
(especially -ing). We used the
CLAWS 7 part of speech tagger (see the
C7 tagset). In many cases
CLAWS suggests multiple possible parts of speech (e.g. strike: single
noun or the base forms of the verb). In these cases, we took the most frequent
tag, as suggested by CLAWS. We then spent 2-3 months manually correcting the
part of speech tags. This was particularly difficult in the case of, for
example, -ING words (e.g. noun/verb: learning, meeting, thinking, beginning,
living, teaching, reading, feeling or verb/adjective: leading, following,
growing, changing, developing, missing, supporting). There are thousands of
such words, and we looked at all of them. Our part of speech tagging isn't
perfect, but it's very good.
3. How were the words selected? After
extracting the words and parts of speech, we selected the top
100,000 word forms / Pos in COCA for the list. We did not separate for case sensitivity (e.g.
service and Service (as nouns) are just one entry). We
then spent 2-3 months comparing these lists to many other word lists (e.g. from WordNet or online dictionaries, or the BNC lists) to weed out "problem words".
In addition to looking at frequency, we also placed a distributional
threshold on each word, in that it had to occur in at least five different texts
(from the 180,000 or so texts in COCA). This eliminated, for example, words that
might have occurred 20-30 times total in COCA, but which were limited to just 2
or 3 different texts.
4. Multi-word expressions. What about
fixed phrases like (two
words) each other, even if, kind of, depending on, at all (three
words) as well as, in favor of, in light of, as long as, with respect to
(or four words) all of a sudden, once and for all? Do we list words like
kind, long, light, or favor as separate words? You will see some
words like this occurring both as (for example) a noun (the light
turned on) and as a conjunction, preposition, or adverb (in light
of). But if you see a "strange" word for one of these three parts of speech,
check the multi-word lists at the beginning of this paragraph to see if it's
5. Foreign words. Should the
following italicized words be in this list of English, or are they foreign
words: (in) absentia, (ad) nauseam, habeas / corpus,
(ad) infinitum, (in) memoriam, (Homo) sapiens, (a)
capella, (per) capita, (au) contraire, esprit
(de corps), ancien
(regime), café, (au) naturel, papier (mache), (de)
rigueur, bambino, barrio, macho, loco, latino, taco (grande),
gracias, (Yom) Kippur, kung (fu), feng shui, teriyaki, or banzai. There is no
simple answer for these and thousands of other words like these. We have
followed an admittedly subjective approach to such words -- some will be in the
list; some won't. Let us know if you find a word that shouldn't be there but is,
or vice versa (and how about that word?)
6. Proper nouns (and capitalized words).
Should the follow italicized words be in the list?: (Baltimore) Ravens,
March (the month), (Mr) Brown, (Daytona) Beach, AIDS,
or Rice (University). Most people would say no. Just as we wouldn't
include Alice or Chicago or Smith in a frequency list, we
wouldn't include these either, because they are (at least quasi-) proper nouns.
The problem is that the same words, though, could be common nouns or adjectives,
e.g. ravens (actual birds), march (= "walk"), brown
(color), beach (waves and sand), visual aids, or rice
(food). One way to know how to handle these words is to look at what percentage
of the time the word is capitalized. For example, Alice and Chicago
are always capitalized, whereas beach or rice sometimes is not. We
looked at all words that are capitalized at least 20% of the time, to determine
whether they should be included in the list. For those words that are in the
list but that do have a heavy "proper noun" component, we indicate in the list
what percentage of the time they are capitalized (e.g. .46 or .81), and you can
decide if you want to keep the words in the list as well.