1. What are the lists based on?
One set of datasets is based on iWeb,
which contains about 14 billion words from an extremely wide range of websites. The
main set of wordlists are based on the the one billion word Corpus
of Contemporary American English, which gives you data from a wide range of
genres of English.
2. What about part of speech tagging
(especially -ing). We used the
CLAWS 7 part of speech tagger (see the
C7 tagset). In many cases
CLAWS suggests multiple possible parts of speech (e.g. strike: single
noun or the base forms of the verb). In these cases, we took the most frequent
tag, as suggested by CLAWS. We then spent 2-3 months manually correcting the
part of speech tags. This was particularly difficult in the case of, for
example, -ING words (e.g. noun/verb: learning, meeting, thinking, beginning,
living, teaching, reading, feeling or verb/adjective: leading, following,
growing, changing, developing, missing, supporting). There are thousands of
such words, and we looked at all of them. Our part of speech tagging isn't
perfect, but it's very good.
3. How were the words selected? After
extracting the words and parts of speech, we selected the top
60,000 lemmas (where lemma = "dictionary headword", e.g. decide =
decide, decides, decided, deciding). We did not separate for case sensitivity (e.g.
service and Service (as nouns) are just one entry). We
then spent 2-3 months comparing these lists to many other word lists (e.g. from WordNet
or online dictionaries, or the BNC lists) to weed out "problem words".
4. Multi-word expressions. What about
fixed phrases like (two
words) each other, even if, kind of, depending on, at all (three
words) as well as, in favor of, in light of, as long as, with respect to
(or four words) all of a sudden, once and for all? Do we list words like
kind, long, light, or favor as separate words? You will see some
words like this occurring both as (for example) a noun (the light
turned on) and as a conjunction, preposition, or adverb (in light
of). But if you see a "strange" word for one of these three parts of speech,
check the multi-word lists at the beginning of this paragraph to see if it's
listed there.
5. Foreign words. Should the
following italicized words be in this list of English, or are they foreign
words: (in) absentia, (ad) nauseam, habeas / corpus,
(ad) infinitum, (in) memoriam, (Homo) sapiens, (a)
capella, (per) capita, (au) contraire, esprit
(de corps), ancien
(regime), café, (au) naturel, papier (mache), (de)
rigueur, bambino, barrio, macho, loco, latino, taco (grande),
gracias, (Yom) Kippur, kung (fu), feng shui, teriyaki, or banzai. There is no
simple answer for these and thousands of other words like these. We have
followed an admittedly subjective approach to such words -- some will be in the
list; some won't. Let us know if you find a word that shouldn't be there but is,
or vice versa (and how about that word?)
6. Proper nouns (and capitalized words).
Should the follow italicized words be in the list?: (Baltimore) Ravens,
March (the month), (Mr) Brown, (Daytona) Beach, AIDS,
or Rice (University). Most people would say no. Just as we wouldn't
include Alice or Chicago or Smith in a frequency list, we
wouldn't include these either, because they are (at least quasi-) proper nouns.
The problem is that the same words, though, could be common nouns or adjectives,
e.g. ravens (actual birds), march (= "walk"), brown
(color), beach (waves and sand), visual aids, or rice
(food). One way to know how to handle these words is to look at what percentage
of the time the word is capitalized. For example, Alice and Chicago
are always capitalized, whereas beach or rice sometimes is not. We
looked at all words that are capitalized at least 20% of the time, to determine
whether they should be included in the list. For those words that are in the
list but that do have a heavy "proper noun" component, we indicate in the list
what percentage of the time they are capitalized (e.g. .46 or .81), and you can
decide if you want to keep the words in the list as well.
|