Word frequency data


The following questions and issues relate to the 100,000 word list.

1. What is the list based on? It is based on the the 520 million word Corpus of Contemporary American English (COCA; July 2012 update of 450 million words), the 400 million word Corpus of Historical American English, the 100 million word British National Corpus, and the 100 million word Corpus of American Soap Operas.

2. What about part of speech tagging (especially -ing). We used the CLAWS 7 part of speech tagger (see the C7 tagset). In many cases CLAWS suggests multiple possible parts of speech (e.g. strike: single noun or the base forms of the verb). In these cases, we took the most frequent tag, as suggested by CLAWS. We then spent 2-3 months manually correcting the part of speech tags. This was particularly difficult in the case of, for example, -ING words (e.g. noun/verb: learning, meeting, thinking, beginning, living, teaching, reading, feeling or verb/adjective: leading, following, growing, changing, developing, missing, supporting). There are thousands of such words, and we looked at all of them. Our part of speech tagging isn't perfect, but it's very good.

3. How were the words selected? After extracting the words and parts of speech, we selected the top 100,000 word forms / Pos in COCA for the list. We did not separate for case sensitivity (e.g. service and Service (as nouns) are just one entry). We then spent 2-3 months comparing these lists to many other word lists (e.g. from WordNet or online dictionaries, or the BNC lists) to weed out "problem words". In addition to looking at frequency, we also placed a distributional threshold on each word, in that it had to occur in at least five different texts (from the 180,000 or so texts in COCA). This eliminated, for example, words that might have occurred 20-30 times total in COCA, but which were limited to just 2 or 3 different texts.

4. Multi-word expressions. What about fixed phrases like (two words) each other, even if, kind of, depending on, at all (three words) as well as, in favor of, in light of, as long as, with respect to (or four words) all of a sudden, once and for all? Do we list words like kind, long, light, or favor as separate words? You will see some words like this occurring both as (for example) a noun (the light turned on) and as a conjunction, preposition, or adverb (in light of). But if you see a "strange" word for one of these three parts of speech, check the multi-word lists at the beginning of this paragraph to see if it's listed there.

5. Foreign words. Should the following italicized words be in this list of English, or are they foreign words: (in) absentia, (ad) nauseam, habeas / corpus, (ad) infinitum, (in) memoriam, (Homo) sapiens, (a) capella, (per) capita, (au) contraire, esprit (de corps), ancien (regime), café, (au) naturel, papier (mache), (de) rigueur, bambino, barrio, macho, loco, latino, taco (grande), gracias, (Yom) Kippur, kung (fu), feng shui, teriyaki, or banzai. There is no simple answer for these and thousands of other words like these. We have followed an admittedly subjective approach to such words -- some will be in the list; some won't. Let us know if you find a word that shouldn't be there but is, or vice versa (and how about that word?)

6. Proper nouns (and capitalized words). Should the follow italicized words be in the list?: (Baltimore) Ravens, March (the month), (Mr) Brown, (Daytona) Beach, AIDS, or Rice (University). Most people would say no. Just as we wouldn't include Alice or Chicago or Smith in a frequency list, we wouldn't include these either, because they are (at least quasi-) proper nouns. The problem is that the same words, though, could be common nouns or adjectives, e.g. ravens (actual birds), march (= "walk"), brown (color), beach (waves and sand), visual aids, or rice (food). One way to know how to handle these words is to look at what percentage of the time the word is capitalized. For example, Alice and Chicago are always capitalized, whereas beach or rice sometimes is not. We looked at all words that are capitalized at least 20% of the time, to determine whether they should be included in the list. For those words that are in the list but that do have a heavy "proper noun" component, we indicate in the list what percentage of the time they are capitalized (e.g. .46 or .81), and you can decide if you want to keep the words in the list as well.