Word frequency data

1. What are the lists based on? One set of datasets is based on iWeb, which contains about 14 billion words from an extremely wide range of websites. The main set of wordlists are based on the the one billion word Corpus of Contemporary American English, which gives you data from a wide range of genres of English.

2. What about part of speech tagging (especially -ing). We used the CLAWS 7 part of speech tagger (see the C7 tagset). In many cases CLAWS suggests multiple possible parts of speech (e.g. strike: single noun or the base forms of the verb). In these cases, we took the most frequent tag, as suggested by CLAWS. We then spent 2-3 months manually correcting the part of speech tags. This was particularly difficult in the case of, for example, -ING words (e.g. noun/verb: learning, meeting, thinking, beginning, living, teaching, reading, feeling or verb/adjective: leading, following, growing, changing, developing, missing, supporting). There are thousands of such words, and we looked at all of them. Our part of speech tagging isn't perfect, but it's very good.

3. How were the words selected? After extracting the words and parts of speech, we selected the top 60,000 lemmas (where lemma = "dictionary headword", e.g. decide = decide, decides, decided, deciding). We did not separate for case sensitivity (e.g. service and Service (as nouns) are just one entry). We then spent 2-3 months comparing these lists to many other word lists (e.g. from WordNet or online dictionaries, or the BNC lists) to weed out "problem words".

4. Multi-word expressions. What about fixed phrases like (two words) each other, even if, kind of, depending on, at all (three words) as well as, in favor of, in light of, as long as, with respect to (or four words) all of a sudden, once and for all? Do we list words like kind, long, light, or favor as separate words? You will see some words like this occurring both as (for example) a noun (the light turned on) and as a conjunction, preposition, or adverb (in light of). But if you see a "strange" word for one of these three parts of speech, check the multi-word lists at the beginning of this paragraph to see if it's listed there.

5. Foreign words. Should the following italicized words be in this list of English, or are they foreign words: (in) absentia, (ad) nauseam, habeas / corpus, (ad) infinitum, (in) memoriam, (Homo) sapiens, (a) capella, (per) capita, (au) contraire, esprit (de corps), ancien (regime), café, (au) naturel, papier (mache), (de) rigueur, bambino, barrio, macho, loco, latino, taco (grande), gracias, (Yom) Kippur, kung (fu), feng shui, teriyaki, or banzai. There is no simple answer for these and thousands of other words like these. We have followed an admittedly subjective approach to such words -- some will be in the list; some won't. Let us know if you find a word that shouldn't be there but is, or vice versa (and how about that word?)

6. Proper nouns (and capitalized words). Should the follow italicized words be in the list?: (Baltimore) Ravens, March (the month), (Mr) Brown, (Daytona) Beach, AIDS, or Rice (University). Most people would say no. Just as we wouldn't include Alice or Chicago or Smith in a frequency list, we wouldn't include these either, because they are (at least quasi-) proper nouns. The problem is that the same words, though, could be common nouns or adjectives, e.g. ravens (actual birds), march (= "walk"), brown (color), beach (waves and sand), visual aids, or rice (food). One way to know how to handle these words is to look at what percentage of the time the word is capitalized. For example, Alice and Chicago are always capitalized, whereas beach or rice sometimes is not. We looked at all words that are capitalized at least 20% of the time, to determine whether they should be included in the list. For those words that are in the list but that do have a heavy "proper noun" component, we indicate in the list what percentage of the time they are capitalized (e.g. .46 or .81), and you can decide if you want to keep the words in the list as well.