Word frequency data

The wordlists from the Corpus of Contemporary American English (COCA) and the American National Corpus (ANC) are quite different. These differences are due to the way in which the two corpora were created. The ANC has just 22 million words, and is heavily skewed in terms of genres and sources. COCA is nearly 50 times as large, and it contains a much wider range of genres and sources.


About 20-25% of the words in the top 5,000 COCA wordlist are not in the ANC list. In other words, of the top 5000 lemmas in COCA, the word is at least twice as infrequent in the ANC list (e.g. COCA #4000, ANC #8000 or lower). Things get much, much messier at lower levels, where the ANC lists will be missing 50-60% of the words in the COCA lists.

The following words are examples. These words are in the top 3000-4000 words in COCA, but (in this case) they are at least four times farther down the list (for example, #2000 in COCA, #9000 in the ANC). As one can see, these are full of "everyday" words:

Adjectives: left, far, concerned, involved, supposed, Christian, growing, clean, alone, married, Catholic, English, used, surprised, spiritual, existing, living, fun, remaining, leading

Nouns: university, back, data, American, Republican, congress, south, east, Democrat, troop, institute, Christmas, learning, sir, fat, Jew, e-mail, academy, Indian, navy, teen, pine, Muslim, Olympics, handle

Verbs: need, stand, thank, lay, laugh, shake, smile, stare, drink, lift, grab, lean, nod, stir, dance, bend, slide, kiss, whisper, glance, pray, wave, bake, pause, shrug, cope, brush, sigh, excuse, hurry, burst, spill, hug, blend

(Note that many of these words come from fiction and from "popular magazines". They occur very infrequently in the ANC, since the ANC has essentially no texts from fiction or popular magazines. COCA, on the other hand, has 120-130 million words from these genres).


On the other hand, there are about 20-25% of the words in the ANC top 5000 list that are not in the COCA list, and things are much messier for lower frequency words. The following are words in the top 5000 words in the ANC list, which are at least four times less common in COCA (e.g. ANC #2000, COCA #9000). As one can see, they are either errors (bad part of speech or lemma) in the ANC, or are a function of the skewed text composition of the ANC (apparently, lots of academic journal articles on DNA sequencing):

Adjective: uh-huh, um-hum, binding, e-mail, amino, conserved, mutant, genomic, molecular, incubated, viral, wild-type, purified, bye-bye, cultured, locus, correlated, putative, phylogenetic, endogenous, cytoplasmic, downstream, mammalian, catalytic, sequenced, transfected, recombinant, transgenic, terminus, gene-expression, eukaryotic

Noun: yeah, um, cell, gene, datum, protein, sequence, gonna, tissue, acid, receptor, genome, mutation, tumor, huh, www, probe, cdna, mhm, mrna, clone, assay, membrane, activation, transcription, chromosome

Verb: accord, detect, induce, calculate, isolate, label, activate, usee, controll, bind, stain, clone, cluster, inhibit, code, underlie, rang, amplify, overlap, school, sequence, encode, splice