The wordlists from the Corpus of Contemporary American English
(COCA) and the American National Corpus (ANC) are quite different. These differences are due to
the way in which the two corpora were created. The ANC has
just 22 million words, and is heavily skewed in terms of genres and
sources. COCA is nearly 50 times as large, and it contains a much
wider
range of genres and sources.
+COCA / -ANC
About 20-25% of the words in the top 5,000 COCA wordlist are not in the ANC list. In
other words, of the top 5000 lemmas in COCA, the word is at least twice
as infrequent in the ANC list (e.g. COCA #4000, ANC #8000 or lower).
Things get much, much messier at lower levels, where the ANC lists will
be missing 50-60% of the words in the COCA lists.
The
following words are examples. These words are in the top 3000-4000 words in COCA,
but (in this case) they are
at least four times farther down the list (for example, #2000 in COCA,
#9000 in the ANC). As one can see, these are full of "everyday" words:
Adjectives: left, far, concerned, involved, supposed, Christian,
growing, clean, alone, married, Catholic, English, used, surprised,
spiritual, existing, living, fun, remaining, leading
Nouns: university, back, data, American, Republican, congress,
south, east, Democrat, troop, institute, Christmas, learning, sir, fat,
Jew, e-mail, academy, Indian, navy, teen, pine, Muslim, Olympics, handle
Verbs: need, stand, thank, lay, laugh, shake, smile, stare, drink,
lift, grab, lean, nod, stir, dance, bend, slide, kiss, whisper, glance,
pray, wave, bake, pause, shrug, cope, brush, sigh, excuse, hurry, burst,
spill, hug, blend
(Note that many of these words come from fiction and from "popular magazines". They occur very infrequently in the ANC, since the ANC has essentially no texts from fiction or popular magazines. COCA, on the other hand, has 120-130 million words from these genres).
+ANC / -COCA
On the other hand, there are about 20-25% of the words in the ANC top 5000 list
that are not in the COCA list, and things are much messier for lower
frequency words. The following are words in the top 5000 words in the
ANC list, which are at least four times less common in COCA (e.g. ANC
#2000, COCA #9000). As one can see, they are either errors (bad part of
speech or lemma) in the ANC, or are a function of the skewed text
composition of the ANC (apparently, lots of academic journal articles
on DNA sequencing):
Adjective: uh-huh, um-hum, binding, e-mail, amino, conserved,
mutant, genomic, molecular, incubated, viral, wild-type, purified,
bye-bye, cultured, locus, correlated, putative, phylogenetic,
endogenous, cytoplasmic, downstream, mammalian, catalytic, sequenced,
transfected, recombinant, transgenic, terminus, gene-expression,
eukaryotic
Noun: yeah, um, cell, gene, datum, protein, sequence, gonna,
tissue, acid, receptor, genome, mutation, tumor, huh, www, probe, cdna,
mhm, mrna, clone, assay, membrane, activation, transcription, chromosome
Verb: accord, detect, induce, calculate, isolate, label, activate,
usee, controll, bind, stain, clone, cluster, inhibit, code, underlie,
rang, amplify, overlap, school, sequence, encode, splice
.
|