Word frequency: based on one billion word COCA corpus

Our data is based on two different corpora: the 14 billion word iWeb corpus, and the Corpus of Contemporary American English (COCA). COCA is the only large corpus of English that is large (one billion words), up-to-date (the latest texts are from late 2019), and which is based on a wide range of genres (e.g. blogs and other web pages, TV/movie subtitles, (more formal) spoken, fiction, newspapers, magazines, academic writing). Most of the following refers to the COCA word lists.

Why worry about what corpus is used? After all, there are many English word lists and frequency lists out on the Web (see in particular the British National Corpus and the American National Corpus). Some are good, and others are very poor in quality. Not all frequency lists are created equal.

One should be very, very suspicious of word lists that are taken from messy web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse, word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)".

Here's some questions you might ask yourself as you consider downloading or purchasing a word list:

Depth and accuracy. Why do so many wordlists on the web contain just the top 3000-5000 words of English? (For example, see the New General Service List or the Oxford 3000 and 5000 lists). Why not the top 20,000 or 60,000? It's because even a bad corpus (the collection of texts that the word lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 20,000 or 60,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list.

Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, academic, and web texts? Frequency lists that are based on just one of these (such as just web pages) may only contain 40-50% of the words from a more balanced corpus. For example, if the corpus is composed solely of texts from newspapers or web pages (which are very easy to get), but it doesn't have any texts from fiction, then words like (NOUN) eyes, stairs, smile (ADJ) pale, faint, dark (VERB) stare, fade, lean (ADV) softly, gently will be very infrequent in the corpus. But most native speakers of English wouldn't think of eyes or dark or softly or lean (as a verb) as being particularly strange, which shows how skewed the data from a corpus that is based solely on newspapers or web texts might be. The COCA data is based on the Corpus of Contemporary American English, which is almost perfectly balanced across genres.

Genres (more). Other lists, like the New General Service List or the Oxford 3000 and 5000 lists, don't allow you to see the frequency by genre -- to see whether a word is mainly informal or formal, or limited to a particular subject area. Our lists show the frequency by genre, as well as the frequency in 100+ sub-genres, such as Newspaper-Sports, Magazine-Financial, Blogs-Personal, or Academic-Medicine.

	noun	verb	adjective	adverb
TV/movies	sweetie, bro, fella, ma'am, sir, honey, dude, sweetheart, congratulation, babe	excuse, kid, hurry, gasp, pee, freak, calm, mess, thank, swear	okay, sorry, alright, weird, cute, nice, crazy, scared, stupid, insane	okay, alright, kinda, like, right, tomorrow, sure, all, here, anymore
fiction	doorway, gaze, forehead, cheek, grin, brow, chin, nostril, whisper, eyebrow	glance, nod, murmur, gesture, grin, squint, frown, glare, mutter, clench	damp, pale, blond, shut, slender, puzzled, faint, bare, dim, gray	softly, nervously, silently, sideways, abruptly, upright, calmly, last, loudly, cautiously
magazine	teaspoon, watercolor, skier, nebula, skillet, saucepan, astronomer, palette, ski, telescope	preheat, ski, simmer, chop, sprinkle, stir, coat, bake, rinse, hike	chopped, lightweight, medium, versatile, planetary, built-in, decorative, durable, ceramic, compact	thinly, finely, evenly, freshly, outdoors, lightly, indoors, comfortably, annually, famously
newspaper	homer, semifinal, spokeswoman, baseman, inning, cornerback, linebacker, postseason, playoff, quarterback	coach, rebound, staff, renovate, average, pitch, score, oversee, total, bat	all-star, freelance, consecutive, Shiite, Methodist, Olympic, downtown, upscale, longtime, saturated, veteran	nationwide, downtown, nationally, illegally, finely, annually, freshly, allegedly, daily, route, aggressively
Web	URL, browser, font, attribute, commenter, directory, password, server, template, functionality	upload, delete, update, download, google, email, blog, submit, upgrade, encode	applicable, anonymous, accessible, informative, updated, unavailable, valid, alternate, mobile, eligible	online, automatically, lastly, above, currently, below, intentionally, globally, remotely, explicitly
academic	subscale, coefficient, regression, fluency, variance, predictor, questionnaire, variable, adolescent, impairment	hypothesize, correlate, omit, assess, mediate, facilitate, compute, categorize, evaluate, underlie	instructional, qualitative, normative, longitudinal, spatial, differential, interpersonal, descriptive, conceptual, empirical	statistically, culturally, respectively, significantly, negatively, consequently, furthermore, moreover, thus, thereby

Size. COCA contains about one billion words of text, and each of the top 20,000 words occurs ~1000 times or more. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point, the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA. (And iWeb is 14 times as large as COCA).

How recent is it? Language change happens. If the word list is based on 30 to 35 year old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language. COCA is based on texts from 1990-2019 (28 million words each year, plus blogs and other web pages from 2012-13) and iWeb was collected in 2017 -- or in other words, virtually right up to the current time.

Are they just word forms? Do you really want to see the individual frequency of shoe and shoes, or realize, realizes, realized, and realizing? Do you want to have the combined frequency of watch as a verb (they watch TV) and watch as a noun (his watch broke)? If the lists are simply taken from pages that are "scraped" from the web, they will just provide long lists of words, without grouping them meaningfully (e.g. shoe/shoes), or separating them when necessary (e.g. watch as a noun and as a verb). Both the COCA and iWeb word lists show the lemma (e.g. decide = decide, decides, decided, deciding) and group by part of speech (e.g. watch as a noun and as a verb).

Are the words grouped in a meaningful way? Some word lists, such as the New General Service List, group words by "word families". These may be helpful for learners, but they often combine words that researchers might want to keep separate. (See this article from Applied Linguistics, Section 2.1). The words in our list are grouped in a way that we believe makes the most sense for researchers -- by lemma and part of speech, but not word families.

Summary. There are many word frequency lists out on the web. Some are just OK, and some are truly bad. The frequency lists that we have created are the only ones available anywhere that are based on a large, recent, and balanced corpus of English.

Word frequency data