The
Corpus of Contemporary American English (COCA) is the most
widely-used corpus in the world. In March 2020 it was updated for
the last time (with data up through Dec 2019), and the word
frequency data from the corpus was updated in April 2020. The
following are the major changes and improvements in the word
frequency data.
Feature |
Previous |
COCA 2020 |
Corpus: size |
400
million words |
More than twice
as large, at one billion words. This means that the data
is even more accurate for lower frequency words. |
Corpus: how up to date |
Texts
from 1990 - ~2012 |
The most recent
texts are from Dec 2019. There are 20 million
words each year from 1990-2019 (+ about 240 million words
from blogs and other websites from 2013). So there are about
600 million new words of data since the
previous data was released in 2012. |
Corpus: genres |
Spoken, fiction, magazine, newspaper, academic. |
Same five genres
as before (with about 120-130 million words per genre), plus
the three new genres:
-- Blog posts and other web pages
(120-130 million words for each of these two genres). So
much of what we consume nowadays comes from the web, and
these genres include many words that don't occur much
elsewhere (e.g. ebook, webpage, browsing, password,
template, meme, snarky, off-topic, downloadable,
open-source, updated, (to) monetize, upgrade, debunk,
archive, pirate, upgrade).
-- TV and movies subtitles (130 million
words). This is by far the most informal language we've ever
had in COCA. Many studies (e.g.
A,
B, and
C show that the data from subtitles
agrees with native speaker intuitions about their language even
better than the data from actual everyday conversation (like in
the BNC). Until now, COCA didn't really have this highly
informal language. |
Data:
formats |
Separate lists for:
-- 60k lemmas
-- 60k genres
-- 100k word forms |
Now all purchases
include all three of these lists. In addition, the "genres"
list now includes the frequency of each of the 60,000 lemmas
in nearly 100 different sub-categories, like
Magazine-Sports, Newspaper-Finance, Academic-Medical,
Web-Reviews, Blogs-Personal,
or TV-Comedies.
The new data also includes something
that people have been wanting for a long time. Every
purchase also includes a list of the top 220,000 words
in the billion word corpus (word forms, not lemmas). For
each word, there is helpful information on whether or not
the word might be a proper noun, how well the word is spread
across the entire corpus, and in which of the eight main
genres it is the most common. |
Data:
accuracy |
Very
good (compare to other corpora) |
Even better. We
have exhaustively compared the 60k lemmas list to the
previous COCA word frequency lists, as well as the iWeb
frequency lists. We have also compared each word to five
online dictionaries to see if the word occurs there, and (if
not) we have manually checked each of these words. No
frequency list will ever be 100% correct, but we believe
that the COCA 2020 lists are by far the most accurate word
frequency lists available anywhere. |
Pricing |
Separate prices for each purchase: 60k lemmas list, 100k
word forms list, 60k genres list, etc |
All four of the
formats are now included for the same price as
one format previously. |
|