|
COCA+ 100,000 list (see
samples) |
COCA 20,000 / 60,000 lists (see
samples) |
Lemma / word form |
Words forms. This allows you to see
the frequency of different parts of speech within noun, verb, and
adjective. e.g. paper (nn1) papers (nn2) ||
listen (vv0) listens (vvz) listening (vvg)
listened (vvd) listened (vvn) || fast (jj)
faster (jjr) fastest (jjt). This is probably the better
option for linguists, for those
who do natural language processing, or for those who are developing an
application where they need a list of each individual word form. Note
that a word can be listed multiple times for different parts of speech
(e.g. run as a noun and as a verb), and this yields 86,413
distinct words in the list (e.g. all forms of run counted just
once). |
Lemmas. While the main parts of
speech are separated (e.g. separate entries for point as a noun and as a
verb), all uses within noun, verb, etc are grouped together, i.e. the
frequencies for listen, listens, listening, and listened
are all grouped together under listen (verb). This may be the
better option for some teachers and learners. |
Corpora |
Provides data from the 520 million word
Corpus of Contemporary American
English (COCA; 2012 update of 450 million words), the 400 million word
Corpus of Historical American
English, the 100 million word
British National Corpus, and the 100 million word
Corpus of American Soap Operas.
|
Data from just the
Corpus of Contemporary American
English (COCA) (version from 2010 -- 400 million words) |
Genre frequency |
Provides the frequency for COCA spoken,
fiction, popular magazine, newspaper, and academic, as well as BNC
spoken, fiction, popular magazine, newspapers, non-academic journals,
academic journals, and miscellaneous. For COHA, the lists also shows the
frequency in the three main time periods of COHA (1950s-1980s,
1900s-1940s, and 1810s-1890s). |
No frequency information for genres in the
standard word lists. But researchers can purchase the lists with genre
frequency in COCA (only) (see sample)
-- the five main genres, as well as 40+ sub-genres (e.g.
Newspaper-Sports or Academic-Medicine). |
Format / distribution |
As Excel files (see a sample:
xlsx,
xls), with hyperlinks for each of
the 100,000 words, for one-click searches to the online corpora. Also
available as a text file, for those who want to import the data into
another program (such as a relational database). Both formats are
included in the purchase price. |
Available only as text files (basic word
lists and collocates lists) or as an Excel file (genre frequencies
list), but with no hyperlinks for online queries. |
# texts |
Sometimes it's useful to know not just how
many times a word occurred in a corpus (or a particular genre of a
corpus), but also in how many texts it occurred (e.g. 37,273 of 182,893
texts overall, or 20.4% of all texts), which lets you see how well it's
"spread across" the texts in a corpus. The 100,000 word lists provide
this data for each corpus (COCA, COHA, BNC, SOAP) as well as each genre
in COCA and the BNC, and the three main time periods of COHA
(1950s-1980s, 1900s-1940s, and 1810s-1890s). |
No information on the number of texts. |
|
|