COCA 20k / 60k lemmas list (compare
to COCA 100k word forms list) There are a number of different formats available for the
20,000-60,000 word list, as shown below. Click here to
order.
1. Wordlist |
-
Lemma, rank, part of speech, dispersion score
-
Text or Excel file: can be
printed / copied
-
List sizes:
5,000, 20,000, 60,000
|
Short sample (see
expanded sample: 6,000 entries: every tenth word 1-60,000)
rank |
lemma / word |
PoS |
frequency |
dispersion |
7309 |
attic |
n |
2711 |
0.91 |
17311 |
tearful |
j |
542 |
0.93 |
27303 |
tailgate |
v |
198 |
0.85 |
37310 |
hydraulically |
r |
78 |
0.83 |
47309 |
unsparing |
j |
35 |
0.83 |
57309 |
embryogenesis |
n |
22 |
0.66 |
2. Wordlist + genre frequency |
-
Overall frequency (+dispersion), as above. But also
includes:
-
Frequency in
five main genres: spoken, fiction, popular magazine,
newspaper, academic
-
Frequency in each of the 40+ sub-genres (e.g. MAG-Sports,
NEWS-Financial, ACAD-Medicine)
-
With the frequency data for specific genres and
sub-genres, you can create customized wordlists for
specific purposes: medical, technology, sports, etc.
-
Excel file; can be
printed / copied
-
List sizes: 5,000, 20,000, 60,000
|
Short sample (see
expanded sample: 6,000 entries: every tenth word 1-60,000)
Note 1: Due to space constraints, in this sample only six of the 40+
sub-genres are shown (M1: MAG-Financial; M2: MAG-Science/Tech; N1:
NEWS-Sports; N2: NEWS-Editorial; A1: ACAD-Law/PolSci; A2: ACAD-Medicine).
Note 2: The green shading for the five main genres highlights those
words whose frequency in that genre are at least double what would
otherwise be expected (based on genre size).
rank |
lemma / word |
PoS |
disp |
totFreq |
spok |
fic |
mag |
news |
acad |
M1 |
M2 |
N1 |
N2 |
A1 |
A2 |
25083 |
piglet |
n |
0.88 |
239 |
20 |
97 |
54 |
46 |
22 |
10 |
2 |
3 |
3 |
0 |
2 |
25088 |
woodsman |
n |
0.70 |
300 |
10 |
176 |
77 |
12 |
25 |
1 |
2 |
1 |
3 |
2 |
0 |
25090 |
candied |
j |
0.87 |
242 |
17 |
49 |
102 |
73 |
1 |
0 |
1 |
2 |
1 |
0 |
0 |
25093 |
metacognitive |
j |
0.69 |
306 |
0 |
0 |
0 |
0 |
306 |
0 |
0 |
0 |
0 |
0 |
0 |
25107 |
industry-wide |
j |
0.89 |
236 |
16 |
2 |
64 |
109 |
45 |
19 |
10 |
2 |
1 |
10 |
6 |
25108 |
health-food |
j |
0.85 |
246 |
10 |
19 |
154 |
55 |
8 |
6 |
4 |
7 |
1 |
0 |
2 |
25110 |
posterior |
n |
0.88 |
240 |
6 |
30 |
36 |
27 |
139 |
0 |
5 |
4 |
0 |
0 |
99 |
3. Wordlist + collocates
(see
www.collocates.info
for full info) |
-
More than
4,800,000 entries, showing which words occur most
frequently with others.
-
200-300
collocates each for most of the top 20,000-30,000 words,
and somewhat fewer for less-common words.
-
Collocates provide useful information on word
meaning and usage
-
List sizes:
5,000, 20,000, 60,000
|
Short sample (see
expanded sample: 45,390 entries; every hundredth word 1-60,000):
nodeID |
node |
nodePoS |
collocate |
collPoS |
freq |
MutInfo |
preNode |
postNode |
% preNode |
15349 |
smolder |
v |
still |
r |
76 |
4.39 |
74 |
2 |
0.97 |
15349 |
smolder |
v |
fire |
n |
59 |
6.33 |
39 |
20 |
0.66 |
15349 |
smolder |
v |
eye |
n |
43 |
4.41 |
24 |
19 |
0.55 |
15349 |
smolder |
v |
cigarette |
n |
26 |
6.93 |
17 |
9 |
0.65 |
15349 |
smolder |
v |
ash |
n |
15 |
7.42 |
5 |
10 |
0.33 |
15349 |
smolder |
v |
ember |
n |
14 |
10.62 |
4 |
10 |
0.28 |
15349 |
smolder |
v |
resentment |
n |
14 |
8.26 |
2 |
12 |
0.14 |
4. N-grams
(see
www.ngrams.info
for full info) |
-
Format #1: All n-grams that occur three times or more:
6.2 million 2-grams, 11.9 million 3-grams, 8.3 million
4-grams, 3.4 million 5-grams. Available +/- case
sensitive, +/- with part of speech
-
Format #2: All distinct 2-5 grams in the corpus.
(Hundreds of millions of rows of data.) N-grams table linked to "lexicon" table.
With this data, you can run an unlimited number of queries against the
corpus from your own machine
|
Short sample (see
expanded sample: 194,000 n-grams)
Note: for ease of presentation, the words themselves are displayed
in this sample table, as in format #1 above. You can also see the
part of speech for each word in the string. In format #2 above, there are two
tables: a lexicon (with a unique wordID, along with word form,
lemma, and PoS), and an n-grams table that has the wordID values
from the lexicon table (as integer values, for smaller size and
faster searching).
frequency |
word1 |
word2 |
word3 |
1419 |
much |
the |
same |
461 |
much |
more |
likely |
432 |
much |
better |
than |
266 |
much |
more |
difficult |
235 |
much |
of |
the |
226 |
much |
more |
than |
195 |
much |
less |
a |
194 |
much |
like |
a |
5. eBook |
-
Top 20-30
collocates for each word, grouped by part of speech, as well
as synonyms (for most words)
-
Note that this file cannot be edited or printed or
copied from
-
List sizes: 5,000, 10,000, 20,000
|
Short sample (see
expanded sample: 2,700 entries; every seventh word 1-20,000):
1421 blow
v
noun wind•, whistle, air, •nose,
•smoke, breeze•, •face,
hair, •kiss, head, window, horn, •candle,
•mind,
storm•
misc •away, •through,
•across
out •candle, window, •breath,
air, wind•, •smoke, •knee,
tire, •match up •building, plot•,
bomb, plane, car, bridge, wind•, threaten•
off •steam, head•, roof•,
leg•
● whoosh, gust, waft, puff || move, propel, drive,
carry
27254 | 0.94 F |
19964 bodice
n
adj black, tight, white, embroidered, fitted,
red, blue, gathered, beaded, pleated noun dress,
skirt, gown, lace, sleeve, •ripper, back, satin,
silk, waist verb embroider, rip, pull,
•fit, feature•, wear, cover, lace, cut,
tuck•
421 | 0.86 F |
6. Printed book
(available online, e.g.
Routledge or
Amazon) |
-
Top 20-30
collocates for each word, grouped by part of speech.
-
Also contains 31 frequency-based, thematically-oriented
lists
-
List size: 5,000 words
|
Short sample (see
expanded sample: every seventh page in the book):

|