Word frequency: based on one billion word COCA corpus

COCA 20k / 60k lemmas list (compare to COCA 100k word forms list)

There are a number of different formats available for the 20,000-60,000 word list, as shown below. Click here to order.

1. Wordlist

Lemma, rank, part of speech, dispersion score
Text or Excel file: can be printed / copied
List sizes: 5,000, 20,000, 60,000

Short sample (see expanded sample: 6,000 entries: every tenth word 1-60,000)

rank	lemma / word	PoS	frequency	dispersion
7309	attic	n	2711	0.91
17311	tearful	j	542	0.93
27303	tailgate	v	198	0.85
37310	hydraulically	r	78	0.83
47309	unsparing	j	35	0.83
57309	embryogenesis	n	22	0.66

2. Wordlist + genre frequency

Overall frequency (+dispersion), as above. But also includes:
Frequency in five main genres: spoken, fiction, popular magazine, newspaper, academic
Frequency in each of the 40+ sub-genres (e.g. MAG-Sports, NEWS-Financial, ACAD-Medicine)
With the frequency data for specific genres and sub-genres, you can create customized wordlists for specific purposes: medical, technology, sports, etc.
Excel file; can be printed / copied
List sizes: 5,000, 20,000, 60,000

Short sample (see expanded sample: 6,000 entries: every tenth word 1-60,000)
Note 1: Due to space constraints, in this sample only six of the 40+ sub-genres are shown (M1: MAG-Financial; M2: MAG-Science/Tech; N1: NEWS-Sports; N2: NEWS-Editorial; A1: ACAD-Law/PolSci; A2: ACAD-Medicine).
Note 2: The green shading for the five main genres highlights those words whose frequency in that genre are at least double what would otherwise be expected (based on genre size).

rank	lemma / word	PoS	disp	totFreq	spok	fic	mag	news	acad	M1	M2	N1	N2	A1	A2
25083	piglet	n	0.88	239	20	97	54	46	22	10	2	3	3	0	2
25088	woodsman	n	0.70	300	10	176	77	12	25	1	2	1	3	2	0
25090	candied	j	0.87	242	17	49	102	73	1	0	1	2	1	0	0
25093	metacognitive	j	0.69	306	0	0	0	0	306	0	0	0	0	0	0
25107	industry-wide	j	0.89	236	16	2	64	109	45	19	10	2	1	10	6
25108	health-food	j	0.85	246	10	19	154	55	8	6	4	7	1	0	2
25110	posterior	n	0.88	240	6	30	36	27	139	0	5	4	0	0	99

3. Wordlist + collocates (see www.collocates.info for full info)

More than 4,800,000 entries, showing which words occur most frequently with others.
200-300 collocates each for most of the top 20,000-30,000 words, and somewhat fewer for less-common words.
Collocates provide useful information on word meaning and usage
List sizes: 5,000, 20,000, 60,000

Short sample (see expanded sample: 45,390 entries; every hundredth word 1-60,000):

nodeID	node	nodePoS	collocate	collPoS	freq	MutInfo	preNode	postNode	% preNode
15349	smolder	v	still	r	76	4.39	74	2	0.97
15349	smolder	v	fire	n	59	6.33	39	20	0.66
15349	smolder	v	eye	n	43	4.41	24	19	0.55
15349	smolder	v	cigarette	n	26	6.93	17	9	0.65
15349	smolder	v	ash	n	15	7.42	5	10	0.33
15349	smolder	v	ember	n	14	10.62	4	10	0.28
15349	smolder	v	resentment	n	14	8.26	2	12	0.14

4. N-grams (see www.ngrams.info for full info)

Format #1: All n-grams that occur three times or more: 6.2 million 2-grams, 11.9 million 3-grams, 8.3 million 4-grams, 3.4 million 5-grams. Available +/- case sensitive, +/- with part of speech
Format #2: All distinct 2-5 grams in the corpus. (Hundreds of millions of rows of data.) N-grams table linked to "lexicon" table. With this data, you can run an unlimited number of queries against the corpus from your own machine

Short sample (see expanded sample: 194,000 n-grams)
Note: for ease of presentation, the words themselves are displayed in this sample table, as in format #1 above. You can also see the part of speech for each word in the string. In format #2 above, there are two tables: a lexicon (with a unique wordID, along with word form, lemma, and PoS), and an n-grams table that has the wordID values from the lexicon table (as integer values, for smaller size and faster searching).

frequency	word1	word2	word3
1419	much	the	same
461	much	more	likely
432	much	better	than
266	much	more	difficult
235	much	of	the
226	much	more	than
195	much	less	a
194	much	like	a

5. eBook

Top 20-30 collocates for each word, grouped by part of speech, as well as synonyms (for most words)
Note that this file cannot be edited or printed or copied from
List sizes: 5,000, 10,000, 20,000

Short sample (see expanded sample: 2,700 entries; every seventh word 1-20,000):

1421 blow v
noun wind_•, whistle, air, _•nose, _•smoke, breeze_•, _•face, hair, _•kiss, head, window, horn, _•candle, _•mind, storm_• misc _•away, _•through, _•across
out _•candle, window, _•breath, air, wind_•, _•smoke, _•knee, tire, _•match up _•building, plot_•, bomb, plane, car, bridge, wind_•, threaten_• off _•steam, head_•, roof_•, leg_•● whoosh, gust, waft, puff || move, propel, drive, carry
27254 | 0.94 F

19964 bodice n
adj black, tight, white, embroidered, fitted, red, blue, gathered, beaded, pleated noun dress, skirt, gown, lace, sleeve, _•ripper, back, satin, silk, waist verb embroider, rip, pull, _•fit, feature_•, wear, cover, lace, cut, tuck_•421 | 0.86 F

6. Printed book (available online, e.g. Routledge or Amazon)

Top 20-30 collocates for each word, grouped by part of speech.
Also contains 31 frequency-based, thematically-oriented lists
List size: 5,000 words

Short sample (see expanded sample: every seventh page in the book):

Word frequency data