Word frequency: based on one billion word COCA corpus

Why not just have AI generate word frequency data, such as the "top 100 words" that meet certain criteria? It's because AI-generated word frequency data does a very poor job of modeling what happens in actual human-generated language, such as the COCA corpus. More information.

This site contains what is probably the most accurate word frequency data for English. The data is based on the one billion word Corpus of Contemporary American English (COCA) -- the only corpus of English that is large, up-to-date, and balanced between many genres.

When you purchase the data, you have access to four different datasets, and you can use whichever ones are the most useful for you. Short samples are given below for each of these datasets, and you can also see much more complete samples (every tenth entry), as well as free copies of the top 5,000 entries for each list.

1 The most basic data shows the frequency of each of the top 60,000 words (lemmas) in each of the eight main genres in the corpus. Unlike word frequency data that is just based on web pages, the COCA data lets you see the frequency across genres, to know if the word is more informal (e.g. blogs or TV and movies subtitles) or more formal (e.g. academic). The following are just a few entries of words at different frequency levels (rank), 1-60,000.

rank	lemma	PoS	freq	# texts	disp	BLOG	WEB	TV/M	SPOK	FIC	MAG	NEWS	ACAD
614	describe	v	159521	81551	0.94	17718	25573	4270	15796	7866	20583	18065	49650
615	guess	v	159454	74761	0.96	21706	15355	57719	28378	23413	5878	5453	1552
616	choice	n	159277	82417	0.98	28879	23742	17114	16776	10416	21726	17835	22789
617	source	n	158588	74656	0.95	23426	28366	5171	11870	4764	26220	18954	39817
618	mom	n	158511	44697	0.95	12884	10313	66877	19934	25394	13766	8432	911
619	soon	r	158194	95115	0.98	18711	19451	26696	14647	29098	22532	17933	9126
620	director	n	158028	79105	0.94	14248	18521	5554	19383	5197	28196	51981	14948
15024	redhead	n	1766	1209	0.90	96	101	432	86	761	167	95	28
15025	despair	v	1766	1637	0.95	250	290	127	104	330	343	172	150
15026	pretentious	j	1766	1420	0.94	293	384	261	93	252	203	185	95
15027	disservice	n	1765	1580	0.94	437	292	53	322	44	206	276	135
15028	childlike	j	1765	1510	0.94	209	232	94	106	498	239	214	172
15029	complicit	j	1765	1450	0.93	510	330	83	230	99	164	131	218
15030	macaroni	n	1765	1170	0.92	84	101	317	125	315	393	412	18
rank	lemma	PoS	freq	# texts	disp	BLOG	WEB	TV/M	SPOK	FIC	MAG	NEWS	ACAD
30005	glutamate	n	372	159	0.77	40	67	11	19	0	101	8	126
30006	twisty	j	372	341	0.89	50	42	26	6	100	104	36	8
30007	lyricism	n	372	301	0.87	35	49	5	19	23	67	67	107
30008	peppery	j	372	331	0.86	17	16	5	15	68	114	134	3
30009	firebird	n	372	69	0.15	15	7	21	1	305	8	11	4
30010	wuss	n	372	323	0.90	55	36	188	21	35	28	8	1
30011	strafe	v	372	319	0.89	24	53	32	24	79	83	62	15
45003	thawing	n	115	102	0.81	5	11	2	10	18	26	21	22
45004	sugarless	j	115	97	0.82	12	7	20	1	21	38	12	4
45005	fold-up	j	115	107	0.83	5	7	6	5	41	29	20	2
45006	energizing	j	115	112	0.84	14	14	2	10	3	44	12	16
45007	endoplasmic	j	115	65	0.64	5	36	4	0	0	14	0	56
45008	ejector	n	115	93	0.80	10	9	27	3	8	41	7	10
45009	saliency	n	115	76	0.69	3	9	0	6	0	1	1	95
rank	lemma	PoS	freq	# texts	disp	BLOG	WEB	TV/M	SPOK	FIC	MAG	NEWS	ACAD
60026	exudative	j	45	16	0.25	1	8	2	0	1	5	0	28
60027	shakti	n	45	21	0.44	25	1	0	0	0	10	3	6
60028	shearling	j	45	41	0.73	2	1	1	3	15	18	4	1
60029	sheerly	r	45	45	0.77	3	8	1	6	9	8	3	7
60030	short-selling	n	45	37	0.67	4	11	1	1	0	10	14	4
60031	phytic	j	45	19	0.48	10	16	0	0	0	4	0	15
60032	piedmontese	j	45	31	0.68	2	13	1	0	5	6	10	8

2 Another dataset shows the frequency not only in the eight main genres, but also in nearly 100 "sub-genres" (Magazine-Sports, Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal, or TV-Comedies, etc).

3 A third dataset shows the frequency of the word forms of the top 60,000 lemmas:

lemmaRank	lemma	PoS	lemFreq	wordFreq	word form
13164	rehabilitate	v	2286	1033	rehabilitate
13164	rehabilitate	v	2286	749	rehabilitated
13164	rehabilitate	v	2286	452	rehabilitating
13164	rehabilitate	v	2286	52	rehabilitates
13165	subprime	j	2286	2079	subprime
13165	subprime	j	2286	207	sub-prime
13166	headline	v	2285	999	headline
13166	headline	v	2285	943	headlined
13166	headline	v	2285	343	headlining
13167	blue-collar	j	2285	2262	blue-collar
13167	blue-collar	j	2285	23	bluecollar
13168	deduce	v	2285	1088	deduce
13168	deduce	v	2285	965	deduced
13168	deduce	v	2285	136	deducing
13168	deduce	v	2285	96	deduces
lemmaRank	lemma	PoS	lemFreq	wordFreq	word form
13169	oats	n	2284	2284	oats
13170	stand-up	j	2284	2284	stand-up
13171	squeak	v	2283	894	squeaked
13171	squeak	v	2283	593	squeaking
13171	squeak	v	2283	583	squeak
13171	squeak	v	2283	213	squeaks
13172	naming	n	2283	2254	naming
13172	naming	n	2283	29	namings
13173	toad	n	2283	1483	toad
13173	toad	n	2283	800	toads
13174	clockwise	r	2282	2282	clockwise

4 A final dataset shows the top 219,000 words in the billion word corpus -- each word that occurs at least 20 times and in 5 different texts. In this list, the words are not lemmatized (e.g. each form of a word is listed separately from other forms) and the words are not tagged for part of speech. For each word, it shows in which genres it is the most common (again, to show +/- formal) and what percent are capitalized (useful for determining +/- proper noun; see daymond and dentzer below).

word rank	word	freq	# texts	% caps	BLOG	WEB	TV/M	SPOK	FIC	MAG	NEWS	ACAD
100033	datatypes	89	20	0.18	8	74	0	0	0	0	0	7
100034	daymond	89	40	1.00	13	9	0	30	1	16	17	3
100035	deductively	89	68	0.03	4	18	3	0	0	3	1	60
100036	delp	89	25	1.00	2	5	2	0	52	10	12	6
100037	demoed	89	81	0.02	24	18	7	4	2	33	0	0
100038	dentzer	89	40	1.00	0	20	0	46	0	20	2	1
100039	denys	89	53	0.94	2	18	1	0	2	28	14	23
100040	despatch	89	50	0.17	6	38	6	0	17	12	1	9
100041	digged	89	33	0.04	6	69	6	1	2	1	1	0
100042	diigo	89	32	0.98	10	12	0	0	0	0	0	67
100043	dilator	89	25	0.02	0	2	5	1	6	5	0	70
100044	dimitroff	89	43	1.00	4	4	0	0	0	23	54	4
100045	disasterous	89	86	0.01	41	42	1	0	1	0	3	1
100046	disgracefully	89	83	0.12	18	11	7	10	11	18	5	8
word rank	word	freq	# texts	% caps	BLOG	WEB	TV/M	SPOK	FIC	MAG	NEWS	ACAD
100047	dispersant	89	43	0.07	9	25	8	26	0	6	3	12
100048	do-overs	89	72	0.11	18	11	11	15	2	10	12	4
100049	docter	89	48	0.88	7	22	12	12	0	9	27	0
100050	dollarization	89	20	0.10	2	52	0	2	0	3	10	20
100051	doohickey	89	72	0.11	9	8	53	0	9	7	2	0
100052	doozies	89	80	0.01	21	9	13	10	9	14	7	1
100053	dorrie	89	26	1.00	1	2	7	2	64	2	11	0
100054	dort	89	39	0.82	1	6	20	7	5	2	16	29
100055	doster	89	41	1.00	1	7	0	0	0	6	62	13
100056	dowler	89	40	1.00	4	14	0	13	6	17	20	15
100057	drainer	89	71	0.12	3	11	14	4	40	9	3	5
100058	drams	89	56	0.51	12	2	4	1	11	16	5	38
100059	drina	89	42	1.00	1	1	4	14	33	12	19	5
100060	druggy	89	74	0.10	11	3	8	8	17	21	11	1

Word frequency data