For samples of the four different datasets, see
https://www.wordfrequency.info/samples.asp.
Note: "lemmas" on this page means that all of the different word forms are
grouped together. For example, the frequency of the verb {decide, decides,
decided, deciding} are all grouped together under the one entry {decide}. Word
forms refer to each of the distinct word forms {decide, decides, decided,
deciding}. The "lemmatized" entries always distinguish by part of speech, however,
so that deciding as an adjective (the deciding factor) and deciding as a verb
(he really had a hard time deciding what to do) will always be distinguished
from each other and calculated separately. |
1. lemmas_60k.txt: top 60,000 lemmas,
Explanation of columns:
rank |
1-62,000. Based on
word frequency. |
lemma |
Again, the
"dictionary / headword" entry. This is why was or
happier or shoes would not be included here; they
are word forms of the lemmas be, happy, and
shoe. |
PoS |
Part of speech.
This is the first letter of the codes from https://ucrel.lancs.ac.uk/claws7tags.html |
freq |
Total frequency |
perMil |
The "normalized"
frequency -- per million words in COCA (992,960,152 words
total) |
%caps |
The percent of all
tokens that are capitalized, e.g. February, German, Trump,
Adobe, Hummer, Marshall |
Especially for nouns, the [%caps]
column can be very useful to find entries that might
actually be used as proper nouns the majority of the time,
even though the CLAWS 7 tagger did not tag them as proper nouns (e.g. Trump,
Springer, Savannah, Newt, etc. Most proper nouns (e.g. Minnesota, Alice) have
been removed from the frequency list. But there are many "intermediate" cases
like those just listed, and it is impossible to have one single rule for what
percent of the time it needs to be used as a proper noun, in order to be removed
from the list. But with the [%caps] column, you can find (and delete, if
desired) some of the "marginal" words. |
%allC |
The percent of all
tokens that are completely capitalized, e.g. USB, DNA, CEO |
Related to #1 are words that are
completely capitalized, and are often acronyms, e.g. BC
(before Christ), IT (Information technology), DNA, USB, ROI
(Return on Investment). Which acronyms should we include, and which ones (e.g.
state or local agencies, or highly technical scientific terms) should we omit?
There is no clear answer on this, but the [%allC] (all caps) column can at least
provide data on this. |
range |
The number of the
485,179 texts in which the lemma occurs at least one time |
disp |
The Juilland "d"
dispersion measure (0.00 to 1.00) shows how "evenly" a word
is spread across the corpus. |
For example, if the word occurs
in 1000 texts, but only 1 or 2 times in 987 of those 1000
texts (and many times in the other 13), the "range" figure
simple shows "1000". The dispersion measure can see that
even though all 1000 texts contain the word, it is not
evenly spread across these texts. A word like "the" or
"with" will have a dispersion value very close to 1.00, and
a highly specialized word (which occurs in such a few texts)
will be closer to 0.00. |
{blog, web, TVM, spok, fic, mag, news, acad} |
The raw frequency
in each of these eight genres. |
{blogPM, webPM,
. . .newsPM, acadPM} |
The normalized
frequency (per million words: PM) in each of these eight
genres. |
2. lemmas_60k_subgenres.txt: top 60,000 lemmas
+ sub-categories. This table is essentially a continuation of #1
above. But because there are so many columns (nearly 200 columns),
we've created a separate file for this, for those who don't need
this much detail. Explanation of columns:
3. lemmas_60k_words.txt: top 60,000 lemmas + words
(more than 100,000 forms). Shows the frequency of each word form for each of
the top 60,000 lemmas, where the word form occurs at
least five times total. Explanation of columns:
lemRank, lemFreq |
Same as [rank] and [freq] in #1 above |
lemma, PoS |
Same as the columns in #1 above |
wordFreq |
The frequency of the individual word forms, e.g.
{decide, decides, decided, deciding}, {big, bigger,
biggest}, or {shoe, shoes}. The word form must have a
frequency of at least 5. For some lower frequency words, it
is possible that not all of the word forms will be listed
(but this can often be compensated for by using the data
from #3 below) |
word |
The frequency of the individual word forms, e.g.
{decide, decides, decided, deciding}, {big, bigger,
biggest}, or {shoe, shoes}. The word form must have a
frequency of at least 5. For some lower frequency words, it
is possible that not all of the word forms will be listed
(but this can often be compensated for by using the data
from #3 below) |
4. words_219k.txt: top
~220,000 word forms. This includes all word forms that occur at least 20 times in the
corpus, in at least five different texts (so a strange
name that occurs in just 1 or 2 of the 500,000 texts
wouldn't be included).
rank |
Same as [rank] in
#1 above |
word |
Same as the
columns in #1 above |
freq |
Same as [freq] in
#1 above (but obviously for the individual word form, rather
than lemma) |
#texts |
The number of the
485,179 texts in which the lemma occurs at least one time |
%caps |
Same as in #1
above. |
As is mentioned there, this can often
be be used to distinguish between proper nouns (capitalized)
and common nouns (not capitalized), and this is particularly
useful in a list like this, where there is no indication of
part of speech. |
{blog, web, TVM, spok, fic, mag, news, acad} |
Same as the
columns in #1 above |
{blogPM, webPM,
. . .newsPM, acadPM} |
Same as the
columns in #1 above |
|