When you purchase the word
frequency data, you are purchasing access to several different
datasets (all included for the same price). The samples below
contain every tenth entry, and the samples are available in both
Excel (XLSX) and text (TXT) format (more information on converting
TXT to Excel).
You can now
download the top 5,000 entries for each wordlist
(each word #1-5,000, not just every tenth entry). If you
re-post the list on the web, you must clearly
indicate
www.wordfrequency.info as the source of the
data. Automated queries run every night to find
copies of the data that do not follow these
guidelines, and you will be forced to remove those
copies from your website. Please respect these guidelines
-- thanks. |
|
Frequency list |
Text |
Excel |
Top 5000 |
Explanation |
1 |
Top
60,000 "lemmas" (see below) |
TXT |
XLSX |
Download |
- Perhaps most
useful for language learners, where they probably don't care
about the separate frequency of individual word forms, e.g.
decide,
decides, decided.
- Shows the frequency (raw frequency and
frequency per million words) in each of the eight main
genres: blogs, other web, TV/Movies, (more formal) spoken,
fiction, magazine, newspaper, and academic.
- Shows range (what percentage of the nearly 500,000 texts
have the lemma) and dispersion (a more complicated measure
showing how "evenly" the word is spread across the corpus).
|
2 |
Top
60,000 lemmas + sub-categories |
TXT |
XLSX |
Download |
- An extension of
#1. Distributed as a separate file because of the number of
sub-categories, for those who don't need this much
detail.
- The frequency in 96 different sub-categories of the
right main genres, such as Magazine-Sports,
Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal,
or TV-Comedies
- Perhaps most useful for teachers or students of a
particular domain of English, such as legal or medical
English
|
3 |
Top
60,000 lemmas + word forms (100,000+ forms) |
TXT |
XLSX |
Download |
- Shows the frequency of each word form for each of
the top 60,000 lemmas, where the word form occurs at
least five times total.
- For example, 5950 tokens of compensate; 2922
compensated, 902 compensating, 505
compensates.
- Perhaps most useful for computational processing of
English.
|
4 |
Top
~220,000 word forms |
TXT |
XLSX |
Download |
- All word forms that occur at least 20 times in the
corpus, in at least five different texts (so a strange
name that occurs in just 1 or 2 of the 500,000 texts
wouldn't be included)
- Words occur without lemma or part of speech
- Shows the range -- in how many of the nearly 500,000
texts the word occurs
- Shows what percentage of the time the word is
capitalized, which often gives insight into whether the
word is a proper noun.
- Shows the frequency in each of the eight main genres
shown above in #1.
|
Lemmas above
means that all of the different word forms are grouped together. For
example, the frequency of the verb {decide, decides, decided,
deciding} are all grouped together under the one entry {decide}. Word forms refer to each of the distinct word forms {decide, decides, decided, deciding}.
The "lemmatized" entries always separate by
part of speech, however, so that deciding as an adjective (the
deciding factor) and deciding as a verb (he really had a hard
time deciding what to do) will always be distinguished from each
other and calculated separately.
|