Word frequency data


When you purchase the word frequency data, you are purchasing access to several different datasets (all included for the same price). The samples below contain every tenth entry, and the samples are available in both Excel (XLSX) and text (TXT) format (more information on converting TXT to Excel).
 
  Frequency list Text Excel Explanation
1 Top 60,000 "lemmas" (see below) TXT XLSX
  • Perhaps most useful for language learners, where they probably don't care about the separate frequency of individual word forms, e.g. decide, decides, decided.
  • Shows the frequency (raw frequency and frequency per million words) in each of the eight main genres: blogs, other web, TV/Movies, (more formal) spoken, fiction, magazine, newspaper, and academic.
  • Shows range (what percentage of the nearly 500,000 texts have the lemma) and dispersion (a more complicated measure showing how "evenly" the word is spread across the corpus.
2 Top 60,000 lemmas + sub-categories TXT XLSX
  • An extension of #1. Distributed as a separate file because of the number of sub-categories, for those who don't need this much detail.
  • The frequency in 96 different sub-categories of the right main genres, such as Magazine-Sports, Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal, or TV-Comedies
  • Perhaps most useful for teachers or students of a particular domain of English, such as legal or medical English
3 Top 60,000 lemmas + word forms (100,000+ forms) TXT XLSX
  • Shows the frequency of each word form for each of the top 60,000 lemmas, where the word form occurs at least five times total.
  • For example, 5950 tokens of compensate; 2922 compensated, 902 compensating, 505 compensates.
  • Perhaps most useful for computational processing of English.
4 Top ~220,000 word forms TXT XLSX
  • All word forms that occur at least 20 times in the corpus, in at least five different texts (so a strange name that occurs in just 1 or 2 of the 500,000 texts wouldn't be included)
  • Words occur without lemma or part of speech
  • Shows the range -- in how many of the nearly 500,000 texts the word occurs
  • Shows what percentage of the time the word is capitalized, which often gives insight into whether the word is a proper noun.
  • Shows the frequency in each of the eight main genres shown above in #1.

Lemmas above means that all of the different word forms are grouped together. For example, the frequency of the verb {decide, decides, decided, deciding} are all grouped together under the one entry {decide}. Word forms refer to each of the distinct word forms {decide, decides, decided, deciding}. The "lemmatized" entries always separate by part of speech, however, so that deciding as an adjective (the deciding factor) and deciding as a verb (he really had a hard time deciding what to do) will always be distinguished from each other and calculated separately.