Word frequency data

Corpus of Contemporary American English


 Purchase data 

Overview
Using the data
Compare 100k/60k

100,000 word list
  Samples
  Compare
  FAQ / questions

5,000-60,000 lemma lists
   Samples / formats
   Compare
   Free list (5,000)  

Spanish data
Portuguese data

Related sites
  Full-text data 
  Collocates
  N-grams
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


(See also the page for the 100,000 word list, which has some differences from the 5,000-60,000 word lists)

1. Word frequency (overall)

The whole idea of frequency lists and dictionaries is to discover the most frequent words in a language. If you are a learner or teacher, this allows you to use your time more effectively, by focusing on learning the words that you or your students are most likely to encounter in the real world. A typical dictionary will list words A-Z, and it might even highlight "frequent" words, but which ones are the most frequent ones? The only way to know this is by means of real frequency lists, based on a reliable corpus (collection of texts).

So what can one do with with real frequency data? There are no end to the possibilities, but here are a few ideas:

  • Learners can go through the lists word by word in frequency order, finding words that you they aren't familiar with. This is a great way to fill in gaps in their vocabulary.

  • Teachers can assign your students to learn a certain block of words each week and then have a short quiz at the end of the week. At the end of the semester, they'll know that their students are at least familiar with a certain frequency range of words.

  • Materials developers can use the frequency data to design language learning materials that are more realistic and more useful, since they know that these are the words that students will need in the real world.

  • Linguists can use the frequency and collocates data to design language experiments or to carry out research -- on lexical semantics, psycholinguistics, morphology, and a wide range of other fields.

  • Computational linguists can use the frequency and collocates data for a wide range of natural language applications. We are not aware of any other frequency / collocates list of English that is this extensive and this accurate.

2. Word frequency (by genre and sub-genre)

These lists show the frequency of each word in each of the five main genres (spoken, fiction, popular magazines, newspapers, and academic) as well as the frequency in more than 40 sub-genres (e.g. MAG-Sports, NEWS-Financial, ACAD-Medicine). Let's look at one example (from among the 60,000 words in the lists):

word PoS total spoken fiction magazine newspaper academic
bullish j 569 19 17 188 232 12

This shows us that bullish is most common in newspapers, and that it is also fairly common in popular magazines. There isn't room to show it here, but the downloadable list also shows the frequency in each of the 40+ sub-genres, and we would find that this word is most common in NEWS-Financial and also in MAG-Financial.

This type of data can be used for materials development and for teaching ESP -- English for Specific Purposes. Rather than having students look at English vocabulary in its entirety, they can focus on specific areas, like Medical English or Legal English, and find the words that are much more common in that genre than in others. Likewise, linguists can use the data from a certain "slice" of English as they are extracting data for experiments and surveys.

3. Collocates (nearby words)

Collocates provide information on word meaning and usage, following the idea that "you can tell a lot about a word by the words that it hangs out with". Collocates are grouped by part of speech and then sorted by frequency. Let's look at two quick examples:

13730 brooding j
noun dark, eyes, look, silence, presence, sky, sense, cloud, thought, mood, portrait, bird misc dark, over, sit, silent, heavy, gray, stare, handsome, mysterious, beneath, moody

Suppose you find the word brooding in a short story and you don't know what it means. You could simply look it up in a dictionary, and you'd find a definition like "cast in subdued light so as to convey a somewhat threatening atmosphere". But the collocates lists provide a much better and complete "word sketch". You can really "feel" the meaning of this word by seeing what other words it occurs with.

11961 sprawl n
adjective urban, suburban, rural, industrial, metropolitan, vast, unchecked, surrounding, Southern, increasing noun city, development, traffic, growth, pollution, congestion, land, town, farmland, county verb create, encourage, stop, fight, reduce, curb, slow, threaten, limit, crawl

A dictionary would tell you that sprawl refers to "growth" or a "spreading out". But the collocates show that it refers particularly to the growth of cities (city, suburban, farmland), that it may be more common in the Southern US, that it is associated with pollution and congestion, and that people are trying to reduce, stop, and fight against it.

4. N-grams

These would mainly be useful for (computational and corpus) linguists. Let's take the example of the ten or so most common three-word strings with point in the middle position (with the frequency of the string indicated as well):

(6093 tokens) the point of; 3309 the point where; 2646 to point out; 2558 the point is; 2304 the point that; 2118 a point of; 1324 this point in; 1126 a point where; 814 no point in; 814 some point in; 594 starting point for

Corpus linguists use n-grams to look for patterns in language. By looking at the immediate contexts of a word and how often they occur, we can begin to identify and categorize the different uses of a word.

Computational linguists use n-grams to train computers to process language in roughly the same way that humans do. Humans know what words occur together, and given one word, what the next word might be. Computers don't know that. But if we train a computer to see patterns in 155,000,000 strings of words (with their frequencies) from a robust, balanced corpus, then computers can begin to learn.