Word frequency data

1. Word frequency (overall)

The whole idea of frequency lists and dictionaries is to discover the most frequent words in a language. If you are a learner or teacher, this allows you to use your time more effectively, by focusing on learning the words that you or your students are most likely to encounter in the real world. A typical dictionary will list words A-Z, and it might even highlight "frequent" words, but which ones are the most frequent ones? The only way to know this is by means of real frequency lists, based on a reliable corpus (collection of texts).

So what can one do with with real frequency data? There are no end to the possibilities, but here are a few ideas:

  • Learners can go through the lists word by word in frequency order, finding words that you they aren't familiar with. This is a great way to fill in gaps in their vocabulary.

  • Teachers can assign your students to learn a certain block of words each week and then have a short quiz at the end of the week. At the end of the semester, they'll know that their students are at least familiar with a certain frequency range of words.

  • Materials developers can use the frequency data to design language learning materials that are more realistic and more useful, since they know that these are the words that students will need in the real world.

  • Linguists can use the frequency and collocates data to design language experiments or to carry out research -- on lexical semantics, psycholinguistics, morphology, and a wide range of other fields.

  • Computational linguists can use the frequency and collocates data for a wide range of natural language applications. We are not aware of any other frequency / collocates list of English that is this extensive and this accurate.

2. Word frequency (by genre and sub-genre)

These lists show the frequency of each word in each of the five main genres (spoken, fiction, popular magazines, newspapers, and academic) as well as the frequency in more than 40 sub-genres (e.g. MAG-Sports, NEWS-Financial, ACAD-Medicine). Let's look at one example (from among the 60,000 words in the lists):

16908 bullish j 1410 322 180 14 161 28 348 340 17

This shows us that bullish is most common in newspapers, magazines, and blogs, and that it is also fairly common in general web pages and in (somewhat more formal) spoken (which in COCA is taken from unscripted conversation on national TV and radio programs). But it is not very common in TV and Movie subtitles (sitcoms, etc) or in fiction or academic texts. There isn't room to show it here, but the downloadable list also shows the frequency in each of the nearly 100 different sub-genres, and we would find that this word is most common in NEWS-Financial and also in MAG-Financial.

This type of data can be used for materials development and for teaching ESP -- English for Specific Purposes. Rather than having students look at English vocabulary in its entirety, they can focus on specific areas, like Medical English or Legal English, and find the words that are much more common in that genre than in others. Likewise, linguists can use the data from a certain "slice" of English as they are extracting data for experiments and surveys.

3. Collocates (nearby words)

Collocates provide information on word meaning and usage, following the idea that "you can tell a lot about a word by the words that it hangs out with". Collocates are grouped by part of speech and then sorted by frequency. Let's look at two quick examples:

13730 brooding j
noun dark, eyes, look, silence, presence, sky, sense, cloud, thought, mood, portrait, bird misc dark, over, sit, silent, heavy, gray, stare, handsome, mysterious, beneath, moody

Suppose you find the word brooding in a short story and you don't know what it means. You could simply look it up in a dictionary, and you'd find a definition like "cast in subdued light so as to convey a somewhat threatening atmosphere". But the collocates lists provide a much better and complete "word sketch". You can really "feel" the meaning of this word by seeing what other words it occurs with.

11961 sprawl n
adjective urban, suburban, rural, industrial, metropolitan, vast, unchecked, surrounding, Southern, increasing noun city, development, traffic, growth, pollution, congestion, land, town, farmland, county verb create, encourage, stop, fight, reduce, curb, slow, threaten, limit, crawl

A dictionary would tell you that sprawl refers to "growth" or a "spreading out". But the collocates show that it refers particularly to the growth of cities (city, suburban, farmland), that it may be more common in the Southern US, that it is associated with pollution and congestion, and that people are trying to reduce, stop, and fight against it.

4. N-grams

These would mainly be useful for (computational and corpus) linguists. Let's take the example of the ten or so most common three-word strings with point in the middle position (with the frequency of the string indicated as well):

(6093 tokens) the point of; 3309 the point where; 2646 to point out; 2558 the point is; 2304 the point that; 2118 a point of; 1324 this point in; 1126 a point where; 814 no point in; 814 some point in; 594 starting point for

Corpus linguists use n-grams to look for patterns in language. By looking at the immediate contexts of a word and how often they occur, we can begin to identify and categorize the different uses of a word.

Computational linguists use n-grams to train computers to process language in roughly the same way that humans do. Humans know what words occur together, and given one word, what the next word might be. Computers don't know that. But if we train a computer to see patterns in 155,000,000 strings of words (with their frequencies) from a robust, balanced corpus, then computers can begin to learn.