1. Word frequency (overall)
The whole idea of frequency lists and
dictionaries is to discover the most frequent words in a language.
If you are a learner or teacher, this allows you to use your time
more effectively, by focusing on learning the words that you or your
students are most likely to encounter in the real world. A typical
dictionary will list words A-Z, and it
might even highlight "frequent" words, but which ones are
the most frequent ones? The only way to know this is by means
of real frequency lists, based on a reliable corpus
(collection of texts).
So what can one do with with real
frequency data? There are no end to the possibilities, but here are
a few ideas:
-
Learners can go
through the lists word by word in frequency order,
finding words that you they aren't familiar with. This is a great way
to fill in gaps in their vocabulary.
-
Teachers can assign your
students to learn a certain block of words each week and then
have a short quiz at the end of the week. At the end of the
semester, they'll know that their students are at least familiar
with a certain frequency range of words.
-
Materials developers can use
the frequency data to design language learning materials that
are more realistic and more useful, since they know that these
are the words that students will need in the real world.
-
Linguists can use the
frequency and collocates data to design language experiments or
to carry out research -- on lexical semantics, psycholinguistics,
morphology, and a wide range of other fields.
-
Computational linguists can
use the frequency and collocates data for a wide range of
natural language applications. We are not aware of any other
frequency / collocates list of English that is this extensive
and this accurate.
2. Word frequency (by genre and
sub-genre)
These lists show the frequency of each
word in each of the five main genres (spoken, fiction, popular
magazines, newspapers, and academic) as well as the frequency in
more than 40 sub-genres (e.g. MAG-Sports,
NEWS-Financial, ACAD-Medicine). Let's look at one example
(from among the 60,000 words in the lists):
rank |
word |
PoS |
TOTAL |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
16908 |
bullish |
j |
1410 |
322 |
180 |
14 |
161 |
28 |
348 |
340 |
17 |
This shows us that bullish is most
common in newspapers, magazines, and blogs, and that it is also fairly common in
general web pages and in (somewhat more formal) spoken (which in COCA is taken
from unscripted conversation on national TV and radio programs). But it is not
very common in TV and Movie subtitles (sitcoms, etc) or in fiction or academic
texts. There isn't room to show it here, but the downloadable list also shows
the frequency in each of the nearly 100 different sub-genres, and we would find that
this word is most common in NEWS-Financial and also in MAG-Financial.
This type of data can be used for
materials development and for teaching ESP -- English for Specific
Purposes. Rather than having students look at English vocabulary in
its entirety, they can focus on specific areas, like Medical English
or Legal English, and find the words that are much more common in
that genre than in others. Likewise, linguists can use the data from a
certain "slice" of English as they are extracting data for
experiments and surveys.
3.
Collocates (nearby words)
Collocates provide information on word
meaning and usage, following the idea that "you can tell a lot about
a word by the words that it hangs out with". Collocates are
grouped by part of speech and then sorted by frequency. Let's look at
two quick examples:
13730 brooding j
noun dark, eyes, look, silence, presence, sky, sense,
cloud, thought, mood, portrait, bird misc dark, over,
sit, silent, heavy, gray, stare, handsome,
mysterious, beneath, moody |
Suppose you find the word brooding
in a short story and you don't know what it means. You could
simply look it up in a dictionary, and you'd find a definition like
"cast in subdued light so as to convey a somewhat threatening
atmosphere". But the collocates lists provide
a much better and complete "word sketch". You can really "feel" the
meaning of this word by seeing what other words it occurs with.
11961 sprawl
n
adjective urban, suburban, rural, industrial, metropolitan,
vast, unchecked, surrounding, Southern, increasing noun
city, development, traffic, growth, pollution, congestion,
land, town, farmland, county verb create, encourage,
stop, fight, reduce, curb, slow, threaten, limit, crawl |
A dictionary would tell you that sprawl
refers to "growth" or a "spreading out". But the collocates show
that it refers particularly to the growth of cities (city,
suburban, farmland), that it may be more common in the
Southern US, that it is associated with pollution and
congestion, and that people are trying to reduce, stop,
and fight against it.
4. N-grams
These would mainly be useful for
(computational and corpus) linguists. Let's take the example of the
ten or so most common three-word strings with point in the
middle position (with the frequency of the string indicated as
well):
(6093 tokens) the point of;
3309 the point where; 2646 to point out; 2558
the point is; 2304 the point that; 2118 a
point of; 1324 this point in; 1126 a point
where; 814 no point in; 814 some point in;
594 starting point for |
Corpus linguists use n-grams to look
for patterns in language. By looking at the immediate contexts of a
word and how often they occur, we can begin to identify and
categorize the different uses of a word.
Computational linguists use n-grams to
train computers to process language in roughly the same way that
humans do. Humans know what words occur together, and given one
word, what the next word might be. Computers don't know that. But if
we train a computer to see patterns in 155,000,000 strings of words
(with their frequencies) from a robust, balanced corpus, then
computers can begin to learn.
|