Portuguese word frequency data (introduction)
Note: "lemmas" on this page means that all of the different word forms are
grouped together. For example, the frequency of the verb {estar, está, estão,
estava, estou} are all grouped together under the one entry {estar}. Word
forms refer to each of the distinct word forms {estar, está, estão, estava,
estou}. The "lemmatized" entries always distinguish by part of speech, however,
so that trabalho as a noun (um trabalho dificil) and trabalho as a
form of the verb trabalhar
(eu trabalho com crianças) will always be distinguished
from each other and calculated separately. |
1. port_40k_lemmas
(sample): top
40,000 lemmas,
Explanation of columns:
ID |
1-40,000. Based on
word frequency. |
lemma |
Again, the "dictionary /
headword" entry. This is why was or
trabalhava or sapatos would not be included
here; they are word forms of the lemmas trabalhar and
sapato. |
PoS |
Part of speech. This is the
first letter of the codes from
https://www.corpusdoportugues.org/web-dial/help/posList.asp |
freq |
Total frequency |
texts |
# texts (from the
total of 1,111,288) in which at least one form of the lemma
occurs at least once |
2. port_40k_forms.txt (sample): top 40,000 lemmas + words
(more than 200,000 forms). Shows the frequency of each word form for each of
the top 40,000 lemmas, where the word form occurs at
least 1/10,000 of the total lemma frequency. Explanation of columns:
lemID |
Same as [ID] in #1 above |
lemFreq |
Same as [freq] in #1 above |
lemma |
Same as [lemma] in #1 above |
PoS |
Same as [PoS] in #1 above |
wordFreq |
The frequency of the individual
word forms. The word form must have a frequency at least 5,
and its frequency needs to be 1/10,000 the total lemma
frequency (for example, if the lemma occurs 83,930 times,
the word form must occur at least 9 times). For some lower frequency words, it
is possible that not all of the word forms will be listed
(but this can often be compensated for by using the data
from #3 below) |
word |
The individual word forms, e.g.
{estar, está, estão, estava, estou} or {sapato,
sapatos}. |
% form |
What percentage of the overall
lemma frequency is the particular word form, e.g. lemma =
5000 and form = 250, the percentage = 5%. |
3. port_200k.txt (sample): top
~200,000 word forms. This includes all word forms that occur at least 20 times in the
corpus, in at least five different texts (so a strange
name that occurs in just 1 or 2 of the 1,111,288 texts
wouldn't be included).
ID |
Same as [ID] in
#1 above |
word |
The word form |
freq |
Same as [freq] in
#1 above (but obviously for the individual word form, rather
than lemma) |
#texts |
The number of the
1,111,288 texts in which the lemma occurs at least one time.
Due to the difficulty in calculating this for the top 9-10
words, we simply use 1,110,000 instead of the actual number
of texts from the 1,111,288 texts total |
%caps |
The percentage of [freq] that
are capitalized. This can often be be used to distinguish
between proper nouns (capitalized) and common nouns (not
capitalized), and this is particularly useful in a list like
this, where there is no indication of part of speech. |
|