Portuguese word frequency data (introduction)

Note: "lemmas" on this page means that all of the different word forms are grouped together. For example, the frequency of the verb {estar, está, estão, estava, estou} are all grouped together under the one entry {estar}. Word forms refer to each of the distinct word forms {estar, está, estão, estava, estou}. The "lemmatized" entries always distinguish by part of speech, however, so that trabalho as a noun (um trabalho dificil) and trabalho as a form of the verb trabalhar (eu trabalho com crianças) will always be distinguished from each other and calculated separately.

1. port_40k_lemmas (sample): top 40,000 lemmas, Explanation of columns:


1-40,000. Based on word frequency.


Again, the "dictionary / headword" entry. This is why was or trabalhava or sapatos would not be included here; they are word forms of the lemmas trabalhar and sapato.


Part of speech. This is the first letter of the codes from https://www.corpusdoportugues.org/web-dial/help/posList.asp


Total frequency

texts # texts (from the total of 1,111,288) in which at least one form of the lemma occurs at least once

2. port_40k_forms.txt (sample): top 40,000 lemmas + words (more than 200,000 forms). Shows the frequency of each word form for each of the top 40,000 lemmas, where the word form occurs at least 1/10,000 of the total lemma frequency. Explanation of columns:


Same as [ID] in #1 above


Same as [freq] in #1 above


Same as [lemma] in #1 above


Same as [PoS] in #1 above


The frequency of the individual word forms. The word form must have a frequency at least 5, and its frequency needs to be 1/10,000 the total lemma frequency (for example, if the lemma occurs 83,930 times, the word form must occur at least 9 times). For some lower frequency words, it is possible that not all of the word forms will be listed (but this can often be compensated for by using the data from #3 below)


The individual word forms, e.g. {estar, está, estão, estava, estou} or {sapato, sapatos}.

% form

What percentage of the overall lemma frequency is the particular word form, e.g. lemma = 5000 and form = 250, the percentage = 5%.

3. port_200k.txt (sample): top ~200,000 word forms. This includes all word forms that occur at least 20 times in the corpus, in at least five different texts (so a strange name that occurs in just 1 or 2 of the 1,111,288 texts wouldn't be included).


Same as [ID] in #1 above


The word form


Same as [freq] in #1 above (but obviously for the individual word form, rather than lemma)


The number of the 1,111,288 texts in which the lemma occurs at least one time.  Due to the difficulty in calculating this for the top 9-10 words, we simply use 1,110,000 instead of the actual number of texts from the 1,111,288 texts total


The percentage of [freq] that are capitalized. This can often be be used to distinguish between proper nouns (capitalized) and common nouns (not capitalized), and this is particularly useful in a list like this, where there is no indication of part of speech.