In addition to frequency lists for
English, we also have what we
believe are the most accurate frequency lists for Portuguese, containing the top
40,000 lemmas /
words in the language. The Portuguese data is based on 1 billion
words of data in the
Web/Dialects corpus from the Corpus do Português.
When you purchase
the data, you will receive three different datasets (see
detailed information on format):
1. Top 40,000 lemmas (see
samples), where lemma = all forms of a word (e.g. estar = estar, está, estão,
estava, estou).
Rank | Lemma |
PoS | Frequency |
# texts |
15285 | embaixadora | n | 1232 | 908 |
15295 | fosfato | n | 1230 | 734 |
15305 | destronar | v | 1227 | 1111 |
15315 | susceptibilidade | n | 1226 | 1034 |
15325 | desvelar | v | 1224 | 1043 |
15335 | canonizar | v | 1222 | 923 |
15345 | guinada | n | 1220 | 1091 |
15355 | notabilizar | v | 1219 | 1140 |
15365 | baço | j | 1217 | 1057 |
15375 | alcoolizar | v | 1215 | 946 |
15385 | quilate | n | 1213 | 753 |
15395 | apologético | j | 1210 | 776 |
2. Frequency of forms of top 40,000 lemmas (see
samples)
Lemma rank | Lemma freq |
Lemma | PoS |
Word freq | Word |
% of lemma |
5355 | 9173 | contador | n | 7214 | contador | 0.786 |
5355 | 9173 | contador | n | 1958 | contadores | 0.213 |
5365 | 9149 | ambicioso | j | 4558 | ambicioso | 0.498 |
5365 | 9149 | ambicioso | j | 1930 | ambiciosos | 0.211 |
5365 | 9149 | ambicioso | j | 1899 | ambiciosa | 0.208 |
5365 | 9149 | ambicioso | j | 760 | ambiciosas | 0.083 |
5375 | 9129 | sensacional | j | 7672 | sensacional | 0.840 |
5375 | 9129 | sensacional | j | 1441 | sensacionais | 0.158 |
5385 | 9100 | viabilizar | v | 4633 | viabilizar | 0.509 |
5385 | 9100 | viabilizar | v | 813 | viabiliza | 0.089 |
5385 | 9100 | viabilizar | v | 519 | viabilizando | 0.057 |
5385 | 9100 | viabilizar | v | 398 | viabilizou | 0.044 |
3. Frequency of top 200,000 word forms (see
samples), not lemmatized or tagged for part of speech. Shows # texts (to
find and perhaps remove words that are in just a few texts) and % capitalized (to find words that are
probably proper nouns)
Rank | Word form |
Frequency | # texts |
% capitalized |
38265 | avelãs | 854 | 555 | 0.25 |
38275 | deluxe | 853 | 524 | 0.74 |
38285 | risível | 853 | 741 | 0.08 |
38295 | proletário | 853 | 559 | 0.09 |
38305 | desacelerar | 852 | 773 | 0.03 |
38315 | placa-mãe | 852 | 329 | 0.03 |
38325 | etéreo | 851 | 713 | 0.03 |
38335 | monstruosidade | 851 | 778 | 0.02 |
38345 | mises | 851 | 464 | 0.97 |
38355 | empenhadas | 851 | 815 | 0.01 |
38365 | imunização | 850 | 571 | 0.13 |
38375 | telex | 850 | 295 | 0.32 |
To order the data:
1. Download and fill out the
license agreement (academic,
non-academic) and then
send it back to us as an attachment.
-
In order to receive academic pricing, you must send the license agreement from an academic email address
(i.e. not Gmail, etc).
-
The license agreement
states that you will not give the data to anyone else outside of
your university or company (which also means that you cannot post it
on the web).
2. Once we receive the license agreement, we
will send you a request for payment
from PayPal.
3. As soon as we receive confirmation
of the payment, we will send you the
link to download the data.
Thanks for your interest.
|