Word frequency data


In addition to frequency lists for English, we also have what we believe are the most accurate frequency lists for Spanish, containing the top 40,000 lemmas / words in the language. The Spanish data is based on 1 billion words of data in the Web/Dialects corpus from the Corpus del Español.

When you purchase the data, you will receive three different datasets (see detailed information on format):

1. Top 40,000 lemmas (see samples), where lemma = all forms of a word (e.g. estar = estar, estamos, estaba, estarán).
RankLemma PoSFrequency # texts
18895jubileon1630813
18905crecientementer16281439
18915filibusteron1626745
18925cubrimienton16251260
18935caraduran16231316
18945aspirantej16221449
18955fatalmenter16201484
18965siman16171020
18975proscribirv16141337
18985aullarv16121332
18995sollozarv16091352
19005resbaladizoj16071412

2. Frequency of forms of top 40,000 lemmas (see samples)
Lemma rankLemma freq LemmaPoS Word freqWord % of lemma
87158146vegetarianoj4534vegetariana0.557
87158146vegetarianoj1344vegetariano0.165
87158146vegetarianoj1136vegetarianos0.139
87158146vegetarianoj1132vegetarianas0.139
87258132occidentaln4297occidentales0.528
87258132occidentaln3834occidental0.471
87358116teóricamenter7538teóricamente0.929
87358116teóricamenter578teoricamente0.071
87458101susurrarv1555susurra0.192
87458101susurrarv1346susurró0.166
87458101susurrarv1023susurrar0.126
87458101susurrarv775susurrando0.096

3. Frequency of top 200,000 word forms (see samples), not lemmatized or tagged for part of speech. Shows # texts (to find and perhaps remove words that are in just a few texts) and % capitalized (to find words that a probably proper nouns)
RankWord form Frequency# texts % capitalized
30585tripulados238915870.03
30595previenen238820730.04
30605atmósferas238618390.02
30615poquitos238521360.02
30625monóxido238414140.07
30635influyendo238221990.01
30645cerezo238113600.70
30655engañen237820340.01
30665prospección237715860.09
30675joy237616510.93
30685bernarda23755880.99
30695detenemos237421690.01

To order the data.

1. Download and fill out the license agreement (academic, non-academic) and then send it back to us as an attachment.

  • In order to receive academic pricing, you must send the license agreement from an academic email address (i.e. not Gmail, etc).

  • The license agreement states that you will not give the data to anyone else outside of your university or company (which also means that you cannot post it on the web).

2. Once we receive the license agreement, we will send you a request for payment from PayPal.

3. As soon as we receive confirmation of the payment, we will send you the link to download the data.

Thanks for your interest.