Spanish word frequency data
(introduction)
 

Note: "lemmas" on this page means that all of the different word forms are grouped together. For example, the frequency of the verb {estar, está, estoy, estuvimos} are all grouped together under the one entry {estar}. Word forms refer to each of the distinct word forms {estar, está, estoy, estuvimos}. The "lemmatized" entries always distinguish by part of speech, however, so that trabajo as a noun (un trabajo dificil) and trabajo as a form of the verb trabajar (yo trabajo en la fábrica) will always be distinguished from each other and calculated separately.

1. span_40k_lemmas (sample): top 40,000 lemmas, Explanation of columns:

ID

Rank order, 1-40,000. Based on word frequency.

lemma

Again, the "dictionary / headword" entry. This is why was or trabajaba or zapatos would not be included here; they are word forms of the lemmas trabajar and zapato.

PoS

Part of speech. This is the first letter of the codes from https://www.corpusdelespanol.org/web-dial/help/posList.asp

freq

Total frequency

texts # texts (from the total of 2,127,738) in which at least one form of the lemma occurs at least once


2. span_40k_forms.txt (sample): top 40,000 lemmas + words (more than 200,000 forms). Shows the frequency of each word form for each of the top 40,000 lemmas, where the word form occurs at least 1/10,000 of the total lemma frequency. Explanation of columns:

lemID

Same as [ID in #1 above

lemFreq

Same as [freq] in #1 above

lemma

Same as [lemma] in #1 above

PoS

Same as [PoS] in #1 above

wordFreq

The frequency of the individual word forms. The word form must have a frequency at least 5, and its frequency needs to be 1/10,000 the total lemma frequency (for example, if the lemma occurs 83,930 times, the word form must occur at least 9 times). For some lower frequency words, it is possible that not all of the word forms will be listed (but this can often be compensated for by using the data from #3 below)

word

The individual word forms, e.g. {estar, está, estoy, estuvimos}, {seco, seca, secos, secas} or {zapato, zapatos}.

% form

What percentage of the overall lemma frequency is the particular word form, e.g. lemma = 5000 and form = 250, the percentage = 5%.


3. span_200k.txt (sample): top ~200,000 word forms. This includes all word forms that occur in at least five different texts (so a strange name that occurs in just 1 or 2 of the 2,127,738 texts wouldn't be included).

ID

Same as [ID] in #1 above, but this time applied just to the word form.

word

The word form

freq

Same as [freq] in #1 above (but obviously for the individual word form, rather than lemma). .

#texts

The number of the 2,127,738 texts in which the lemma occurs at least one time. Due to the difficulty in calculating this for the top 9-10 words, we simply use 2,100,000 instead of the actual number of texts from the 2,127,738 texts total

%caps

The percentage of [freq] that are capitalized. This can often be be used to distinguish between proper nouns (capitalized) and common nouns (not capitalized), and this is particularly useful in a list like this, where there is no indication of part of speech.