Portuguese word frequency data (introduction) 
 
				
					| 
					 Note: "lemmas" on this page means that all of the different word forms are
grouped together. For example, the frequency of the verb {estar, está, estão, 
					estava, estou} are all grouped together under the one entry {estar}. Word
forms refer to each of the distinct word forms {estar, está, estão, estava, 
					estou}. The "lemmatized" entries always distinguish by part of speech, however,
so that trabalho as a noun (um trabalho dificil) and trabalho as a 
					form of the verb trabalhar
(eu trabalho com crianças) will always be distinguished
from each other and calculated separately.  | 
				 
			 
			1. port_40k_lemmas
			(sample): top 
			40,000 lemmas,
			Explanation of columns: 
			
				
					| 
					ID | 
					
					 1-40,000. Based on
					word frequency.  | 
				 
				
					| 
					lemma | 
					
					 Again, the "dictionary / 
					headword" entry. This is why was or 
					trabalhava or sapatos would not be included 
					here; they are word forms of the lemmas trabalhar and 
					sapato.  | 
				 
				
					| 
					PoS | 
					
					 Part of speech. This is the 
					first letter of the codes from 
					https://www.corpusdoportugues.org/web-dial/help/posList.asp  | 
				 
				
					| 
					freq | 
					
					 Total frequency  | 
				 
				
					| 
					texts | 
					# texts (from the 
					total of 1,111,288) in which at least one form of the lemma 
					occurs at least once | 
				 
				 
			 
			2. port_40k_forms.txt (sample): top 40,000 lemmas + words
			(more than 200,000 forms). Shows the frequency of each word form for each of
						the top 40,000 lemmas, where the word form occurs at
						least 1/10,000 of the total lemma frequency. Explanation of columns: 
			
				
					| lemID | 
					
					 Same as [ID] in #1 above  | 
				 
				
					| lemFreq | 
					
					 Same as [freq] in #1 above  | 
				 
				
					| lemma | 
					
					 Same as [lemma] in #1 above  | 
				 
				
					| PoS | 
					
					 Same as [PoS] in #1 above  | 
				 
				
					| wordFreq | 
					
					 The frequency of the individual 
					word forms. The word form must have a frequency at least 5, 
					and its frequency needs to be 1/10,000 the total lemma 
					frequency (for example, if the lemma occurs 83,930 times, 
					the word form must occur at least 9 times). For some lower frequency words, it
					is possible that not all of the word forms will be listed
					(but this can often be compensated for by using the data
					from #3 below)  | 
				 
				
					| word | 
					
					 The individual word forms, e.g. 
					{estar, está, estão, estava, estou} or {sapato, 
					sapatos}.  | 
				 
				
					| % form | 
					
					 What percentage of the overall 
					lemma frequency is the particular word form, e.g. lemma = 
					5000 and form = 250, the percentage = 5%.  | 
				 
				 
			 
			3. port_200k.txt (sample): top
					~200,000 word forms. This includes all word forms that occur at least 20 times in the
						corpus, in at least five different texts (so a strange
						name that occurs in just 1 or 2 of the 1,111,288 texts
						wouldn't be included). 
			
				
					| 
					ID | 
					
					 Same as [ID] in
					#1 above  | 
				 
				
					| 
					word | 
					
					 The word form  | 
				 
				
					| 
					freq | 
					
					 Same as [freq] in
					#1 above (but obviously for the individual word form, rather
					than lemma)  | 
				 
				
					| 
					#texts | 
					
					 The number of the
					1,111,288 texts in which the lemma occurs at least one time.  
					Due to the difficulty in calculating this for the top 9-10 
					words, we simply use 1,110,000 instead of the actual number 
					of texts from the 1,111,288 texts total  | 
				 
				
					| 
					%caps | 
					
					 The percentage of [freq] that 
					are capitalized. This can often be be used to distinguish 
					between proper nouns (capitalized) and common nouns (not 
					capitalized), and this is particularly useful in a list like 
					this, where there is no indication of part of speech.  | 
				 
				 
						 |