| Source | Notes | |--------|-------| | COCA (Corpus of Contemporary American English) | 60k list available for purchase; gold standard for US English | | SUBTLEX | Based on movie/TV subtitles; good for spoken frequency | | Google Books Ngram | Can be filtered and exported, but less cleaned | | Wiktionary frequency lists | Free, but variable quality | | Leipzig Corpora | Offers word lists for many languages, incl. English |
Most frequency lists stop at 10,000 or 20,000 entries. So why 60,000?
Let’s break down the keyword:
In essence, this file is a massive, sortable database of English vocabulary ranked by real-world utility. word frequency list 60000 englishxlsx
If you find a plain text (.txt) or CSV file with word/frequency columns:
Or use Python (if you have the list in CSV):
import pandas as pd
df = pd.read_csv("frequency_list.txt", header=None, names=["word", "frequency"])
df.to_excel("word_frequency_60k.xlsx", index=False)
A 60,000-word frequency list does not emerge from intuition but from computation. It is the product of a corpus—a massive, structured collection of written and spoken English. Common corpora include the British National Corpus (BNC), the Corpus of Contemporary American English (COCA), or web-derived collections like the Google Books Ngram corpus. The process is deceptively simple: a computer program tokenizes the text (splitting it into words and punctuation), lemmatizes or counts word forms, and then sorts them by raw frequency or by a weighted metric like "frequency per million words." | Source | Notes | |--------|-------| | COCA
Why 60,000? This number sits at a critical intersection. Research suggests that a typical educated native speaker knows between 20,000 and 35,000 word families. However, passive recognition vocabulary can reach 50,000–75,000 words. A list of 60,000 lemmas or word forms covers the vast majority of running text in general English—often over 98% coverage—while excluding the "long tail" of rare words (e.g., obscure scientific terms, archaic literary words, or highly specialized jargon). Thus, the 60K list is a pragmatic balance between comprehensiveness and utility.
However, treating a frequency list as an objective truth is dangerous. Several limitations must be acknowledged.
First, corpus bias. No corpus perfectly represents all English. A list built from newswire text will overrepresent journalistic words (e.g., "alleged," "verdict") and underrepresent conversational words (e.g., "gonna," "yeah"). A list from Twitter will be rich in slang and hashtags but poor in formal expository prose. Most 60K lists blend multiple genres, but residual bias remains. In essence, this file is a massive, sortable
Second, word sense ambiguity. The list treats each word form as a single entity, but "bank" (financial) and "bank" (river) are different senses with different frequencies. A true frequency list should ideally be sense-disambiguated, but that requires far more complex annotation.
Third, the curse of the long tail. The difference between rank 40,000 and rank 60,000 is minimal in coverage but large in obscurity. Words at this level might appear once in 50 million words of text—hardly worth memorizing for a learner, but crucial for a specialist.
Fourth, grammar and collocation. Frequency lists ignore syntax. Knowing that "make" is common is useless unless you also know it forms "make a decision" (not "do a decision"). A word list does not teach patterns.