A few months ago I wrote a spider to fetch content from various RSS feeds and count the words in the posts.  Since then, it's fetched over 20,000 documents, discovered usages of 50,000 words, and counted a grand total of 18 million word instances.

Results

The 30 most common words are:

the to and of in was that her he his you it for on with she as at is had him be said me but have my up from not

"the" alone counts for almost 5% of all words used.

The most common words of at least 10 characters are

everything government understand completely immediately information especially expression themselves remembered

"everything" makes up only 0.04% of words in the corpus.

Full results can be found in this CSV file.  This file is being placed in the public domain, however if you discover anything interesting when using it, I'd love to hear about it.

Word distribution shows the expected power law graph:

Letter distribution is more interesting:

Methodology

The spider fetched new entries from a variety of RSS feeds to create the corpus - feeds used include Newsweek, People, TechCrunch, FanFiction.net, and Sports Illustrated.

A spider fetched new entries from these sites every 4 hours.  Each link was followed, converted to a rough text format, and split in to sentences.  Since it was hard to separate page structure from content, each sentence was only counted once - in an attempt to reduce the prevalence of text like "Click here to sign up" or "Make CNN your home page".  Sentences with two or less words, all capitalized words, or consisting entirely of lower case letters were also ignored.

Each sentence was then split in to words, and the word list was filtered against the ENABLE2K data set.  96,000 words were found that were not valid English - these may have been names or word fragments.  125,000 words in the ENABLE2K set were not found in any document.

Caveats

The system is far from perfect: for instance, the word "de" is erroneously counted among the top 50 most popular words.

References

Text::Sentence

ENABLE2K

Similar Lists
http://esl.about.com/library/vocabulary/bl1000_list1.htm

 


Comments

gene

Tue, 08 Apr 2008 09:01:55

here's the lists i use for english stopwords:

http://esl.about.com/library/vocabulary/bl1000_list1.htm

 

Frank Hopkins

Thu, 06 Nov 2008 06:23:16

Do you have an updated list of common English words that you have gathered from the web lately?

 



Leave a Reply