Skip to content

Generating a word list from Wikipedia

Holy shit, a site update! And after only 6 weeks too! Great.

This update is mainly a small program that shows how to parse huge XML files (about 3.5 GB) with C#. Recently I needed a giant word list and all word lists I found on the internet were very unsatisfactory. Therefore I decided to make my own one and the best source for words right now is probably Wikipedia (which you can thankfully download in XML format).

No, I didn't need that word list for a dictionary attack on some unsuspecting victim. Let's just pretend I was inspired by this flash movie and I wanted to find out what the highest scoring Scrabble words are.

Unfortunately that Scrabble program is on hold right now because I realized that I've never actually played Scrabble (except for like 10 games with a Shareware game while I developed my program) and there were some discrepancies between Scrabble score lists available online and the results I calculated. Now I don't know if I'm wrong or if they are wrong as I'm not really familiar with Scrabble rules at this point.

Anyway, the word list generated from Wikipedia using this C# 2005 program is a nice by-product of that project. Not only is it probably the most comprehensive word list available right now (it contains 1035166 words, misspellings included of course), it can also be used to generate random statistics about common words.

Here's the top 10 most common words found in Wikipedia (the number says how often the word was found by my program) and right below comes the top 10 most common words in the English language according to about.com.

01. 000016388912 the
02. 000011596615 of
03. 000007508385 and
04. 000006187831 in
05. 000005658564 to
06. 000005629777 a
07. 000003998170 is
08. 000002427399 was
09. 000002355800 for
10. 000001977568 as

01. the
02. of
03. to
04. a
05. and
06. in
07. is
08. it
09. you
10. that

Notice a major difference? The word "you" is not in the Wikipedia Top 10 list. In fact the word "you" is only at position #73 in the Wikipedia word list. Not very surprising if you consider that Wikipedia articles are generally not dialogs where a speaker needs to address another person directly.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

sp on :

Thanks, but I think I need some more practice of actually playing Scrabble. Maybe I'm going to buy the game one day, maybe I'll try to play it online somewhere.

uncleboob on :

Very nice little article. I exactly needed that C# programm.

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
BBCode format allowed
Form options

Submitted comments will be subject to moderation before being displayed.