Generating a word list from Wikipedia
Holy shit, a site update! And after only 6 weeks too! Great.
This update is mainly a small program that shows how to parse huge XML files (about 3.5 GB) with C#. Recently I needed a giant word list and all word lists I found on the internet were very unsatisfactory. Therefore I decided to make my own one and the best source for words right now is probably Wikipedia (which you can thankfully download in XML format).
No, I didn't need that word list for a dictionary attack on some unsuspecting victim. Let's just pretend I was inspired by this flash movie and I wanted to find out what the highest scoring Scrabble words are.
Unfortunately that Scrabble program is on hold right now because I realized that I've never actually played Scrabble (except for like 10 games with a Shareware game while I developed my program) and there were some discrepancies between Scrabble score lists available online and the results I calculated. Now I don't know if I'm wrong or if they are wrong as I'm not really familiar with Scrabble rules at this point.
This update is mainly a small program that shows how to parse huge XML files (about 3.5 GB) with C#. Recently I needed a giant word list and all word lists I found on the internet were very unsatisfactory. Therefore I decided to make my own one and the best source for words right now is probably Wikipedia (which you can thankfully download in XML format).
No, I didn't need that word list for a dictionary attack on some unsuspecting victim. Let's just pretend I was inspired by this flash movie and I wanted to find out what the highest scoring Scrabble words are.
Unfortunately that Scrabble program is on hold right now because I realized that I've never actually played Scrabble (except for like 10 games with a Shareware game while I developed my program) and there were some discrepancies between Scrabble score lists available online and the results I calculated. Now I don't know if I'm wrong or if they are wrong as I'm not really familiar with Scrabble rules at this point.
Anyway, the word list generated from Wikipedia using this C# 2005 program is a nice by-product of that project. Not only is it probably the most comprehensive word list available right now (it contains 1035166 words, misspellings included of course), it can also be used to generate random statistics about common words.
Here's the top 10 most common words found in Wikipedia (the number says how often the word was found by my program) and right below comes the top 10 most common words in the English language according to about.com.
Notice a major difference? The word "you" is not in the Wikipedia Top 10 list. In fact the word "you" is only at position #73 in the Wikipedia word list. Not very surprising if you consider that Wikipedia articles are generally not dialogs where a speaker needs to address another person directly.
Here's the top 10 most common words found in Wikipedia (the number says how often the word was found by my program) and right below comes the top 10 most common words in the English language according to about.com.
01. 000016388912 the 02. 000011596615 of 03. 000007508385 and 04. 000006187831 in 05. 000005658564 to 06. 000005629777 a 07. 000003998170 is 08. 000002427399 was 09. 000002355800 for 10. 000001977568 as 01. the 02. of 03. to 04. a 05. and 06. in 07. is 08. it 09. you 10. that
Notice a major difference? The word "you" is not in the Wikipedia Top 10 list. In fact the word "you" is only at position #73 in the Wikipedia word list. Not very surprising if you consider that Wikipedia articles are generally not dialogs where a speaker needs to address another person directly.
Comments
Display comments as Linear | Threaded
Chris Capoccia on :
sp on :
uncleboob on :