Here are some more details about the program I used to create some graphs from Wikipedia two days ago.
The C# source code is now available. The program takes five command-line parameters.
- The name of the Wikipedia XML input file. You can download it here. It's the 1.8 GB pages-articles file.
- A number that specifies how many nodes you want in the final graph. 25 - 50 are reasonable values.
- The keyword to search for (like Egypt for example). Note that this parameter is case-sensitive.
- A number that specifies how many sentences are considered when searching articles for the keyword.
- The name of the output file. This file is a graph definition file that can be turned into a graph using dot.exe from the GraphViz package (like dot.exe -Tsvg output.txt > graph.svg for example).
I toyed around with the parameters for a while and here are a few more things I noticed:
- Searching through the first three sentences of all articles seems to produce very nice results for most keywords.
- If the keyword is relatively rare (for example "Australian rules football") it's OK to search through the entire article (set the sentences parameter to 0). Don't do this for popular keywords though or you'll end up with a graph that shows articles that are objectively important but only tangentially related to the keyword. If you do a full-article search for "Germany" for example you end up with a graph full of nodes containing the names of other European countries that played a role in Germany's history. That's because articles about countries have a high importance and all European countries were somehow important to Germany in the last few thousand years.
- Trying to use the size of articles as an indicator of their relative importance didn't work out. Look at this 4,000 words treatise on the Goomba and compare it to the page for Niels Bohr which is only half as long. This should give you a first idea about the potential problems. There's still legacy code in the app from where I tried that idea. It's easy to enable again it but you need to recompile the app.
- Trying to use the position of the keyword compared to the size of the article didn't work either. Long articles still have too much weight. A better idea might be to give an article weight 10 if the keyword appeared in the first 10% of the article, weight 9 if it appeared in the next 10%, and so on. I didn't try that though.
- Different parameters lead to different meanings of the resulting graphs. If you do a 3-sentences search for LSD the graph shows information about the drug itself and its history. If you do a full-text search one half of the graph is dominated by rock stars.
Here are a few other graphs which turned out particularly well:
The key to cool graphs is to choose a keyword that has lots of articles which nevertheless belong closely together. An example for a bad keyword is "Mathematics". There are thousands of math-related articles in Wikipedia but they don't belong closely together because math is a huge and fragmented field. The resulting graphs of keywords like math degenerate into trees or unconnected subgraphs.
Generating a graph takes approximately 5 minutes on my computer. In most cases nearly all the time is spent on parsing the 8 GB XML file. Generating the actual graph is nearly always a matter of seconds. Only for keywords like Germany or America which have some ten-thousand relevant articles generating the graph takes a few more minutes.