Skip to content

Data-mining Wikipedia

Recently I finished Ian Shaw's book The Oxford History of Ancient Egypt. It was pretty interesting but it introduces more names per page than your average Tolkien book. I really could have used a chronologically ordered graph that shows names of important people and places and how they are connected. Creating a graph like that manually is obviously too much work so I tried to use the power of Wikipedia to create a graph automatically. My original plan didn't quite work out. Apparently the problem of data-mining from texts written in a natural language like the English language can't be solved in a few hours on a lazy saturday afternoon. At least not by me. Nevertheless I managed to get some interesting partial results.

After I downloaded the latest Wikipedia dump I wrote a short C# program to analyze the data. My general strategy was the following:

  • Find all articles that mention "Egypt" in the first sentence. I figured that all articles that are really relevant about Egypt contain the word Egypt right in the first sentence (I found 2738 articles that way).
  • For each of these articles collect all outbound links to other articles.
  • Once all articles are parsed create a graph that mirrors the links between the articles. Discard links to articles which don't contain "Egypt" in the first sentence.
  • Sort the articles by importance (which I defined as the sum of inbound links from other articles and outbound links to other articles about Egypt).

At first I wanted to create a graph that shows all articles connected to at least 5 other articles. Unfortunately it turned out that 981 articles fulfill this criterium. I changed my strategy and decided to create graphs showing the Top X articles.

Here are two graphs I created showing the Top 20 and the Top 58. The SVG version is zoomable but you need a browser plugin. The size of a node reflects the importance of the article.

The first noticeable thing is that nearly all nodes are about Ancient Egypt. Apparently modern Egypt is not particularly interesting. The only two exceptions in the Top 20 are Fatimid and Cairo and Cairo is only half an exception because the general area around Cairo was already important in ancient times. Originally I wanted to search for dates in the articles too and restrict the graph to articles containing dates before the year 0. Since nearly all important Wikipedia articles about Egypt are about Ancient Egypt I didn't bother with that though.

The next noticeable thing is that the nodes of the Top 20 graph really are about topics that are very important to Egypt. I feel that validates my code at least somewhat. There's the modern capital Cairo, the old spiritual capital Thebes (although curiously Memphis is ranked a lot lower), there are the Pharaohs with their Valley of the Kings, the 18th dynasty - which is maybe the most interesting Egyptian dynasty - is mentioned along with the New Kingdown (the era the 18th dynasty belongs to), there's Nubia which was Egypt's main enemy for many centuries, and there's Ra (one of the primary gods of Ancient Egypt). The article about Greek Language appears to be an outlier at first glance but Alexander the Great conquered Egypt in 332 BC and founded the Ptolemaic empire in Egypt (see the left side of the Top 58). During the Ptolemaic era several well-known buildings like the Library of Alexandria or the lighthouse on the island Pharos were created. There's also the Rosetta Stone which establishes a direct connection between Greek Language and Egypt.

My methodology is nowhere near perfect of course. On the code-side of things the biggest problem I have observed is the way I select the first sentence of an article. I merely search for the first period character. Everything before that character is the first sentence. Certain article templates appear before the actual text though and mess up this strategy. Another problem are abbreviations which include period characters. The result is that some relevant articles are not included in my graphs. In a particularly amusing parallel to life Anwar Al Sadat didn't make it (into the graph, I mean) while Hosni Mubarak did.

Improving code can only go so far though. I believe that Wikipedia articles would greatly benefit from having invisible tags that could be parsed and processed easily. I have only passing knowledge about the Semantic Web ideas but I think a hierarchical model like the Scientific Classification of organisms could work. Imagine something like this, a hierarchical structure that include fully typed entries (types are in parentheses):

  • Name(String)=George Washington
  • Type(Category)=Person
    • Born(Date)=1732/02/22
    • Profession(String)=National President
      • Country(Country)=United States of America

  • Name(String)=Melbourne
  • Type(Category)=City
    • Country(Country)=Australia
    • Population(Integer)=3689700

A system like that certainly would have made my job significantly easier. I could have easily identified all articles about people and cities of Ancient Egypt and I could have linked them properly instead of applying a simple heuristic which didn't produce the desired results.

Of course there are many shortcomings about my simple tag system. Here's one. How do you handle disputes like the exact place of birth of the former Peruan president Alberto Fujimori? Smarter people than me should solve these problems though. Data-mining Wikipedia might be worth a Bachelor/Master/PhD thesis. Maybe someone wants to give it a try.

Oh yeah, about The Oxford History of Ancient Egypt. I think the quality of the book is proven by the fact that I knew all topics of the Top 58 with the single exception of Manetho. Me not knowing what (or better who) Manetho was was strictly my fault too. According to the book's index he's mentioned quite often.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

mazzoo on :

very interesting work!
Are any sources to be released soon?
What's the name of the graph rendering library you used?

sp on :

Hi,

I didn't plan to release the sources because at this point they're quite horrible. I can clean them up later today and post them though.

The graph library I used was dot.exe from the GraphViz package.

Glich on :

This is nice. Connecting information like this can make everything much clearer. You've inspired me to try to make my own program! thx!

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
BBCode format allowed
Form options

Submitted comments will be subject to moderation before being displayed.