WikiRank (using PageRank algorithm)

So for grins I implemented pagerank on Wikipedia (this was actually last month). I figured I'd share my results (and code, although it's kind of a hack job) in case anyone was curious how it turned out.

The code is here: https://github.com/lkolbly/pagerank

There are two folders: pagecount/ contains code to download the November page view data and aggregate it. pagerank/ contains all the code to parse the Wikipediaenwiki dump, parse it, and run pagerank on the result.

If you run:

gcc -O2 sparsematrix.c pagerank.c main.c -lpthread -lm -o pagerank

Inside pagerank/ you will get an executable file called "pagerank", which will perform the pagerank algorithm on an arbitrary network specified in the file "network". There's a small test network in the file testdata, which is the one Dr. Cline showed us in class (in the slides). It'll output the pagerank data as a vector, one element on each line, into the file vector-out.

The file toolchain.sh shows how I ran it. On the Wikipedia data you need about 20GB of disk space and 16GB of RAM to run everything comfortably. It ran in about 2-3 hours.

Final results are here, in a CSV file. (Just kidding. It's over 260MB compressed, so you'll have to make your own). Also, it's bigger than most spreadsheet programs will accept.

There are three columns, the name of the article, the expected rank according to pagerank, and the actual rank based on page view data from November 2014.

The pagerank algorithm worked by multiplying the markov matrix by a vector 1000 times. I'm not sure if that's enough, but the eigenvector wasn't changing by a measurable amount (I measured the angle between successive iterations), so I assumed it was good.

Okay, now that the technical stuff is over, now some pictures. I made a plot, where X was the expected rank from pagerank and Y was the actual rank in November, then made a density chart out of that. In this image, red means more points and blue means fewer points:

Odd little graphic (The Z axis is the log of the number of points in the cell). You would expect it to be square, except that a lot of pages didn't have empirical pagerank data attached to them, so they had to be cut out. The scale is 512 = 16,000,000

Ideally, if the pagerank algorithm perfectly predicted the actual page ranks, then the graph would be white except for a bright red line along Y=X.

We don't see that, exactly. We see a sort of plume in the lower left corner, which in general follows a line. This plume shows that pagerank in general got the higher-ranked pages right, and in general didn't guess that unpopular pages would be popular or vice versa.

Over towards the right, we see a few bands of lighter blue. These are presumably because certain pages on Wikipedia aren't linked to often, and aren't visited very often either (it's not hard to think of an example or two). I can imagine there are some clusters of pages which are like this - perhaps bridges in Iran, or Gymnastics competitions in China. These clusters would form those vertical bands.

Anyway, here's a zoomed-in image of the plume:

As you can see, it's slightly more likely that a page that's popular in reality gets ranked lowly by the pagerank than it is for a page that pagerank expects to be popular gets rankly lowly in reality. (remember, lower ranks are more popular)

I would imagine that a lot of the error would be because Wikipedia's traffic is driven primarily by what's happening in the news, rather than networking effects like the internet's traffic.

Anyway, I guess the result of all this is that pagerank actually works. It may still be magic, but it's magic which actually works. Also a working sparse matrix pagerank program, if you ever need one.