I was unable to find a simple and efficient implementation of PageRank. As a consequence, I wrote this post and my own Python code that within ten seconds can compute the PageRank of the Polish Wikipedia containing 17,880,897 links between 1,113,939 articles. For the curious, here are the top ten articles.

Title of the article | PageRank | Equivalent English title |
---|---|---|

Polska | 0.00349 | Poland |

Francja | 0.00343 | France |

Stany Zjednoczone | 0.00341 | United States |

Język angielski | 0.00257 | English language |

Łacina | 0.00225 | Latin |

Niemcy | 0.00205 | Germany |

Włochy | 0.00162 | Italy |

Wieś | 0.00156 | Village |

Wielka Brytania | 0.00131 | United Kingdom |

Podział administracyjny Polski 1975–1998 | 0.00124 | Voivodeships of Poland (1975–98) |

### The random surfer

The PageRank of a page is the probability of visiting the page at a fixed distant moment during a random surf through the web. Before each of her actions, the ideal random surfer tosses a biased coin.

- If the toss is a head, she clicks a random link on the current page, provided that the page has any outbound links.
- If the toss is a tail, she gets bored and jumps to a random page.
- If the toss is a head and the current page has no outbound links, she also jumps to a random page.

The original papers recommend setting the head probability *c* to 0.85.

### Linear algebra

Number the pages from 1 to *n*. Store the probabilities of visiting each page at a given moment in an *n*-element vector **p**. The actions of the random surfer correspond to linear transformations of vector **p**. To express these transformations, we need three building blocks:

- A sparse
*n*×*n*matrix**H**that reflects the following of links. When page*i*has*f*outbound links, among them a link to page_{i}*j*, then the*h*element of matrix_{ij}**H**is equal to 1/*f*. When there are no links from page_{i}*i*to page*j*, then*h*= 0. The vector_{ij}*c***H**^{T}**p**contains the probabilities of visiting each page after action 1 above. - A vector with
*n*elements equal to 1/*n*, which we denote as (1/*n*)**1**, that reflects jumping to a random page. For a scalar*a*or a 1×1 matrix**A**= [*a*], each element of the vector*a*(1/*n*)**1**or (1/*n*)**1A**contains the same number*a*/*n*. In particular, the vector (1 −*c*)(1/*n*)**1**contains the probabilities after action 2. - A sparse vector
**d**that reflects dead ends. Its element*d*has value 1 when page_{i}*i*has no outbound links. If there is a link from page*i*somewhere else, then*d*= 0. The 1×1 matrix_{i}**d**^{T}**p**contains the probability that the surfer is visiting the end of the Internet so the vector*c*(1/*n*)**1d**^{T}**p**corresponds to action 3.

The PageRank vector **p** can be computed iteratively. Start with any initial vector **p**_{0} of probabilities that add to one, for instance the vector (1/*n*)**1**. Repeat the step

until, say, the sum of absolute differences |**p**_{k+1} – **p**_{k}| is smaller than a given threshold *ϵ*. Do not worry, the sequence (**p**_{k}) does converge whenever *c* < 1.

The dominant operation in step (1) is the multiplication **H**^{T}**p**_{k}. For a sparse matrix **H**, it takes time proportional to the number of nonzero elements in **H**, which in our case is the total number of links. While it is possible to express step (1) in a simpler form **p**_{k+1} := **Gp**_{k}, the matrix **G** = *c*(**H**^{T} + (1/*n*)**1d**^{T}) + (1 − *c*)(1/*n*)**1**^{T}**1** would be dense. The multiplication would take much longer time, proportional to the square of the number of pages.

Note that you can use the dot product **d**·**p**_{k} to write (**d**·**p**_{k})(1/*n*)**1** instead of (1/*n*)**1d**^{T}**p**_{k} and that often you can see step (1) above written equivalently as

.

### An example

Let us work with a concrete graph:

.

Repeatedly computing

,

for instance

,

we obtain

.

The sum of absolute differences between **p**_{k+1} and **p**_{k} decreases exponentially:

### The code

#!/usr/bin/python3 import scipy from scipy import sparse def ComputePageRank(links, c=0.85, epsilon=0.001): """Computes the PageRank for given |links|. The source or target elements of |links| are numbered from zero upwards. Gaps in the numbering represent pages having neither inbound nor outbound links. Note that with normal settings, such pages still get a nonzero PageRank. Args: links: A list of ⟨source, target⟩ pairs. Source and target must be integers >= 0. c: The probability that the random surfer follows a link, if there are any links. epsilon: The desired accuracy. Returns: A scipy.array whose ith element is the PageRank of the ith page. """ ones = scipy.ones(len(links)) sources = [x[0] for x in links] targets = [x[1] for x in links] n = max(max(sources), max(targets)) + 1 # |HT| is sparse, so store it in COOrdinate # format: a list of ⟨source, target, probability⟩ # triples. Make it explicitly square in case # the last target has no outbound links or the # last source has no inbound links. HT = sparse.coo_matrix( (ones, (targets, sources)), shape=(n, n)) # Count the outbound links for each source. num_outlinks = scipy.array(HT.sum(axis=0)).flatten() # Divide each nonzero element of |HT|, # so far equal to one, by the number # of outbound links in its source. HT.data /= num_outlinks[sources] d_indices = scipy.where(num_outlinks == 0)[0] r = scipy.ones(n) / n while True: previous_r = r r = c * (HT * r + sum(r[d_indices])/n) + (1.0 - c)/n # r.sum() ≈ 1 but prevent errors from adding up. r /= r.sum() if scipy.absolute(r - previous_r).sum() < epsilon: return r if __name__ == '__main__': print(PageRank( [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 1), (3, 2), (3, 4)]))