PageRank in 600 words and 60 lines of Python

I was unable to find a simple and efficient implementation of PageRank. As a consequence, I wrote this post and my own Python code that within ten seconds can compute the PageRank of the Polish Wikipedia containing 17,880,897 links between 1,113,939 articles. For the curious, here are the top ten articles.

Title of the article PageRank Equivalent English title
Polska 0.00349 Poland
Francja 0.00343 France
Stany Zjednoczone 0.00341 United States
Język angielski 0.00257 English language
Łacina 0.00225 Latin
Niemcy 0.00205 Germany
Włochy 0.00162 Italy
Wieś 0.00156 Village
Wielka Brytania 0.00131 United Kingdom
Podział administracyjny Polski 1975–1998 0.00124 Voivodeships of Poland (1975–98)

The random surfer

The PageRank of a page is the probability of visiting the page at a fixed distant moment during a random surf through the web. Before each of her actions, the ideal random surfer tosses a biased coin.

1. If the toss is a head, she clicks a random link on the current page, provided that the page has any outbound links.
2. If the toss is a tail, she gets bored and jumps to a random page.
3. If the toss is a head and the current page has no outbound links, she also jumps to a random page.

The original papers recommend setting the head probability c to 0.85.

Linear algebra

Number the pages from 1 to n. Store the probabilities of visiting each page at a given moment in an n-element vector p. The actions of the random surfer correspond to linear transformations of vector p. To express these transformations, we need three building blocks:

1. A sparse n×n matrix H that reflects the following of links. When page i has fi outbound links, among them a link to page j, then the hij element of matrix H is equal to 1/fi. When there are no links from page i to page j, then hij = 0. The vector cHTp contains the probabilities of visiting each page after action 1 above.
2. A vector with n elements equal to 1/n, which we denote as (1/n)1, that reflects jumping to a random page. For a scalar a or a 1×1 matrix A = [a], each element of the vector a(1/n)1 or (1/n)1A contains the same number a/n. In particular, the vector (1 − c)(1/n)1 contains the probabilities after action 2.
3. A sparse vector d that reflects dead ends. Its element di has value 1 when page i has no outbound links. If there is a link from page i somewhere else, then di = 0. The 1×1 matrix dTp contains the probability that the surfer is visiting the end of the Internet so the vector c(1/n)1dTp corresponds to action 3.

The PageRank vector p can be computed iteratively. Start with any initial vector p0 of probabilities that add to one, for instance the vector (1/n)1. Repeat the step

$\mathbf{p}_{k+1} := c \big(\mathbf{H}^\mathrm{T}\mathbf{p}_k + (1/n)\mathbf{1}\mathbf{d}^\mathrm{T}\mathbf{p}_k\big) + (1 - c)(1/n)\mathbf{1}\qquad(1)$

until, say, the sum of absolute differences |pk+1 – pk| is smaller than a given threshold ϵ. Do not worry, the sequence (pk) does converge whenever c < 1.

The dominant operation in step (1) is the multiplication HTpk. For a sparse matrix H, it takes time proportional to the number of nonzero elements in H, which in our case is the total number of links. While it is possible to express step (1) in a simpler form pk+1 := Gpk, the matrix G = c(HT + (1/n)1dT) + (1 − c)(1/n)1T1 would be dense. The multiplication would take much longer time, proportional to the square of the number of pages.

Note that you can use the dot product d·pk to write (d·pk)(1/n)1 instead of (1/n)1dTpk and that often you can see step (1) above written equivalently as

$\mathbf{p}_{k+1}^\mathrm{T} := c \big(\mathbf{p}_k^\mathrm{T}\mathbf{H} + \mathbf{p}_k^\mathrm{T}\mathbf{d}(1/n)\mathbf{1}^\mathrm{T}\big) + (1 - c)(1/n)\mathbf{1}^\mathrm{T}$.

An example

Let us work with a concrete graph:

$\hfill \mathbf{H} = \begin{pmatrix} 0 & 1/3 & 1/3 & 1/3 & 0 \\ 0 & 0 & 1/2 & 1/2 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1/2 & 0 & 1/2 \\ 0 & 0 & 0 & 0 & 0 \end{pmatrix}, (1/n)\mathbf{1} = \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}, \mathbf{d} = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}, \mathbf{p}_0 = \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}$.

Repeatedly computing

$\begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_{k+1} := 0.85 \left( \begin{pmatrix} 0 & 1/3 & 1/3 & 1/3 & 0 \\ 0 & 0 & 1/2 & 1/2 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1/2 & 0 & 1/2 \\ 0 & 0 & 0 & 0 & 0 \end{pmatrix}^\mathrm{T} \begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_k + \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}^\mathrm{T} \begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_k \right) + 0.15 \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}$,

for instance

$\begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_1 := 0.85 \left( \begin{pmatrix} 0 \\ 1/3\times1/5+1\times1/5 \\ 1/3\times1/5+1/2\times1/5+1/2\times1/5 \\ 1/3\times1/5+1/2\times1/5 \\ 1/2\times1/5 \end{pmatrix} + \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} \begin{pmatrix} 1/5 \end{pmatrix} \right) + 0.15 \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}$,

we obtain

$\mathbf{p}_0 = \begin{pmatrix} 0.200 \\ 0.200 \\ 0.200 \\ 0.200 \\ 0.200 \end{pmatrix}, \mathbf{p}_1 = \begin{pmatrix} 0.064 \\ 0.291 \\ 0.291 \\ 0.205 \\ 0.149 \end{pmatrix}, \mathbf{p}_2 = \begin{pmatrix} 0.055 \\ 0.321 \\ 0.284 \\ 0.197 \\ 0.143 \end{pmatrix}, \mathbf{p}_3 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.201 \\ 0.141 \end{pmatrix}$
$\mathbf{p}_4 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.203 \\ 0.139 \end{pmatrix}, \mathbf{p}_5 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.203 \\ 0.139 \end{pmatrix}, \mathbf{p}_6 = \begin{pmatrix} 0.054 \\ 0.314 \\ 0.289 \\ 0.203 \\ 0.140 \end{pmatrix}, \mathbf{p}_7 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.202 \\ 0.140 \end{pmatrix}$.

The sum of absolute differences between pk+1 and pk decreases exponentially:

$|\mathbf{p}_1 - \mathbf{p}_0| = 0.374 \\ |\mathbf{p}_2 - \mathbf{p}_1| = 0.060 \\ |\mathbf{p}_3 - \mathbf{p}_2| = 0.029 \\ |\mathbf{p}_4 - \mathbf{p}_3| = 0.013 \\ |\mathbf{p}_5 - \mathbf{p}_4| = 0.005 \\ |\mathbf{p}_6 - \mathbf{p}_5| = 0.002 \\ |\mathbf{p}_7 - \mathbf{p}_6| = 0.001$

The code

#!/usr/bin/python3

import scipy
from scipy import sparse

def ComputePageRank(links, c=0.85, epsilon=0.001):
"""Computes the PageRank for given |links|.

The source or target elements of |links|
are numbered from zero upwards. Gaps in
the numbering represent pages having neither
inbound nor outbound links. Note that with
normal settings, such pages still get a
nonzero PageRank.

Args:
links: A list of ⟨source, target⟩ pairs.
Source and target must be integers >= 0.
c: The probability that the random surfer
follows a link, if there are any links.
epsilon: The desired accuracy.

Returns:
A scipy.array whose ith element is the PageRank
of the ith page.
"""
sources = [x[0] for x in links]
targets = [x[1] for x in links]
n = max(max(sources), max(targets)) + 1
# |HT| is sparse, so store it in COOrdinate
# format: a list of ⟨source, target, probability⟩
# triples.  Make it explicitly square in case
# the last target has no outbound links or the
# last source has no inbound links.
HT = sparse.coo_matrix(
(ones, (targets, sources)), shape=(n, n))
# Count the outbound links for each source.
# Divide each nonzero element of |HT|,
# so far equal to one, by the number
# of outbound links in its source.
d_indices = scipy.where(num_outlinks == 0)[0]

r = scipy.ones(n) / n
while True:
previous_r = r
r = c * (HT * r + sum(r[d_indices])/n) + (1.0 - c)/n
# r.sum() ≈ 1 but prevent errors from adding up.
r /= r.sum()
if scipy.absolute(r - previous_r).sum() < epsilon:
return r

if __name__ == '__main__':
print(PageRank(
[(0, 1), (0, 2), (0, 3), (1, 2),
(1, 3), (2, 1), (3, 2), (3, 4)]))


This site uses Akismet to reduce spam. Learn how your comment data is processed.