I was unable to find a simple and efficient implementation of PageRank. As a consequence, I wrote this post and my own Python code that within ten seconds can compute the PageRank of the Polish Wikipedia containing 17,880,897 links between 1,113,939 articles. For the curious, here are the top ten articles.
|Title of the article||PageRank||Equivalent English title|
|Stany Zjednoczone||0.00341||United States|
|Język angielski||0.00257||English language|
|Wielka Brytania||0.00131||United Kingdom|
|Podział administracyjny Polski 1975–1998||0.00124||Voivodeships of Poland (1975–98)|
The random surfer
The PageRank of a page is the probability of visiting the page at a fixed distant moment during a random surf through the web. Before each of her actions, the ideal random surfer tosses a biased coin.
- If the toss is a head, she clicks a random link on the current page, provided that the page has any outbound links.
- If the toss is a tail, she gets bored and jumps to a random page.
- If the toss is a head and the current page has no outbound links, she also jumps to a random page.
The original papers recommend setting the head probability c to 0.85.
Number the pages from 1 to n. Store the probabilities of visiting each page at a given moment in an n-element vector p. The actions of the random surfer correspond to linear transformations of vector p. To express these transformations, we need three building blocks:
- A sparse n×n matrix H that reflects the following of links. When page i has fi outbound links, among them a link to page j, then the hij element of matrix H is equal to 1/fi. When there are no links from page i to page j, then hij = 0. The vector cHTp contains the probabilities of visiting each page after action 1 above.
- A vector with n elements equal to 1/n, which we denote as (1/n)1, that reflects jumping to a random page. For a scalar a or a 1×1 matrix A = [a], each element of the vector a(1/n)1 or (1/n)1A contains the same number a/n. In particular, the vector (1 − c)(1/n)1 contains the probabilities after action 2.
- A sparse vector d that reflects dead ends. Its element di has value 1 when page i has no outbound links. If there is a link from page i somewhere else, then di = 0. The 1×1 matrix dTp contains the probability that the surfer is visiting the end of the Internet so the vector c(1/n)1dTp corresponds to action 3.
The PageRank vector p can be computed iteratively. Start with any initial vector p0 of probabilities that add to one, for instance the vector (1/n)1. Repeat the step
until, say, the sum of absolute differences |pk+1 – pk| is smaller than a given threshold ϵ. Do not worry, the sequence (pk) does converge whenever c < 1.
The dominant operation in step (1) is the multiplication HTpk. For a sparse matrix H, it takes time proportional to the number of nonzero elements in H, which in our case is the total number of links. While it is possible to express step (1) in a simpler form pk+1 := Gpk, the matrix G = c(HT + (1/n)1dT) + (1 − c)(1/n)1T1 would be dense. The multiplication would take much longer time, proportional to the square of the number of pages.
Note that you can use the dot product d·pk to write (d·pk)(1/n)1 instead of (1/n)1dTpk and that often you can see step (1) above written equivalently as
Let us work with a concrete graph:
The sum of absolute differences between pk+1 and pk decreases exponentially:
#!/usr/bin/python3 import scipy from scipy import sparse def ComputePageRank(links, c=0.85, epsilon=0.001): """Computes the PageRank for given |links|. The source or target elements of |links| are numbered from zero upwards. Gaps in the numbering represent pages having neither inbound nor outbound links. Note that with normal settings, such pages still get a nonzero PageRank. Args: links: A list of ⟨source, target⟩ pairs. Source and target must be integers >= 0. c: The probability that the random surfer follows a link, if there are any links. epsilon: The desired accuracy. Returns: A scipy.array whose ith element is the PageRank of the ith page. """ ones = scipy.ones(len(links)) sources = [x for x in links] targets = [x for x in links] n = max(max(sources), max(targets)) + 1 # |HT| is sparse, so store it in COOrdinate # format: a list of ⟨source, target, probability⟩ # triples. Make it explicitly square in case # the last target has no outbound links or the # last source has no inbound links. HT = sparse.coo_matrix( (ones, (targets, sources)), shape=(n, n)) # Count the outbound links for each source. num_outlinks = scipy.array(HT.sum(axis=0)).flatten() # Divide each nonzero element of |HT|, # so far equal to one, by the number # of outbound links in its source. HT.data /= num_outlinks[sources] d_indices = scipy.where(num_outlinks == 0) r = scipy.ones(n) / n while True: previous_r = r r = c * (HT * r + sum(r[d_indices])/n) + (1.0 - c)/n # r.sum() ≈ 1 but prevent errors from adding up. r /= r.sum() if scipy.absolute(r - previous_r).sum() < epsilon: return r if __name__ == '__main__': print(PageRank( [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 1), (3, 2), (3, 4)]))