PageRank in 600 words and 60 lines of Python

I was unable to find a simple and efficient implementation of PageRank. As a consequence, I wrote this post and my own Python code that within ten seconds can compute the PageRank of the Polish Wikipedia containing 17,880,897 links between 1,113,939 articles. For the curious, here are the top ten articles.

Title of the article PageRank Equivalent English title
Polska 0.00349 Poland
Francja 0.00343 France
Stany Zjednoczone 0.00341 United States
Język angielski 0.00257 English language
Łacina 0.00225 Latin
Niemcy 0.00205 Germany
Włochy 0.00162 Italy
Wieś 0.00156 Village
Wielka Brytania 0.00131 United Kingdom
Podział administracyjny Polski 1975–1998 0.00124 Voivodeships of Poland (1975–98)

The random surfer

The PageRank of a page is the probability of visiting the page at a fixed distant moment during a random surf through the web. Before each of her actions, the ideal random surfer tosses a biased coin.

  1. If the toss is a head, she clicks a random link on the current page, provided that the page has any outbound links.
  2. If the toss is a tail, she gets bored and jumps to a random page.
  3. If the toss is a head and the current page has no outbound links, she also jumps to a random page.

The original papers recommend setting the head probability c to 0.85.

Linear algebra

Number the pages from 1 to n. Store the probabilities of visiting each page at a given moment in an n-element vector p. The actions of the random surfer correspond to linear transformations of vector p. To express these transformations, we need three building blocks:

  1. A sparse n×n matrix H that reflects the following of links. When page i has fi outbound links, among them a link to page j, then the hij element of matrix H is equal to 1/fi. When there are no links from page i to page j, then hij = 0. The vector cHTp contains the probabilities of visiting each page after action 1 above.
  2. A vector with n elements equal to 1/n, which we denote as (1/n)1, that reflects jumping to a random page. For a scalar a or a 1×1 matrix A = [a], each element of the vector a(1/n)1 or (1/n)1A contains the same number a/n. In particular, the vector (1 − c)(1/n)1 contains the probabilities after action 2.
  3. A sparse vector d that reflects dead ends. Its element di has value 1 when page i has no outbound links. If there is a link from page i somewhere else, then di = 0. The 1×1 matrix dTp contains the probability that the surfer is visiting the end of the Internet so the vector c(1/n)1dTp corresponds to action 3.

The PageRank vector p can be computed iteratively. Start with any initial vector p0 of probabilities that add to one, for instance the vector (1/n)1. Repeat the step

\mathbf{p}_{k+1} := c \big(\mathbf{H}^\mathrm{T}\mathbf{p}_k + (1/n)\mathbf{1}\mathbf{d}^\mathrm{T}\mathbf{p}_k\big) + (1 - c)(1/n)\mathbf{1}\qquad(1)

until, say, the sum of absolute differences |pk+1 – pk| is smaller than a given threshold ϵ. Do not worry, the sequence (pk) does converge whenever c < 1.

The dominant operation in step (1) is the multiplication HTpk. For a sparse matrix H, it takes time proportional to the number of nonzero elements in H, which in our case is the total number of links. While it is possible to express step (1) in a simpler form pk+1 := Gpk, the matrix G = c(HT + (1/n)1dT) + (1 − c)(1/n)1T1 would be dense. The multiplication would take much longer time, proportional to the square of the number of pages.

Note that you can use the dot product d·pk to write (d·pk)(1/n)1 instead of (1/n)1dTpk and that often you can see step (1) above written equivalently as

\mathbf{p}_{k+1}^\mathrm{T} := c \big(\mathbf{p}_k^\mathrm{T}\mathbf{H} + \mathbf{p}_k^\mathrm{T}\mathbf{d}(1/n)\mathbf{1}^\mathrm{T}\big) + (1 - c)(1/n)\mathbf{1}^\mathrm{T}.

An example

Let us work with a concrete graph:

\hfill \mathbf{H} = \begin{pmatrix} 0 & 1/3 & 1/3 & 1/3 & 0   \\ 0 & 0   & 1/2 & 1/2 & 0   \\ 0 & 1   & 0   & 0   & 0   \\ 0 & 0   & 1/2 & 0   & 1/2 \\ 0 & 0   & 0   & 0   & 0 \end{pmatrix}, (1/n)\mathbf{1} = \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}, \mathbf{d} = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}, \mathbf{p}_0 = \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix}.

Repeatedly computing

\begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_{k+1} := 0.85 \left( \begin{pmatrix} 0 & 1/3 & 1/3 & 1/3 & 0   \\ 0 & 0   & 1/2 & 1/2 & 0   \\ 0 & 1   & 0   & 0   & 0   \\ 0 & 0   & 1/2 & 0   & 1/2 \\ 0 & 0   & 0   & 0   & 0 \end{pmatrix}^\mathrm{T} \begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_k + \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}^\mathrm{T} \begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_k \right) + 0.15 \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} ,

for instance

\begin{pmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{pmatrix}_1 := 0.85 \left( \begin{pmatrix} 0 \\ 1/3\times1/5+1\times1/5 \\ 1/3\times1/5+1/2\times1/5+1/2\times1/5 \\ 1/3\times1/5+1/2\times1/5 \\ 1/2\times1/5 \end{pmatrix} + \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} \begin{pmatrix} 1/5 \end{pmatrix} \right) + 0.15 \begin{pmatrix} 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \\ 1/5 \end{pmatrix} ,

we obtain

\mathbf{p}_0 = \begin{pmatrix} 0.200 \\ 0.200 \\ 0.200 \\ 0.200 \\ 0.200 \end{pmatrix}, \mathbf{p}_1 = \begin{pmatrix} 0.064 \\ 0.291 \\ 0.291 \\ 0.205 \\ 0.149 \end{pmatrix}, \mathbf{p}_2 = \begin{pmatrix} 0.055 \\ 0.321 \\ 0.284 \\ 0.197 \\ 0.143 \end{pmatrix}, \mathbf{p}_3 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.201 \\ 0.141 \end{pmatrix}
\mathbf{p}_4 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.203 \\ 0.139 \end{pmatrix}, \mathbf{p}_5 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.203 \\ 0.139 \end{pmatrix}, \mathbf{p}_6 = \begin{pmatrix} 0.054 \\ 0.314 \\ 0.289 \\ 0.203 \\ 0.140 \end{pmatrix}, \mathbf{p}_7 = \begin{pmatrix} 0.054 \\ 0.315 \\ 0.289 \\ 0.202 \\ 0.140 \end{pmatrix}.

The sum of absolute differences between pk+1 and pk decreases exponentially:

|\mathbf{p}_1 - \mathbf{p}_0| = 0.374 \\ |\mathbf{p}_2 - \mathbf{p}_1| = 0.060 \\ |\mathbf{p}_3 - \mathbf{p}_2| = 0.029 \\ |\mathbf{p}_4 - \mathbf{p}_3| = 0.013 \\ |\mathbf{p}_5 - \mathbf{p}_4| = 0.005 \\ |\mathbf{p}_6 - \mathbf{p}_5| = 0.002 \\ |\mathbf{p}_7 - \mathbf{p}_6| = 0.001

The code


import scipy
from scipy import sparse

def ComputePageRank(links, c=0.85, epsilon=0.001):
  """Computes the PageRank for given |links|.

  The source or target elements of |links|
  are numbered from zero upwards. Gaps in
  the numbering represent pages having neither
  inbound nor outbound links. Note that with
  normal settings, such pages still get a
  nonzero PageRank.

    links: A list of ⟨source, target⟩ pairs.
      Source and target must be integers >= 0.
    c: The probability that the random surfer
      follows a link, if there are any links.
    epsilon: The desired accuracy.

    A scipy.array whose ith element is the PageRank
    of the ith page.
  ones = scipy.ones(len(links))
  sources = [x[0] for x in links]
  targets = [x[1] for x in links]
  n = max(max(sources), max(targets)) + 1
  # |HT| is sparse, so store it in COOrdinate
  # format: a list of ⟨source, target, probability⟩
  # triples.  Make it explicitly square in case
  # the last target has no outbound links or the
  # last source has no inbound links.
  HT = sparse.coo_matrix(
      (ones, (targets, sources)), shape=(n, n))
  # Count the outbound links for each source.
  num_outlinks = scipy.array(HT.sum(axis=0)).flatten()
  # Divide each nonzero element of |HT|,
  # so far equal to one, by the number
  # of outbound links in its source. /= num_outlinks[sources]
  d_indices = scipy.where(num_outlinks == 0)[0]
  r = scipy.ones(n) / n
  while True:
    previous_r = r
    r = c * (HT * r + sum(r[d_indices])/n) + (1.0 - c)/n
    # r.sum() ≈ 1 but prevent errors from adding up.
    r /= r.sum()
    if scipy.absolute(r - previous_r).sum() < epsilon:
      return r

if __name__ == '__main__':
      [(0, 1), (0, 2), (0, 3), (1, 2),
       (1, 3), (2, 1), (3, 2), (3, 4)]))
PageRank in 600 words and 60 lines of Python

What Makes a Best-Selling Novel?

A Machine Learning Approach

In 2013, Ashok et al. answered this question basing on the writing style, with 61–84% accuracy. This post, on the other hand, examines plot themes in best sellers. Note that my model can hardly predict the commercial success of a novel from its plot. That would be quite a surprising feat, making reviewers obsolete. My goal was more modest: finding statistically profitable topics to write about.

Using PetScan and Wikipedia’s page export, I downloaded 25,359 Wikipedia articles belonging to Category:Novels by year. From each article, I extracted the section named Plot, Plot summary, Synopsis, etc. if present and, stripped of MediaWiki markup, saved it into an SQLite database along with the title of the novel, its year of publication, and a Boolean that indicates if it ever topped the New York Times Fiction Best Seller list:

SELECT title, year, was_bestseller, length(plot) FROM Novels
ORDER BY random() LIMIT 5;
Sharpe's Havoc            | 2003 | 0 | 2759
The Rescue (Sparks novel) | 2000 | 1 |
Slayers                   | 1989 | 0 |
The Warden                | 1855 | 0 | 2793
The Fourth Protocol       | 1984 | 1 | 5666

SELECT count(*) FROM Novels

SELECT count(*) FROM Novels
WHERE plot IS NOT NULL AND was_bestseller;

SELECT min(year) FROM Novels  -- The year of publication.
WHERE was_bestseller;  -- The NYT list starts in 1942.

To obtain easy to interpret results, I have built a logistic regression model on top of the TF–IDF transformation of articles processed by the Porter stemmer. The parameters have default values. In particular, the logistic regression uses L2 regularization so all lowercase words that are not stopwords appear in the model.

import nltk
from nltk.corpus import stopwords
from nltk.stem import porter
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import pipeline
from sklearn.feature_extraction import text

def Tokenize(
    punctuation_re = re.compile(
  text = punctuation_re.sub(' ', text)
  tokens = nltk.word_tokenize(text)
  return [stemmer.stem(x) for x in tokens
          if x.lower() not in stop_set and x[0] not in uppercase]

X = []
y = []
connection = sqlite3.connect('novels.sqlite')
for row in connection.cursor().execute(
    """SELECT plot, was_bestseller FROM Novels
    WHERE year >= 1941 AND plot IS NOT NULL"""):
X_train, X_test, y_train, y_test = (
    cross_validation.train_test_split(X, y, test_size=0.3))
model = pipeline.Pipeline(
    [('tfidf', text.TfidfVectorizer(
          lowercase=False, tokenizer=Tokenize)),
     ('logistic', linear_model.LogisticRegression())]), y_train)

The model can return the probability of being a best seller for any novel b with a plot summary:

logit(b) = −4.6 + 2.5 tfidf(lawyer, b) + 2.4 tfidf(kill, b) + ⋯ − 1.5 tfidf(planet, b)

Pr(was_bestseller(b)|plot(b)) = elogit(b) / (1 + elogit(b))

To put these coefficients in context, tfidf(lawyer, The Firm) ≈ 0.06. As it happens, the model returns logit(b) > 0, that is Pr(was_bestseller(b)|plot(b)) > 1/2 for no novel b from the train or test set. The highest probability, 0.39, is predicted for Cross Fire, indeed a best seller in December 2010. Only if I disable the normalization in TF–IDF or weaken the regularization in the logistic regression, I can overfit the model to the train set while for the test set both its precision and recall would be at most 20%. But, like I wrote in the introduction, this is not the point of this exercise. Let us look at the words with high absolute value of coefficients.

  • Apparently, it pays off to write legal thrillers: lawyer +2.5, case +2.4, law +1.5, client +1.3, jury +1.3, trial +1.3, attorney +1.0, suspect +1.0, judge +0.9, convict +0.8;
  • kill +2.4, murder +1.8, terrorist +1.2, shoot +1.1, body +1.1, die +1.0, serial +0.9, attack +0.9, assassin +0.8, kidnap +0.8, killer +0.8.
  • Political thrillers are not bad either: agent +1.4, politics +1.4, president +1.3, defector +1.2.
  • Business may be involved: firm +1.3, company +1.3, career +1.1, million +1.0, success +1.0, business +0.9, money +0.9.
  • Finally, the characters should have families: husband +1.4, family +1.3, house +1.2, couple +1.2, daughter +1.2, baby +1.1, wife +1.0, father +1.0, child +0.9, birth +0.8, pregnant +0.8, and use a car +1.5 and a phone +0.8.

The genres to avoid for prospective best-selling authors?

  • Sci-fi: planet −1.5, human −1.0, space −0.7, star −0.4, robot −0.3, orbit −0.3.
  • Children’s literature: boy −1.3, school −1.0, young −0.8, girl −0.8, youth −0.4, teacher −0.4, aunt −0.4, grow −0.4.
  • Geography and travels: village −1.0, city −1.0, ship −0.8, way −0.7, go −0.7, land −0.6, adventure −0.6, colony −0.5, native −0.5, follow −0.5, mountain −0.5, crew −0.5, forest −0.5, travel −0.5, inhabit −0.4, sail −0.4, road −0.4, map −0.3, tribe −0.3.
  • War: fight −1.0, warrior −0.6, war −0.6, weapon −0.5, soldier −0.5, army −0.5, ally −0.4, enemy −0.3, conquer −0.3.
  • Fantasy: magic −0.9, creature −0.5, magician −0.4, zombie −0.3, treasure −0.3, dragon −0.3.
  • History: princess −0.5, rule −0.5, kingdom −0.4, castle −0.4, century −0.4, ruler −0.3, palace −0.3 (for what it’s worth, A Game of Thrones only made it to the third place on the list so it does not count as a best seller).

Note that the code above ignores capitalized words. If it does not, the most significant words become the names of characters from best selling book series: Scarpetta +3.0, Stephanie +2.9, Ayla +2.0, etc., with additional insights like FBI +1.3, CIA +1.3, NATO +0.9, Soviet +0.9, or Earth −1.1.

What Makes a Best-Selling Novel?

Stylometry—It Works! (in Some Circumstances)

This post was supposed to reveal the author of the 13th Book of Pan Tadeusz, an anonymous pornographic sequel to the Polish national epic. Despite my attempts that took into account rhyming sounds, word syllable count, and custom morphological analysis for Early Modern Polish, I failed to identify the author. Which is not that bad: authorship attribution, especially when regurgitated by journalists, is often reduced to ex cathedra statements: “a computer has proven that work X was written by author Y”; the fact that the confidence level is unknown is not reported.

Instead of a literary discovery, I present you a little game: Which Polish text is your writing like? It tells me that The 13th Book is most similar to Antymonachomachia by Ignacy Krasicki who died 33 years before the publication of Pan Tadeusz. Oh well.

The game is based on texts from Wolne Lektury, the Polish equivalent of Project Gutenberg. I appreciate Radek Czajka’s help in downloading them.

Since I know little about writing style analysis (known as stylometry), the entire sophistication of my program lies in calculating the frequency of a few dozen of tokens in each text. This idea is similar to Alphonse Bertillon’s anthropometry, a late-19th-century efficient system of identifying recidivists by classifying eleven body parts as small, medium or large.

We compare text style rather than text topics, so the program pays little attention to content words. It counts final punctuation marks, commas, and 86 frequent function words, that is conjunctions, prepositions, adverbs, and so-called qubliks. These counts are divided by the total number of tokens in the text, yielding a 90-dimensional vector of token frequency for each text.

The figure below shows the results of hierarchical clustering of the texts longer than 5000 tokens, obtained with

        frequency_matrix, method=’ward’, distance=’euclidean’))


I, for one, am impressed by its gathering together most of texts written by Kasprowicz, Krasicki, Rzewuski, and Sienkiewicz, or translated by Boy–Żeleński and Ulrich.

How reliable are the results? To answer this question, I perturbed the token counts: for each text composed of N tokens, I replaced k occurrences of each counted token by a random variable with the binomial distribution B(N, k/N), that is the count of heads in N tosses of a biased coin whose heads probability is k/N. For each text from Wolne Lektury, the x axis in the figures below shows the total number of tokens. The y axis shows the frequency with which the nearest point by the Euclidean metric corresponded to a different text or a text by another author/translator, measured in 1000 such random perturbations. In case you wonder how the y axis appears logarithmic and contains zero at once, the plotted variable is log(y + 0.001).

I approximated both the text misattribution probability and the author misattribution probability by 1−(erf(√N/c))b, with empirical values of constants b and c depending on the language, the tokens, and the texts.

Here is my hand-waving explanation of this formula. The coordinates of perturbed points, multiplied by N, have a multivariate binomial distribution (it does not matter whether the coordinates are correlated or not). When N approaches infinity and k/N remains constant, the binomial distribution is asymptotically normal with variance proportional to N (by the central limit theorem applied to tossing the coin), and the multivariate binomial distribution is asymptotically multivariate normal. Dividing the random variables by N, we return to the coordinates, which asymptotically have a multivariate normal distribution with individual variances and covariances proportional to 1/N.

The points divide the 90-dimensional vector space into Voronoi cells whose centres correspond to the mean vectors of the distributions. Moving a point to the other side of some wall of its Voronoi cell means moving it by more than d in the direction perpendicular to the wall. The projection of any multivariate normal distribution with variances and covariances proportional to 1/N onto a vector is a (univariate) normal distribution with variance proportional to 1/N. The probability that a random variable with variance σ2=a/N differs from its mean by more than d (that is, that the permuted point crosses the wall, causing a misattribution) equals 1−erf(d/σ) = 1−erf(dN/√a) = 1−erf(√N/c). Since the Voronoi cell has many walls in different directions, the overall probability that the point exits its cell is approximately equal to 1−erf(√N/c1)×⋯×erf(√N/cn). The erf function decreases rapidly so the factors with the smallest cis dominate the product, which can be approximated by the formula 1−(erf(√N/c))b.


The figures explain why it was hard to ascribe the author to The 13th Book: even if other works by the author belonged to the Wolne Lektury corpus (they probably do not), The 13th Book has merely 1773 tokens.

Stylometry—It Works! (in Some Circumstances)

The Vector Space of the Polish Parliament in Pictures

In fact, only of the lower house, the Sejm. During its current term, 460 deputies voted 5545 times since the 8th of November 2011. The Sejm website publishes the individual results of the votes as PDF files. I put the results into an SQLite database with an unholy hodgepodge of shell scripts, wget, pdftohtml, and Python. Here is a sample query and its output:

SELECT deputy, party, voting, result
FROM VotingResults
JOIN Deputies USING(deputy_id)
JOIN Parties USING(party_id)
JOIN Votings USING(voting_id)
JOIN Results USING(result_id)
ORDER BY random()

HOK MAREK          | PO  | data/glos_32_10.html | NAY
MILLER LESZEK      | SLD | data/glos_49_42.html | AYE
MASŁOWSKA GABRIELA | PiS | data/glos_18_14.html | ABSTAIN
ROMANEK ANDRZEJ    | SP  | data/glos_93_58.html | AYE
ŁOPATA JAN         | PSL | data/glos_57_65.html | ABSENT

The database registers as many as 509 deputies since some of them died during the term or were appointed or elected to other posts. The deputies are grouped into parties according to their latest status. There are data on 5539 votings. I skipped 6 elections of speakers whose PDF layout was different.

Using Python and numpy, we can put the results into a matrix R with 5539 rows and 509 columns. Its columns assigned to deputies are vectors in the 5539-dimensional space of voting results. The elements of R are +1 for AYE, −1 for NAY, and 0 for ABSTAIN and ABSENT. Shift each row to make its mean equal to zero, keeping ABSTAIN and ABSENT unbiased at zero:

M = R.sum(axis=1) / numpy.fabs(R).sum(axis=1)
MT = M[:, numpy.newaxis]
Rcentred = numpy.where(R != 0, R - MT, 0)

Then perform the Singular Value Decomposition of Rcentred:

U, S, VT = numpy.linalg.svd(Rcentred, full_matrices=False)

U holds the so-called left-singular vectors of Rcentred along which the variance of Rcentred is the largest possible. S contains the so-called singular values of Rcentred. We do not use VT.

We have just carried out the Principal Component Analysis of R. Thanks to it, we can visualize 5539 dimensions of R in a much smaller number of dimensions. In our case, three initial left-singular vectors contribute 70.1%, 8.5%, and 3.6% of the total variance of R, respectively. The projection of the vectors of Rcentred along its three initial left-singular vectors looks like this:

A scatter plot of three principal components of the voting patterns

The x axis, usually called the partisan axis, corresponds to the division between the government (PO+PSL) and the opposition (everybody else). The votings with the highest absolute weights in the first left-singular vector concern the involvement of the government in the Amber Gold scandal: this (−0.0216) and this (+0.0217).

The y axis appears to correspond to right wing-left wing polarization (everybody else versus SLD+RP). Its most significant votings are the rejection of a proposal to waive the law against insulting religious feelings (−0.0451) and the proposal to raise the taxes on copper and silver mining (+0.0447).

The z axis, in turn, seems to correspond to sentiments towards the European Union (everybody else versus PiS+RP+ZP aka SP). Its most significant votings concern two alignments between Polish and EU law: this (−0.0463) and this (+0.0470).

Finally, here are the votes of individual deputies projected on the xy, xz, and yz planes.

EDIT: here is a zoomable version, made with the Bokeh visualization library. Thanks for the tip, stared!

The 1st and the 2nd principal components of the voting patterns

The 1st and the 3rd principal components of the voting patterns

The 2nd and the 3rd principal components of the voting patterns

The Vector Space of the Polish Parliament in Pictures