Seven Triple Double Checks

I have recently enhanced the chess search engine I write in my spare time with double-check detection. To test it, I made the engine ingest 5.82 million unique over-the-board games from the 15 Million chess games database kept up to date until 2015. 0.85% of these games turned out to contain double checks. More precisely, there are:

  • 48261 games with one double check,
  • 1184 games with two double checks,
  • 76 games with three double checks,
  • 3 games with four double checks,
  • 3 games with five double checks.

I naively expected that the games with many double checks would shine with mate combinations, yet as a rule they just end with perpetual check, plain and double in turn. Much more spectacular and much less frequent are games where double checks occur in a row. It took my engine 0.75 and 0.55 seconds, respectively, to answer the queries

???++ K?? ???++


???++ K?? ???##

with 129 pairs and 6 triplets of consecutive double checks. I show the triplets below. They all follow the same pattern where rook blocks and discovers the diagonals of a bishop pair. Incidentally, among chess problems, a mate in 13 by Stojnić and Babić (The Problemist, 2004) sports as many as 13 consecutive double checks based on the same idea.

Holger Norman–Hansen vs Erik Andersen
Copenhagen, 1930

1. e4 e5 2. Nf3 Nf6 3. Nxe5 d6 4. Nf3 Nxe4 5. d4 d5 6. Bd3 Bg4 7. O-O Bd6 8. c4 O-O 9. cxd5 f5 10. Nc3 Nd7 11. h3 Bh5 12. Nxe4 fxe4 13. Bxe4 Nf6 14. Bf5 Kh8 15. Be6 Ne4 16. g4 Bg6 17. Kg2 Qf6 18. Be3 Rae8 19. h4


19. … Rxe6 20. dxe6 Nc3 21. bxc3 Be4 22. Kh3 Qxf3+ 23. Qxf3 Rxf3+ 24. Kg2 Rg3++ 25. Kh2 Rg2++ 26. Kh1 Rh2++ 27. Kg1 Rh1# 0-1

Steven Avramidis vs Georgios Alexopoulos
Windsor, 1978

1.d4 Nf6 2.c4 c5 3.dxc5 e5 4.b4 a5 5.Ba3 axb4 6.Bxb4 Na6 7.Ba3 Nxc5 8.Bxc5 Bxc5 9.Nf3 e4 10.Nd4 e3 11.fxe3 d5 12.cxd5 Nxd5 13.Nc2 O-O 14.g3 Nxe3 15. Qxd8 Nxc2+ 16.Kd2 Rxd8+ 17.Kxc2 Bf5+ 18.Kb2 Bd4+ 19.Nc3 Rac8 20.Rc1


20. … Rxc3 21.Rxc3 Rc8 22.Bg2 Rxc3 23.Rc1 Rc2++ 24.Kb1 Rb2++ 25.Ka1 Rb1# 0-1

Pascal Tching–Sin vs Edouard Bonnet
Étang-Salé, 2000.11.03

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 6. Be2 e5 7. Nf3 Be7 8. h3 O-O 9. O-O b5 10. Bg5 Nbd7 11. a3 Bb7 12. Bd3 Rc8 13. Re1 h6 14. Bh4 Nb6 15. Qd2 Nc4 16. Bxc4 Rxc4 17. Bxf6 Bxf6 18. Rad1 Be7 19. Qe2 Qa8 20. Nd2 Rcc8 21. Nf1 f5 22. Ng3 fxe4 23. Ncxe4 d5 24. Nd2 Bc5 25. Nf3 Rce8 26. Rf1 Qb8 27. Nh5 e4 28. Nd4 Qe5 29. c3 Bd6 30. g3 Bc8 31. Kg2 Qg5 32. g4 Qh4 33. Ng3 g6 34. Qc2 Kg7 35. Nde2 Bb7 36. Qd2 Be5 37. Qe3 Re7 38. Qb6 Ref7 39. Qe6 Bb8 40. Qb6 Be5 41. Qe3 Rf3 42. Qb6 R8f7 43. Qe6 Bb8 44. Qe8 Bc7 45. Rd4 e3 46. g5 Qxg5 47. Rg4


47. … d4 48. Rxg5 Rxg3++ 49. Kh2 Rg2++ 50. Kh1 Rh2++ 0-1

Guido Müssig vs Natia Engels
Germany, 2003.11.08

1. d4 Nf6 2. c4 c5 3. d5 e6 4. Nc3 exd5 5. cxd5 d6 6. e4 g6 7. f3 Bg7 8. Bg5 a6 9. a4 h6 10. Be3 O-O 11. Qd2 Kh7 12. Nge2 Nbd7 13. Ng3 Qa5 14. Be2 b5 15. O-O b4 16. Nd1 Re8 17. Bf4 Ne5 18. Ne3 Nfd7 19. Nh1 Qc7 20. Bg3 c4 21. f4 Nd3 22. Nxc4 Qxc4 23. Bxd3 Qd4+ 24. Nf2 Nc5 25. f5 Nb3 26. fxg6+ fxg6 27. Qc2 Nxa1 28. Rxa1 Qxb2 29. Qxb2 Bxb2 30. Rb1 Bc3 31. Bxd6 a5 32. Bb5 Rg8 33. g4 h5 34. h3 Bd4 35. Kg2 Ba6 36. Bc6 Raf8 37. Bxf8 Rxf8 38. Nh1 Bd3 39. Re1 b3 40. Bb5 Bc2 41. Bc4 b2 42. Ba2 Bc3 43. Rg1


43. … Bxe4+ 44. Kg3 Rf3+ 45. Kg2 Bd4 46. Re1 Re3+ 47. Kf1 Bd3+ 48. Kf2 Re2++ 49. Kf1 Rf2++ 50. Kg1 Rf1++ 0-1

Maria Petsetidi vs Nikolaos Tepelenis
Agios Kirykos, 2010.07.15

1. e4 e6 2. b3 d5 3. Bb2 dxe4 4. Nc3 Nf6 5. Qe2 Bd7 6. Nxe4 Nxe4 7. Qxe4 Bc6 8. Qg4 Nd7 9. O-O-O Nf6 10. Qe2 Qd5 11. Nf3 Qe4 12. d4 Bd6 13. Ne5 Qxe2 14. Bxe2 Bxg2 15. Rhg1 Be4 16. c4 g6 17. d5 exd5 18. Ng4 Bf4+ 19. Ne3 Ke7 20. Ba3+ Ke6 21. Rd4 c6 22. Kd1 Be5 23. cxd5+ Nxd5 24. Nxd5 Bxd5 25. Rd2 Bxh2 26. Re1 Be5 27. Bg4+ Kf6 28. f4 Bxf4 29. Rf2 g5 30. Bb2+ Kg6 31. Rg1 f5 32. Be2 Be3 33. Rh2 h5 34. Rf1 Rh7 35. Bd3 Bf4 36. Re2 Rd8 37. Re5 Be4 38. Re6+ Kf7 39. Rf6+ Ke7 40. Re1


40. … Rxd3+ 41. Kc2 Rd2++ 42. Kc1 Rc2++ 43. Kb1 Rxb2++ {missing a mate in one} 0-1

Kjell Børre Grebstad vs Karl–Petter Jernberg
Tromsø, 2010.07.31

1. d4 d5 2. c4 e6 3. Nf3 Nf6 4. Nc3 Bb4 5. Bg5 Bxc3+ 6. bxc3 Nbd7 7. e3 c6 8. Qc2 O-O 9. Bd3 Qc7 10. Bh4 Re8 11. h3 c5 12. Bg3 Qc6 13. Rb1 cxd4 14. cxd4 Ne4 15. Bh2 Ndf6 16. Ne5 Qa6 17. O-O Qa5 18. f3 Nd2 19. Rb5 Nxf3+ 20. Nxf3 Qd8 21. Ne5 a6 22. Rb2 Qe7 23. c5 g6 24. Qf2 Kg7 25. Bg3 h6 26. Qf3 Rf8 27. Rbf2 Bd7


28. Qxf6+ Qxf6 29. Rxf6 Be8 30. Nxg6 Rg8 31. Be5 fxg6 32. Rxg6++ Kh7 33. Rg7++ Kh8 34. Rh7# 1-0

The seventh confirmed occurrence of a triple double check comes from Renaud and Kahn’s book The Art of Checkmate. The initial moves of the game are lost.

Victor Place vs N.N.
Café de la Régence, 1922


1. Nxg7 Kxg7 2. d5 Bg4 3. Rxf6 Bxd1 4. Rg6++ Kh7 5. Rg7++ Kh8 6. Rh7++ Kg8 7. Rh8# 1-0

Seven Triple Double Checks

Reproducing the AuthaGraph World Map

The AuthaGraph world map by Hajime Narukawa took the media by storm when it was honoured with the Japan Good Design Grand Award in October 2016. After several rounds of copying the news, the journalists dubbed it “the most accurate world map ever”, comparing it with the usual straw man, Mercator projection. Hype aside, the facts are: the map almost avoids the chopping of land, almost has the equal-area property, and does a decent job with land shapes. Its large distortions near the vertices of the basal tetrahedron and drastic discontinuities along the boundary are masked by the ocean.

Since the map projection is not public, I set out to develop its lookalike with PROJ.4. My way is simpler than the AuthaGraph method at the cost of losing the proportions of areas that AuthaGraph claims to preserve. You can compare the results below. Among the most visible changes, my Iceland lies on the European side of the map, Australia properly looks smaller than China, and Brazil’s shape is distorted differently. Given the overall distortions of AuthaGraph, treating Earth as an ellipsoid would not buy us much, so my method assumes it is a sphere. The code is available on Bitbucket.

Original AuthaGraph


The top border of the original AuthaGraph map corresponds to a great-circle arc that separates Africa from Iceland. If the border passed through the North Pole, it would have to follow a meridian. Any meridian near Africa with westernmost longitude 17° 33′ W and Iceland with easternmost longitude 13° 16′ W must pass through either land. As the border avoids land, it must not pass through the North Pole, which is projected inside the map. Therefore, in spite of the claims about tiling the plane with the map, juxtaposing two maps yields a map with two North Poles close to each other. Incidentally, even the Bering Strait is blocked for a land-avoiding meridian by St. Lawrence Island and Unalaska.

The map of the Arctic below shows the possible locations of the northern vertex with a land-avoiding top border. My northern vertex lies around 73° N 121° E, near the delta of Lena.


Another constraint is to keep Australia and New Zealand in the middle triangle of the map. The original map does not fulfill it entirely: the Stewart Island lies in its Antarctic triangle. Therefore, I based the projection on a disphenoid, an irregular tetrahedron that can be rotated, scaled, and translated so that its vertices have Cartesian coordinates (+a, +b, +c), (+a, -b, -c), (-a, +b, -c), and (-a, -b, +c). A disphenoid can still be developed to a parallelogram so we can cut a right triangle or trapezoid near the acute-angle vertex of its net and attach it to the opposite side, making the map rectangular.

My approach to projecting a point P from a sphere onto a plane begins with finding the spherical triangle ABC, ABD, ACD, or BCD that contains P.

Given a tetrahedron KLMN inscribed into the unit sphere, the point P lies inside the spherical triangle KLM if \left(P \cdot (K \times L)\right)\left(N \cdot (K \times L)\right) \le 0, \left(P \cdot (K \times M)\right)\left(N \cdot (K \times M)\right) \le 0, and \left(P \cdot (L \times M)\right)\left(N \cdot (L \times M)\right) \le 0.

I considered several ways to map spherical triangles onto planar triangles, keeping in mind that the projection should be continuous on the edges shared by two triangles and smooth in their interiors. Here are my attempts:

  • Barycentric coordinates.

    The barycentric coordinates of point P with respect to a spherical or planar triangle KLM are the ratios of areas of spherical or planar triangles: (\Delta PLM / \Delta KLM, \Delta KPM / \Delta KLM, \Delta KLP / \Delta KLM).

    I was finding a point P' with planar barycentric coordinates corresponding to the spherical barycentric coordinates of P. Unfortunately, this elegant mapping is continuous on the shared edge of two triangles only if they are symmetric with respect to this edge. In an irregular tetrahedron, this is not the case.

  • Gnomonic projection, that is casting from the centre of the sphere onto the plane KLM. Although this projection is continuous on the edges of triangles, I find its distortion unacceptable.
  • Find points K', L', and M' on the sphere, being the intersections of great circles LM with KP, KM with LP, and KL with MP. Express the location of point P as a triple \left(\lvert KM'\rvert / \lvert KL \rvert, \lvert LK'\rvert / \lvert LM \rvert, \lvert ML'\rvert / \lvert MK \rvert\right). On a plane, the three lines that correspond to this triple in general do not cross in one point. Treating the centre of the intersection triangle as P' leads to discontinuities along the edges. I took the incentre of the intersection triangle instead.

The above method produces the map shown below, with sharp turns on the edges of the basal tetrahedron. I smoothed the joints by interpolation.


The gepard spots on the map below are Tissot indicatrices for the projection, magnified images of infinitesimal circles on Earth’s surface.

The Tissot indicatrix is an ellipse with scale factor along the meridian h = \sqrt{(\partial x/\partial\phi)^2 + (\partial y/\partial\phi)^2}/R, scale factor along the parallel k = \sqrt{(\partial x/\partial\lambda)^2 + (\partial y/\partial\lambda)^2}/(R\cos\phi), and the angle \theta' at which the meridian and parallel intersect given by \sin\theta' = \left((\partial y/\partial\phi)(\partial x/\partial\lambda) - (\partial x/\partial\phi)(\partial y/\partial\lambda)\right)/(R^2 hk\cos\phi). The angular distortion is equal to \lvert\pi/2 - \theta'\rvert.


Using the Nelder–Mead method, I numerically minimized the total angular distortion in 461 approximately equally spaced points located on land while keeping the borders of the map at least 20 kilometres from any land. With respect to these conditions, the optimal parameters of the projection turned out to be:

  • The coordinates of the northern vertex: 73.10° N 120.96° E.
  • The rotation of the tetrahedron: −23.56°.
  • The coordinates of the vertices: b = 0.763 a, c = 0.963 a.
  • The vertical scale factor: 1.03.

These parameters yield a rectangle whose sides are in ratio 4:1.96. To separate Antarctica from Africa, I cut the map vertically at x = 1.6.

With a forward projection, you can make vector maps. For making raster maps, you need the inverse projection. As inverting the equations of the projection algebraically is a hopeless task, I resorted to the numerical Broyden’s method. The initial guess of (\phi, \lambda) = f^{-1}(x, y) comes from inverting the gnomonic projection. When the accuracy is set to 10-7 R, i.e. about one metre, the method usually finds \phi and \lambda after 5 iterations.

The map below is based on a public domain raster image from Natural Earth. The other maps in this post use public domain shapefiles from the same source.


Reproducing the AuthaGraph World Map

Warping Maps

Polish railway network
In this warped map of Poland, straight line distance (emphatically not distance measured along the segments) between points in the network is approximately proportional to their rail distance. I do not have a good name for this kind of map. All I know is: if it showed the travel time, we would call it a time-space map. By the way, here is the unwarped version.

Interestingly, the map is the best possible in the sense of minimizing the squared error of the approximation. The rest of this article explains how to produce it.

  1. Scrape the data about Polish railway lines from and put it into an SQLite database. Here is a sample:
    SELECT line, name, kind, metrage, another_line
    FROM Lines JOIN Names USING(name_id) JOIN Kinds USING(kind_id)
    WHERE line = 137 ORDER BY metrage LIMIT 21;
    137 | Katowice              | stacja węzłowa     |   381 |   1
    137 | Katowice              | stacja węzłowa     |   381 | 138
    137 | Katowice              | stacja węzłowa     |   381 | 139
    137 | Katowice              | stacja węzłowa     |   381 | 656
    137 | Katowice              | stacja węzłowa     |   381 | 713
    137 | Katowice Załęże       | przystanek osobowy |  2528 |
    137 | Katowice Towarowa KTC | stacja węzłowa     |  2913 | 713
    137 | Chorzów Batory        | stacja węzłowa     |  6166 | 131
    137 | Chorzów Batory        | stacja węzłowa     |  6166 | 164
    137 | Chorzów Batory        | stacja węzłowa     |  6166 | 713
    137 | Świętochłowice        | przystanek osobowy |  8413 |
    137 | Ruda Chebzie          | stacja węzłowa     | 11740 | 187
    137 | Ruda Chebzie          | stacja węzłowa     | 11740 | 189
    137 | Ruda Śląska           | przystanek osobowy | 14068 |
    137 | Zabrze                | stacja             | 18926 |
    137 | Gliwice               | stacja węzłowa     | 27100 | 141
    137 | Gliwice               | stacja węzłowa     | 27100 | 147
    137 | Gliwice               | stacja węzłowa     | 27100 | 168
    137 | Gliwice               | stacja węzłowa     | 27100 | 200
    137 | Gliwice               | stacja węzłowa     | 27100 | 671
    137 | Gliwice               | stacja węzłowa     | 27100 | 711
  2. Construct a list of 1838 edges between all terminal and junction nodes of the network, labelled with their length in metres:
    edges = [
        (u’Katowice’, u’Katowice Towarowa KTC’, 2532),
        (u’Katowice Towarowa KTC’, u’Chorzów Batory’, 3253),
        (u’Chorzów Batory’, u’Ruda Chebzie’, 5574),
        (u’Ruda Chebzie’, u’Gliwice’, 15360), ... ]
  3. Convert the list of edges into an adjacency matrix, putting infinite distance between nonadjacent vertices. The matrix has 1258 rows and columns.
  4. Compute the shortest distances between all pairs of vertices with the Floyd–Warshall algorithm, obtaining the so-called proximity matrix D.
  5. Perform multidimensional scaling (MDS) of the proximity matrix D, in its classical version invented by Torgerson in 1952:
    coords = sklearn.manifold.MDS(

    Under the hood, it would perform four steps:

    1. set up the matrix of squared proximities D(2)
      [if the ith point has coordinates xiyizi,…, then dij(2) = (xixj)2 + (yiyj)2 + (zizj)2 +⋯],
    2. perform the “double centring”: subtract the row mean and the column mean from the elements of D(2), add the total mean, and multiply by −1/2, obtaining matrix B
      [bij = xixj + yiyj + zizj +⋯],
    3. find the singular value decomposition of B into the product of three matrices USUT,
    4. take the initial two singular vectors U2 and two corresponding singular values S2 — the rows of matrix X = U2S21/2 give the coordinates of the points
      [the initial two coordinates computed by SVD follow the principal axes of B, the remaining coordinates are less important, so bijxixj + yiyj or in matrix form BXXT = U2S2U2T],

    were it not using another algorithm, called SMACOF (Scaling by Majorizing a Complicated Function).

  6. The obtained map is randomly oriented. Rotate it to move Gdańsk Główny (G) straight north of Katowice (KO).
  7. If Lublin (Lb) happens to lie west of the Katowice–Gdańsk Główny meridian, flip the X axis.

On my machine, steps 2–7 take about 20 seconds. The code is available on Bitbucket.

Warping Maps

O sol é grande…

Słońce jest wielkie: ptactwo cicho zniża loty
w okresie, który zwykle w zimnej trwa pogodzie.
Zbudziłbym się przy z góry spadającej wodzie,
lecz nie ze snu bynajmniej — z poważnej zgryzoty.

O rzeczy, ciągle próżne — ciągłe wasze zwroty,
jakie serce was przyjmie w ufności i zgodzie?
Czas mija, dni za dniami ciągną się w pochodzie,
bardziej chwiejne niż żagli na wietrze trzepoty.

Widziałem tu już cienie, widziałem kwitnienie,
widziałem tyle wody, zieleń tak bogatą,
słyszałem wszystkich ptaków o miłości pienie.

Wszystko jest suche, nieme, z barwą popielatą;
ja też, zmieszany, inne przyjmuję odcienie.
Odżywa wszystko inne: nie ma rady na to!

Coat of arms of Portugal
Coat of arms of Portugal. Source: Wikipedia

This is the Polish translation… not of the first Portuguese sonnet but of the most popular one among the earliest Portuguese sonnets. The original, shown below, was written by Francisco de Sá de Miranda between 1530 and 1558. And today, June 10, is the Portugal Day.

O sol é grande: caem coa calma as aves,
Do tempo em tal sazão, que sói ser fria.
Esta água que de alto cai acordar-me-ia,
Do sono não, mas de cuidados graves.

Ó cousas, todas vãs, todas mudaves,
Qual é tal coração que em vós confia?
Passam os tempos, vai dia trás dia,
Incertos muito mais que ao vento as naves.

Eu vira já aqui sombras, vira flores,
Vi tantas águas, vi tanta verdura,
As aves todas cantavam de amores.

Tudo é seco e mudo; e, de mistura,
Também mudando-me eu fiz doutras cores.
E tudo o mais renova: isto é sem cura!

O sol é grande…

Kling-dikt över författarens sinnebild, en silkesmask

Sonet o godle autora, jedwabniku

Ucisz się, myśli! Z wolna spekulujesz,
czym być to może. Spójrz: ta kreatura,
żałosny, nagi robak, to figura,
co nie ma kształtu — nic w niej nie wyczujesz

wzrokiem. Lecz zobacz, przecież więcej tu jest
niż sądzisz: zdatna, czysta, cna natura,
dziwna, przez Boga utworzona, która
gryząc rośliny, kornie włókno snuje,

wyplata nici, tka płótno jedwabne.
Tworzy skarb z liści, aż pusta, zmarniała,
owita przędzą, żywota dokona.

Lecz wstanie nowe stworzenie powabne
o pięknych skrzydłach: świeża i wspaniała
dusza, przez żywe słońce przebudzona.

Coat of arms of Sweden
Coat of arms of Sweden. Source: Wikipedia

I published this translation of the first Swedish sonnet to celebrate June 6, the National Day of Sweden. Below is the original, written in 1644 by Georg Stiernhielm, and a literal re-translation from Polish into English for those who want to criticize my version. For another sonnet with a silkworm in it, read my translation of Farie së ndeerme….

Kling-dikt över författarens sinnebild, en silkesmask

Håll stilla mitt förnuft, dig saktelig besinna,
vad detta vara må. Du sir här en figur,
en usel, naken kropp, en mask, ett kreatur,
som ingen skapnad har, där intet är till finna,

som ögat lyster se. Men märk: här ligger inna
mer än en tänka kan, en nyttig, ädel, pur,
en sällsam, underlig av Gud beredd natur:
en mask, dess spis är blad, dess id är artigt spinna,

dess spunna silkes-tråd, dess verk och väv är siden.
Av blad gör han en skatt, till dess han, tom och mager,
invecklat in-dör i sin väv och livet stäcker.

Men si, en ny figur, med vingar prydd, med tiden
här kommer fram igen, uppkvickter, fin och fager,
en livlig sol hans själ med kraft en gång uppväcker.

Sonnet about the author’s crest, silkworm

Calm down, [my] thought! You slowly speculate
what this can be. Look: this creature,
pathetic, naked worm, is a figure
that has no shape — you will sense nothing in it

with sight. But see: still, there is more here
than you suppose: an apt, pure, noble nature,
strange, formed by God, which,
chewing plants, humbly spins fibre,

plaits threads, weaves silk linen.
It is creating a treasure from leaves until, empty, wasted,
wrapped in yarn, it will end [its] life.

But a new graceful being will rise,
with beautiful wings: a fresh and splendid
soul, awakened by the vivid sun.

Kling-dikt över författarens sinnebild, en silkesmask

What Makes a Best-Selling Novel?

A Machine Learning Approach

In 2013, Ashok et al. answered this question basing on the writing style, with 61–84% accuracy. This post, on the other hand, examines plot themes in best sellers. Note that my model can hardly predict the commercial success of a novel from its plot. That would be quite a surprising feat, making reviewers obsolete. My goal was more modest: finding statistically profitable topics to write about.

Using PetScan and Wikipedia’s page export, I downloaded 25,359 Wikipedia articles belonging to Category:Novels by year. From each article, I extracted the section named Plot, Plot summary, Synopsis, etc. if present and, stripped of MediaWiki markup, saved it into an SQLite database along with the title of the novel, its year of publication, and a Boolean that indicates if it ever topped the New York Times Fiction Best Seller list:

SELECT title, year, was_bestseller, length(plot) FROM Novels
ORDER BY random() LIMIT 5;
Sharpe's Havoc            | 2003 | 0 | 2759
The Rescue (Sparks novel) | 2000 | 1 |
Slayers                   | 1989 | 0 |
The Warden                | 1855 | 0 | 2793
The Fourth Protocol       | 1984 | 1 | 5666

SELECT count(*) FROM Novels

SELECT count(*) FROM Novels
WHERE plot IS NOT NULL AND was_bestseller;

SELECT min(year) FROM Novels  -- The year of publication.
WHERE was_bestseller;  -- The NYT list starts in 1942.

To obtain easy to interpret results, I have built a logistic regression model on top of the TF–IDF transformation of articles processed by the Porter stemmer. The parameters have default values. In particular, the logistic regression uses L2 regularization so all lowercase words that are not stopwords appear in the model.

import nltk
from nltk.corpus import stopwords
from nltk.stem import porter
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import pipeline
from sklearn.feature_extraction import text

def Tokenize(
    punctuation_re = re.compile(
  text = punctuation_re.sub(' ', text)
  tokens = nltk.word_tokenize(text)
  return [stemmer.stem(x) for x in tokens
          if x.lower() not in stop_set and x[0] not in uppercase]

X = []
y = []
connection = sqlite3.connect('novels.sqlite')
for row in connection.cursor().execute(
    """SELECT plot, was_bestseller FROM Novels
    WHERE year >= 1941 AND plot IS NOT NULL"""):
X_train, X_test, y_train, y_test = (
    cross_validation.train_test_split(X, y, test_size=0.3))
model = pipeline.Pipeline(
    [('tfidf', text.TfidfVectorizer(
          lowercase=False, tokenizer=Tokenize)),
     ('logistic', linear_model.LogisticRegression())]), y_train)

The model can return the probability of being a best seller for any novel b with a plot summary:

logit(b) = −4.6 + 2.5 tfidf(lawyer, b) + 2.4 tfidf(kill, b) + ⋯ − 1.5 tfidf(planet, b)

Pr(was_bestseller(b)|plot(b)) = elogit(b) / (1 + elogit(b))

To put these coefficients in context, tfidf(lawyer, The Firm) ≈ 0.06. As it happens, the model returns logit(b) > 0, that is Pr(was_bestseller(b)|plot(b)) > 1/2 for no novel b from the train or test set. The highest probability, 0.39, is predicted for Cross Fire, indeed a best seller in December 2010. Only if I disable the normalization in TF–IDF or weaken the regularization in the logistic regression, I can overfit the model to the train set while for the test set both its precision and recall would be at most 20%. But, like I wrote in the introduction, this is not the point of this exercise. Let us look at the words with high absolute value of coefficients.

  • Apparently, it pays off to write legal thrillers: lawyer +2.5, case +2.4, law +1.5, client +1.3, jury +1.3, trial +1.3, attorney +1.0, suspect +1.0, judge +0.9, convict +0.8;
  • kill +2.4, murder +1.8, terrorist +1.2, shoot +1.1, body +1.1, die +1.0, serial +0.9, attack +0.9, assassin +0.8, kidnap +0.8, killer +0.8.
  • Political thrillers are not bad either: agent +1.4, politics +1.4, president +1.3, defector +1.2.
  • Business may be involved: firm +1.3, company +1.3, career +1.1, million +1.0, success +1.0, business +0.9, money +0.9.
  • Finally, the characters should have families: husband +1.4, family +1.3, house +1.2, couple +1.2, daughter +1.2, baby +1.1, wife +1.0, father +1.0, child +0.9, birth +0.8, pregnant +0.8, and use a car +1.5 and a phone +0.8.

The genres to avoid for prospective best-selling authors?

  • Sci-fi: planet −1.5, human −1.0, space −0.7, star −0.4, robot −0.3, orbit −0.3.
  • Children’s literature: boy −1.3, school −1.0, young −0.8, girl −0.8, youth −0.4, teacher −0.4, aunt −0.4, grow −0.4.
  • Geography and travels: village −1.0, city −1.0, ship −0.8, way −0.7, go −0.7, land −0.6, adventure −0.6, colony −0.5, native −0.5, follow −0.5, mountain −0.5, crew −0.5, forest −0.5, travel −0.5, inhabit −0.4, sail −0.4, road −0.4, map −0.3, tribe −0.3.
  • War: fight −1.0, warrior −0.6, war −0.6, weapon −0.5, soldier −0.5, army −0.5, ally −0.4, enemy −0.3, conquer −0.3.
  • Fantasy: magic −0.9, creature −0.5, magician −0.4, zombie −0.3, treasure −0.3, dragon −0.3.
  • History: princess −0.5, rule −0.5, kingdom −0.4, castle −0.4, century −0.4, ruler −0.3, palace −0.3 (for what it’s worth, A Game of Thrones only made it to the third place on the list so it does not count as a best seller).

Note that the code above ignores capitalized words. If it does not, the most significant words become the names of characters from best selling book series: Scarpetta +3.0, Stephanie +2.9, Ayla +2.0, etc., with additional insights like FBI +1.3, CIA +1.3, NATO +0.9, Soviet +0.9, or Earth −1.1.

What Makes a Best-Selling Novel?

Wisła in Fact Likes Cracovia but Doesn’t Know How to Start Talking

The relationships between fans of football clubs in Poland can be fourfold: neutrality, friendship (zgoda), enmity (kosa), or alliance (układ). The belief that there are two disjoint blocs gathered around The Great Triad (Arka, Cracovia, and Lech) and Three Kings of Great Cities (Śląsk, Wisła, and Lechia) is false. Here is the largest connected component of the graph of friendships.


The graph of enmities would be less clear. For instance, Cracovia has friendships with Tarnovia and Sandecja but Tarnovia and Sandecja are enemies. Or GKS, Górnik, Ruch, and Zagłębie: every two of them are enemies.


Wisła in Fact Likes Cracovia but Doesn’t Know How to Start Talking