What Makes a Best-Selling Novel?

A Machine Learning Approach

In 2013, Ashok et al. answered this question basing on the writing style, with 61–84% accuracy. This post, on the other hand, examines plot themes in best sellers. Note that my model can hardly predict the commercial success of a novel from its plot. That would be quite a surprising feat, making reviewers obsolete. My goal was more modest: finding statistically profitable topics to write about.

Using PetScan and Wikipedia’s page export, I downloaded 25,359 Wikipedia articles belonging to Category:Novels by year. From each article, I extracted the section named Plot, Plot summary, Synopsis, etc. if present and, stripped of MediaWiki markup, saved it into an SQLite database along with the title of the novel, its year of publication, and a Boolean that indicates if it ever topped the New York Times Fiction Best Seller list:

SELECT title, year, was_bestseller, length(plot) FROM Novels
ORDER BY random() LIMIT 5;
Sharpe's Havoc            | 2003 | 0 | 2759
The Rescue (Sparks novel) | 2000 | 1 |
Slayers                   | 1989 | 0 |
The Warden                | 1855 | 0 | 2793
The Fourth Protocol       | 1984 | 1 | 5666

SELECT count(*) FROM Novels

SELECT count(*) FROM Novels
WHERE plot IS NOT NULL AND was_bestseller;

SELECT min(year) FROM Novels  -- The year of publication.
WHERE was_bestseller;  -- The NYT list starts in 1942.

To obtain easy to interpret results, I have built a logistic regression model on top of the TF–IDF transformation of articles processed by the Porter stemmer. The parameters have default values. In particular, the logistic regression uses L2 regularization so all lowercase words that are not stopwords appear in the model.

import nltk
from nltk.corpus import stopwords
from nltk.stem import porter
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import pipeline
from sklearn.feature_extraction import text

def Tokenize(
    punctuation_re = re.compile(
  text = punctuation_re.sub(' ', text)
  tokens = nltk.word_tokenize(text)
  return [stemmer.stem(x) for x in tokens
          if x.lower() not in stop_set and x[0] not in uppercase]

X = []
y = []
connection = sqlite3.connect('novels.sqlite')
for row in connection.cursor().execute(
    """SELECT plot, was_bestseller FROM Novels
    WHERE year >= 1941 AND plot IS NOT NULL"""):
X_train, X_test, y_train, y_test = (
    cross_validation.train_test_split(X, y, test_size=0.3))
model = pipeline.Pipeline(
    [('tfidf', text.TfidfVectorizer(
          lowercase=False, tokenizer=Tokenize)),
     ('logistic', linear_model.LogisticRegression())])
model.fit(X_train, y_train)

The model can return the probability of being a best seller for any novel b with a plot summary:

logit(b) = −4.6 + 2.5 tfidf(lawyer, b) + 2.4 tfidf(kill, b) + ⋯ − 1.5 tfidf(planet, b)

Pr(was_bestseller(b)|plot(b)) = elogit(b) / (1 + elogit(b))

To put these coefficients in context, tfidf(lawyer, The Firm) ≈ 0.06. As it happens, the model returns logit(b) > 0, that is Pr(was_bestseller(b)|plot(b)) > 1/2 for no novel b from the train or test set. The highest probability, 0.39, is predicted for Cross Fire, indeed a best seller in December 2010. Only if I disable the normalization in TF–IDF or weaken the regularization in the logistic regression, I can overfit the model to the train set while for the test set both its precision and recall would be at most 20%. But, like I wrote in the introduction, this is not the point of this exercise. Let us look at the words with high absolute value of coefficients.

  • Apparently, it pays off to write legal thrillers: lawyer +2.5, case +2.4, law +1.5, client +1.3, jury +1.3, trial +1.3, attorney +1.0, suspect +1.0, judge +0.9, convict +0.8;
  • kill +2.4, murder +1.8, terrorist +1.2, shoot +1.1, body +1.1, die +1.0, serial +0.9, attack +0.9, assassin +0.8, kidnap +0.8, killer +0.8.
  • Political thrillers are not bad either: agent +1.4, politics +1.4, president +1.3, defector +1.2.
  • Business may be involved: firm +1.3, company +1.3, career +1.1, million +1.0, success +1.0, business +0.9, money +0.9.
  • Finally, the characters should have families: husband +1.4, family +1.3, house +1.2, couple +1.2, daughter +1.2, baby +1.1, wife +1.0, father +1.0, child +0.9, birth +0.8, pregnant +0.8, and use a car +1.5 and a phone +0.8.

The genres to avoid for prospective best-selling authors?

  • Sci-fi: planet −1.5, human −1.0, space −0.7, star −0.4, robot −0.3, orbit −0.3.
  • Children’s literature: boy −1.3, school −1.0, young −0.8, girl −0.8, youth −0.4, teacher −0.4, aunt −0.4, grow −0.4.
  • Geography and travels: village −1.0, city −1.0, ship −0.8, way −0.7, go −0.7, land −0.6, adventure −0.6, colony −0.5, native −0.5, follow −0.5, mountain −0.5, crew −0.5, forest −0.5, travel −0.5, inhabit −0.4, sail −0.4, road −0.4, map −0.3, tribe −0.3.
  • War: fight −1.0, warrior −0.6, war −0.6, weapon −0.5, soldier −0.5, army −0.5, ally −0.4, enemy −0.3, conquer −0.3.
  • Fantasy: magic −0.9, creature −0.5, magician −0.4, zombie −0.3, treasure −0.3, dragon −0.3.
  • History: princess −0.5, rule −0.5, kingdom −0.4, castle −0.4, century −0.4, ruler −0.3, palace −0.3 (for what it’s worth, A Game of Thrones only made it to the third place on the list so it does not count as a best seller).

Note that the code above ignores capitalized words. If it does not, the most significant words become the names of characters from best selling book series: Scarpetta +3.0, Stephanie +2.9, Ayla +2.0, etc., with additional insights like FBI +1.3, CIA +1.3, NATO +0.9, Soviet +0.9, or Earth −1.1.

What Makes a Best-Selling Novel?

Stylometry—It Works! (in Some Circumstances)

This post was supposed to reveal the author of the 13th Book of Pan Tadeusz, an anonymous pornographic sequel to the Polish national epic. Despite my attempts that took into account rhyming sounds, word syllable count, and custom morphological analysis for Early Modern Polish, I failed to identify the author. Which is not that bad: authorship attribution, especially when regurgitated by journalists, is often reduced to ex cathedra statements: “a computer has proven that work X was written by author Y”; the fact that the confidence level is unknown is not reported.

Instead of a literary discovery, I present you a little game: Which Polish text is your writing like? It tells me that The 13th Book is most similar to Antymonachomachia by Ignacy Krasicki who died 33 years before the publication of Pan Tadeusz. Oh well.

The game is based on texts from Wolne Lektury, the Polish equivalent of Project Gutenberg. I appreciate Radek Czajka’s help in downloading them.

Since I know little about writing style analysis (known as stylometry), the entire sophistication of my program lies in calculating the frequency of a few dozen of tokens in each text. This idea is similar to Alphonse Bertillon’s anthropometry, a late-19th-century efficient system of identifying recidivists by classifying eleven body parts as small, medium or large.

We compare text style rather than text topics, so the program pays little attention to content words. It counts final punctuation marks, commas, and 86 frequent function words, that is conjunctions, prepositions, adverbs, and so-called qubliks. These counts are divided by the total number of tokens in the text, yielding a 90-dimensional vector of token frequency for each text.

The figure below shows the results of hierarchical clustering of the texts longer than 5000 tokens, obtained with

        frequency_matrix, method=’ward’, distance=’euclidean’))


I, for one, am impressed by its gathering together most of texts written by Kasprowicz, Krasicki, Rzewuski, and Sienkiewicz, or translated by Boy–Żeleński and Ulrich.

How reliable are the results? To answer this question, I perturbed the token counts: for each text composed of N tokens, I replaced k occurrences of each counted token by a random variable with the binomial distribution B(N, k/N), that is the count of heads in N tosses of a biased coin whose heads probability is k/N. For each text from Wolne Lektury, the x axis in the figures below shows the total number of tokens. The y axis shows the frequency with which the nearest point by the Euclidean metric corresponded to a different text or a text by another author/translator, measured in 1000 such random perturbations. In case you wonder how the y axis appears logarithmic and contains zero at once, the plotted variable is log(y + 0.001).

I approximated both the text misattribution probability and the author misattribution probability by 1−(erf(√N/c))b, with empirical values of constants b and c depending on the language, the tokens, and the texts.

Here is my hand-waving explanation of this formula. The coordinates of perturbed points, multiplied by N, have a multivariate binomial distribution (it does not matter whether the coordinates are correlated or not). When N approaches infinity and k/N remains constant, the binomial distribution is asymptotically normal with variance proportional to N (by the central limit theorem applied to tossing the coin), and the multivariate binomial distribution is asymptotically multivariate normal. Dividing the random variables by N, we return to the coordinates, which asymptotically have a multivariate normal distribution with individual variances and covariances proportional to 1/N.

The points divide the 90-dimensional vector space into Voronoi cells whose centres correspond to the mean vectors of the distributions. Moving a point to the other side of some wall of its Voronoi cell means moving it by more than d in the direction perpendicular to the wall. The projection of any multivariate normal distribution with variances and covariances proportional to 1/N onto a vector is a (univariate) normal distribution with variance proportional to 1/N. The probability that a random variable with variance σ2=a/N differs from its mean by more than d (that is, that the permuted point crosses the wall, causing a misattribution) equals 1−erf(d/σ) = 1−erf(dN/√a) = 1−erf(√N/c). Since the Voronoi cell has many walls in different directions, the overall probability that the point exits its cell is approximately equal to 1−erf(√N/c1)×⋯×erf(√N/cn). The erf function decreases rapidly so the factors with the smallest cis dominate the product, which can be approximated by the formula 1−(erf(√N/c))b.


The figures explain why it was hard to ascribe the author to The 13th Book: even if other works by the author belonged to the Wolne Lektury corpus (they probably do not), The 13th Book has merely 1773 tokens.

Stylometry—It Works! (in Some Circumstances)

The Vector Space of the Polish Parliament in Pictures

In fact, only of the lower house, the Sejm. During its current term, 460 deputies voted 5545 times since the 8th of November 2011. The Sejm website publishes the individual results of the votes as PDF files. I put the results into an SQLite database with an unholy hodgepodge of shell scripts, wget, pdftohtml, and Python. Here is a sample query and its output:

SELECT deputy, party, voting, result
FROM VotingResults
JOIN Deputies USING(deputy_id)
JOIN Parties USING(party_id)
JOIN Votings USING(voting_id)
JOIN Results USING(result_id)
ORDER BY random()

HOK MAREK          | PO  | data/glos_32_10.html | NAY
MILLER LESZEK      | SLD | data/glos_49_42.html | AYE
MASŁOWSKA GABRIELA | PiS | data/glos_18_14.html | ABSTAIN
ROMANEK ANDRZEJ    | SP  | data/glos_93_58.html | AYE
ŁOPATA JAN         | PSL | data/glos_57_65.html | ABSENT

The database registers as many as 509 deputies since some of them died during the term or were appointed or elected to other posts. The deputies are grouped into parties according to their latest status. There are data on 5539 votings. I skipped 6 elections of speakers whose PDF layout was different.

Using Python and numpy, we can put the results into a matrix R with 5539 rows and 509 columns. Its columns assigned to deputies are vectors in the 5539-dimensional space of voting results. The elements of R are +1 for AYE, −1 for NAY, and 0 for ABSTAIN and ABSENT. Shift each row to make its mean equal to zero, keeping ABSTAIN and ABSENT unbiased at zero:

M = R.sum(axis=1) / numpy.fabs(R).sum(axis=1)
MT = M[:, numpy.newaxis]
Rcentred = numpy.where(R != 0, R - MT, 0)

Then perform the Singular Value Decomposition of Rcentred:

U, S, VT = numpy.linalg.svd(Rcentred, full_matrices=False)

U holds the so-called left-singular vectors of Rcentred along which the variance of Rcentred is the largest possible. S contains the so-called singular values of Rcentred. We do not use VT.

We have just carried out the Principal Component Analysis of R. Thanks to it, we can visualize 5539 dimensions of R in a much smaller number of dimensions. In our case, three initial left-singular vectors contribute 70.1%, 8.5%, and 3.6% of the total variance of R, respectively. The projection of the vectors of Rcentred along its three initial left-singular vectors looks like this:

A scatter plot of three principal components of the voting patterns

The x axis, usually called the partisan axis, corresponds to the division between the government (PO+PSL) and the opposition (everybody else). The votings with the highest absolute weights in the first left-singular vector concern the involvement of the government in the Amber Gold scandal: this (−0.0216) and this (+0.0217).

The y axis appears to correspond to right wing-left wing polarization (everybody else versus SLD+RP). Its most significant votings are the rejection of a proposal to waive the law against insulting religious feelings (−0.0451) and the proposal to raise the taxes on copper and silver mining (+0.0447).

The z axis, in turn, seems to correspond to sentiments towards the European Union (everybody else versus PiS+RP+ZP aka SP). Its most significant votings concern two alignments between Polish and EU law: this (−0.0463) and this (+0.0470).

Finally, here are the votes of individual deputies projected on the xy, xz, and yz planes.

EDIT: here is a zoomable version, made with the Bokeh visualization library. Thanks for the tip, stared!

The 1st and the 2nd principal components of the voting patterns

The 1st and the 3rd principal components of the voting patterns

The 2nd and the 3rd principal components of the voting patterns

The Vector Space of the Polish Parliament in Pictures