What Makes a Best-Selling Novel?

A Machine Learning Approach

In 2013, Ashok et al. answered this question basing on the writing style, with 61–84% accuracy. This post, on the other hand, examines plot themes in best sellers. Note that my model can hardly predict the commercial success of a novel from its plot. That would be quite a surprising feat, making reviewers obsolete. My goal was more modest: finding statistically profitable topics to write about.

Using PetScan and Wikipedia’s page export, I downloaded 25,359 Wikipedia articles belonging to Category:Novels by year. From each article, I extracted the section named Plot, Plot summary, Synopsis, etc. if present and, stripped of MediaWiki markup, saved it into an SQLite database along with the title of the novel, its year of publication, and a Boolean that indicates if it ever topped the New York Times Fiction Best Seller list:

SELECT title, year, was_bestseller, length(plot) FROM Novels
ORDER BY random() LIMIT 5;
Sharpe's Havoc            | 2003 | 0 | 2759
The Rescue (Sparks novel) | 2000 | 1 |
Slayers                   | 1989 | 0 |
The Warden                | 1855 | 0 | 2793
The Fourth Protocol       | 1984 | 1 | 5666

SELECT count(*) FROM Novels
WHERE plot IS NOT NULL;
17744

SELECT count(*) FROM Novels
WHERE plot IS NOT NULL AND was_bestseller;
398

SELECT min(year) FROM Novels  -- The year of publication.
WHERE was_bestseller;  -- The NYT list starts in 1942.
1941

To obtain easy to interpret results, I have built a logistic regression model on top of the TF–IDF transformation of articles processed by the Porter stemmer. The parameters have default values. In particular, the logistic regression uses L2 regularization so all lowercase words that are not stopwords appear in the model.

import nltk
from nltk.corpus import stopwords
from nltk.stem import porter
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import pipeline
from sklearn.feature_extraction import text

def Tokenize(
    text,
    stemmer=porter.PorterStemmer(),
    uppercase=set(string.uppercase),
    stop_set=set(stopwords.words('english')),
    punctuation_re = re.compile(
        ur'[’“”…–—!"#$%&\'()*+,\-./:;?@\[\\\]^_`{|}~]',
        re.UNICODE)):
  text = punctuation_re.sub(' ', text)
  tokens = nltk.word_tokenize(text)
  return [stemmer.stem(x) for x in tokens
          if x.lower() not in stop_set and x[0] not in uppercase]

X = []
y = []
connection = sqlite3.connect('novels.sqlite')
for row in connection.cursor().execute(
    """SELECT plot, was_bestseller FROM Novels
    WHERE year >= 1941 AND plot IS NOT NULL"""):
  X.append(row[0])
  y.append(row[1])
connection.close()
X_train, X_test, y_train, y_test = (
    cross_validation.train_test_split(X, y, test_size=0.3))
model = pipeline.Pipeline(
    [('tfidf', text.TfidfVectorizer(
          lowercase=False, tokenizer=Tokenize)),
     ('logistic', linear_model.LogisticRegression())])
model.fit(X_train, y_train)

The model can return the probability of being a best seller for any novel b with a plot summary:

logit(b) = −4.6 + 2.5 tfidf(lawyer, b) + 2.4 tfidf(kill, b) + ⋯ − 1.5 tfidf(planet, b)

Pr(was_bestseller(b)|plot(b)) = elogit(b) / (1 + elogit(b))

To put these coefficients in context, tfidf(lawyer, The Firm) ≈ 0.06. As it happens, the model returns logit(b) > 0, that is Pr(was_bestseller(b)|plot(b)) > 1/2 for no novel b from the train or test set. The highest probability, 0.39, is predicted for Cross Fire, indeed a best seller in December 2010. Only if I disable the normalization in TF–IDF or weaken the regularization in the logistic regression, I can overfit the model to the train set while for the test set both its precision and recall would be at most 20%. But, like I wrote in the introduction, this is not the point of this exercise. Let us look at the words with high absolute value of coefficients.

  • Apparently, it pays off to write legal thrillers: lawyer +2.5, case +2.4, law +1.5, client +1.3, jury +1.3, trial +1.3, attorney +1.0, suspect +1.0, judge +0.9, convict +0.8;
  • kill +2.4, murder +1.8, terrorist +1.2, shoot +1.1, body +1.1, die +1.0, serial +0.9, attack +0.9, assassin +0.8, kidnap +0.8, killer +0.8.
  • Political thrillers are not bad either: agent +1.4, politics +1.4, president +1.3, defector +1.2.
  • Business may be involved: firm +1.3, company +1.3, career +1.1, million +1.0, success +1.0, business +0.9, money +0.9.
  • Finally, the characters should have families: husband +1.4, family +1.3, house +1.2, couple +1.2, daughter +1.2, baby +1.1, wife +1.0, father +1.0, child +0.9, birth +0.8, pregnant +0.8, and use a car +1.5 and a phone +0.8.

The genres to avoid for prospective best-selling authors?

  • Sci-fi: planet −1.5, human −1.0, space −0.7, star −0.4, robot −0.3, orbit −0.3.
  • Children’s literature: boy −1.3, school −1.0, young −0.8, girl −0.8, youth −0.4, teacher −0.4, aunt −0.4, grow −0.4.
  • Geography and travels: village −1.0, city −1.0, ship −0.8, way −0.7, go −0.7, land −0.6, adventure −0.6, colony −0.5, native −0.5, follow −0.5, mountain −0.5, crew −0.5, forest −0.5, travel −0.5, inhabit −0.4, sail −0.4, road −0.4, map −0.3, tribe −0.3.
  • War: fight −1.0, warrior −0.6, war −0.6, weapon −0.5, soldier −0.5, army −0.5, ally −0.4, enemy −0.3, conquer −0.3.
  • Fantasy: magic −0.9, creature −0.5, magician −0.4, zombie −0.3, treasure −0.3, dragon −0.3.
  • History: princess −0.5, rule −0.5, kingdom −0.4, castle −0.4, century −0.4, ruler −0.3, palace −0.3 (for what it’s worth, A Game of Thrones only made it to the third place on the list so it does not count as a best seller).

Note that the code above ignores capitalized words. If it does not, the most significant words become the names of characters from best selling book series: Scarpetta +3.0, Stephanie +2.9, Ayla +2.0, etc., with additional insights like FBI +1.3, CIA +1.3, NATO +0.9, Soviet +0.9, or Earth −1.1.

Advertisements
What Makes a Best-Selling Novel?

Wisła in Fact Likes Cracovia but Doesn’t Know How to Start Talking

The relationships between fans of football clubs in Poland can be fourfold: neutrality, friendship (zgoda), enmity (kosa), or pact (układ). The belief that there are two disjoint blocs gathered around The Great Triad (Arka, Cracovia, and Lech) and Three Kings of Great Cities (Śląsk, Wisła, and Lechia) is false. Here is the largest connected component of the graph of friendships.

kluby

The graph of enmities would be less clear. For instance, Cracovia has friendships with Tarnovia and Sandecja but Tarnovia and Sandecja are enemies. Or GKS, Górnik, Ruch, and Zagłębie: every two of them are enemies.

Source: http://polscyhools.w.interiowo.pl/ekipy.html.

Wisła in Fact Likes Cracovia but Doesn’t Know How to Start Talking

Stylometry—It Works! (in Some Circumstances)

This post was supposed to reveal the author of the 13th Book of Pan Tadeusz, an anonymous pornographic sequel to the Polish national epic. Despite my attempts that took into account rhyming sounds, word syllable count, and custom morphological analysis for Early Modern Polish, I failed to identify the author. Which is not that bad: authorship attribution, especially when regurgitated by journalists, is often reduced to ex cathedra statements: “a computer has proven that work X was written by author Y”; the fact that the confidence level is unknown is not reported.

Instead of a literary discovery, I present you a little game: Which Polish text is your writing like? It tells me that The 13th Book is most similar to Antymonachomachia by Ignacy Krasicki who died 33 years before the publication of Pan Tadeusz. Oh well.

The game is based on texts from Wolne Lektury, the Polish equivalent of Project Gutenberg. I appreciate Radek Czajka’s help in downloading them.

Since I know little about writing style analysis (known as stylometry), the entire sophistication of my program lies in calculating the frequency of a few dozen of tokens in each text. This idea is similar to Alphonse Bertillon’s anthropometry, a late-19th-century efficient system of identifying recidivists by classifying eleven body parts as small, medium or large.

We compare text style rather than text topics, so the program pays little attention to content words. It counts final punctuation marks, commas, and 86 frequent function words, that is conjunctions, prepositions, adverbs, and so-called qubliks. These counts are divided by the total number of tokens in the text, yielding a 90-dimensional vector of token frequency for each text.

The figure below shows the results of hierarchical clustering of the texts longer than 5000 tokens, obtained with

scipy.cluster.hierarchy.dendrogram(
    scipy.cluster.hierarchy.linkage(
        frequency_matrix, method=’ward’, distance=’euclidean’))

dendrogram

I, for one, am impressed by its gathering together most of texts written by Kasprowicz, Krasicki, Rzewuski, and Sienkiewicz, or translated by Boy–Żeleński and Ulrich.

How reliable are the results? To answer this question, I perturbed the token counts: for each text composed of N tokens, I replaced k occurrences of each counted token by a random variable with the binomial distribution B(N, k/N), that is the count of heads in N tosses of a biased coin whose heads probability is k/N. For each text from Wolne Lektury, the x axis in the figures below shows the total number of tokens. The y axis shows the frequency with which the nearest point by the Euclidean metric corresponded to a different text or a text by another author/translator, measured in 1000 such random perturbations. In case you wonder how the y axis appears logarithmic and contains zero at once, the plotted variable is log(y + 0.001).

I approximated both the text misattribution probability and the author misattribution probability by 1−(erf(√N/c))b, with empirical values of constants b and c depending on the language, the tokens, and the texts.

Here is my hand-waving explanation of this formula. The coordinates of perturbed points, multiplied by N, have a multivariate binomial distribution (it does not matter whether the coordinates are correlated or not). When N approaches infinity and k/N remains constant, the binomial distribution is asymptotically normal with variance proportional to N (by the central limit theorem applied to tossing the coin), and the multivariate binomial distribution is asymptotically multivariate normal. Dividing the random variables by N, we return to the coordinates, which asymptotically have a multivariate normal distribution with individual variances and covariances proportional to 1/N.

The points divide the 90-dimensional vector space into Voronoi cells whose centres correspond to the mean vectors of the distributions. Moving a point to the other side of some wall of its Voronoi cell means moving it by more than d in the direction perpendicular to the wall. The projection of any multivariate normal distribution with variances and covariances proportional to 1/N onto a vector is a (univariate) normal distribution with variance proportional to 1/N. The probability that a random variable with variance σ2=a/N differs from its mean by more than d (that is, that the permuted point crosses the wall, causing a misattribution) equals 1−erf(d/σ) = 1−erf(dN/√a) = 1−erf(√N/c). Since the Voronoi cell has many walls in different directions, the overall probability that the point exits its cell is approximately equal to 1−erf(√N/c1)×⋯×erf(√N/cn). The erf function decreases rapidly so the factors with the smallest cis dominate the product, which can be approximated by the formula 1−(erf(√N/c))b.

wrong-text
wrong-author

The figures explain why it was hard to ascribe the author to The 13th Book: even if other works by the author belonged to the Wolne Lektury corpus (they probably do not), The 13th Book has merely 1773 tokens.

Stylometry—It Works! (in Some Circumstances)

Sipping Rum: Some New Palindromes

A somewhat popular sport [1, 2] is extending Leigh Mercer’s immortal palindrome “A man, a plan, a canal—Panama!” It occurred to me that its principle can be applied to the Polish palindrome by Julian Tuwim: “Popija rum as, samuraj i pop.” (“Both an ace, a samurai, and an Orthodox priest are sipping rum.”) All we need is a computer program and a list of Polish animate nouns. Here we go:

Popija rum as, said, diak, goj, drab, tokolog, igrek, odlewca, mim, tenor, abba, rodak, imam, alkad, gigant, alb, ober, retor, fan, ilot, rapper, nowy car, usar, adresat, efor, papa, grek, saper, treser, epik, bob, angol, ananas, aga, mameluk, urka, tatka, ergolog, ladro, lis, ork, induna, grum, fleja, batiar, akyn, wał, sowar, psar, kudła, renegat, symplak, ilota, kat, alumn, amor, eponim, daremnik, spec, tan, gajowy, durnota, kret, inka, mods, esbol, rajtar, bidak, tamada, mongoloid, arat, sir, abat, imamita, barista, radiolog, nomada, mat, kadi, brat, jarl, obses, domak, niter, katon, rudy woj, agnat, cep, skin, mer, admin, operoman, mulat, akatolik, alp, mysta, generał, duk, ras, prawosławny, karaita, baj, elf, murga, nudnik, rosi, lord, algolog, reak, tata, kruk, ulema, mag, asan, analog, nabob, kiper, eser, trep, asker, gapa, profeta, serdar, asura, cywon, rep, partolina, froter, reb, oblat, nagi gdak, lama, mikado, rab, baronet, mima, cwel, doker, gigolo, kot, bard, jog, kaid, dias, samuraj i pop.

(The adjectives nowy, rudy, and nagi got mixed among nouns. For a better effect, I manually removed the commas that followed them.)

Lazily, I used only slightly modified Peter Norvig’s backtracking program. I extracted the nouns from a text file used in the Polish morphological analyzer Polimorfologik. The lines of the file look like this:

samuraj         samuraj subst:sg:nom:m1
samuraja        samuraj subst:sg:acc:m1+subst:sg:gen:m1
samurajach      samuraj subst:pl:loc:m1
samurajami      samuraj subst:pl:inst:m1

The appropriate forms of nouns can be extracted with

$ grep subst:sg.*nom.*m1 polimorfologik.txt | cut -f 1 > npdict.txt

The m1 class contains masculine-personal aka virile nouns. Although the names of animals from the m2 class would also suit our purposes, that class contains also names of odd things like currencies, dances, car brands, or mushroom genera that would look strange in the sentence. With apologies to feminists, I have no means of extracting feminine-personal nouns or neuter-personal nouns automatically as they play no special role in Polish grammar.

The palindrome above contains only singular forms of common nouns (152 words in total). If we allow also singular masculine forms of adjectives, we can get a 269-word palindrome:

Popija rum as, said, diak, goj, drab, perski murga, kadi beż netto, klawy rzutki froter, kto, penolog, iglany sini magaski frant, utyty bojowny kaper, trak, sanowy cynawy rusy rotny sracz, sowar, aspan, ilot, raptor, mamlas, on, rebe, wali lis, jebak, cacy woli amor, fan, wodnik, ergolog, lama, mim, asker, gad, jarski men, as, pajac, darmy rosi preser, oferent, rapsod, nabab, abat, symplak, induna, tebriski gamrat, akatolik, rodak, ladro, lisi migany wandal, papa, cwel, doker, gid, rumski golkiper, asura, elew, okej arat, siwy ratar, maksi potowy żywotni wig, orski alb, obyły baca, cywil, paskuda, gemajn, inka, tępy wolowaty raby skin, turkos, esbol, użyty spec, ki marecki mima, kruk, a-ż popi lżywy wozak, angol, ananas, a-z raja, jary picer, kok, tato, nominat, inaki cap, ospały wilk, muli pupka, mods, aborter, dr, enat, siny talib, abba, reb, be rab, babi latynista, nerd, retro, bas, domak, pupil, umkliwy łaps, opaci kani tani mono tatko, kreci pyra, jajarz, asan, analog, nakazowy wyżli pop, żak, urka, mimik, ceramik, cep, syty żul, obses, okrutnik, sybaryta, wolowy pętak, ninja, mega duk, sapliwy caca były bob, laik, srogi wintowy żywotopis, kamrat, arywista, rajek, ow, elear, usar, epik, logik, smurd, igrek, odlewca, papla, dnawy nagi misi lord, alkad, ork, ilota, katar, magik, sir, beta, nudnik, alp, mysta, baba, bandos, partner, efor, eser, pisorym, radca, japs, anemik, srajda, grek, sam, imam, algolog, rekin, down, afro, mailowy cacka, bej, si lila, weber, nosal, mamrot, partolina, psar, awosz, car, syn, torys, urywany cywon, askar, trep, akyn, woj, obyty tutnar, fiks, aga, mini synal, gigolo, nepot, kret, orfik, tuz, rywal, kot, tenże bidaka, grum, iks, rep, bard, jog, kaid, dias, samuraj i pop.

Using plural and proper nouns, we can reach at least 1493 words, for instance:

Popija rum as, kalif, drab, Belg, Remus, Leon, pank, said, diak, Goj, Acis, Pac, Tabak, Kajus, reb, luj, Iwon, Noe, Selim, kacap, Damon, Melcer, epicy, Rob, Atkinson, Jahn, Wahl, Omar, turowiec, ubici, nabab, Mobutu, helota, Nagórski, Dyda, kraker, Gola, Timur, Gil, Ramzes, Romanik, imperator, Ajnos, waleci, biker, rajtar, Umer, Miotk, Popek, Nils, Lefeld, epik, car, Tamil, Amoni, Capała, Sarnat, Jeron, Urban, Orwelle, Tym, Einar, Usarek, rabi, nemrodzi, Kramnik, Sałacki, Sawini, basza, Idzi, Zaremba, raja, Rogacz, Rubeni, Atlas, ontolog, anemicy, Ron, imitator, Inka, Bulik, setkarz, sublokator, Eisler, akatolicy, tato, idioci, Vidor, Apacz, Alan, Ziomek, Albin, Rom, Oktawian, odaj, Arka, kadi, boss, Ursyn, Ahmad, Dassin, Eden, Siadlak, Oscar, Idec, udecy, Rupert, rastamani, Armin, Onak, Rola, Gill, Eco, były, Tom, Eluard, Niski, Nyk, Anatol, Olson, Orkan, rasta, Geller, Bask, oblat, Emil, Ado, Izydor, ras, sipaj, Amnon, Nelson, imam, lapicyda, Bill, Iwan, Alo, Skiba, Tim, Jankes, Uryga, Nycz, Darek, Ardelli, arbek, lirycy, Raczek, Orion, Roda, Cortazar, Odo, idol, Otis, Sobik, cep, opaci, Sak, Lew, Morka, bydlak, Sommer, Feret, Tyrsi, robol, opilec, Rama, Gert, ufolog, Iżyccy, Malka, Kwak, Malak, Tal, Redak, Solti, Lopez, Sot, rabbi, Latacz, Darren, Sopata, Taj, ninje, limnolog, flisak, udek, Kimak, Byrski, Fibak, Masaj, Elsner, efeb, lokator, Papała, Bacik, Cepil, Sabat, sietniak, tokolog, Igrek, Sade, Mahomet, Opacki, sowar, Racki, Amor, Rawik, suswał, Syta, rwacz, Dow, Zadura, Dunin, Olas, Bakuła, Bil, Loeb, ergolog, Lasek, Liw, jarle, Varga, mener, Tabaka, logicy, Razin, Roja, Paganini, Lak, Baran, ilot, rapper, twoi, Wołosi, Colin, Eweni, Borak, Jaksa, gimnazjasta, Rams, Uri, Cywka, Durka, Meisels, Ilia, Koj, Jordi, Vadim, Arrow, Agaton, Rudnik, Eric, Neil, knajak, Tyrawa, Bazan, Nestor, Olek, Nawoj, uwol, Racine, Magnus, mahdi, Wronka, Bełza, Floryni, Zub, Ołdak, Lasota, Linde, Ibsen, Negr, Ezopi, Woda, Swatek, odlewcy, Cezary, preser, eks, ubol, Koba, katar, Katz, Sapir, Nehru, lider, ubek, Collina, blogger, git, Nastak, Inuita, Maj, akolici, Fuk, nomada, rapsod, Adad, Jagger, autycy, Ted, Jeka, Jakub, ulema, Kwoka, Byrnes, rajas, abaci, Bukała, Bigos, satrapa, kortyzani, Lewek, alastor, profes, Ares, Kobak, lord, Labuda, Prada, Gleba, Trak, Sas, Able, Baka, Dubik, Cortez, Carsten, Rokita, Sornat, stenograf, Asser, Roth, celnik, Surmiak, Kern, Elgar, Edgar, Agis, tutnar, Ozawa, Kret, ultrasi, Bulak, Rurak, Dubois, Eldar, Bator, askar, Buksa, Kulig, Annamita, Rodak, Edyp, Agha, psor, Engel, Orione, Rota, King, asan, agregat, symplak, Ezaw, Ed, Toni, Gamrat, Artur, Ammon, organik, radny, doyeni, woj, agnat, sir, tatul, Pini, Drabik, modele, Depa, Dowi, Losey, Lesik, Sobota, Lis, pajac, Rolnik, Topor, Knysak, Turner, Olsen, Rabin, Mizak, Lada, Tyl, kady, filareci, patron, etatowi, Sasak, lemani, Wontor, Wanat, esbol, logik, Seliga, Mały, Waluk, Komsta, radiesteci, Nata, lump, Radlak, Stefani, Samsel, Tibor, Deptuła, baje, rabini, Norris, Ajmar, Grecy, zabici, Cini, akrobata, Dustin, Abel, Bąba, krytycy, Brando, Baj, Lizut, Zin, Bielat, Fokker, Edison, Jarema, logograf, Ilje, Bata, Pazik, smerd, Renat, Hempel, Rus, Rey, matoł, Spytko, jubilat, fani, Sudnik, etolog, Nanaj, Abram, Otokar, wyrypaje, Izraele, Elkana, Motak, Josh, Calik, Kelles, seksoman, acanek, Kazulin, Odon, alim, Kahn, abba, Dante, Mulawa, Lasak, Puk, Jagła, renegat, Samon, Osak, Nijak, Golan, Aznar, Ficek, Odil, aktor, Gruza, maruda, Mick, eleata, Pol, Alec, Lehr, Habakuk, oblaci, Fin, Adam, Rojak, modsi, Welt, Opałka, Sroka, Galaty, Malik, Sałata, Inuici, Walas, homeopata, Kunz, Siwik, Cąkała, Kazko, Wonder, Flak, urka, Maldini, mohele, nemrod, Nash, Allan, orator, Mamak, John, Eros, Kenar, Fik, Cebulak, Lompa, Grek, Rapak, Pot, stangreci, Pen, Rubaj, elf, Kukiz, androfag, Rumas, Ornat, Ante, fajter, rajca, geje, Becu, Mak, lokat, pedagog, adept, togat, Simon, Erik, Cudak, sensat, Natanek, inki, Piccoli, kretyn, Liszcz, sipaje, tamada, Morgała, Duff, Alois, Anan, Aresi, Reje, Makuła, Bułat, Olak, Jedak, Repin, Nita, Kudyba, Breza, gej, Dejon, Gała, Parda, Hatak, Lussac, Ulf, larrikini, leweller, rafiner, Atapask, insi, Boni, Capone, Fryz, sumici, Mann, Arent, Socyn, Tomas, Roger, groom, kat, Rufin, Tomasze, Najder, froter, Pinda, Rataj, Zappa, Prodi, skini, model, papa, Dudała, Kalukin, Lisik, Swat, Sycz, Durak, trombon, Snell, atleci, foks, Kamil, Lutz, Solak, Renn, Ernest, pokraka, Meir, uczeni, tramp, pedał, Duk, wał, Carnot, Romeo, Fedak, Ratka, Dydycz, Rubik, Celej, Opoka, baca, Kiwak, Bubak, Orest, Keler, Ebert, Saługa, mimik, pupka, Trawka, mods, Rudi, Wadas, Korpak, Majak, Messi, papla, Sawka, tępak, Dudycz, Delon, Allach, Corelli, Heine, Lejb, Resnais, sahib, mozarab, Otto, Klein, adresant, sardar, homo, niter, Al, Hadała, Bugała, Basta, serdar, DJ, adamici, Numida, wałkoń, erudyci, Gama, Kutz, Skałka, nudnik, Nobel, opętany, zupak, typas, Elamici, Follett, Sujka, Basak, Tutak, Pałka, Perez, Cliff, Latała, hip, opat, Sierak, Pęksa, Haba, Bremer, kok, Loba, Pękała, pokraki, Kasiak, Lepka, Boba, goje, tatka, Pułka, Pełka, Josif, Lars, esbole, Druszcz, sułtan, Hawel, pludrak, Solon, gapa, Klim, Kazik, alumni, Soski, Marecki, Sowa, profeci, Noam, Nalepa, Korda, Lece, dewot, Stowe, Dece, ladro, kapelan, Mao, Nicefor, Paw, Osik, ceramik, Sosin, Mulak, Izak, Milka, Pagnol, Oskar, Dul, Plewa, Hnat, Łuszcz, Surdel, obses, Ralfi, Sojak, łepak, Łupak, Tate, joga, Bobak, Pelka, Isak, Ikar, Kopała, Kępa, Bolko, Kremer, baba, Has, Kępka, reista, popi, Hałat, Alf, Fil, Czerepak, Łapka, Tutka, Sabak, Just, Tell, ofici, Malesa, Pytka, Puzyna, tępole, Bonk, induna, Kłak, Sztuka, magicy, dureń, Okła, Wadim, unici, Madaj, dr, adresat, Sabała, Gubała, Dahl, Aretino, Mohr, Adrast, Naser, Daniel, Kott, Obara, zombi, Hass, Ian, Serb, jelenie, Hille, Roch, Callan, Oledzcy, Dudka, pętak, wasal, Papis, Semka, Jamka, Proksa, Dawid, Urs, domak, Wartak, pupki, mima, Guła, streber, elekt, Seroka, Bubka, Wika, Cabak, opoje, Lecki, burzcy, Dydak, Tarka, Defoe, Morton, Racław, Kudła, Depp, Martinez, Curie, Makar, Kopt, Sen, Renner, kalosz, Tulli, Maks, Kofi, Celt, Allen, snob, Mortka, Rudzcy, Stawski, silni, Kula, Kała, Duda, paple, Dominik, Sidor, Papp, Azjata, radni, pretor, Fred, Janez, samotni, Furtak, Moor, Gregor, samotny, Costner, Annamici, muszyr, fenopaci, Nobis, Niksa, patareni, Farrell, Ewelini, Kir, Ralf, Lucas, Sulka, Taha, Drapała, gnoje, DJ-e, gazer, baby, Dukat, inni, Perka, Dej, Kalota, Łuba, Łuka, Mejer, Iser, ananasi, Olaf, Fudała, Gromada, Mateja, Piszcz, silny, Terki, Locci, piknik, enat, Antas, Neska, Ducki, renomista, Gott, pedagoga, Depta, Kolka, muce, beje, Gac, Jarret, Jafet, Natan, Rosa, murga, Ford, nazi, Kuk, fleja, Burne, picer, Gnat, Stopka, Parker, gap, Molka, Lubecki, Franek, Soren, Hojka, mamrot, Aron, Allah, Sandor, menele, hominid, lama, Kruk, Alfred, Nowok, zakała, Kącki, Wisznu, Kata, Poe, Mohs, alawici, uniata, Łaski, lamy, Talaga, Korsak, Łapot, Lewis, Domka, Jorma, Dani, Fic, alb, Okuka, Bahr, Helcel, alopata, elekci, Madura, Mazur, Grot, Kali, dokeci, Franz, analog, Kaj, Inkas, onomasta, generał, Gaj, Kupka, Salawa, Lumet, Nadab, ban, Hak, Milan, Odoni, luzak, Ken, acan, Amos, Kessel, Lekki, Lach, Sojka, Tomana, Klee, Lear, Zieja, pyry, wrak, Otomar, Bajan, Angol, Otek, Indusi, Naftali, Bujok, typ, Słota, Myers, Urlep, Mehta, nerd, Remski, Zapata, bej, Lifar, gogo, lamer, Ajnosi, Derek, Kofta, Leibniz, tuz, Ilja, Bodnar, Bycy, Tyrka, bąble, banit, Suda, Tabor, kainici, Ciba, zycer, Gram, Jasir, ronini, Bareja, Bałut, Pedro, bitles, Masina, Fet, skald, Arp, Mulat, Anicet, seid, arat, Smok, kulawy, łamagi, Leski, Gollob, Seta, Nawrot, Nowina, Melka, Sasi, Wota, tenor, tapicer, Ali, Fyda, Klyta, Dalka, Zimni, Barnes, Loren, Rutka, syn, Kropotkin, Lorca, japsi, Lato, boski, Selye, Soliwoda, pedele, Domki, Bardini, Pluta, Tristan, gajowi, Ney, Odyn, Darkin, agronom, Marut, ratar, Maginot, dewa, zek, alp, mysta, Ger, Ganas, Agni, Kato, Renoir, Oleg, Nero, spah, gapy, Deka, Dorati, man, nagi, Lukas, Kubrak, Sarota, Brad, Lesio, Budka, Rurka, lubi, Sart, Luter, Kawa, Zoran, Tutsi, Gara, gdera, Glen, Rek, Kaim, Ruskin, Lech, Torres, Safar, Gonet, Stan, Rosati, Kornet, sracze, Trocki, Buda, kabel, Basa, skartabel, Gad, Arpad, Ubald, Rolka, bokser, as, efor, Prot, Salak, Ewelin, Azy, Troka, par, Tasso, Gibała, Kubica, Basaj, Arsen, Rybakow, Kamel, Ubu, Kajak, ejdetycy, Tuareg, Gajda, Dados, Parada, Monk, ufici, lokaj, Amati, unikat, Santi, Greg, Golba, Nil, Locke, Bur, Edi, Lur, Henri, Paszta, Krata, kaboklo, busker, eser, pyra, zecy, cwel, doketa, wsadowi, pozer, Gennes, biedni, Latos, alkad, łobuziny, Rolf, az, łebak, Norwid, Hamsun, gameni, Carlo, wujo, Wankel, Or-Ot, Senna, Zabawa, Rytka, Jan, klienci, rekin, durnota, Gawor, Rami, David, Roj, Jokai, Lisle, Siemak, Rudak, wycirus, Marat, Saj, Zan, Migas, Kajka, Robin, Ewen, iloci, Sołowiow, trep, partolina, rab, Kalinin, aga, Pajor, nizaryci, Golak, abat, Rene, mag, Ravel, Raj, Wilkes, algolog, rebe, Olli, Bałuk, Absaloni, nuda, Ruda, zwodzca, Wratysław, Suski, Warro, Maik, Carra, Wosik, Capote, Mohamed, asker, gigolo, Kot, Kain, teista, bas, Lipecki, Cabała, Paprota, Kolbe, Ferens, Leja, Sam, kabi, fiks, Rybka, Mikke, Dukas, Ilf, Golon, Milej, ninja, tata, Posner, radzca, Talib, Bartosze, Polit, Loska, Derlatka, Lam, Kawka, Klamyccy, żigolo, Futrega, Marceli, Polo, Boris, Rytter, Efrem, Moskal, Dyba, Kromwel, Kasica, Popecki, Bossi, Tolo, Diodor, Aza, Troc, Adorno, Irokez, carycy, Rilke, Braille, Drake, Radzcy, Nagy, Rusek, najmita, Bik, Solana, Will, ibadyci, Palma, Minos, Lennon, Maja, Pissarro, Dyzio, Dali, metal, Boksa, Brel, legat, Sarna, Kronos, Lolo, tan, akyn, Iks, Indra, ulem, otyły, Bocelli, Gal, ork, anonim, Raina, mat, Sartre, puryce, duce, Dirac, Sokal, Daisne, Denis, Saddam, hanys, Russo, bidaka, Kraj, Adonai, Wat, komorni, Blake, moi, znalazca, Parodi, Vico, idiota, Tyc, ilota, Karel, Sierota, Kolbusz, Rak, Teski, Lubak, Niro, Tati, minoryci, menago, Lot, nosal, Taine, burzca, Goraj, Arab, mer, Aziz, Diaz, Sabini, Wasik, Cała, skin, markiz, Dor, meni, Barker, asura, niemy, Telle, Wrona, Bruno, Rejtan, Rasała, Pacino, Mali, Matracki, pedle, Fels, Linke, pop, kto, Imre, Murat, Jarre, kibice, Lawson, Jarota, rep, Mikina, Morse, zmarli, grum, italogrek, Arkady, Diks, Róg, Anatole, Hutu, Bomba, banici, buce, Iwo, Rut, ramol, Hawn, Hajnos, Nik, taboryci, Perec, Lem, nomad, Pacak, Miles, eon, nowi, Jul, Ber, Sujak, Kabat, cap, Sica, jog, kaid, dias, Knap, Noel, Sumer, Gleb, bard, Filak, samuraj i pop.

Sipping Rum: Some New Palindromes

Tracking Spanish Flu through Austro–Hungarian Press

The area of the red circles on this map is proportional to the smoothed relative number of mentions of influenza in newspapers throughout the Austro–Hungarian lands in the last four months of 1918. In the same vein as Google was detecting influenza epidemics using search engine query data, perhaps these numbers approximate the local timeseries of influenza morbidity.

Austria-Hungary

The 1918–1919 pandemic, called Spanish flu, killed at least five times more people than World War I. Here is a chart of its mortality in England and Wales (source: Edwin O. Jordan, Epidemic Influenza: A Survey, 1927):

weekly-mortality

The three waves occurred at similar times worldwide. I concentrated on the second, most deadly wave in the Austro–Hungarian empire and in the states that emerged in its lands. The full-text search of ANNO (AustriaN Newspapers Online, a digitisation initiative of the Austrian National Library) allowed me to count all the occurrences of the words {Grippe, Influenza}, {chřipka, chřipky, chřipce, chřipku, chřipkou} (in the Czech newspaper Dělnické listy), and {grypa, grypą, grypę, grypie, grypy, influenza, influenzy} (in the Polish newspaper Kurjer Lwowski) by the day of their publication in 27 newspapers from 14 cities and towns. A little side discovery is that the name “Spanish flu” in all three languages first appeared in print at the beginning of July 1918.

Despite the inevitable inaccuracy of this method, among others due to OCR errors and the shadowing of the epidemic by other topics like the surrender of Austria–Hungary, the end of the monarchy, and the independence of national states, noise tends to cancel out. I inspected all 29 mentions of flu in Kurjer Lwowski in the week of Oct 7–13, 1918. Out of them:

  • 19 concerned local events,
  • 4 concerned events elsewhere in Austria,
  • 5 concerned events elsewhere in Europe.

I fitted the Gompertz curve y = c \exp(-e^{-b(t-t_0)}) to the cumulative numbers, keeping in mind the fact that the issues of some newspapers from the end of the year are missing from ANNO (recall the independence). For instance, here is the best fit for the cumulative number of times flu was mentioned in newspapers published in Vienna:

Vienna1

and a full-year chart of mentions of flu per week in the same newspapers:

Vienna2

I tried also to verify Andrew Price–Smith’s claim that Spanish flu tipped the balance of power towards the Entente in the last days of World War I. Unfortunately, many belligerent states do not have online archives of newspapers from 1918, the German archives are unsearchable, and Gallica and the British Newspaper Archive are useless: the mentions of grippe or influenza in French and British press are few and far between outside advertisements, a sign of wartime censorship. However, this table from Epidemic Influenza: A Survey showing that the highest mortality co-occurred in France and Germany makes me doubt Price–Smith’s claim:

highest-weekly-mortality

The map of Austria–Hungary above comes from MPIDR [Max Planck Institute for Demographic Research] and CGG [Chair for Geodesy and Geoinformatics, University of Rostock] 2012: MPIDR Population History GIS Collection (slightly modified version of a GIS-File by Rumpler and Seger 2010) – Rostock (Rumpler, H. and Seger, M. 2010: Die Habsburgermonarchie 1848–1918. Band IX: Soziale Strukturen. 2. Teilband: Die Gesellschaft der Habsburgermonarchie im Kartenbild. Verwaltungs-, Sozial- und Infrastrukturen. Nach dem Zensus von 1910, Wien).

Tracking Spanish Flu through Austro–Hungarian Press

When Will Earnings in Poland Match Those in Western Europe?

They already did, 450 years ago.

wages

Source of data: Bob Allen’s home page; indirectly also three Polish books: Ceny w Krakowie w latach 1369–1600, 1601–1795, 1796–1914. The prices at the end of the period would appear less inflated if expressed in gold: the gold/silver price ratio was approximately 15.5 until 1872, whereas in the early 1900s, it grew to 33–40.

When Will Earnings in Poland Match Those in Western Europe?