Stylometry—It Works! (in Some Circumstances)

This post was supposed to reveal the author of the 13th Book of Pan Tadeusz, an anonymous pornographic sequel to the Polish national epic. Despite my attempts that took into account rhyming sounds, word syllable count, and custom morphological analysis for Early Modern Polish, I failed to identify the author. Which is not that bad: authorship attribution, especially when regurgitated by journalists, is often reduced to ex cathedra statements: “a computer has proven that work X was written by author Y”; the fact that the confidence level is unknown is not reported.

Instead of a literary discovery, I present you a little game: Which Polish text is your writing like? It tells me that The 13th Book is most similar to Antymonachomachia by Ignacy Krasicki who died 33 years before the publication of Pan Tadeusz. Oh well.

The game is based on texts from Wolne Lektury, the Polish equivalent of Project Gutenberg. I appreciate Radek Czajka’s help in downloading them.

Since I know little about writing style analysis (known as stylometry), the entire sophistication of my program lies in calculating the frequency of a few dozen of tokens in each text. This idea is similar to Alphonse Bertillon’s anthropometry, a late-19th-century efficient system of identifying recidivists by classifying eleven body parts as small, medium or large.

We compare text style rather than text topics, so the program pays little attention to content words. It counts final punctuation marks, commas, and 86 frequent function words, that is conjunctions, prepositions, adverbs, and so-called qubliks. These counts are divided by the total number of tokens in the text, yielding a 90-dimensional vector of token frequency for each text.

The figure below shows the results of hierarchical clustering of the texts longer than 5000 tokens, obtained with

scipy.cluster.hierarchy.dendrogram(
    scipy.cluster.hierarchy.linkage(
        frequency_matrix, method=’ward’, distance=’euclidean’))

dendrogram

I, for one, am impressed by its gathering together most of texts written by Kasprowicz, Krasicki, Rzewuski, and Sienkiewicz, or translated by Boy–Żeleński and Ulrich.

How reliable are the results? To answer this question, I perturbed the token counts: for each text composed of N tokens, I replaced k occurrences of each counted token by a random variable with the binomial distribution B(N, k/N), that is the count of heads in N tosses of a biased coin whose heads probability is k/N. For each text from Wolne Lektury, the x axis in the figures below shows the total number of tokens. The y axis shows the frequency with which the nearest point by the Euclidean metric corresponded to a different text or a text by another author/translator, measured in 1000 such random perturbations. In case you wonder how the y axis appears logarithmic and contains zero at once, the plotted variable is log(y + 0.001).

I approximated both the text misattribution probability and the author misattribution probability by 1−(erf(√N/c))b, with empirical values of constants b and c depending on the language, the tokens, and the texts.

Here is my hand-waving explanation of this formula. The coordinates of perturbed points, multiplied by N, have a multivariate binomial distribution (it does not matter whether the coordinates are correlated or not). When N approaches infinity and k/N remains constant, the binomial distribution is asymptotically normal with variance proportional to N (by the central limit theorem applied to tossing the coin), and the multivariate binomial distribution is asymptotically multivariate normal. Dividing the random variables by N, we return to the coordinates, which asymptotically have a multivariate normal distribution with individual variances and covariances proportional to 1/N.

The points divide the 90-dimensional vector space into Voronoi cells whose centres correspond to the mean vectors of the distributions. Moving a point to the other side of some wall of its Voronoi cell means moving it by more than d in the direction perpendicular to the wall. The projection of any multivariate normal distribution with variances and covariances proportional to 1/N onto a vector is a (univariate) normal distribution with variance proportional to 1/N. The probability that a random variable with variance σ2=a/N differs from its mean by more than d (that is, that the permuted point crosses the wall, causing a misattribution) equals 1−erf(d/σ) = 1−erf(dN/√a) = 1−erf(√N/c). Since the Voronoi cell has many walls in different directions, the overall probability that the point exits its cell is approximately equal to 1−erf(√N/c1)×⋯×erf(√N/cn). The erf function decreases rapidly so the factors with the smallest cis dominate the product, which can be approximated by the formula 1−(erf(√N/c))b.

wrong-text
wrong-author

The figures explain why it was hard to ascribe the author to The 13th Book: even if other works by the author belonged to the Wolne Lektury corpus (they probably do not), The 13th Book has merely 1773 tokens.

Stylometry—It Works! (in Some Circumstances)