Visualizing Lexical Distance in Three Dimensions

If you do not know Stephan Steinbach’s blog Alternative Transport, check it out. Among subjects that interest me, Stephan covers data visualization. At one time, a picture from his post on lexical distance among languages of Europe went viral. Since then, he has been pondering ways to improve the illustration by including more languages, positioning them more scientifically, and going into the third dimension. Eventually, we decided to co-operate in this endeavour. Yesterday, Stephan published his article on the dimensional limits of graphs and today, I am publishing this post.

Without more ado, here is the visualization of lexical distance among 210 living and extinct languages coming mainly from the Old World.


Here is what it took to make this visualization:

  • Get Vincent Beaufils’s list of 18 most stable word stems: “eye”, “ear”, “nose”, “hand”, “tongue”, “tooth”, “death”, “water”, “sun”, “wind”, “night”, “two”, “three”, “four”, “I”, “you”, “who”, and “name” for the languages in question.
  • For each pair of languages, add up the Brown–Holman–Wichmann distance between all pairs of corresponding words. This kind of distance takes into account 692 correspondences between consonants and vowels that recur among world’s languages.
  • Apply multidimensional scaling to the matrix of distances. This approach minimizes the sum of squared differences between the input and output distances. Take three initial dimensions of the result.
  • Using Matplotlib, draw the labels of the languages in three dimensions. Move the viewpoint with uniform speed along the tennis ball seam as defined parametrically by López-López:
    x(t) = (1 − b)sin t + b sin 3t
    y(t) = (1 − b)cos t − b cos 3t
    z(t) = √4(1 − b)b cos 2t
    Following López-López, set b = 0.20.
  • The closer the label of a language is to the viewpoint, the less transparent make the label. Concretely, vary the alpha channel of the labels from 0.05 far away from the viewpoint to 1.0 close to it.

As a bonus, here are visualizations of lexical distance among languages of the Indo–European family and its Germanic, Italic, and Romance branches.



The colors come from Ethan Schoonover’s Solarized palette. The elegant narrow font is Delicious, a free font from exljbris Font Foundry.

Visualizing Lexical Distance in Three Dimensions