Visualizing Lexical Distance in Three Dimensions

If you do not know Stephan Steinbach’s blog Alternative Transport, check it out. Among subjects that interest me, Stephan covers data visualization. At one time, a picture from his post on lexical distance among languages of Europe went viral. Since then, he has been pondering ways to improve the illustration by including more languages, positioning them more scientifically, and going into the third dimension. Eventually, we decided to co-operate in this endeavour. Yesterday, Stephan published his article on the dimensional limits of graphs and today, I am publishing this post.

Without more ado, here is the visualization of lexical distance among 210 living and extinct languages coming mainly from the Old World.


Here is what it took to make this visualization:

  • Get Vincent Beaufils’s list of 18 most stable word stems: “eye”, “ear”, “nose”, “hand”, “tongue”, “tooth”, “death”, “water”, “sun”, “wind”, “night”, “two”, “three”, “four”, “I”, “you”, “who”, and “name” for the languages in question.
  • For each pair of languages, add up the Brown–Holman–Wichmann distance between all pairs of corresponding words. This kind of distance takes into account 692 correspondences between consonants and vowels that recur among world’s languages.
  • Apply multidimensional scaling to the matrix of distances. This approach minimizes the sum of squared differences between the input and output distances. Take three initial dimensions of the result.
  • Using Matplotlib, draw the labels of the languages in three dimensions. Move the viewpoint with uniform speed along the tennis ball seam as defined parametrically by López-López:
    x(t) = (1 − b)sin t + b sin 3t
    y(t) = (1 − b)cos t − b cos 3t
    z(t) = √4(1 − b)b cos 2t
    Following López-López, set b = 0.20.
  • The closer the label of a language is to the viewpoint, the less transparent make the label. Concretely, vary the alpha channel of the labels from 0.05 far away from the viewpoint to 1.0 close to it.

As a bonus, here are visualizations of lexical distance among languages of the Indo–European family and its Germanic, Italic, and Romance branches.



The colors come from Ethan Schoonover’s Solarized palette. The elegant narrow font is Delicious, a free font from exljbris Font Foundry.

Visualizing Lexical Distance in Three Dimensions

3 thoughts on “Visualizing Lexical Distance in Three Dimensions

  1. This is brilliant stuff! Thank you so much for taking the time to write out a thorough explanation of how you arrived at your distances and how you visualized the data. I was wondering if there are any plans to make an interface where one could control the visualization by clicking and dragging? Like the list would be static and you could navigate around the space to see the distance? That would be phenomenal!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.