Phylogeny and the history of language and culture

Increasingly, work is being done using the methods of phylogenetic systematics to uncover cultural and linguistic evolution. A leading lab on this work is Russell Gray’s lab at the University of Auckland in New Zealand. He and his collaborators have looked at the evolution of language, particularly Pacific languages, and other cultural trends (like canoe decoration) in evolutionary terms.

Now they have published a paper in Science (Bouckaert et al. 2012), “Mapping the Origins and Expansion of the Indo-European Language Family”. The media of course published this under headlines like “English language originated in Turkey”, thereby demonstrating that journalists understand no evolutionary thinking as well as they understand no economics. Basically by using word forms of many extant and extinct Indoeuropean (IE) languages, and a Bayesian analysis, they established that IE originated in Anatolia, or the central regions of modern Turkey. 

Anatolian origins of IE

This is not unlikely, for certain values of “began”. They locate the origination event around 9500 years ago, which is not long after agriculture began more or less in the same region. Previous hypotheses were that IE began around 5000 years ago in the central Asian steppes, along with the domestication of horses and the invention of the stirrup.

However, how good is this thesis? The BBC article with the silly headline is actually pretty well sourced and written. They quote Prof. Petri Kallio from the University of Helsinki as saying that “Unlike archaeological radiocarbon dating based on the fixed rate of decay of the carbon-14 isotope, there is simply no fixed rate of decay of basic vocabulary, which would allow us to date ancestral proto-languages.” He remains skeptical.

This is a matter of methodology and epistemology, and it goes to the very foundation of phylogenetic method itself. At best a sample of an organism or artefact from within C14 radioisotopic dating ranges only shows that an instance of that type was around at that time and place. It does not show whether it was the earliest or latest; it merely sets up a single anchor point that all hypotheses must account for.

Likewise, a document or monument with written language shows only that a language type was there at a time. And since writing per se did not arise until around 3500 years ago, even that cannot help. All we know are where recorded languages are or were found. They are anchor points, but not fixed ones. These anchors can and do move.

And Kallio is right: there are no fixed decay rates, or molecular clocks. In fact there aren’t such things in biology either. Molecular rates of change are not universal or constant, and inferences based upon them are at best hypotheses based hypotheses. Phylogeny is a tool, but what does it show?

In my last post I noted that phylogenetic reconstructions show only relatedness. What they do not show, without some extensive ancillary assumptions, is how that relatedness arose. The increasing awareness of lateral transfer and hybridisation between taxonomic lineages indicates that there can be some complex histories even if the taxonomic relationships are treelike. NB: to head off the most common error about this, lateral transfer does not undercut the treelike structure either of evolution or phylogenetic diagrams. It makes them harder to detect, but a certain admixture can be accommodated in a standard tree classification. Of course, if the rate of lateral transfer approaches equality, then you no longer have separate taxa, and so that would count as a single lineage that temporarily separated like populations either side of a geographic barrier that are brought back into contact.

In the case of sociocultural evolution, such lateral transfer is often assumed to be rife. This is thought by some to undercut the importance of phylogenetic method in cultural contexts. I want to argue that it doesn’t, but that the inferences from phylogeny are not so obviously historical as some seem to think.

First of all if some lateral transfer is possible in biological contexts between “good species” (the term used by biologists when they know it’s a species but it doesn’t follow some set of strictures they think species must), then some must be permissible in sociocultural contexts to. A loan word in French from English doesn’t make English and French the same language, no matter what the Academie Française might think. In biology it is the entire shared developmental system, including the genome, that makes a species a species. In culture a tradition has more than just a few elemental objects; it has a functional structure, and in language a grammar.

So phylogenetics can apply nicely in contexts where traditions (or species) are well behaved. If they are relatively stable (i.e., not so transitory that they cannot be tracked), and distinct (i.e., the rate of lateral transfer is not so high they aren’t recognisable traditions any more), then you can do a phylogenetic analysis of them. It’s not so surprising really. Willi Hennig, whose Phylogenetic Systematics (1966) set up modern phylogenetics, took some of his ideas out of the discipline of stemmatics, or tracking manuscripts by differences in transcription (Platnick and Cameron 1977, Atkinson and Gray 2005)

However, there are limitations when using this to reconstruct history. For a start, suppose you have two manuscripts that differ. You cannot reconstruct the last version they share historically. Suppose you have three, and two agree mostly. Can you reconstruct the last version from that? Well the two copies that agree might be from a large copy centre, but the one that disagrees might be from a minor monastery that actually had better copying procedures and a more original version, and so on. These issues are well known to historians and biblical scholars, for example.

Now consider the argument put forward by Bouckaert et al. They look at the frequencies of cognate words and conclude from their analysis that IE began in a particular location at a particular time. Such reconstructions rely on assumptions, like a relatively constant rate of diffusion in all directions. What if the language was blocked by a cohesive language and culture in one direction? What if one population into which it diffused was more conservatively structured? What if a small military power managed to spread through large territories? Each of those shifts the “weight” of the diffusion pattern and means we might think something other than the conclusion that Anatolia was the centre of origin. There are many contingencies and possibilities allowed just by a phylogeny, in culture and language as in biology.

I am not denying the conclusions reached here. I think it likely (the use of Bayesian analysis here is significant) that Anatolia was indeed a centre of many cultural novelties. We certainly think that agriculture arose near or around there. But it doesn’t follow that because Anatolia is novel in one respect (farming) it is novel in another (language). We should avoid confirmation bias in science.

In more general terms, what counts as evidence in any historical and evolutionary process? Can we say that passerine birds first evolved in Austronesia? Can we say that writing began once and was diffused or whether there were many independent inventions? Where did the Etruscans come from? Can we make any origin claims at all? We certainly would like to. The trouble is that information gets lost over time, and the best we can do is anchor events based on actual data. All process hypotheses based on these anchoring events are at best consistent with the data, not proven or even necessarily made more likely by them (to avoid confirmation bias and affirming the consequent style inferences when unwarranted).

It may sound like I am being contrarian here. I am not. This is the standard view in palaeontology (see for example Smith 1994), for example. History is hard to find, and we never have much confidence in our extensions beyond the data. It might be that we can reasonably think IE arose in Anatolia; knowing that is a lot harder.


  1. How high does the posterior probability have to get before we switch from “reasonably think” to “know”? 🙂

    • Oh, sure, ask a simple question, why don’t you?

      95.67% precisely.

      When do grains of sand become a pile? How many hairs on one’s head make you not bald?

      Seriously, as you know but readers may not, there is a massive literature on this. It doesn’t materially affect the argument, though, because whatever counts as “knowledge” in a given discipline or community, phylogenies are not yet knowledge apart from the data points as measured. They summarise what is known and act as straight rules for inductive generalisations from those data points. They test hypotheses in whatever way data tests hypotheses. They are not hypotheses (in my view).

  2. Aaron Clausen Aaron Clausen

    The origins of Proto-Indo European have undergone just about every kind of analysis Linguists over the last two centuries could think of. While perhaps this study will have some merit. Though I am only an interested in this as a layman, to me the Kurgan Hypothesis for the location of the PIE Urheimat is the conservation of so many PIE features in the languages of this region; in particular Lithuanian but also to a lesser degree in the Slavic branch.

    While, under the Kurgan model, Hittite is the earliest family to branch out from PIE, it is still less conservative than Lithuanian language is.

    The real problem here is just how far you can take the biological genetic model and fit it into linguistics. They are both most certainly evolutionary processes, but there are always dangers of taking things too far. But if we do consider the highest number of conserved PIE traits as being a good candidate for the closest language to the original “proto” language, then Lithuanian and its close (but extinct) relatives seem to fit best.

    That’s not to say that the earliest PIE speakers couldn’t have arisen in Anatolia and that there was a split from that point as some group moved northward, the more northerly group retaining a more archaic and anachronistic version of PIE (much as, for instance, Quebec French tends to have far more archaisms than Parisian French, despite Parisian French being in the French “Urheimat”).

    Still, that doesn’t explain other aspects that make the Kurgan hypothesis popular. The PIE roots tend much more towards a Pontic-Caspian/Eastern European environment than an Anatolian one. There are also at least some degree of affinity between the Uralic languages, with at least a few roots from the two proto languages seeming to be of common origin. Of course the data is scant, and it is quite possible that these common roots may be borrowings in one direction or another, but still there seems to be at least some reason to formulate a more distant Indo-European-Uralic mother tongue further back in time, and considering the distribution of Uralic languages, if such a hypothesis gains strength, it makes an Anatolian origin for PIE seem less likely.

  3. gerdien gerdien

    There are several weaknesses in the article by Bouckaert et al that together might very well invalidate their conclusion.

    One weakness seems to make a rather common reasoning error: equating a basal branch with origin. Basal branching does not directly translate into origin, as any biologist should know – the platypus is not the origin of the placentals. Moreover, the point about basal and origin can be seen in Bouckaert et al’s supplementary information that gives the large phylogenetic tree. We know Latin is the origin of the Romance languages, and Latin clusters basal to Romance languages – OK. But Gothic is not the origin of the other Germanic languages, just clusters basal. And Romani, the Gypsy language, clusters basal to all other extant Indian languages; it won’t be the origin of those languages. The phylogenetic tree is fully compatible with Hittite the earliest to branch out from PIE (Aaron Clausen said above), rather than Hittite representing the origin of PIE.

    Moreover, the shape of the phylogenetic tree in fig 2 of the article makes no sense if one considers an Anatolian homeland. Apart from a few non-significant early branches with the known problem languages, the main split in Bouckaert et al is between Indo-Iranian and a Western group. Look at the map in fig 2. That map presents two possibilities: Indo-Iranian departed first from Anatolia, or the Western branch departed first from Anatolia. In both cases the branch that first departed should cluster more basal, and the branch that left last should cluster with Anatolian! The phylogenetic tree (as presented fig 2) says IE should depart from Anatolia and after that split (where? Kurgan area?) into Indo-Iranian and Western language group.

    Another weakness is the reliance on written languages. We know the historic Skythians spoke an IE language (of the Iranian branch it is thought), but as we only know Skythian from a few words recorded by Greek authors, Skythian is absent in Bouckaert’s database. The area of Skythian is represented in Bouckaert’s figure S4 only by the late invasion of Slavonic languages. Given that Skythian is in the Kurgan area, the absence of early IE branches in the Kurgan area in Bouckaert’s modeling is not a result but a clear artifact.

