Origin of language lies in song

Chris Knight of the Radical Anthropology Group examines one of science’s most intriguing unsolved problems. This is an edited transcript of a talk given to Communist University in August 2015

Jerome Lewis is that rare thing: an anthropologist who has had years of experience conducting fieldwork with a group of hunter-gatherers in the Congo. He has become especially interested in the way the Bayaka forest people sing and he argues that their polyphonic singing is an expression of their egalitarianism, their communistic way of life.

There is much improvisation, with frequent switching between melodies, but no-one is organising this. Jerome describes the way in which this kind of singing acts as a sort of blueprint for these people’s style of government: they enjoy ‘government from below’. Just as they organise their singing, so they organise everything else. Building on Jerome’s fieldwork, the two of us have been working together in recent years to try to solve one of science’s great remaining mysteries: the evolutionary emergence of language in our species.

Many Marxists might wonder why this should be considered a mystery. Didn’t Engels solve it long ago, when he wrote that people began developing language from the moment they had something to say to one another? Surely when people need to cooperate in labour tasks they must communicate with each other, and that is how language develops?

I agree that this is a fairly accurate description, but we need to do quite a lot more. Many animals cooperate in all sorts of ways and, of course, they communicate. But language in the human sense, with its grammatical rules and its digital structure, is radically different from anything known in the animal world. So Engels’s explanation, although basically correct, would not in itself satisfy any scientist today. It is hardly detailed enough to qualify as a scientific theory.

So why is the origin of language one of the most difficult questions in science? Let me explain some of the paradoxes.

Darwin and Chomsky

I shall start with Darwin’s theory, which he himself described as “descent with modification”. Fish have fins, but animals have legs. The point here is that the fin provided a starting point. Over millions of years, as certain fish became stranded on mud-flats, their fins became more stubby, more bony, and the eventual result was legs. “Descent with modification” only works if there is a precursor: had there been no fins for evolution to work on, legs could not have evolved.

The problem with language is that it is very hard to find a precursor. Let us take, say, the chimpanzee ‘pant-hoot’. Like all other primate vocalisations, it is basically body language - audible bodily condition, arousal, emotion. Listeners are interested in inferring the exact degree of arousal not by choosing from a set of categories, but by evaluating the signal in terms of an infinite range of gradations. Needless to say, human language can be emotional too, but that is not central to the system. Language works by selecting between a limited range of choices, as in the contrast between ‘bin’ and ‘pin’. Deep down, grammatical language is a digital, dispassionate, logical means of communication.

It has often been said that, since language is so complex, its precursor must have been something correspondingly complex in our primate ancestors. That would suggest a basis in primate cognition, which is hugely complex, rather than primate vocal communication, which shows little if any of that complexity.

Ape intelligence is complex in the sense that it is ‘Machiavellian’. Apes can easily work out that ‘the enemy of my enemy is my friend’ and so on. But to say that is to leave unexplained the basic problem, which is how primate social intelligence ever came to be expressed in ways allowing complex thoughts to be shared.

Most scientists these days agree with Noam Chomsky and his supporters that language is in some sense an instinct. I agree with that too. Not in the sense that words and grammar are instinctive, but in the sense that a young child does not have to be trained to acquire language through reward and punishment. In the 1940s and 1950s, it was thought by many that a child acquires its first language in the same way that a laboratory rat is programmed to run through a maze - that is, through rewards and punishments applied from outside. As a result of Noam Chomsky’s brilliant insights, it is deeply unfashionable to believe this now.

A child very quickly works out how to speak the language of those around it. Grammar is the most theoretically complex edifice imaginable, yet a child picks up the complexities almost effortlessly at the age of two or three - a biologically fixed time known as the critical period for language acquisition. More complexity comes out from the child than apparently goes in. Anyone who has ever had children will know that they will often say things which are so original, so funny or so imaginative that you can only wonder where they got it from! A human baby comes into the world with the innate equipment to do that: a chimpanzee does not. Chomsky calls this biological endowment, possessed equally by all of us, “universal grammar”.

So where does this biological endowment come from? If you look at most human instincts, you can discern a counterpart of some kind in the animal world. We have a sex instinct, and it is clear where this comes from. Aggression? Not a problem. Maternity? Not a problem. These human instincts have counterparts in the animal world. But if we accept that humans have a genetically determined language instinct - and I think we should - then where on earth did that come from? It seems to have evolved in our species from nowhere. It is hard to see any animal counterpart or precursor. Chomsky suggests that it must have been installed in just one step by a single mutation, perhaps triggered by a cosmic ray shower.

Biologists would normally expect a completely new instinct to take millions of years to evolve. If a language instinct began evolving in our species millions of years ago, archaeologists might expect that unusual development to be accompanied by other signs of unusual things happening, such as remarkable levels of planning and social cooperation. But if you look at the Australopithecines evolving several million years ago, there is no sign that they were doing anything particularly cooperative. They seem to have been living much like other monkeys and apes - albeit on two legs and with some tools. So, if language did start evolving two or three million years ago, it seems strange that it left no indirect signs which archaeologists can discern. The conclusion must be that language did not start evolving until very much later.

Most specialists nowadays assume a relatively recent date for the emergence of language - say, 150,000 to 200,000 years ago, coinciding more or less with the emergence of Homo sapiens. Chomsky suggests a date of around 100,000 years ago at most.

In Africa, where we evolved, we begin to get evidence for the use of red ochre pigments in body art from around 250,000 years ago, suggesting that our ancestors may by this stage have been regularly staging symbolic rituals. It seems plausible to suggest that symbolism in such activities was in some way linked with linguistic symbolism. But if grammatically complex language has only been with us for that brief period of time, it is difficult to see how the necessary biological underpinnings could have had time to evolve.

Charles Darwin had a very simple idea, but unfortunately it does not seem to work. He argued that if apes cannot use language, it is because they are not clever enough. As our primate ancestors’ brains got bigger, he argued, they learned to imitate other species’ calls and cries, arranging and combining these sounds in more and more complex ways. But, as Chomsky and others have pointed out, there does not seem to be any simple correlation between the cognitive complexity of a species and its communicative complexity. The creature with perhaps the most complex form of communication apart from humans is the honeybee, whose brain is about the size of a grass seed. Through its celebrated waggle dance, a bee that has discovered important information can tell the rest of the hive where the flowering plants are, where they are in relation to the angle of the sun, how to get there - and it is all done in the darkness of the hive.

There is no correlation here between the size of the brain and communicative ability. And, when it comes to primates, there is a hugely disappointing reverse correlation - the more intelligent the primate, the bigger the brain, the less complex its vocalisations seem to be. So, for example, many monkeys with relatively small brains actually seem to have a more complex repertoire of meaningful vocalisations than is possessed by the larger-brained great apes, such as chimpanzees. At first sight, this reverse correlation seems hard to explain.


To grasp what is happening, you have to remember why scientists describe primate intelligence as ‘Machiavellian’. Monkeys and apes live in complex political systems. Their intelligence is politically calculating and self-serving, with no need to take account of moral principles upheld by the group as a whole. Needless to say, there is also cooperation, as when chimpanzees band together to hunt colobus monkeys. But, once some individual has seized its monkey victim and begun trying to eat it, others quickly arrive and there is a fight over the spoils, with no advance commitment to ensuring fair shares. Apes are highly competitive creatures. Even when they work together, there is no community-wide contract or commitment to put common interests first.

Insofar as an intelligent primate can manipulate its vocal calls, it will tend to do this in order to deceive. Occasionally a monkey, while feeding, will falsely emit an alarm call to scare off its rivals and then grab the available food for itself. But this works only if most calls are honest. Deception has to be relatively rare, otherwise the signaller will soon find itself ignored. The more patently Machiavellian or ‘fake’ an animal’s calls, the less likely are those around it to pay any attention to it at all.

This explains why Darwin’s theory about regular, routine vocal mimicry does not work. To mimic emotional cries on a regular basis would be to fake them deliberately, which would render them no longer trustworthy. This, then, is the difficulty with Darwin’s theory: where primate listeners are concerned, a constant priority is to be on guard against the possibility of being deceived. As a result, signallers are under constant pressure to convince sceptical listeners of the intrinsic reliability of their calls and cries. Lack of trust in communicative intentions has the effect of restricting primate vocal signalling to hard-to-fake body language, much like human screams, sobs and cries.

All this is beautifully explained by the Israeli theoretical biologist and ornithologist, Amotz Zahavi, in his ground-breaking book, The handicap principle: a missing piece of Darwin’s puzzle.It is a wonderful read. Published in 1997, the book helped transform our understanding of animal communication.

A good illustration is the peacock’s tail. Why does the male need such extravagant plumage to attract the peahen? In fact, the tail brings all kinds of problems: it stops the male from flying very far, it invites parasites, and if the brilliant pattern turns out to be even slightly asymmetrical the problem shows and, as a result, any female will reject him completely. Zahavi’s answer is that it is precisely because the tail makes the peacock so vulnerable - precisely because the display is so costly in all these ways - that it is so convincing. Only an extremely fit male could afford that kind of handicap.

Another example is ‘stotting’. In Africa when a lion approaches what do gazelles do? You would think they would just run away, but no. Instead they perform a sequence of high jumps, each throwing its highly visible backside as high as it can in the air. Why would the herd of gazelles do that? Zahavi’s answer is simple: each gazelle wants the lion to chase someone else. The predator sees a whole range of jumping backsides. Which one will it decide to target? Not the gazelle who is so confident that it demonstrates its reserves of spare energy by impressively jumping up and down. The lion is more likely to pick on any gazelle in the herd who seems in a hurry to escape. So the gazelles are engaged in costly signalling, using as much energy as they can afford to bounce up and down instead of running away. This is a brilliant theory that explains so much.


All animal signalling is in some way costly. Animals do not just use a minimum of sound to make a point. Go to a zoo and hear the monkeys hooting and howling - they keep repeating the same sound over and over again. These are costly signals, audible body language - extravagant rather than efficient. Admittedly, there are occasions when we humans, too, must resort to costly cries, screams, sobs and so forth. This tends to happen when we need to convince others that we really mean it. But for most of the time, when we are engaged in conversation, there is no need to shout, bang the table or scream. Where people trust one another and share similar concerns, it becomes possible to exchange quiet combinations of ultra-low-cost signals, each consisting of a small number of categorically differentiated vowels and consonants. Repetition is kept to a minimum. Human language is the most efficient system of communication on the planet.

One way to maximise efficiency is to rely on the muscles and mechanisms which take least effort to move. When we speak, we use what are called the articulators: movable mechanisms, such as the tongue, lips, mandible, soft palette and larynx. These are parts of the mouth and throat which evolved originally for eating and breathing, not speaking. The lips bring in food, the tongue manipulates it, the mandible chews, the soft palette helps us swallow. Speech scientists like to point out that talking involves going through the motions of eating. It is as if we were chewing, while at the same time phonating - producing sounds from the voice box. It is significant that monkeys and apes appear unable to manage this feat: either they are vocalising or they are eating: one or the other. As an ape starts to vocalise, its tongue becomes temporarily immobilised, playing no role in modulating the sound. We humans are unusual, in that we can do both - switching on the voice while simultaneously activating the ingestion system.

When we do this, each so-called articulator functions as a digital switch, which has to be in one position or the other: typically either ‘on’ or ‘off’. Either the lips are open or closed, the vocal folds are audibly vibrating or silent, the tongue is in this or that position inside the mouth. If the consonant ‘d’ is unvoiced, the sound switches to a ‘t’. An intermediate sound somewhere between ‘d’ and ‘t’ is not possible. Note that the switch from one state to the other could easily change the meaning of a whole sentence, doing so at essentially zero cost.

All the other subtle moves we make while speaking - for example, differentiating between ‘z’ and ‘s’, or between ‘o’ and ‘a’ - are so cheap and easy as to be essentially cost-free. One consequence is that unlike primate vocal signalling, whose costly features must be evaluated by the listener on an analog scale, the signal stream under these circumstances gets reduced to digital format.

Whereas Chomsky explains this by claiming that the human brain incorporates a special digital module, I prefer a simpler explanation. Intentions cost nothing. Where listeners trust you sufficiently to care primarily about what you intend, there is no longer any need to scream, wave your hands or shout. Your audience will now encourage you to use short cuts.

If you can reduce the costs of signalling sufficiently, you soon arrive at the point where all that matters is that your partner can discern the difference between ‘signal on’ and ‘signal off’. This has huge theoretical significance, highlighting what Chomsky considers the core principle which sets language apart from primate vocal communication. Chomsky calls it ‘digital infinity’. I agree that digital format is central, but do not agree that this implies a special computer module in the brain. It is simply that if you keep cutting costs, that point is where you will eventually arrive. To reduce signal costs to zero is, by definition, to arrive at the logical extreme of a digital system of communication. It is well known by speech scientists that speech is digital, but all too often this is explained mechanistically instead of socially and politically. I am arguing that speech is digital because, to repeat, where trust is sufficiently high, that is what you get.

To the listener attending to human speech sounds, what matters about each arrangement of vowels and consonants is not the precise quality of sound or what it reveals about the speaker’s bodily state or reserves of energy. All that matters is the underlying communicative intention. Note how different this is from what happens when a chimpanzee hears another chimpanzee emitting, say, a food call. The listening chimpanzee is hardly interested in what the signaller intends. Its focus, rather, is on the precise quality of that signal as a guide to the other chimp’s involuntary excitement on discovering food.

So what is it that stops apes and monkeys from producing complex syllables, vowels and consonants? The choice is between a Marxist, materialist explanation and an abstract, mechanistic one. Scientists who are not Marxists tend to rely on mechanisms. A good example here is the idea that monkeys and apes have an unfortunately inflexible tongue, or maybe the wrong kind of nerves controlling the tongue. You may find it hard to believe, but arguments of that kind are continuously being proposed. Somehow, it is said, we humans began talking when genetic mutations led to a more flexible or more controllable tongue.

But think about it. The idea that evolution could possibly have yielded an ape or monkey with an inflexible tongue makes no sense. The truth is the exact opposite: non-human primates need a flexible tongue under volitional control, because otherwise they would starve. I like to turn the argument on its head. It is precisely because the tongue is so flexible and easy to manipulate that it is excluded from a role in primate vocal communication. If you follow my train of thought, I hope you will immediately see that if the chimpanzee were to use its remarkably flexible tongue, lips, etc to intentionally manipulate its signalling, then no other chimp would believe what it was saying.

So it is an entirely social and political deficit - lack of community-wide trust - which continuously drives chimpanzees to fall back on the muscles and mechanisms in their bodies which are least subject to volitional control and which, therefore, are most likely to provide reliable evidence of what is happening. If there is any doubt about the truth of what is being conveyed by a particular call or cry, non-human primates have good reason to press for another and more convincing version of that same signal - one which cannot possibly be faked. This explains why primate vocal communication seems so repetitive to human ears. It is easy to see why the same factor of mistrust leads to the marginalisation of the all too flexible, potentially deceptive tongue.

Choral singing

Turning now to human evolution, the articulatory apparatus for speech hardly needs to be explained. For millions of years, the basics were already in place among our ancestors, for the simple reason that possession of a flexible tongue, lips and so forth had long been essential for eating. Much more difficult was to establish something new - full volitional breath control and control over the larynx. The challenge was to develop the uniquely human ability to take a deep breath and make continuous vocal sounds, while breathing out and articulating at the same time. An intriguing theory now being widely debated is that our ancestors refined and developed these capacities by regularly resorting to choral singing.

Singing is extremely important for African hunter-gatherers. That is especially true for the women, who sing whenever they go out foraging. When the men are out hunting they do the opposite, remaining as quiet as they can. If they do need to communicate, they resort to sign language, or perhaps silently move certain twigs or leaves. Silence is needed to avoid scaring away the animals they are hunting. But when the women are out foraging they face quite different challenges. Charging buffalo, elephants and other large animals can pose a real threat. To keep themselves and their children safe from harm, they want the animals to keep away.

Working in the Congo among the Bayaka forest people, Jerome Lewis reports how he once asked the women, “Why do you sing so much?” They replied, “We are singing for our lives.” They explained that when they were out in the forest, especially on dark nights with no moon, they knew that their traditional style of polyphonic singing kept the dangerous animals away. Jerome is inclined to believe what the women themselves say. With that particular kind of polyphonic choral singing, even a small group of five or six women can produce sounds conveying the impression of a sizeable group.

In the distant evolutionary past, the predators of the day - sabre-tooth tigers and similarly fearsome creatures - preferred to hunt on moonless nights, so as to take advantage of their superior night vision. Even to this day, Bayaka women’s polyphonic singing reaches a crescendo during the nights of the month when there is no moon. In the case of the Hadza of Tanzania, we find the same thing - women’s polyphonic singing goes on for hours in or near the camp during dark nights with no moon. Those dark nights are precisely when lions pose most danger. If you add it all up, it makes beautiful sense.

Singing bonds people together, producing a group effect. It is difficult to sing without feeling some kind of emotion. The reason for this is that our primate ancestry still associates the production of pitch changes with profound changes in emotional state. If you go back far enough into our pre-human past, there must have been a time when it just was not possible to cry, scream, moan or howl, producing pitch variations accordingly, unless the corresponding emotions were being genuinely experienced. Over time, it became possible to exert a certain amount of control over these variations in emotionally expressive sound. But, even if this involved a certain ability to imitate or fake the sounds, it remained difficult to establish full independence from bodily or emotional states. Deep down, when you emitted or heard these sounds, they continued to churn the emotions. As if you were still a chimpanzee or gorilla, your emotional brain would have heard the sounds as genuine expressions, not fakes.

For this reason, the best way to control production of such sounds would have been to put yourself in a situation where the corresponding emotions were genuinely being worked up inside you. By getting sorrowful, you could make sorrowful sounds; by getting elated, you could vocally convey elation to everyone else. Note that if your real environment at the present time does not trigger the necessary feelings, all this might prove impossible to achieve. An obvious solution would have been to produce pitch changes jointly with other people, ensuring that the whole group’s emotional states became shared as part of a constructed artificial environment. That would imply choral singing. Where this occurred, body language would have become collectivised, as singers tuned in to each other. People would now have enjoyed enhanced volitional control over the emotionally expressive sounds they could make. Whenever the group changed its mood, the vocal expression of that mood would have changed accordingly.

It is important to stress here that choral singing is emphatically not the same thing as speech. Every principle of efficiency central to speech is systematically violated. Instead of interspersing short vowels with consonants, we draw out those vowel sounds without requiring tongue movements or consonants at all. Perhaps the key point is that in music efficiency would be absurd and so you do not resort to abbreviations. To produce the effect you must go through the full performance - each and every note of it - and in many cases repeat the entire sequence again and again. Singers are not under pressure to speed up or cut corners. Since in music there are no abbreviations, it follows that there are no sequences of abbreviations which need to be clearly differentiated from one another. Words are not required. Technically, there is no need to overlay pitch variations with contrastive vowel-consonant alternations. While singing in the forest, Bayaka women will dwell endlessly on varying the pitch of just a few vowel sounds.

Singing, then, requires time and energy. To sing through the whole night with a chorus of other women is to do something costly which signals your commitment to the group. I have stressed that singing is not language, but now I want to stress that precisely for that reason - because singing takes time, is costly and is emotionally bonding - it can generate exactly the kind of group-level trust on which language depends.

Let me conclude with a possible scenario. Imagine that the singing has now stopped. Even though no-one is singing, everyone remains aware of the group’s repertoire of different songs. This gives everyone the rudiments of a shared code, allowing the possibility of using fragments of song to convey diverse messages in different, freely chosen contexts. There might be one melody which people traditionally sing when they are looking for honey, another which they sing when someone has recently died. Each song has a certain pattern and content and, when the occasion arises, everyone would normally join together in singing the whole thing.

But, right now, it is the next day. We are starting to plan other activities, perhaps without any singing at all. Someone might just pick a little fragment out of the honey song to suggest we go looking for honey. So you get parts of the song standing in for the whole, just to convey a thought, and then, to distinguish one song fragment as clearly as possible from another, you might find vowels and consonants useful in a way they never were before.

In a short talk of this kind, there is no time to answer all the problems I earlier raised - that would take a book. But this brief sketch in my view captures the essence of how language in our species evolved. It did not all happen the way Chomsky says it did - in one step as a result of a mutation triggered by a cosmic ray shower. Rather, it happened as a consequence of profound social and political change.

Further reading

C Knight and J Lewis, ‘Vocal deception, laughter and the linguistic significance of reverse dominance’, in D Dor, C Knight and J Lewis (eds) The social origins of language Oxford 2014, pp297-314.