The origin of language in gesture–speech unity

Part 2: Gesture-first

By Professor David McNeill

This popular hypothesis says that the first steps of language phylogenetically were not speech, nor speech with gesture, but were gestures alone.  In some versions, it was a sign language. In any case, it was a language of recurring gesture forms in place of spoken forms. Vocalizations in non-human primates, the presumed precursors of speech without gesture’s assistance, are too restricted in their functions to offer a plausible platform for language, but primate gestures appear to offer the desired flexibility. Thus, the argument goes, gesture could have been the linguistic launching pad (speech evolving later). The gestures in this theory are regarded as the mimicry of real actions, a kind of pantomime, hence the appeal of mirror neurons as the mechanism. To quote Rizzolatti and Arbib (1998), in their exposition of gesture-first, mirror neurons are “neurons that discharge not only when the monkey grasped or manipulated the objects, but also when the monkey observed the experimenter making a similar gesture” (p. 188).  Current chimps show this kind of action mimicry (see illustration later in this post).

Did gesture scaffold speech, then speech supplant it?  Even if mirror neurons were a factor in the origin of language, our basic claim is that a primitive phase in which communication was by gesture or sign alone, if it existed, could not have evolved into the kind of speech–gesture combinations that we observe in ourselves today. We see two problems. First, gesture-first must claim that speech, when it emerged, supplanted gesture; second, the gestures would be pantomimes, that is, gestures that simulate actions and events. However, such gestures do not combine with co-expressive speech but rather fall into other slots on the continuum of gestures, supplements (rather than co-expressiveness) and pantomime.

Looking over a roster of gesture-first advocates, including several writing before the mirror neuron discovery, all say at some point that speech supplants the original gesture language, which then is marginalized. For example, Henry Sweet (said to be Shaw’s model in Pygmalion for Henry Higgins) wrote, “…gesture which later would be dropped as superfluous” (pp. 3-4).  More recently, Rizzolatti and Arbib said,  “… gesture became purely an accessory factor to sound communication.”  In all cases, as in these quotes and many others, gesture withers to the status of an “add-on.”

This is the first wrong assertion. Gesture-first commits one to the false prediction that speech replaced gesture rather than, as we see in ourselves, speech and gesture united as one “thing.”  We say that gesture-first incorrectly predicts that speech would have supplanted gesture, and fails to predict that speech and gesture became a single system. It thus is falsified – twice in fact. The contradiction of gesture-first is that speech supplants gesture, it says, yet ends up integrated with it. The logic of gesture-first, at its very core, means that supplantation, overt or hidden, is inescapable. This is why every advocate naturally posits it.

Empirically, there is this perfect correlation of those advocating gesture first and the supplantation step. Moreover, there is a conceptual point that explains it. It is important to see that gesture-first is a theory about the origin of speech (not gesture). Given that aim, it must logically consider that from gesture one gets to speech; and here supplantation enters: it is unavoidable. Even Sweet, who envisions a transition from hand gestures to tongue gestures, and with them to speech, wants to leave hand gestures out at the end as “superfluous.” He has no way to say from his several transitions that gestures in the end are other than left-overs.

When it emerged, why did speech not gradually integrate with gesture? This is possibly what “scaffolding” intends in part. But even if scaffolding took place it could only have been a temporary arrangement. For speech to become an autonomous system, sooner or later gesture and speech must have separated. The reason again lies in the gesture-first tenet. The whole logic of gesture-first is to picture one code coming after another. The models of supplantation immediately below show the effects. The most that can happen is that the codes divide the labor of communication, as will be seen with the second model. Even if speech integrates with a gesture-language (as a kind of vocal gesture) it must sooner or later become an encoded system of its own, and the would-be integration is lost. The first of the models shows this happening – two codes, one for gesture and one for speech refusing to synchronize. It does not help to point to gestures in non-linguistic primates. There is nothing in them to show how they could lead to language without encountering the same roadblock of supplantation.

Models of supplanting and scaffolding.  To see what may happen when two codes co-occur, as they would at the hypothetical gesture-first/speech supplantation crossover, we have two models: Aboriginal signs performed with speech, and hearing bilingual ASL signs with spoken English. In neither case is there the formation of packages of semiotic opposites, as the example in post 1 illustrated and the growth point explains. When a pairing of semantically equivalent gesture and speech is examined in these models, the two actively avoid speech–gesture combinations at co-expressive points. They repel each other in time or functionality or both, and do not coincide at points of co-expressivity.

1. Warlpiri sign language. Women use the Warlpiri sign language of Aboriginal Australia when they are under (apparently quite frequent) speech bans and also, casually, when speech is not prohibited. When this latter happens signs and speech co-occur and lets us see what may have occurred at the hypothetical gesture or sign-speech crossover. Here is one example from Kendon:

The spacing is meant to show relative durations, not that signs and speech were performed with temporal gaps (both were performed continuously). Speech and sign start out together at the beginning of each phrase but, since signing is slower, they immediately fall out of step. Each is on a track of its own and they do not unify. Speech does not slow down to keep pace with gesture, as would be expected if speech and gesture were unified (mutual speech–gesture slowing is shown by the deafferented patient, IW, “the man who lost his body,” described in post 3). They then reset (there is one reset in the example) and immediately separate again. So, according to this model, co-expressive speech–gesture synchrony would be systematically interrupted at the crossover point of gesture and speech codes. Yet synchrony of co-expressive speech and gesture is what evolved.

2. English-ASL bilinguals. The second model is Emmorey et al.’s observation of the pairings of signs and speech by hearing ASL/English bilinguals. While 94% of such pairings are signs and words translating each other, 6% are not mutual translations. In the latter, sign and speech collaborate to form sentences, half in speech, half in sign. For example, a bilingual says, “all of a sudden [LOOKS-AT-ME]” (from a Sylvester and Tweety narration; capitals signify signs simultaneous with speech). This could be “scaffolding” but it does not create the combinations of unlike semiotic modes at co-expressive points that we are looking for. First, signs and words are of the same semiotic type – segmented, analytic, repeatable, listable, and so on. Second, there is no global-synthetic component, no built-in merging of analytic/combinatoric forms with gesture’s global synthesis, and the spoken and gestured elements are not co-expressive but are the different constituents of a sentence. Of course, ASL/English bilinguals have the ability to form GP-style cognitive units. But if we imagine a transitional species evolving this ability, the bilingual ASL-spoken English model suggests that scaffolding did not lead to GP-style cognition; on the contrary, it implies two analytic/combinatoric codes dividing the work. If we surmise that an old pantomime/sign system did scaffold speech and then withered away, this leaves us unable to explain how gesticulation emerged and became engaged with speech. We conclude that scaffolding, even if it occurred, would not have led to current-day speech-gesticulation linkages.

Corballis, in his 2002 argument for speech supplanting a gesture-first system of communication, points out the advantages of speech over gesture. There is the ability to communicate while manipulating objects and to communicate in the dark. Less obviously, speech reduces demands on attention since interlocutors do not have to look at one another (p. 191). While valid, these qualities are not necessary. There are also positive reasons for gestures not being language-like, and they would be so even if gesture and speech co-evolved as a single adaptation. All across the world, languages are spoken/auditory unless there is some interference to the channel (deafness, acoustic incompatibility, religious practice, etc.), and no culture has a visual/gestural primary language. Susan Goldin-Meadow, Jenny Singleton and I once proposed that gesture is the non-linguistic side of the speech–gesture dual semiotic because it is better than speech for imagery: gesture has multiple dimensions on which to vary, while speech has only the one dimension of time.  Given this asymmetry, even if speech and gesture were jointly selected, as proposed in this series, it would work out that speech is the medium of linguistic segmentation.

Problems with pantomime. The second problem is that the gestures of gesture-first would be pantomimes. Gesture-first claims the initial communicative actions were symbolic replications of actions of self, others and entities, and these pantomimes later scaffolded speech. This process appeals because it so clearly taps the mirror neuron response. Merlin Donald likewise posited mimesis as an early stage in the evolution of human intelligence. It is conceivable that pantomime is something that an apelike brain is capable of and was already in place in the last common chimp–human ancestor, some 8 million years back. Contemporary bonobos are capable of it, supporting this idea:

Bonobo Gestures

The problem is not a lack of pantomime precursors but that pantomime repels speech. The distinguishing mark of pantomime compared to gesticulation is that the latter is integrated with speech; it is an aspect of speaking. In pantomime this does not occur. There is no co-construction with speech, no co-expressiveness; timing is different (if there is speech at all), and no dual semiotic modes. Pantomime, if it relates to speaking at all, does so, as Susan Duncan points out, as a “gap filler” – appearing where speech does not, for example completing a sentence (“the parents were OK but the kids were [pantomime of knocking things over]”). Movement by itself offers no clue to whether a gesture is “gesticulation” or “pantomime”; what matters is whether or not two modes of semiosis combine to co-express one idea unit simultaneously. Pantomime does not have this dual semiosis.

Last word on gesture-first.  Whether you are persuaded by these arguments depends, ultimately, on taking seriously gesture–speech unity, that gesture and speech comprise a single multimodal system, and that gesture is not an accompaniment, ornament, supplement or “add-on” to speech but is actually part of it. Gesture-first does not predict this language–gesture integration. When we look at models of speech–gesture crossovers of the kind that, in theory, gesture-first would have encountered when speech supplanted an original gesture language, we do not find conditions for gesture–speech unity, but instead non-co-expressiveness or mutual speech–gesture exclusion.

Joining the damage is Woll’s (2005/2006) argument that not only does gesture-first leave gestures unable to integrate with speech but it also blocks, within speech itself, the arbitrary pairing of signifiers with signifieds that is characteristic of (or, Saussure says, defining of) a linguistic code.

Michael Arbib, in his gesture-first theory, envisions an “‘expanding spiral’ of increasingly sophisticated protosign and protospeech,” a spiral moving from gesture-first to speech.  A spiral pictures gradual changes from gesture (or protosign) to speech (or protospeech). This appears not to be the “crossover” modeled above, but the models still apply. Pantomime and signs push synchrony and co-expressiveness with speech away, and do not break out of this self-defeating pattern (despite the spiral’s openness, as Arbib also argues, to sign and speech shaping each other). Nothing in the spiral forms co-expressiveness and gesture–speech unity. With each turn gesture spins off (“scaffolds”) a bit more of itself into speech; but then speech, far from shaping gesture or being shaped by it, repels it and/or divides the labor between itself and its former gesture master.

Michael Corballis likewise continues to advocate gesture-first in a new work, which takes as its central theme a posited linguistic universal, recursion.  However, recursion is equally beyond gesture-first. This is because recursion enters into gesture–speech unities. It co-expressively appears in both gesture and speech simultaneously. In one example, a speaker outlined what she took to be an ambiguity in the bowling ball episode of the cartoon stimulus described in post 1.  She first states the perceived ambiguity (“you can’t tell if the bowling ball”) and then, recursively, states the alternatives (“is under Sylvester or inside of him”); concurrently and co-expressively, she moves her left hand to a certain space for the ambiguity itself, and then opposes two spaces within it for the two poles of the ambiguity (two further gestures in the “ambiguity” space  – first the hand moves forward with “is under”, then inward with “or inside of him”); so there is recursion on both sides of the dialectic. The recursions, spoken and gestured, partake of the usual semiotic oppositions: while speech is codified, comprised of recurrent elements with constraints of meaning and form, gesture is global and synthetic and the meaning of the whole (ambiguity) determines the meanings of the parts (the “under” pole, in particular, being anti-iconic for the meaning of being under something).  None of this can gesture-first explain.

I do not deny that gesture-first may once have existed, and in fact I assume that it did exist once. But if it did it could not have led to human language.  It would have created pantomime, a type of gesture that does not unify with speech. Gesture-first either extinguished or shunted off into a dead end.  I propose in a later post that it was a dead end seen now only in children’s earliest language.

The upshot is that gesture-first has little light to shed on the origin of language, as we know it; at best it explains the evolution of pantomime as a stage of phylogenesis that, if it once occurred, went extinct as a code and landed at a different point on the continuum of gestures.


Further Reading


Arbib, M. A. 2005. ‘From monkey-like action recognition to human language:  An evolutionary framework for neurolinguistics.’  Behavioral and Brain Sciences, 28: 105-124.

Armstrong, David F. and Wilcox, Sherman E. The Gestural Origins of Language. Oxford.

Armstrong, David F., Stokoe, William F. and Wilcox, Sherman E. 1995. Gesture and the Nature of Language. Cambridge.

Corballis, Michael C. 2002. From hand to mouth: the origins of language. Harvard.

Corballis, Michael C. 2011.  The Recursive Mind: The Origin of Human Language, Thought, and Civilization. Princeton.

Donald, Merlin. 1991. Origins of the Modern Mind: Three Stages in the Evolution of Culture and Cognition.  Harvard.

Henderson, E. (ed.). 1971. The Indispensable Foundation: a selection from the writings of Henry Sweet. Oxford

Hewes, Gordon W. 1973.  ‘Primate communication and the gestural origins of language.’  Current Anthropology 14:5-24.

Rizzolatti, Giacomo and Arbib. Michael. ‘Language within our grasp.’ Trends in Neurosciences 1998 21:188-194


McNeill, David, Duncan, Susan D., Cole, Jonathan, Gallagher, Shaun & Bertenthal, Bennett. 2008.  ‘Growth points from the very beginning.’  Interaction Studies 9: 117-132.

Goldin-Meadow, Susan, McNeill, David, and Singleton, Jenny. 1996. ‘Silence is liberating: Removing the handcuffs on grammatical expression in the manual modality.’ The Psychological Review 103: 34-55.

Woll, Bencie. 2005/2006. ‘Do mouths sign? Do hands speak?’ in Botha, Rudie & de Swart, Henriette (eds.), Restricted Linguistic Systems as Windows on Language Evolution. Utrecht: LOT (Netherlands Graduate School of Linguistics Occasional Series, Utrecht University). (accessed 05/02/11).

Sign languages with speech:

Emmorey, Karen, Borinstein, Helsa B. and Thompson, Robin. 2005. ‘Bimodal bilingualism: Code-blending between spoken English and American Sign Language’, in Cohen, Rolstad and MacSwan (eds.) Proceedings of the 4th International Symposium on Bilingualism, pp. 663-673.  Somerville, MA: Cascadilla Press.

Kendon, Adam. 1988. Sign languages of aboriginal Australia: cultural, semiotic and communicative perspectives. Cambridge.

Gestures of Apes:

Call, Josep and Tomasello, Michael (eds.). 2007. The gestural communication of apes and monkeys. Erlbaum.




David McNeill is a professor in the Departments of Linguistics and Psychology at the University of Chicago.

His new title How Language Began: Gesture and Speech in Human Evolution is now available from Cambridge University Press at £19.99/$36.99

6 comments to The origin of language in gesture–speech unity

Leave a Reply to The origin of language in gesture–speech unity « Cambridge Extra at Linguist List




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>