Human face-to-face communication is a multi-component phenomenon: our everyday speech is embedded in an interactional exchange of unified visual, auditory, and often even tactile signals. Some parts of these multimodal displays are intrinsically coupled due to the effort of vocal production (such as mouth movement accompanying speech sounds), but others are flexible (e.g. gaze and co-speech gestures). Research on the nature and function of human multi-component interaction has recently focused particularly on flexible combinations of different articulators (i.e. communication organs such as hands, lips and eyes) e.g. 1,2. For instance, speech acts accompanied by gestures and gaze are processed faster 3,4 and elicit faster responses 5, respectively. This suggests that a complex orchestration of articulators and sensory channels facilitates comprehension and prediction during language processing 6.
Many non-human species also have a natural predisposition for multi-component social interactions, as evident in complex mating, warning and dominance displays 7,8. Multi-component signalling can be distinguished based on the perspective of production versus perception, as emphasized recently by Holler and Levinson 6: “multiplex” communication involves at least two different articulators or communication organs at the production side 6, such as hands plus gaze, whereas multimodal communication involves at least two different sensory channels at the perception end, such as visual plus auditory 9. Many multi-component acts are both multiplex and multimodal, for instance a tactile gesture combined with a facial expression, whereas some are just multimodal, such as the audio-visual loud scratch gesture 10,11, and others are only multiplex, such as a visual gesture combined with a facial expression. In fact, our closest living relatives, the great apes, are renowned for signalling intentional communicative acts in large part by non-vocal means in their close range dyadic interactions 8,12,13. Not only are many of these signals intrinsically multimodal (e.g. tactile gestures that can be simultaneously seen and felt by a receiver, or lip-smacking which can be seen and heard), but they can also be integrated with other, non-vocal or vocal means in multimodal signal combinations e.g. 12,13,14. Because the term “multimodality” has confusingly been used for both types of multi-component communication 15, we will henceforth refer to multi-sensory and multi-articulator acts for multimodal and multiplex (in the sense of ref. 7), respectively.
The fact that close-range communicative acts may be either multi-sensory or multi-articulatory (even if many are both) highlights the importance of assessing whether they serve different communicative functions. However, to date no study has explicitly investigated and compared the usage of uni-/multi-sensory versus uni-/multi-articulator communicative acts in a great ape taxon (nor, to our knowledge, in humans). The theoretical and empirical differences between these combination types are often ignored in comparative research 12,15, but addressing them would be key to draw conclusions about homologous features in the human/ape communication system 16.
A neurobiological perspective underscores the plausibility of this differentiation: in multi-sensory communication, the recipient is forced to integrate incoming information in at least two different sensory channels that initially are processed in different brain regions. Visual and auditory pathways, for instance, are largely separate before converging in the ventrolateral prefrontal cortex (vlPFC) onto neurons that represent higher-order multisensory representations of signals, such as vocalizations and their associated facial expressions 17. This need to integrate may make it more likely that the communicative act is accurately processed, suggesting that multi-sensory communication serves to ensure that a signal is understood 18,19.
The multi-articulator case explicitly takes the signaller’s perspective. In contexts or situations requiring a multi-articulator act, the signaller is forced to execute (at least) two different motor commands in different articulators. For instance, neurobiological research on human multimodal processing suggests that the integration between speech and gesture depends on context and is under voluntary control rather than obligatory 20: co-speech gestures may provide additional information depending on the communicative nature of the situation (e.g. whether or not there is shared common ground between the signaller and the recipient) 21, as well as on gaze direction (i.e. whether or not the signaller’s gaze is directed at the addressee) 22. Together with rich evidence that multi-articulator acts serve to refine messages 1,23,24, this suggests they are of particular relevance when outcomes (due to lower degrees of familiarity and social tolerance) are less predictable.
These neurobiological considerations suggest that multi-sensory and multi-articulator acts may serve different functions. Comparative researchers have recently begun to study the function of great apes’ multi-component communication via observational research, focusing on bi-articulatory gesture-vocal combinations 12–14, 25 and considering two different major function(s): redundancy and refinement 9,15,19. The redundant signal (hereafter referred to as ‘redundancy’) hypothesis implies that the different components convey the same information 9, facilitating the detection and processing of a message 18. In contrast, the refinement hypothesis posits that the presence of one signal component may provide the context in which a receiver can interpret and respond to the second, with the combinations serving to disambiguate meanings (i.e. functions) when these partly overlap 15,19. An important shortcoming of previous work, however, was that researchers did not tease apart production and perception of communicative acts, or whether constituent parts varied with regard to articulators (body parts) or sensory channels (modalities). Teasing these apart will allow us to gain more insight into the function of multi-component communication in great apes.
The aim of this study was to disentangle multi-sensory and multi-articulator communication, and study the sources of variation in production and outcomes in the great ape genus that is most suitable for this avenue of research: orang-utans (Pongo spp.). First, the orang-utan populations of Borneo (Pongo pygmaeus wurmbii) and Northwest-Sumatra (i.e. Suaq and Ketambe, Pongo abelii) differ considerably in sociability 26,cf. 27 and social tolerance (Bornean orang-utans become more stressed in group settings than Sumatrans 28). The consistently higher level of sociability in Sumatrans may lead to a greater need to refine messages conveyed in signals, and thus to more multi-articulator communicative acts. Second, in contrast to natural environments, captive orang-utans are always in close proximity and more on the ground 9,15,19, thus reducing the need for multi-sensory signals. Their sociability is also not constrained by food availability 29. In the wild individuals may have fewer interaction opportunities and communication is hampered by arboreality and obscuring vegetation, whereas captivity enables frequent interactions and short–distance communication with conspecifics other than the mother. Third, the pairing of social partners (interaction dyad) also affects features of social interactions irrespective of captive-wild and Bornean-Sumatran contrasts, e.g. due to differences in social tolerance and familiarity 30,31. Although mothers are the most important communication partner of infant orang-utans 10,32,33, temporary associations during feeding or travelling occur, particularly if food is abundant 34,35, thus providing opportunities for social interactions beyond the mother-infant unit 36–38. We expect that the reduced familiarity of these dyads, and thus the greater uncertainty of tolerance, would lead them to use more multi-articulator signals.
In the present study, we examined close-range communicative interactions of Bornean and Sumatran orang-utans in two wild populations and five zoos. While focal units in this study consisted of mothers and their dependent offspring, we also examined interactions with and among other members of the group/temporary association. By examining species differences related to differential sociability on one hand, and recipient-dependent factors on the other, we aimed to evaluate two major hypotheses explaining multimodal signal function discussed for great apes: redundancy and refinement. Since there are virtually no studies applying a similar comparative approach to any primate species, our predictions are largely exploratory.
If multi-sensory and multi-articulator communicative acts indeed function as ‘backup signals’, we first predicted that that these acts (comprising e.g. visual plus auditory) are more effective (i.e. more likely to result in the apparently intended outcome) than the single (e.g. purely visual) constituent parts, but have little or no effect on the type of outcome (i.e. dominant versus sub-dominant interaction outcome, see below). A second prediction would be that multi-component acts should be more common in the wild than in captive settings, where semi-solitariness limits interaction opportunities and communication is hampered by arboreality and obscuring vegetation 15,19,39.
If, one the other hand, multi-component acts serve primarily to refine messages, we predicted that they would be used more often for subdominant communicative goals (i.e. reducing ambiguity). For instance, if a certain communicative act is most frequently (> 50%) produced to solicit food transfers, but also in others (e.g. grooming, co-locomote), we predict that this communicative act is accompanied by other constituent parts (e.g. facial expression or gaze) more often in non-begging than begging interactions. Second, we predicted that multi-component acts would be more common in settings and interactions with higher uncertainty and more differentiated interactions in the social environment 12,14,15. Specifically, we expected an effect of species- and dyad-dependent effect of setting: although wild individuals may use more acts associated with gaze than their captive counterparts (due to lower degrees of social tolerance and thus less predictable outcomes), this effect should be more pronounced in Sumatrans (i.e. the more sociable population) and in interactions beyond the mother-offspring unit.
A secondary aim was to examine the sources of variation in the individual sensory modalities and articulators that constitute multi–component communication in orang-utans. Inevitably, some modalities and articulators are predicted to be involved more in the communication process than others: in natural settings, dense vegetation in the canopy means that there are fewer opportunities for the direct lines-of-sight needed for visual communication. As arboreal species, orang-utans are thus thought to rely less on purely visual signals than other (e.g. audio-visual) communicative means 34,40.