Dynamic neural stages of proportional gesture-speech integration characterized by cross-modal fragment priming

doi:10.21203/rs.3.rs-2510635/v1

Download PDF

Research Article

Dynamic neural stages of proportional gesture-speech integration characterized by cross-modal fragment priming

https://doi.org/10.21203/rs.3.rs-2510635/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

In natural communications, messages can be expressed through multimodalities and processed predictively. As extra-linguistic information, gesture has frequently been observed to accompany language. Question arises as how the spatial-motoric gesture from auditory interact with the visually linear-analytic speech. Adopting a cross-modal fragment priming paradigm, two experiments were conducted to investigate cortical involvement of gesture-speech integration as lexical representation of speech is proportionally activated and primed by gesture. Proportional presentation of gesture and speech were realized by segmenting gesture and speech fragments into five lengths relative to the gesture discrimination point (DP) and speech identification point (IP), i.e., 0.5, 0.75, 1, 1.25, and 1.5 DP/IP. Experiment 1 quantitively depicted the informativeness of the five lengths of gesture and speech fragments with nameability. Experiment 2 aimed to track the neural processes with event-related potentials (ERPs) when three of the five levels of fragments were chosen and named before, at and after gesture_DP/speech_IP.

Results

Experiment 1 (N = 60) revealed proportional lexical activation relative to the processing time, given the positive correlations between gesture/speech lengths and nameabilities. In Experiment 2 (N = 35), three ERP components were detected to be influenced by distinct gesture-speech information processing stages: an N1 component (0–100 ms of speech onset) modulated solely by gestures; the N400 effect (300-500 ms) found in before_DP/before_IP, DP/IP, after_DP/IP and after_DP/after_IP conditions and inversely correlated to the amount of presented information; followed immediately by the late positive component (LPC) (500-800 ms).

Conclusion

Our results suggest proportionally activated representation of gesture and speech that is inherently distributionally graded. Furthermore, our results uncovered dynamic neural stages when the lexical representations are progressively activated during gesture-speech integration. Thus, this study provides new insights into multimodal semantic processing.

Multisensory

gesture-speech integration

cross-modal fragment priming

proportional representation

ERP

Messages can be expressed through multiple modalities, during which information of all kinds is treated equally regardless of the modality from which the information comes (1-3). Proposals were made that prediction is fundamental for the processing of multisensory information (4) given that naturalistically, multisensory information is temporally misaligned (5).

Gestures, a form of multimodal extralinguistic information, are of particular importance. Gestures frequently co-occur with speech and convey not only relevant information but also express additional information not present in the accompanying speech (6, 7). Temporally, gestures were supposed to be ahead of their semantic affiliate (8). A quantitative analysis carried out by (9) found that approximately 55% of gestures terminate before their semantically related words are spoken. More recently, Researchers discovered the integration of gestures with a discourse model when gestures completely preceded the relevant semantic affiliate (10). Empirical evidence has also been provided for the integration of gesture-speech information when gestures exert semantic priming over speech (11, 12). In fact, semantic modulation of gestures over speech has been illustrated theoretically (13-15). Specifically, these studies assumed a prediction of gestures upon the lexical representation of the upcoming speech (13, 15), even at a low level of phonological representation (14). However, the dynamics of how gestures predict speech remain unclear. The present study addressed this issue by dynamically manipulating the informativeness of gestures and speech when gestures were presented ahead of speech in the cross-modal fragment priming paradigm.

Exemplified by event-related potentials (ERPs) (16), the fragment priming paradigm has been used to characterize cohort activation. In line with this, cross-modal gesture-speech fragment priming was interpreted as reflecting the mapping of gesture/speech fragment input onto their lexical representation and, importantly, the integration of modality-specific sensory information after the evaluation of the best-fitting gesture and speech candidate. Accordingly, by manipulating gesture and speech fragments into various lengths in accordance with their semantic identity (11) to exert distinct representation activation strengths, the present study aimed to assess the mechanisms of gesture-speech semantic information processing with ERPs of high temporal resolution and thus obtain a glimpse into the integration of multisensory semantic information.

Various levels of processing indexed by ERP components have been reported in previous studies in terms of gesture-speech integration. Early N1-P1 and P2 sensory effects, which are linked to perception and attentional processes, have been found under conditions of high sentence constraints (17) as well as when the information contained in the preceding gesture and the following speech were incongruent (18). The N400 effect, which is traditionally considered a response to semantic anomalies (19), has widely been found in semantically incongruent relative to semantically congruent gesture-speech pairs and has been used as an index of gesture-speech semantic integration (18, 20, 21). Furthermore, a late positive component (LPC) has been reported in gesture-speech integration (10, 22) as a reflection of the restructuring of the concurrent representation in light of the previous context. Here, we expected dynamic neural activity in gesture-speech interactions, as representations of gesture primes and speech targets were variously activated.

There were two experiments in the present study. Experiment 1 was designed to quantitively describe the informativeness of gesture and speech fragments using nameability. Manipulated in accordance with their semantic representations determined through two separate gating studies (23, 24), gesture and speech fragments were divided into 5 lengths, i.e., 0.5, 0.75, 1, 1.25, and 1.5 of gesture/speech identification (Figure 1A). Given proportional lexical activation relative to processing time (25), Experiment 1 hypothesized a positive correlation between the nameability of gesture/speech fragments and the proportional length being presented. Experiment 2 aimed to track the neural processes in gesture-speech information processing when the semantic information amount from both modalities was modulated and proportionally presented with high temporal resolution ERP evidence (Figure 1B). We hypothesized that three stages would be observed: an early ERP component at approximately 100 ms reflecting a prediction constraint of gesture onto speech, an N400 effect indexing semantic conflict and an LPC reflecting the unification of gesture and speech information.

Results of Experiment 1: Nameabilities of proportionally presented gesture and speech fragments

The twenty gestures (length = 1771.00 ms, SD = 307.98) and speech segments (length = 447.08 ms, SD = 93.48) were divided into fragments of 5 different durations relative to their minimal lexical length: i.e., 0.5, 0.75, 1, 1.25, and 1.5 DP/IP. For each of the gesture and speech fragments, the last answer given was considered to be the one indicating comprehension. Nameability was calculated as the percentage of participants that provided the most commonly used label.

One-way analysis of variance (ANOVA) showed a significant main effect of gesture nameability (F₍₄₎ = 7.630, p <.001, ηp² =.135). Overall, the nameability of gestures increased as the presentation time increased (Figure 1C), with the smallest nameability occurring when gestures were presented at 0.5 DP (mean =.35, SD =.16) and the largest occurring when gestures were presented at 1.5 DP (mean =.56, SD =.23). Nameabilities were also reported when gestures were presented in lengths of 0.75 DP (mean =.40, SD =.17), DP (mean =.46, SD =.20), and 1.25 DP (mean =.52, SD =.21).

A similar pattern was found in the five speech conditions (0.5 IP: mean =.41, SD =.20; 0.75 IP: mean =.53, SD =.18; IP: mean =.64, SD =.19; 1.25 IP: mean =.77, SD =.14; 1.5 IP: mean =.86, SD =.12) (Figure 1D), with a significant main effect of speech nameability as reflected in the one-way ANOVA (F₍₄₎ = 46.226, p <.001, ηp² =.487).

Importantly, there was a significant Pearson correlation coefficient for gesture lengths and nameability (Pearson’s r =.996, p <.001) (Figure 1C), as well as a significant correlation for speech lengths and speech nameability (Pearson’s r =.999, p <.001) (Figure 1D). Together, the results indicated that the lexical information of either gesture or speech was proportionally expressed over processing time, which was measured at its semantic discrimination point.

Results of Experiment 2: ERP evidence of the dynamic neural stages for proportionally presented gesture-speech integration

Behavioral results

Three of the five gesture/speech fragments were chosen: the 0.75 DP/IP (before_ DP/IP), DP/IP, and the 1.25 DP/IP (after_DP/IP). Gesture segments were presented as primes and were immediately followed by speech segments. Two experimental manipulations were created, the gender congruency factor (e.g., a man doing a gesture combined with a male voice or a woman doing a gesture combined with a male voice) and the semantic congruency factor (e.g., a man or a woman doing a ‘cut’ gesture while saying Mandarin speech word ‘剪jian3 (cut)’ or a man or a woman doing a ‘spray’ gesture while saying ‘剪jian3 (cut)’).

A 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) repeated-measures ANOVA revealed a significant main effect of semantic congruency (F_{(1, 29)} = 38.618, p <.001, ηp² =.57), with longer reaction times (RTs) for semantically incongruent (mean = 561.60, SD = 65.89) than semantically congruent (mean = 553.60, SD = 62.75) trials. A significant three-way interaction among gesture fragments, speech fragments, and semantic congruency (F_{(3.655, 105.995)} = 2.556, p =.048, ηp² =.081) was reported, reflecting that the magnitude of semantic congruency, which was used as an index of gesture-speech integration, was modulated by the interplay of semantic constraints by gesture with the speech representation presented.

The proportionally increased presentation of gesture information does not affect the gesture-speech semantic congruency effect, as reflected by a nonsignificant interaction of gesture fragments by semantic congruency (F_{(1.804, 52.314)} =.879, p =.411, ηp² =.029). The interaction of speech fragments by semantic congruency (F_{(1.925, 55.826)} =.791, p =.454, ηp² =.027) indicated no significant influence of the semantic congruency effect by the various speech information presented.

Additionally, simple effect analysis with Bonferroni correction showed a significant difference in RTs between the semantically incongruent and congruent conditions in the conditions of before_DP & before_IP (F_{(1, 29)}= 7.369, p =.011, ηp² =.203), DP & IP (F_{(1, 29)}= 14.13, p =.001, ηp² =.328), after_DP & IP (F_{(1, 29)}= 7.141, p =.012, ηp² =.198) and after_DP & after_IP (F_{(1, 29)}= 22.617, p <.001, ηp² =.438) (Figure 2A).

For the control factor of gender congruency, there was also a significant main effect (F_{(1, 29)}= 84.403, p <.001, ηp² =.744), indicating longer RTs when speech and gestures were produced by individuals of different genders (mean = 570.80, SD = 66.39) than by individuals of the same gender (mean = 545.26, SD = 63.08). The nonsignificant interactions indicated no gender influence on aspects of the semantic information conveyed by either gesture or speech during gesture-speech integration, as shown by the interaction of gesture fragment and gender congruency (F_{(1.995, 57.867)}=.382, p =.684, ηp² =.013); the interaction of speech fragment and gender congruency (F_{(1.768, 51.277)}=.015, p =.978, ηp² =.001); and the three-way interaction of gesture fragment, speech fragment and gender congruency (F_{(3.650, 105.849)}= 2.044, p =.100, ηp² =.066) (Figure 2B).

ERPs

Figure 3 presents grand-average ERPs elicited by the nine experimental conditions. Generally, semantically incongruent gesture-speech pairs elicited larger negative ERPs (mean = -1.059, SE =.172) than when the primer gesture contained congruent information (mean = -.934, SE =.170) with the subsequent speech. Based on the averaged ERP waveform, three components were identified and further analyzed: the early N1 effect from 0-100 ms (18), the N400 component from 300-500 ms (10, 26) and a LPC from 500-800 ms (27) (Figure 3). Notably, the gender congruency factor was ignored, as the paired t test failed to show a significant effect of gender congruency on the amplitude of N1 (t₍₂₉₎ = -1.325, p =.196), N400 (t₍₂₉₎ = 1.068, p =.294), and LPC (t₍₂₉₎ =.352, p =.727).

N1 effect: 0-100 ms time window

A 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) ANOVA on early ERP components (0–100 ms after speech onset) revealed a significant main effect of gesture fragments (F_{(1.5826, 45.881)} = 31.947, p <.001, ηp² =.524), with the largest negative amplitude being observed in the after_DP condition (mean = -1.822, SE =.326), and the smallest negative amplitudes observed in the before_DP condition (mean = -.879, SE =.251) and DP condition in the middle (mean = -1.418, SE =.309) (Figure 4C). Further analysis of the 3 (gesture fragments) *6 ROIs ANOVA indicated a peak main effect of the gesture fragment over central-anterior sites for both the left and right hemispheres: LA (F_{(2, 28)}= 32.249, p <.001, ηp² =.697), RA (F_{(2, 28)} = 28.650, p <.001, ηp² =.672), LC (F_{(2, 28)} = 19.430, p <.001, ηp² =.581) and RC (F_{(2, 28)}= 17.067, p <.001, ηp² =.549). A main effect of the gesture fragment was also reported in the midline electrodes (F_{(2, 28)} = 28.822, p <.001, ηp² =.673) (Figure 4A).

There were no such early ERP effects due to speech fragments, as the 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) ANOVA revealed no main effect of speech fragments (F_{(1.958, 56.775)} =.014, p =.985, ηp² <.001) (Figure 4B). There was neither a significant main effect of semantic congruency (F_{(1, 29)} =.975, p =.332, ηp² =.033) nor a semantic effect on gesture fragments (F_{(1.682, 48.780)} =.192, p =.788, ηp² =.007) or speech fragments (F_{(1.965, 56.973)}=.505, p =.603, ηp² =.017). Taken together, these results illustrate that the early ERP effect was caused by an increasing top-down gesture constraint resulting from proportionally added lexical representation.

N400 effect: 300-500 ms time window

For 300-500 ms epochs, the 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) ANOVA revealed a significant main effect of gesture fragments (F_{(1.979, 57.394)} = 11.646, p <.001, ηp² =.287) and semantic congruency (F_{(1, 29)} = 11.834, p =.002, ηp² =.290), as well as a significant three-way interaction (F_{(6.955, 201.691)} = 2.202, p =.036, ηp² =.071). However, there was no effect of speech fragments (F_{(1.895, 54.944)} =.785, p =.455, ηp² =.026). There was also no single effect for gesture (gesture fragments * semantic congruency, F_{(1.907, 55.290)} = 2.218, p =.121, ηp² =.071) or single effect for speech (speech fragments * semantic congruency, F_{(1.964, 56.960)} =.201, p =.815, ηp² =.007). The results suggest that the neural correlate of gesture-speech integration indicated by a significant N400 effect between incongruent and congruent pairs was modulated by the interplay between the effect of gesture fragments and that of speech fragments.

Additionally, six (ROIs) separate ANOVAs of the 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) analyses demonstrated a significant N400 effect in the before_DP/before_IP condition in the LA (F_{(1, 29)}= 4.186, p = .050, ηp² = .126), RA (F_{(1, 29)} = 4.227, p = .049, ηp² = .127), LC (F_{(1, 29)} = 4.175, p =.050, ηp² = .126) and RC (F_{(1, 29)}= 4.402, p =.045, ηp² = .132); in the DP/IP condition in the LC (F_{(1, 29)}= 5.477, p = .026, ηp² = .159), LP (F_{(1, 29)} = 6.450, p = .017, ηp² = .182) and RP (F_{(1, 29)} = 5.056, p = .032, ηp2 = .148); in the after_DP/IP condition in the LA (F_{(1, 29)} = 8.277, p = .007, ηp² = .222), RA (F_{(1, 29)}= 11.961, p = .002, ηp² = .292), LC: (F_{(1, 29)} = 5.582, p = .025, ηp² = .161) and RC (F_{(1, 29)} = 6.355, p = .017, ηp² = .180); and in the after_DP/after_IP condition in the LA (F_{(1, 29)}= 13.795, p = .001, ηp² = .322), RA (F_{(1, 29)}= 12.402, p = .001, ηp² = .300), LC (F_{(1, 29)} = 12.344, p = .001, ηp² = .299), RC (F_{(1, 29)}= 9.240, p = .005, ηp² = .242), LP (F_{(1, 29)} = 6.683, p = .015, ηp² = .187) and RP (F_{(1, 29)}= 6.397, p = .017, ηp² = .181) (Figure 5A).

For the average of the middle line electrodes, a separate 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) ANOVA indicated a significant three-way interaction (F_{(3.642, 105.627)}= 3.240, p =.018, ηp² =.101). There was no significant interaction between gesture fragments and semantic congruency (F_{(1.603, 46.487)}=.477, p =.582, ηp² =.016) or between speech fragments and semantic congruency (F_{(1.975, 57.273)}=.332, p =.716, ηp² =.011). Simple effect analysis showed that the significant interplay between the speech effect and the gestural modulation of the gesture-speech integration effect occurred in the following conditions: before_DP/before_IP (F_{(1, 29)}= 14.588, p =.001, ηp² =.335), DP/IP (F_{(1, 29)}= 7.008, p =.013, ηp² =.195), and after_DP/after_IP (F_{(1, 29)}= 16.329, p <.001, ηp² =.360) (Figure 5B). These results indicated that significant gesture-speech integration, as reflected by the N400 effect, existed in conditions in which top-down gesture constraints were balanced with bottom-up speech presentations.

Most importantly, a negative correlation was found for both the semantically congruent (Pearson’s r =.934, p <.001) and semantically incongruent conditions (Pearson’s r =.831, p =.006) over the middle line electrodes between the N400 amplitude and the 9 experimental manipulations (3 gesture fragments * 3 speech fragments), as the amount of information conveyed in the two modalities was gradually increased. Specifically, the largest negative N400 amplitude was found in the before_DP/before_IP condition, and the smallest negative N400 amplitude was observed in the after_DP/after_IP condition. This suggests that in addition to the N400 effect, the N400 amplitude itself may be related to the probability of representation and the degree of intersected semantic features between gestures and speech.

LPC effect: 500-800 ms time window

A 3 (gesture fragments) * 3 (speech fragments) * 2 (semantic congruency) ANOVA for 500-800 ms epochs showed a significant main effect of semantic congruency (F_{(1, 29)} = 6.915, p =.014, ηp² =.193). However, there was no main effect of gesture fragments (F_{(1.894, 54.940)} = 1.255, p =.292, ηp² =.041) or speech fragments (F_{(1.760, 51.035)} = 1.490, p =.235, ηp² =.049). There were also no effects of gesture (gesture fragments * semantic congruency, F_{(1.553, 45.044)} =.614, p =.506, ηp² =.021) or speech (speech fragments * semantic congruency, F_{(1.864, 54.049)} =.208, p =.798, ηp² =.007).

A three-way interaction among gesture fragments, speech fragments, and semantic congruency (F_{(3.390, 98.297)} = 4.226, p =.005, ηp² =.127) was reported. Simple effect analysis with Bonferroni correction showed a significant ERP difference between the semantically congruent and incongruent conditions in the conditions of before_DP/before_IP (F_{(1, 29)}= 16.458, p <.001, ηp² =.362), DP & IP (F_{(1, 29)}= 6.834, p =.014, ηp² =.191) and after_DP/after_IP (F_{(1, 29)}= 12.812, p =.001, ηp² =.306).

Analysis of the factors of ROIs demonstrated a significant LPC effect in the before_DP/before_IP condition in the LA (F_{(1, 29)}= 16.469, p < .001, ηp² = .362), RA (F_{(1, 29)} = 9.745, p = .004, ηp² = .252), LC (F_{(1, 29)} = 18.330, p < .001, ηp² = .387), RC (F_{(1, 29)} = 10.543, p = .003, ηp² = .267), LP (F_{(1, 29)} = 5.757, p = .023, ηp² = .166) and RP (F_{(1, 29)}= 7.756, p =.009, ηp² = .211); in the DP/IP condition in the LC (F_{(1, 29)}= 9.479, p = .005, ηp² = .246), RC (F_{(1, 29)}= 4.231, p = .049, ηp² = .127) and LP (F_{(1, 29)} = 11.504, p = .002, ηp² = .284); and in the after_DP/after_IP condition in the LA (F_{(1, 29)}= 15.737, p < .001, ηp² = .352), RA (F_{(1, 29)}= 12.644, p = .001, ηp² = .304), LC (F_{(1, 29)} = 13.102, p = .001, ηp² = .311), RC (F_{(1, 29)}= 8.582, p = .007, ηp² = .228), LP (F_{(1, 29)} = 9.074, p = .005, ηp² = .238) and RP (F_{(1, 29)}= 5.948, p = .021, ηp² = .170) (Figure 6A).

A gesture fragment * speech fragment * semantic congruency ANOVA on the average of middle line electrodes also revealed a significant three-way interaction (F_{(3.436, 99.652)} = 4.025, p =.004, ηp² =.122). Simple effect analysis with Bonferroni correction found a significant semantic congruency effect in the before_DP/before_IP (F_{(1, 29)}= 14.873, p =.001, ηp² =.339), DP/IP (F_{(1, 29)}= 6.003, p =.021, ηp² =.172), and after_DP/after_IP conditions (F_{(1, 29)}= 15.150, p =.001, ηp² =.343) (Figure 6B). This finding implies that this observed LPC with respect to gesture-speech integration existed only when there was no dominant information modality.

By manipulating the semantic information conveyed by gestures and speech and adopting a cross-modal semantic priming paradigm to produce various levels of activation in response to gestures and speech representation, the present study directly examined the dynamic neural mechanisms that underlie gesture-speech interactions.

We first report our results as a distribution of the gesture/speech input that is inherently graded and/or stochastic. Traditionally, language comprehension is understood as retrieving word meaning from memory and subsequently integrating it into a compositional representation of sentence meaning (28) in a manner that is governed by formal grammar. Recent research has conceptualized language processing as a sequence of mental states that function quantitively (29) and are stored probabilistically (30). More recently, a study reported that the amplitudes of different ERP components and cortical engagements were proportionally correlated with changes in gesture and speech representations (31). In the present study, by presenting gesture and speech fragments at five proportionally increased lengths in accordance with their semantic identification, increasing nameability was reported. We comprehended that the fragmented gesture/speech would activate a dynamically distributed neural landscape, the strength of which is proportionally correlated to the informativeness of the input.

High temporal resolution ERP evidence revealed three components during the influences of gesture and speech as their representations were proportionally activated. For the first time, the present study reported a correlation between an early frontal-central ERP component (0-100 ms after speech onset) and the degree of gesture priming. A significantly larger negative amplitude was found when gestures conveyed more semantic information, as in the after_DP condition compared to that in the before_DP condition, and when the lexical representation of gestures was relatively weak (Figure 4C). The results indicated that information extracted from gestures can provide top-down predictions that allow for phonological extraction of the target speech (32). The more information provided by gestures, the more top-down modulation occurred through the phonological processing of speech, and this process led to a larger negative early ERP amplitude.

In addition, a significant N400 effect followed immediately by the LPC was observed between semantically incongruent and congruent gesture-speech pairs. Successively triggered N400 components and the LPC effect have been reported previously (33), which has been explained as a temporally semantic overlap between the two effects. Specifically, the N400 component reflects the activation of the individual’s conceptual knowledge of the current word, and the P600 component reflects the integration of the activated word knowledge into the representation of the utterance (34, 35). In the present study, by presenting gestures as primes, a top-down prediction of gestures over speech was created; hence, the N400 component and LPC effects were interpreted as a reflection of a combined effect of gesture predictions and the current speech representation. While the N400 component was interpreted as activation of the representation of the coming word, the LPC indices the reorganization of gesture-speech semantic construction and leads to an updated interpretation.

An inversely related N400 amplitude with an increasing amount of semantic information presented by gesture and speech was detected. The largest N400 amplitude occurred when the semantic information contained in both gestures and speech was low, as in the before_DP/before_IP condition, and the smallest N400 amplitude was found when the multimodal semantic information was high, as in the after_DP/after_IP condition. On the one hand, the previously presented gesture would exert a top-down constraint over speech. On the other hand, stimulus-driven activation was elicited by the incoming speech fragment. Activation patterns from the two directions intersected together into an updated representation, the magnitude of which may be reflected by the magnitude of ERP correspondence. Previous studies have ascribed the N400 component to reflect changes in the probabilistic representation pattern of sentence meaning as each word is presented (36, 37). The present study replicated these results by showing that in gesture-speech information processing, the probabilistic representation of the coherent meaning as the intersection of semantic features from gestures and speech might be dynamically updated with the information from each modality as it is gradually presented; this process was captured by the N400 amplitude.

In emphasizing the interplay of the dual-directional information of gesture prediction and speech presentation during gesture-speech integration, we reconcile a certain independence of the information conveyed in both domains (38). Consequently, gestures and speech are encoded separately as a whole and based on their own goals (39) before integration takes place. Therefore, the N400 component referred to the semantic anomalies of the representation elicited by the top-down gesture and the stimulus-driven speech presentation when there was a comparative balance between the two, as in the conditions of before_DP/before_IP, DP/IP, after_DP/IP and after_DP/after_IP. In conditions in which the information taken from one modality is immensely more reliable than that of information taken from the other, the integrated percept would favor the more reliable modality (40, 41), thus resulting in no difference between the congruent pair and the incongruent pair. An evaluation and reorganization process were carried out for coherent comprehension after weighing the activated representation of gestures and that of speech, as reflected by the LPC effect in the before_DP/before_IP, DP/IP, and after_DP/after_IP conditions.

The dynamic neural stages proposed in the present study are consistent with recent findings concerning gesture-speech semantic information interactions (11, 42, 43). More recently, a TMS study differentiated distinct operation time windows of the pMTG and the IFG in gesture-speech integration. In an early prelexical stage, semantic information extracted from gestures constrains the phonological encoding of speech, while in the late postlexical stage, the most feasible lexical candidate is selected, retrieved and unified with gesture semantics to form a context-appropriate semantic representation. Close time alignment of the ERP components with the time windows of the TMS effect, i.e., the N1 component (0-100 ms of speech onset), was aligned with time window 2 (96-136 ms of speech processing) in the present study. We determined that there was a close association of the ERP components with the cortical involvements that both were interrelated with the same gesture-speech information measure. While the N400 component reflects the reverberating activity occurring within the MTG/STG-IFG network (44), the LPC might be a reflection of a self-updating process within the IFG (34). Future studies are encouraged to combine neuroimaging measures with EEG data to identify the neural substance that underlies the corresponding ERP components as well as the wider neural network involved in multisensory integration.

By manipulating the informativeness of top-down gesture prediction and speech presentation in the cross-modal fragment priming paradigm, the present study is the first to provide evidence for a proportionally activation of gesture and speech stored features during integration. Furthermore, the present study uncovered the dynamic neural stages of gesture-speech interaction when the lexical representations of gestures and speech are progressively activated at various levels. Our findings suggest three neural stages, which are reflected in three ERP correlates, namely, the early frontal-central N1 effect from 0-100 ms, the N400 effect from 300-500 ms, and the LPC effect from 500-800 ms after speech onset. The early N1 effect suggested a proportional top-down prediction by gestures on the phonological extraction of speech. The significant N400 effect indexing a semantic conflict of gesture prediction with speech-activated representation was found to be inversely correlated with an increasing amount of lexical information and might reflect a probabilistic representation of intersected semantic features. This effect was immediately followed by an obvious LPC indexing semantic reorganization and updating. This study thus advances our understanding of the temporal dynamics in multisensory semantic processing.

Stimuli

Twenty gestures and 20 corresponding speech recordings that were produced separately by two native Chinese speakers (1 male and 1 female) were used in the present study. Thirty-two participants (16 females, aged 19-25 years, SD = 5.34) rated the semantic relationship of gestures and speech on a 5-point rating scale (1 meant “no relation” and 5 meant a “very strong relation”). The results verified the semantically congruent or incongruent combinations of gesture and speech with a rating of 4.48 (SD = 0.40) for the semantically congruent gesture-speech pairs and 1.44 (SD = 0.42) for the semantically incongruent pairs. The stimulus set was further validated by replicating the semantic congruency effect found in previous studies (11, 12) with another 30 sets of participants (17 females, aged 19-28 years, SD=8.56), as reflected by a significantly larger reaction time in conditions of gestures paired with semantically incongruent speech (mean = 554.51 ms, SE = 11.65) relative to when they were semantically congruent (mean = 533.90 ms, SE = 12.02).

In addition, two separate pretests with 30 subjects in each (pretest 1: 16 females, aged 19-35 years, SD = 11.03; pretest 2: 15 females, aged 19-30 years, SD=10.26) were conducted to determine the minimal length of each gesture and speech required for semantic identification. Presented in segments of increasing duration at a step of 40 ms (the first segment started at 40 ms), participants were asked to use one action verb to describe the action presented; the discrimination point (DP) of the gesture or the identification point (IP) of speech was defined as the first time point when participants gave the final answer of the various speech segments presented. The average gesture DP (183.78 ms, SD = 84.82) and speech IP (176.40 ms, SD = 66.21) were determined (11).

Experiment 1

Participants

Sixty native Chinese participants signed written informed consent forms and took part in Experiment 1 (31 females; 19-35 years, mean age = 24). All of the participants were right-handed (laterality quotient (LQ) = 86.57, SD = 11.56), had normal or corrected-to-normal vision and were paid ¥50 per hour for their participation. This study was approved by the Ethics Committee of the Institute of Psychology, Chinese Academy of Sciences.

Experimental procedure

The twenty gestures (length = 1771.00 ms, SD = 307.98) and speech segments (length = 447.08 ms, SD = 93.48) were divided into fragments of 5 different durations relative to their minimal lexical length: 0.5 * DP/IP (gesture fragment length: mean = 91.89 ms, speech fragment length: mean = 88.2 ms), 0.75 * DP/IP (gesture fragment length: mean = 137.84 ms, speech fragment length: mean = 132.3 ms), DP/IP (gesture fragment length: mean = 183.78 ms, speech fragment length: mean = 176.4 ms), 1.25 * DP/IP (gesture fragment length: mean = 229.73 ms, speech fragment length: mean = 220.5 ms), and 1.5 * DP/IP (gesture fragment length: mean = 275.67 ms, speech fragment length: mean = 264.6 ms) (Figure 1A). Participants were divided into two groups, and a between-subject design was adopted. Gesture or speech fragments were presented randomly, while participants were asked to use one action verb to describe the gesture they saw on the screen or the speech segment they heard. Nameability was defined as the percentage of participants who provided the most frequently used label.

Experiment 2

Participants

Thirty-five native Chinese participants signed written informed consent forms and participated in the experiment. Five participants were excluded from the analysis (2 for excessive artifacts; 3 for not following task instructions). The remaining 30 participants (16 females, aged 19-29 years old, mean age = 22) were all right-handed (LQ = 89.20, SD = 14.69), had normal hearing, had normal or corrected-to-normal vision and were paid ¥100 per hour for their participation. This study was approved by the Ethics Committee of the Institute of Psychology, Chinese Academy of Sciences.

Experimental procedure

To provide ERP evidence for the proportionally presented gesture-speech integration, three of the five gesture/speech fragments were chosen: the 0.75 DP/IP (before_ DP/IP), DP/IP, and the 1.25 DP/IP (after_DP/IP). Thus, 4 experimental manipulations were conducted with the factors of gesture fragments (before_DP, DP, after_DP) * speech fragments (before_IP, IP, after_IP) * semantic congruency (semantically congruent vs. semantically incongruent) * gender congruency (gender congruent vs. gender incongruent). A total of 1440 pairs of stimuli were randomly presented using Presentation software (www.neurobs.com). Participants were asked to look at the screen but to respond quickly and accurately only to indicate the gender of the voice they heard (Figure 1B).

Behavioral analysis

All incorrect responses (4965 out of the total of 43200, 11.49% of trials) were excluded. To eliminate the influence of outliers, a 2 SD trimmed mean was calculated for every participant in each condition. We focused our analysis on the effect of gesture fragments and their interactions with speech fragments over the factor of semantic congruency in an effort to elucidate integrated information processing when gestures and speech were proportionally presented.

Electroencephalogram (EEG) recordings and data analysis

The EEG data were recorded using 48 Ag/AgCI electrodes mounted on a cap according to the 10-20 system (45). EEG signals were amplified with a PORTI-32/MREFA amplifier (TMS International B.V., Enschede, NL) and digitized online at 500 Hz (bandpass, 0.01-70 Hz). Vertical and horizontal eye movements were measured with 4 electrodes placed above the left eyebrow, below the left orbital ridge, and at the left and right external canthi. All electrodes were referenced online to the left mastoid. Electrode impedance was maintained below 5 KΩ.

EEGLAB, a MATLAB toolbox, was used to analyze the EEG data (46). Epochs were time-locked to the onset of speech and lasted for 1000 ms. To ensure that the baseline correction was clean with no stimulus presented, a 200 ms baseline correction before the onset of the gesture was applied. To eliminate the influence of outliers and incorrect responses, trials that had been deleted from the behavioral data were also deleted before further analyses of the EEG data. The averages of the left and right mastoids were used for rereferencing, and a high-pass filter at 0.05 Hz and a low-pass filter at 30 Hz were applied. Artifacts were removed semiautomatically. Overall, participants were excluded from future analysis if their rejected trials were more than 30% of their total number of trials.

In the analysis of the scalp distribution of the ERP effect, 6 regions of interest (ROIs) were defined: the left anterior (LA): F1, F3, F5, FC1, FC3, and FC5; left central (LC): C1, C3, C5, CP1, CP3, and CP5; left posterior (LP): P1, P3, P5, PO3, PO5, and O1; right anterior (RA): F2, F4, F6, FC2, FC4, and FC6; right central (RC): C2, C4, C6, CP2, CP4, and CP6; and right posterior (RP): P2, P4, P6, PO4, PO6, and O2. To maximize the statistical power of the semantic congruency effect, the factor of gender congruency was excluded from the analysis of ERP data. Thus, repeated-measures ANOVA over each of the ROI electrodes and the midline electrodes (Fz, FCz, Cz, Pz, Oz, and CPz) were conducted separately with respect to the effect of semantic congruency and its interactions with speech fragments and gesture fragments. Greenhouse–Geisser and Bonferroni corrections were applied where necessary.

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Competing interests

The authors declare no competing financial interests.

Funding

This research was supported by grants from the National Key Research and Development Program of China (2021ZD0201501), the National Natural Science Foundation of China (31822024, 31800964), and the Strategic Priority Research Program of Chinese Academy of Sciences (XDB32010300).

Author contributions

Conceptualization, W.Y.Z. and Y.D.; Investigation, W.Y.Z. and J.Y.L.; Formal Analysis, W.Y.Z. and J.Y.L.; Methodology, W.Y.Z.; Validation, X.T.J. and X.Q.Z.; Visualization, X.T.J. and X.Q.Z.; Funding Acquisition, W.Y.Z. and Y.D.; Supervision, Y.D.; Project administration, Y.D.; Writing – Original Draft, W.Y.Z.; Writing – Review & Editing, W.Y.Z. and Y.D.

Ethics declarations

This study was approved by the Ethics Committee of the Institute of Psychology, Chinese Academy of Sciences. Informed consent was obtained from all participants involved in the study.

Consent for publication

Not applicable.

Hagoort P, van Berkum J. Beyond the sentence given. Philos T R Soc B. 2007;362(1481):801-11.
MacSweeney M, Campbell R, Woll B, Giampietro V, David AS, McGuire PK, et al. Dissociating linguistic and nonlinguistic gestural communication in the brain. Neuroimage. 2004;22(4):1605-18.
Weisberg J, Hubbard AL, Emmorey K. Multimodal integration of spontaneously produced representational co-speech gestures: an fMRI study. Lang Cogn Neurosci. 2017;32(2):158-74.
de Lange FP, Heilbron M, Kok P. How Do Expectations Shape Perception? Trends in Cognitive Sciences. 2018;22(9):764-79.
Holler J, Levinson SC. Multimodal language processing in human communication. Trends in Cognitive Sciences. 2019;23(8):639-52.
Goldin-Meadow S, Sandhofer CM. Gestures convey substantive information about a child's thoughts to ordinary listeners. Developmental Sci. 1999;2(1):67-74.
Kelly SD, Barr DJ, Church RB, Lynch K. Offering a hand to pragmatic understanding: The role of speech and gesture in comprehension and memory. J Mem Lang. 1999;40(4):577-92.
Morrel-Samuels P, Krauss RM. Word familiarity predicts temporal asynchrony of hand gestures and speech. Journal of Experimental Psychology: Learning, Memory, and Cognition. 1992;18(3):615-22.
Fritz, Isabella. How gesture and speech interact during production and comprehension. (Phd thesis) University of Birmingham 2018.
Fritz I, Kita S, Littlemore J, Krott A. Multimodal language processing: How preceding discourse constrains gesture interpretation and affects gesture integration when gestures do not synchronise with semantic affiliates. J Mem Lang. 2021;117:104191.
Zhao W, Li Y, Du Y. TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience. 2021:10356-64.
Zhao WY, Riggs K, Schindler I, Holle H. Transcranial magnetic stimulation over left inferior frontal and posterior temporal cortex disrupts gesture-speech integration. Journal of Neuroscience. 2018;38(8):1891-900.
Hadar U, Butterworth B. Iconic gestures, imagery, and word retrieval in speech. Semiotica. 1997;115:1-2.
Krauss RM, Chen, Y., & Gottesman, R. F. . lexical-gestures-and-lexical-access-a-process-model. Cambridge: Cambridge University Press. 2000:261–83.
Rauscher FH, Krauss RM, Chen Y. Gesture, speech, and lexical access: The role of lexical movements in speech production. Psychol Sci. 1996;Vol. 7,(4):226-31.
Friedrich CK, Kotz SA, Friederici AD, Gunter TC. ERPs reflect lexical identification in word fragment priming. J Cognitive Neurosci. 2004;16(4):541-52.
Federmeier KD, Mai H, Kutas M. Both sides get the point: hemispheric sensitivities to sentential constraint. Mem Cognit. 2005;33(5):871-86.
Kelly SD, Kravitz C, Hopkins M. Neural correlates of bimodal speech and gesture comprehension. Brain and Language. 2004;89(1):253-60.
Kutas M, Hillyard SA. Event-related brain potentials to semantically inappropriate and surprisingly large words. Biol Psychol. 1980;11(2):99-116.
Ozyurek A, Willems RM, Kita S, Hagoort P. On-line integration of semantic information from speech and gesture: Insights from event-related brain potentials. J Cognitive Neurosci. 2007;19(4):605-16.
Wu YC, Coulson S. Meaningful gestures: Electrophysiological indices of iconic gesture comprehension. Psychophysiology. 2005;42(6):654-67.
Gunter TC, Weinbrenner JED. When to Take a Gesture Seriously: On How We Use and Prioritize Communicative Cues. J Cognitive Neurosci. 2017;29(8):1355-67.
Habets B, Kita S, Shao ZS, Ozyurek A, Hagoort P. The Role of Synchrony and Ambiguity in Speech-Gesture Integration during Comprehension. J Cognitive Neurosci. 2011;23(8):1845-54.
Obermeier C, Gunter TC. Multisensory integration: The case of a time window of gesture-speech integration. J Cognitive Neurosci. 2015;27(2):292-307.
Connine CM, Blasko DG, Wang J. Vertical Similarity in Spoken Word Recognition - Multiple Lexical Activation, Individual-Differences, and the Role of Sentence Context. Percept Psychophys. 1994;56(6):624-36.
Wu YC, Coulson S. Iconic gestures prime related concepts An ERP study. Psychonomic Bulletin & Review. 2007;14(1):57-63.
Kuperberg GR, Brothers T, Wlotko EW. A Tale of Two Positivities and the N400: Distinct Neural Signatures Are Evoked by Confirmed and Violated Predictions at Different Levels of Representation. J Cogn Neurosci. 2020;32(1):12-35.
Jackendoff R. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford: Oxford University Press; 2002.
Brennan JR, Stabler EP, Van Wagenen SE, Luh WM, Hale JT. Abstract linguistic structure correlates with temporal activity during naturalistic comprehension. Brain and Language. 2016;157:81-94.
Armeni K, Willems RM, Frank SL. Probabilistic language models in cognitive neuroscience: Promises and pitfalls. Neurosci Biobehav R. 2017;83:579-88.
Zhao W, Li X, Du Y. Neural correlates of probabilistic information processing in gesture-speech integration. bioRxiv. 2022:2022.11.23.517759.
Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK. Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience. 2005;25(20):5004-12.
Delogu F, Brouwer H, Crocker MW. Event-related potentials index lexical retrieval (N400) and integration (P600) during language comprehension. Brain Cognition. 2019;135.
Brouwer H, Crocker MW, Venhuizen NJ, Hoeks JCJ. A Neurocomputational Model of the N400 and the P600 in Language Processing. Cogn Sci. 2017;41 Suppl 6:1318-52.
Brouwer H, Fitz H, Hoeks J. Getting real about Semantic Illusions: Rethinking the functional role of the P600 in language comprehension. Brain Research. 2012;1446:127-43.
Hodapp A, Rabovsky M. The N400 ERP component reflects an error-based implicit learning signal during language comprehension. European Journal of Neuroscience. 2021;54(9):7125-40.
Rabovsky M, Hansen SS, McClelland JL. Modelling the N400 brain potential as change in a probabilistic representation of meaning. Nat Hum Behav. 2018;2(9):693-705.
Kita S, editor How representational gestures help speaking2000.
Umilta MA, Escola L, Intskirveli I, Grammont F, Rochat M, Caruana F, et al. When pliers become fingers in the monkey motor system. P Natl Acad Sci USA. 2008;105(6):2209-13.
Andersen TS, Tiippana K, Sams M. Factors influencing audiovisual fission and fusion illusions. Cognitive Brain Res. 2004;21(3):301-8.
Shams L, Kamitani Y, Shimojo S. Visual illusion induced by sound. Cognitive Brain Res. 2002;14(1):147-52.
He YF, Steines M, Sammer G, Nagels A, Kircher T, Straube B. Action-Related Speech Modulates Beta Oscillations During Observation of Tool-Use Gestures. Brain Topography. 2018;31(5):838-47.
Drijvers L, Ozyurek A, Jensen O. Alpha and beta oscillations index semantic congruency between speech and gestures in clear and degraded speech. J Cognitive Neurosci. 2018;30(8):1086-97.
Baggio G, Hagoort P. The balance between memory and unification in semantics: A dynamic account of the N400. Lang Cognitive Proc. 2011;26(9):1338-67.
Nuwer MR, Comi G, Emerson R, Fuglsang-Frederiksen A, Guerit JM, Hinrichs H, et al. IFCN standards for digital recording of clinical EEG. The International Federation of Clinical Neurophysiology. Electroencephalogr Clin Neurophysiol Suppl. 1999;52:11-4.
Delorme A, Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J Neurosci Methods. 2004;134(1):9-21.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Dynamic neural stages of proportional gesture-speech integration characterized by cross-modal fragment priming

Status:

Version 1

Abstract

Figures

Background

Results

Discussion

Conclusion

Methods

Declarations

References

Additional Declarations

Status:

Version 1