We show here that understanding of speech in noise improves significantly after a short and simple training session when auditory sentences are combined with a tactile input that corresponds to their low frequencies, including the fundamental frequency (f0). The effect that we report for the trained condition (though using novel sentences), a mean decrease of 10 dB in the speech recognition threshold, is profound, as it represents maintaining the same performance witch background noise perceived twice louder [47]. The current work expands our previous findings [28] which showed that understanding degraded speech against background noise improves immediately by mean 6 dB in SRT without any training, when accompanied with corresponding and synchronized low frequencies delivered as vibrotactile input on fingertips. Interestingly, in both our studies we reported a very similar mean score in the audio-tactile condition with matching vibration (AM before training in the current study) with the mean group SRT of approx. 12 dB. The sentence database, the applied vocoding algorithm and the background noise were identical in both experiments. Obtaining similar scores in a total of almost 30 participants strengthens the interpretations we propose. Interestingly, and as we expected, the AM training applied in the current study also led to improved speech in noise understanding in the unisensory, auditory only, test condition.
These main outcomes of the study have several implications, both for basic science as well as in terms of their potential practical applications. Intriguingly, we show here that the improvement of speech comprehension through audio-tactile training was in the order of magnitude (~ 10 dB in SRT) similar or higher than that found for audio-visual speech-in-noise testing settings, when the performance is compared to only auditory [5, 48, 49, 50, 51]. The reported improvement in SRT was by 3 to 11dB, although the applied methodology and the language content varied. The implications of this finding for basic science are discussed further in text. In addition, showing improvement in speech understanding, both in the multisensory and the unisensory test setting is especially important with respect to the development of novel technologies for the general public and rehabilitation programs for patients with hearing problems. We discuss that in some more detail below.
Multisensory perceptual learning
Findings of our study are an example of rapid perceptual learning, which has already been shown for various acoustic contexts and distortions of the auditory signal, including natural speech presented in background noise, as well synthetically manipulated vocoded/time-compressed speech [25, 52, 53, 54]. Furthermore, our results are in agreement with the emerging scientific literature demonstrating that learning and memory of sensory/cognitive experiences is more efficient when the applied inputs are multisensory [22, 23]. The experimental procedure was also specifically designed to benefit from the fundamental rule of multisensory integration, and namely the inverse effectiveness rule [1, 55, 56, 57]. Although mainly shown for more basic sensory inputs, the rule predicts that multisensory enhancement, i.e. the benefit of adding information through an additional sensory channel, is especially profound for low signal-to-noise conditions and when learning novel tasks [56]. In our study the auditory speech signal was new to the study participants, degraded (vocoded), presented against background noise and in their non-native language. All these manipulations lead to a low signal to noise context and deemed the task of understanding the sentences challenging, thereby increasing the chance of improvement in performance via adding a specifically designed tactile input.
At the same time, the study conditions were ecologically valid, in that we aimed to recreate an every-day challenging acoustic situation encountered by both hearing impaired patients and healthy people, of being exposed to two auditory streams at the same time. We showed here that auditory stream segregation, and specifically focusing on one speaker can be facilitated by adding vibrotactile stimulation which is congruent with the target speaker (see [5] discussing similar benefits of audio-visual binding).
Design of the SSD – considerations for the current study
To deliver tactile inputs, we developed our own audio-to-touch SSD. In the ever growing body of literature it has been suggested that SSDs can advance the benefits of a multisensory training even further than mentioned above, since they convey one modality input via another one in a way that is specifically tailored to the neuronal computations characteristic for the original sensory modality [39, 38, 37, 36, 40, 58, 59, 60, 61]. In our study we used the SSD to deliver an input complementary to the auditory speech through touch which, nevertheless, maintained features typical for the auditory modality, as vibrations are also a periodic signal that fluctuates in frequency and intensity.
At the same time the applied frequency range of the inputs was detectable by the tactile Pacinian cells that are most densely represented on the fingertips and most sensitive for coding frequencies in the range of 150-300Hz (and up to 700-1000Hz) [62]. Specifically, the tactile vibration that was provided on fingertips of our participants was part of the signal below (and including) the extracted fundamental frequency of the speech signal. Access to this low-frequency aspect of the temporal fine structure of speech is specifically reduced in patients with sensorineural hearing loss, including those using a cochlear implant. It has been shown that lack of this input profoundly impairs speech understanding in challenging acoustic situations, especially with several competing speakers [63, 64, 65].
Apart from the current and the previous work of our lab, also one other research group showed that adding low-frequency input delivered on fingertips can improve comprehension of auditory speech presented in background noise in normal hearing individuals [26]. The main advantage of our approach was, however, that to estimate the SNR for various task conditions we applied an adaptive procedure, as opposed to using fixed SNRs, thereby avoiding floor and ceiling effects [46]. In addition, we showed a more significant benefit of adding the specifically designed vibrotactile input than the other group, possibly due the fact that the stimuli we used were in the non-native language of the participants, which further enhanced the effect of the inverse effectiveness rule. Using the native language of the participants, and thus yielding the task easier was maybe also the reason why Fletcher and colleagues failed to show benefit from adding the vibrotactile input before training (which we did in our previous work, [28]).
The audio-tactile interplay
Interestingly, in the current study we also saw improvement in speech understanding in the control test condition, and namely when the auditory sentences were paired with vibro-tactile inputs corresponding to a different sentence than the one presented through audition, i.e., paired with non-matching vibration (AnM). The degree of improvement was, however, far less robust (mean of 4dB in SRT) and far less statistically significant than the improvement reported for the trained AM condition (mean of almost 10dB in SRT). The reason why we introduced this control test condition, was to acquire more information about the mechanisms of multisensory enhancement. Interestingly, we found that the group scores before training for both conditions combining an auditory and a tactile input (matching and non-matching) were almost the same; and they also correlated (r = 0.5). We hypothesize that at this point the participants were still becoming familiarized with the study set-up and the vibrotactile input, and only after a short, dedicated training session they “realized” the true benefit of the matching tactile f0. It is also possible that presenting these two conditions next to one another in the pre-training session might have caused confusion.
In our future studies we would like to investigate the role of the vibrotactile input delivered on two fingertips and the audio-tactile binding effects further. It remains to be studied whether it is of crucial importance that the amplitude and frequency fluctuations delivered as vibrotions follow the auditory signal precisely. Alternatively, non-matching vibrations whose spectro-temporal characteristics are nevertheless still very different from the background speech signal will provide the same benefit, if trained. To answer this question, in our future work we will include two additional training groups, one exposed to the degraded speech input with concurrent vibrotactile stimulation that represents the fundamental frequency that does not match the auditory sentence (an alternative multisensory training), and one receiving training that is only auditory (a unisensory training).
Implications for rehabilitation
The findings of our study have implications for auditory rehabilitation programs for the hearing impaired (including the elderly). Besides providing great improvement in speech recognition in noise, our set-up is also intuitive to the user and thus requires minimal cognitive effort. In addition, it is relatively minimal in terms of the applied technical and time resources. We are working on reducing the size of the device even further. This contrasts with the more cumbersome solutions using tactile inputs that are available on the market [43, 66].
Interestingly, the participants of our study after audio-tactile training also improved when speech understanding was tested only through the auditory channel with the tactile stimulation removed. These findings represent a transfer of a short multisensory training not only to the novel multisensory stimuli but also to the unisensory modality (audio only). The reported scores for the auditory only and the AM conditions after training were also strongly correlated (not the case before training) which suggests common learning mechanisms. We were hoping to see such an effect, as this outcome has crucial implications for the development of rehabilitation programs for the hearing-impaired patients, including actual and future users of hearing aids and cochlear implants. Our in-house SSD and the multisensory training procedure can be potentially applied to a population of HA/CI users to help them progress in their auditory (i.e. unisensory) performance. A similar idea of a multisensory training boosting unisensory auditory and visual speech perception, was shown by Bernstein and colleagues [20] and by Eberhard and colleagues [25], respectively, although the applied language tasks were more basic than repeating whole sentences. At the same time, with future hard-of-hearing candidates for cochlear implantation, we believe that both unisensory tactile and multisensory audio-tactile training can be applied using our set-up, with the aim to “prepare” the auditory cortex for future processing of its natural sensory input [22, 67]. We believe that this can be achieved, based on a number of works from our lab which show that specialization of brain sensory regions, such as the (classically termed as) visual or the auditory cortex, can emerge also following a specifically designed training with inputs from another modality, which however preserve the computational features specific for a given sensory brain area. We have been referring to this type of brain organization in our works as Task-Selective and Sensory-Independent (TSSI) [21, 60].
A revised critical period theory
We show here, with our device, significant multisensory enhancement of speech-in-noise comprehension, at a level comparable or higher to that reported when auditory speech in noise is complemented with cues from lip/speech reading [5; 48, 49, 50, 51]. This is definitely an interesting finding, as synchronous audio-visual speech information is what we as humans are exposed to from the very early years of development and throughout lifetime. The brain networks for speech processing combining higher-order and sensory brain structures, often involving auditory and visual cortices, are also well established. At the same time, exposure to an audio-tactile speech input is an utterly novel experience. We argue, therefore, that one can establish in adulthood a new coupling between a given neuronal computation and an atypical sensory-modality which was never used for encoding that type of information before. We also show that this coupling can be leveraged for improving performance through a tailored SSD-training. Quite interestingly, this can even be achieved for the very complex and dynamic signal, such as speech. The current study thus provides further evidence supporting our new conceptualization of critical/sensitive periods in development, as presented in our recent review paper [22]. Although, in line with the classical assumptions, we agree that brain plasticity spontaneously decreases with age, we also believe that it can be reignited across the lifespan, even with no exposure to certain unisensory or multisensory experiences during childhood. Several studies from our lab and other research groups, mainly involving patients with congenital blindness, as well as the current study, point to that direction [22, 68, 69, 70].
Summary, future applications and research directions
In summary, we show that understanding of speech in noise greatly improves after a short multisensory audio-tactile training with our in-house SSD. The results of the current experiment expand our previous findings where we showed a clear and immediate benefit of complementing an auditory speech signal with tactile vibrations on fingertips, with no training at all [28]. Our research and the specifically developed experimental set-up are indeed novel and contributes to the very scarce literature on audio-tactile speech comprehension.
We believe that development of assistive communication devices involving tactile cues is especially needed in the time of the COVID19 pandemics, which imposes numerous restrictions on live communication, including the limited access to visual speech cues from lip reading. Besides the discussed rehabilitation regimes for the hearing impaired and the elderly, our technology and the tactile feedback can also assist normal hearing individuals, in second language acquisition, improving appreciation of music, as well as when talking on the phone. Furthermore, our lab already started developing new minimal tactile devices that can provide vibrotactile stimulation on other body parts, beyond the fingertips. Our aim is to design a set-up that will be wearable and therefore will assist with speech-in-noise comprehension (and sound source localization) in real-life scenarios.
In addition, our current SSD is compatible with a 3T MRI scanner. We have already collected functional magnetic resonance (fMRI) data in a group of participants performing the same tasks of unisensory and multisensory speech comprehension. To our knowledge this study would be the first to look into the neural correlates of understanding speech presented as combined auditory and vibrotactile stimulation. Our results can help uncover the brain mechanisms of speech-related audio-tactile binding, and may elucidate the neuronal sources of inter-subject variability for what it concerns the benefits of multisensory learning. This latter aspect, in turn, could further direct rehabilitation and training programs. Future fMRI studies are foreseen in the deaf population in Israel that will investigate the neural correlates of perceiving a closed set of trained speech stimuli solely through vibration.