Affective science is an interdisciplinary field of research that examines research questions related to emotion. To address some of those questions, various emotional stimuli corpora or databases have been developed (for a review of some of these databases, see Krumhuber et al., 2017; Wu et al., 2014). For example, psychologists may develop a stimulus set for experimental studies on emotion perception (e.g., Benda & Scherf, 2020; Thompson et al., 2013) and computer scientists may create a corpus of recordings to train machine learning models to annotate emotions automatically (e.g., Cosker et al., 2011; Yin et al., 2008). The development of these databases is often time-consuming and resource-intensive, but fortunately, most of these databases are made available and shared with other researchers. This paper describes the development of one such audio-visual (AV) database that complements the existing ones in the field: the Reading Everyday Emotion Database (REED)1.
Most previous databases tend to be unimodal, that is, the stimuli are either auditory-only (AO) or visual-only (VO). Some examples of the AO databases are the Macquarie Battery of Emotional Prosody (Thompson et al., 2013), the EU-Emotion Voice Database (Lassalle et al., 2019), and the Vocal Expressions of Nineteen Emotions across Cultures (VENEC) corpus (Laukka et al., 2010). These AO databases have verbal vocalisations (e.g., spoken utterances in particular emotions) and/or non-verbal vocalisations such as laughs or screams. The VO databases contain stimuli that are either static (i.e., still photographs or images)—such as the NimStim database (Tottenham et al., 2009) and the Facial Expression of Emotion – Stimuli and Tests (FEEST) (Young et al., 2002), which uses photographs from the classical set Picture of Facial Affect (Ekman & Friesen, 1976)—or dynamic (i.e., silent videos) created using morphs of still images (e.g., from neutral expression to angry) (Montagne et al., 2007; Young et al., 2002) or video-recordings presented without the audio (Golan et al., 2006; O’Toole et al., 2005; van der Schalk et al., 2011; Wingenbach et al., 2016).
Stimuli from these databases have often been used to investigate emotion perception in various studies, and by far, the most used ones are the static VO databases (i.e., the still photographs). Some have criticised the use of still photographs to investigate emotion perception, since the temporal, dynamic information of emotions is crucial for emotion processing (Krumhuber et al., 2013) and human perceivers tend to integrate both auditory (e.g., acoustic) and visual (e.g., facial) cues for emotion recognition (Massaro & Egan, 1996). Indeed, direct comparisons of unimodal (AO or VO) vs. bimodal (AV) presentations of emotions revealed that human perceivers are more accurate at recognising emotions (Kim & Davis, 2012) and rate emotions as more intense (Bhullar, 2013) when presented in AV mode. Thus, to increase ecological validity of emotion perception research (and affective science, generally), AV databases are needed.
There are two main types of AV databases in the field: those that involve naturalistic or interaction-based recordings and those that involve posed recordings. The former typically uses clips from television shows/films or recordings from one or more individuals interacting or performing a task (e.g., Busso et al., 2008; Dhall et al., 2012; Douglas-Cowie et al., 2011). Recordings from these databases often have situational cues to aid emotion expression and the verbal content may not be the same across actors, which, though useful for those investigating spontaneous and naturalistic emotions, may pose a challenge for those who need precise control over the stimuli. The posed AV databases offer such control given that the actors typically use the same set of contents or utterances to produce the same set of emotions. Table 1 presents some examples of posed AV databases in the field. These posed AV databases nonetheless have certain limitations: most consist only of a small range of emotions (typically, the six ‘basic’ emotions— angry, disgusted, fearful, happy, sad, and surprised —and neutral), and are recorded by professional actors and thus may display exaggerated expressions (Jürgens et al., 2015). Moreover, all the databases were recorded in pristine, studio-like conditions with bright lighting, plain-coloured background, high-definition camera, and clear audio. In other words, the currently available posed AV databases may not reflect how emotions are expressed in a typical, ‘real world’ setting (e.g., during teleconferencing) where not only do the expressers may not have any acting experience, but the recording conditions may also differ variably (e.g., the lighting level and colour saturation between clips may vary naturally between clips compared to that of studio recordings).
Table 1
List of posed audio-visual databases and information on their recordings including the location (Location); whether they were speech, song, or both (Domain); the language of the recording (Language); the number of encoders (No. Encoder) and whether they were professionals/experienced actors (Pro?); and the list of emotions recorded.
Name
|
Location
|
Domain
|
Language
|
No. Encoder
|
Pro?
|
Emotions
|
Audio-visual database of emotional speech in Basque (Navas et al., 2004)
|
Studio/Lab
|
Speech
|
Basque
|
1
|
Yes
|
Angry, Disgust, Fearful, Happy, Neutral, Sad, Surprise
|
Database of Kinetic Facial Expressions (DaFEx) (Battocchi et al., 2005)
|
Studio/Lab
|
Speech
|
Italian
|
8
|
Yes
|
Angry, Disgust, Fearful, Happy, Neutral, Sad, Surprise
|
Geneva Multimodal Emotion Portrayals - Core Set (GEMEP-CS) (Bänziger et al., 2012)
|
Studio/Lab
|
Speech
|
Pseudospeech
|
10
|
Yes
|
Admiration, Amusement, Anxiety, Contempt, Cold anger, Despair, Disgust, Hot anger, Fear, Interest, Joy, Pleasure, Pride, Relief, Sadness, Surprise, Tenderness
|
Multimedia Human-Machine Communication (MHMC) Database (Lin et al., 2012)
|
Studio/Lab
|
Speech
|
Chinese
|
7
|
No
|
Angry, Happy, Neutral, Sad
|
Surrey Audio-Visual Expressed Emotion (SAVEE) Database (Haq & Jackson, 2009)
|
Studio/Lab
|
Speech
|
English (British)
|
4
|
No
|
Angry, Disgust, Fearful, Happy, Neutral, Sad, Surprise
|
The EU-Emotion Stimulus Set (O’Reilly et al., 2016)
|
Studio/Lab
|
Speech
|
English (British)
|
19
|
Yes
|
Afraid, Angry, Ashamed, Bored, Disappointed, Disgusted, Excited, Frustrated, Happy, Hurt, Interested, Jealous, Joking, Kind, Neutral, Proud, Sad, Sneaky, Surprise, Unfriendly, Worried
|
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (Livingstone & Russo, 2018)
|
Studio/Lab
|
Speech & Song
|
English (Canadian)
|
24
|
Yes
|
Speech: Angry, Calm, Disgust, Fearful, Happy, Neutral, Sad, Surprise; Song: Angry, Calm, Fearful, Happy, Neutral, Sad
|
The STOIC Dynamic Facial Emotional Expressions Database (Roy et al., 2007)
|
Studio/Lab
|
Speech
|
French (Montreal)
|
34
|
Yes
|
Angry, Disgust, Fearful, Happy, Neutral, Pain, Sad, Surprise
|
As can be seen in Table 1, there is a paucity of databases that include sung emotions, which is regrettable as this presents a barrier to cross-domain emotion research. Indeed, given that speech and song are human-specific vocal channels, there is a lot of interest in studying the similarities and differences between the two. Yet relatively little is known about their similarities and differences in emotion expression, presumably, partly due to the lack of resources available. Understanding how the two domains are related in their emotion expression will not only deepen our understanding of the potential shared mechanism between them, but may also have implications for the development of emotion skill interventions such as for individuals with autism or alexithymia (Allen & Heaton, 2010; Katagiri, 2009). In the one database that does include sung emotions (RAVDESS), only six emotions were examined (angry, calm, fearful, happy, neutral, and sad), which limits the generalisability of comparative studies between speech and song to other (complex) emotions.
We developed the REED to complement the existing posed AV databases by addressing those limitations. The recordings from the REED are devoid of situational cues, similar to the previous posed AV databases. However, unlike the previous ones, we set out to record a wider range of emotions (neutral, the six basic emotions, and six complex emotions—embarrassed, hopeful, jealous, proud, sarcastic, and stressed) with adults across ages with and without acting/drama experience (the ‘encoders’) to better reflect the general population who may have varying levels of acting experience. We also aimed to expand the available AV databases by including both speech and song domains, the latter of which is scarcely available in the field, and thus enable comparative studies in spoken vs. sung emotions that are not limited to basic emotions. To ensure variability in the recording conditions, we recorded encoders using everyday recording devices commonly used in teleconferencing (i.e., their own webcam, mobile phone, etc.)2.
[1] The ‘Reading’ in REED is pronounced as ‘Redd-ing’, following the town in Berkshire, England, where the university is located.
[2] Due to the coronavirus (COVID-19) pandemic, we were not able to systematically manipulate the device and recording conditions for each encoder in the lab but instead, relied on each encoder’s own recording environment for variation.