A Multi-Dilated Convolution Network for Speech Emotion Recognition

doi:10.21203/rs.3.rs-4920990/v1

Download PDF

Article

A Multi-Dilated Convolution Network for Speech Emotion Recognition

https://doi.org/10.21203/rs.3.rs-4920990/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Speech emotion recognition (SER) is an important application in the field of Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER. Accordingly, in this study, we propose a novel SER model based on the learning of the utterance-level spectrogram. First, we use the Spatial Pyramid Pooling (SPP) strategy to remove the size constraint associated with CNN-based image recognition task. Then, the SPP layer is deployed to extract both the global-level prominent feature vector and multi-local-level feature vector, followed by an attention model to weigh the feature vectors. Finally, we apply the ArcFace layer, typically used for face recognition, to the SER task, thereby obtaining improved SER performance. Our model achieved an unweighted accuracy of 67.9 % on IEMOCAP and 77.6 % on EMODB datasets.

Biological sciences/Computational biology and bioinformatics/Computational models

Biological sciences/Computational biology and bioinformatics/Image processing

Speech Emotion Recognition

Deep Learning

Convolution Neural Network

Loss Layer

Spectrogram

Emotion Recognition

No competing interests reported.

Download PDF

Editorial decision: Revision requested
12 Nov, 2024
Reviews received at journal
16 Oct, 2024
Reviewers agreed at journal
30 Sep, 2024
Reviews received at journal
28 Sep, 2024
Reviewers agreed at journal
10 Sep, 2024
Reviewers agreed at journal
10 Sep, 2024
Reviewers agreed at journal
09 Sep, 2024
Reviews received at journal
04 Sep, 2024
Reviewers agreed at journal
04 Sep, 2024
Reviewers invited by journal
04 Sep, 2024
Editor assigned by journal
04 Sep, 2024
Editor invited by journal
30 Aug, 2024
Submission checks completed at journal
30 Aug, 2024
First submitted to journal
15 Aug, 2024

You are reading this latest preprint version

A Multi-Dilated Convolution Network for Speech Emotion Recognition

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1