Emotion Recognition on Speech using Hybrid Model CNN and BI-LSTM Techniques

doi:10.21203/rs.3.rs-5035263/v1

Download PDF

Research Article

Emotion Recognition on Speech using Hybrid Model CNN and BI-LSTM Techniques

https://doi.org/10.21203/rs.3.rs-5035263/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Speech emotion recognition is critical for many applications such as human-computer interactions and psychological analysis. Due to the inability of conventional models to capture the subtle nuance of emotional speech variations, the identification process is less effective. The development of a new hybrid model in this study presents a solution to address this problem through combining the Convolutional Neural Networks and Bidirectional Long Short-Term Memory. The combination of feature extraction and temporal context abilities is a unique value for the model. The study model led to outstanding performance reached 98.48% accuracy, 97.25% precision, 98.29% recall, and an F1-Score of 97.39%. The latter performance surpassed those of other models such as PNN model 95.56%, LSTM model 97.1%, 1-D DCNN model 93.31%, GMM model 74.33%, and Deep Learning Transfer Models 86.54%. The developed hybrid model can accurately detect and classify emotions and speech and can effectively work in real applications.

Emotion recognition

speech

Convolutional Neural Networks

Bidirectional Long Short-Term Memory

Neural Network

Emotion recognition in speech is one of the most required features for human-computer interaction and many real-world applications. Systems that better understand human emotions can significantly improve the performance and responsiveness of automated systems, thereby providing even greater user experiences as well as more intuitive interactions. Thus, emotion recognition is a topic that has gained enough interest in areas like artificial intelligence (AI), machine learning and natural language processing. Yet, emotion recognition is a difficult task even for experienced annotators because human emotions are complex and dynamic in nature when expressed using speech. The emotion recognition problem in speech has its own nuances due to a range of emotions and their subtlety, which do not allow for precise classification. Speaking humanly occurs with a variety of acoustic features such as pitch, tone volume rhythm and more that can carry various emotional states. Also, speech patterns differ from to person to person and are influenced by culturally determined expressions of emotion as well context so it is challenging detecting emotions properly. Since these subtleties are lost in a lot of Machine Learning models, the performance and reliability for him remain constrained by traditional algorithms.

To solve these problems, many researchers have proposed deep learning models. Convolutional Neural Networks and Long Short-Term Memory have been proven to be the effective methods for speech emotion recognition. CNNs are the finest in extracting hierarchical characteristics from raw audio signals, LSTMs well at modeling temporal dependencies and sequential data. Nonetheless, taken in isolation and each of these methods has its own problem. So, for instance CNNs may struggle to capture long-term dependencies and LSTM might not use the spatial characteristics of the data effectively. In this paper, we proposed a hybrid model that combines the advantages of CNN and Bidirectional LSTM (Bi-LSTM) networks to enhance emotion recognition in speech. In the proposed methodology, we make use of CNNs for feature extraction and Bi-LSTMs for temporal context understanding. The hybrid model intends to give a more complete analysis of speech signals combining the strengths respectively on spatial and temporal aspects by utilizing both methods. Before developing the actual hybrid model, we preprocess the speech data so that it is in a standardized form and ready for analysis. This includes filtering the audio signals for noise reduction, normalising and segmenting of this data. Then comes the CNN processing of preprocessed audio to extract high level features corresponding to distinct acoustic properties of speech. The extracted features are input into the Bi-LSTM module, that models the dynamics of speech signal in time domain by taking both forward and backward sequences to get full context.

PERFORMANCE OF THE PROPOSED HYBRID MODEL The proposed hybrid model has been tested over the several performance metrics mentioned above including accuracy, precision, recall and F1-Score. Results show that the hybrid model provides significantly better emotion recognition performance than those of traditional models. More specifically, the hybrid model has an accuracy of 98.48%, precision: 97.25%, recall: 98,29% and F1-Score :97.39%. These results are better than the ones given by others models like Probabilistic Neural Network (95.56% accuracy), Long Short-Term Memories(97.1%accuracy), One-dimensional deep convolutional neural network(93.31%), Gaussian Mixture Model (74..33%) and Deep Learning Transfer Models HA76%"accm*).

A key benefit of the hybrid CNN + Bi-LSTM method is its explicitly effectiveness in extracting spatial and temporal features of speech signals. This full analysis allows the model to learn diverse emotional speaking patterns and differences, making it more accurate. Moreover, since the Bi-LSTM part processes an input sequence in both directions (forward and backward), it is able to take into consideration more context throughout the signal which increases its ability of recognizing emotions. This success of the hybrid model illustrates ways to potentially u-s-e different deep learning techniques for solving challenging emotion recognition problems. This approach combines the advantages of both CNNs and Bi-LSTMs, thus providing a strong yet flexible way for emotion speech recognition. This has major implications for a wide array of applications (such as virtual assistants, customer service automation, mental health monitoring and interactive entertainment) that hinge on the ability to understand human emotions. This hybrid model can be easily adopted and extended to other tasks within speech & audio processing. Such as it is often used in speaker identification, speech synthesis & language translation i.e features extracted should be both spatial and temporal. This hybrid technology has the flexibility expected from DNN and can scale up to a large vocabulary like standard acoustic combination, providing itself as an appealing candidate for improving speech processing state-of-the-art & increasing intelligent systems capabilities. The hybrid model combining CNN along with Bi-LSTM technology proposed in this research paper provide a good choice to emotion recognition from speech. The hybrid model outperforms traditional models and combines the strength of CNNs with that Bi-LSTMs, opposed to other ones in terms of performance related measures (accuracy, precision/recall) as well associated with a F1-Score. This underline the necessity to combine various deep-learning techniques in order to address hard problems and motivates us for further improvement of emotional recognition, speech processing.

The Odyssey 2024 Speech Emotion Recognition (SER) Challenge seeks to transcend classic emotionally acted scenarios and prompt creativity in detecting emotions that accompany spontaneous speech. This includes the speaker-independent training and development sets, as well as a novel test set annotated for categorical speech emotion recognition (SER) in addition attribute SER tasks. This effort promotes collaboration between scientists to improve robust and accurate SER technology while also encouraging new creative methods utilizing breakthroughs in audio processing. This paper outlines the baseline, leaderboard, evaluation results and analysis of key findings [1].

Due to the prevalence of computer-based applications, a great deal of attention has been paid in recent years to Speech Emotion Recognition (SER). Odyssey 2024 Speech Emotion Recognition Challenge This challenge is organized as part of a special session on speech emotion recognition, Odyssey Speaker and Language Recognition Workshop. The present paper proposes the Double Multi-Head Attention Multimodal System, which was developed to be submitted in this challenge. We use pre-trained self-supervised models to capture informative acoustic and text features, through an early fusion strategy. Mixtures of transformed representations are then combined with complementary contextually-sensitized embeddings by a Multi-Head Attention layer, and pooled into an utterance-level vector through yet another attention mechanism. Our system was at 3rd position in the categorical task with a Macro-F1 score of 34.41% over about top-31 teams[2].

In this paper we address the problem of recognizing emotions from spoken language text, using deep learning techniques. In particular, it uses Convolutional Neural Networks (CNNs) and the HuBERT model to perform classification over sets of Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). These results suggest that HuBERT in general and deep neural network models more broadly may be poised to yield efficient, discriminative speech representations of emotion. The dataset was used to train and test models, which included different emotional expressions - happiness, sadness, anger as well as fear. The proposed method includes audio preprocessing, extracting features such as Mel Frequency Cepstral Coefficients (MFCCs) and deep learning architectures to classify emotion. The advanced self-supervised learning approach by HuBERT was proven to substantially outperform conventional CNNs in both accuracy and efficiency. Thus, choosing suitable deep learning models and features are the key points which have been reinforced by this research in speech emotion recognition. Abstract: The HuBERT model, capturing the contextual music and temporal dynamics of speech can be useful for building more smarter SER system which could potentially being employed in a number applications like mental health assessment, interactive voice response systems or educational software. The results provide evidence for some key open problems in AI, such as how to apply deep learning most effectively, and pitfalls of applications in speech processing [3].

The multimodal emotion recognition has become a hot topic in the field of artificial intelligence. In the words of van Hasselt, over years they had made significant progress by scaling up datasets and powerful algorithms. At the same time, current systems face challenges due to complex environments and incorrect annotations that prevent practical applications. We host the MER competition series to tackle these problems. In MER2023 last year, research was invited on multi-label learning, noise robustness and semi-supervised learning. This year's MER2024 extends the size of dataset and new thematic challonges open-vocabulary emotion recognition. This course aims to overcome the problems of having fixed label spaces and utilizing majority voting that can make certain annotations go unnoticed, since non-majority or non-candidate labels are frequently missed [4].

Emotion recognition in speech is a key area of research in the field of signal processing as it requires single, limited amount automatic implementation. Arabic speech varies from region to another for many cultural reasons and dialectals variations, which makes the detect of emotion in Arabic is a challenge. The few available datasets are small comparable with English one that what we have seen before make this task really unique due this limitationatasets also differ.from other emotionalalar_absent-nesses. In this study these models are Xgboost, Adaboost, KNN and DT with a SOM model in addition to the deep learning one ( SERDNN ) based on ANAD dataset characterized by angry,happy,and surprised emotions among 844 features. It is interested in enhancing the effectiveness and precision of game playing recognition regarding Arabic sound recordings. Methods are evaluated by precision, accuracy, recall and F1-score metrics. Xgboost, SOM and KNN classifiers achieved good results along with the SERDNN model had the highest accuracy 97.40% and loss rate of 0.1457 which makes it is a promising alternative for recognizing emotions in Arabic speech [5].

Businesses are starting to pay closer attention, because of the possible implications that affective computing could have on a wide range of fields including mental health monitoring or human-computer interaction (HCI) or even personal advertising. Advances in sentiment analysis and emotion recognition have driven the growth of affective computing. In the results of research applications in latest years, methods based on Deep Learning have distinctly progressed emotion detection to Multimodal Emotion Recognition (MER) systems where data from audio-video-text is consumed. While these innovations have allowed MER systems to improve, they continue to face challenges. Most existing surveys do not give attention to a comprehensive presentation of MER and the corresponding DL architectures. In III, we carry out a comprehensive survey on DL-based MER systems including state-of-the-art architectures and models implemented for the concerned task, basics of the underlying theories as well as motivators behind each theory from an MIR context highlight some common approaches in designing Audio-CNNs or RNN frameworks along with input-output adaptors which are essential parts to handle simultaneous multi-channel/multi-synthesis inputs as single stream audio data fusion mechanisms between various types multimodal information such datasets used insuch challenges evaluation procedures deployed against generated predictions feasibility analysis across different domains/fields. Moreover, the article also points out significant challenges and future research avenues. Aims: This article provides an up-to-date and comprehensive understanding of current advancements in MER, targeting researchers at academic institutes researchers or industry involved in this quickly evolving area [6].

How And Why Speech Is Most Important Way Of Humans To Express Thoughts And Feelings Incorporating AI into this innate method of sharing has spawned systems called speech emotion recognition. Applications for these systems include lie detection, forensics and medicine. The work employs 4240 audio samples from the RAVDESS and TESS datasets, where features like energy and MFCC are extracted. Here we take again one larger similiar dataset and 10 smaller dis-simimiilar datasets the combined is further split among training set vs test sets. Eight emotions are classified using traditional Machine Learning models like CNN, Random Forest and SVM. CNN has the highest accuracy of all models, especially in case neutral [7].

In this paper, the submission of our system to 2024 MSP-Podcast Speech Emotion Recognition Challenge downloading a context from AURORA download link released after NER per ICANN GDRP guidelines focusing on categorical emotion classification task over the unseen test set drawn directly from Meld dataset is described. We used a Self-Supervised Learning fine-tuned ensemble of models that consists speech and text modalities. The maxpooling output of all the models were then fused at score level with a linear SVM that presented an F1-macro score of 0.35% on their development set [8].

This paper presents a Voice-Guided Facial Emotion Recognition System for Blind people. The system identifies and construes facial emotions from analyzed data using a deep learning process in accordance with TensorFlow, accompanied by spoken feedback through an integrated voice assistant. The application, emphasizing the themes of accessibilitiy, privacy and a new design for users largely visually impaired or who want to be more accessible looks really nice. With the kind of emotional context comprehension provided by supporting user feedback and testing [9], ensures that over a long period, such systems will adapt to be as specific to demographic need so they could make all controlbased tasks people do with iOS more efficient [9].

In order to create AI systems that focus on humans, researchers need to excel at recognizing the emotions. In this paper, an in-depth literature review is performed to understand the techniques and technologies for emotion detection over textual, visual images (videos/images), and auditory data with recent deep learning based models. BiLSTM networks for text-based emotion detection, CNNs for image-based detection and LSTM networks for speech based recognition etc. It has been demonstrated that combining these modalities results in improved emotion classification and is particularly useful for a number of applications, such as affect-aware human-computer interaction, sentiment analysis mental health diagnostics, multimedia content understanding etc. This work emphasizes the benefits and complementarity of both modalities as they relate to each other, adding knowledge value into emotion-aware AI systems[10].

Speech self-supervised models (SSL) have shown to be the go-to solution recently for other downstream tasks. In this paper fine-tuning approaches for the WavLM Large model are explored in speech emotion recognition front, with experiments being carried out on MSP Podcast Corpus. Experiments_Experimental Settings : Gender and Semantic information is provided from an utterance. Discussion The results and the final model submitted to Speech Emotion Recognition Challenge 2024 are explained [11].

Such technology is commonly used in human-computer interaction, mental health assessment and also has the potential to use as a recommender system for contents. In this paper, a facial expression and speech cues are integrated into the multimodal approach; at feature-level (CSTNF) Couple with Speech Temporal NeuMF framework hereinafter for modeling_time_as_well_. Fusion techniques have been shown experimentally on the eNTERFACE'05 dataset to improve performance, with particular superiority seen in feature-level fusion over decision level of fusion [12].

AI Emotion Identification This module delivers the capability of recognizing facial expressions and carrying out sentiment analysis in visual data, which is one necessary component for human-machine interaction. Emotional AI or Affective Computing is the field of research that seeks to make robots become more human-like in emotion understanding and responding using cues: mainly text, voice, image and gesture. This paper focuses on emotion recognition based upon audio and video inputs, where analysis of facial expressions is performed to reveal the emotional-state [13].

In particular, automatic emotion recognition from voice signals plays an important role in the task of human-computer interaction and mental health monitoring. The contribution or significance of this paper is to introduce a new Artificial Intelligence Assisted Learning Scheme (AIALS) with hybrid structure of LSTM and CNN architecture. The AIALS framework also leverages the CREMA-D dataset for acoustic feature and spectrogram extraction purposes, to learn long short-term temporal patterns (for emotion classification) as well spatially dependent ones. The model realizes 98% accuracy, outmatches the prior works and shows generality on emotional categories [14].

In [22]: Human emotion detector English is relatively easy... Source: WIRED MORE DETAILSGlobal SetUpsetUp from the above environment setup. The global setup can set up your code before it even starts for real since there are no variables that were created or returned at this time. The study employed English (RAVDESS), German (EmoDb) and Italian (Emo-Vo) emotional databases. It performs extraction of MFCC, chroma, Tonnetz and contrast features from the raw audio data followed by a CNN model to classify between 5 primary emotion labels. Performance comparison with various state-of-the-art methods indicates high accuracy on all the databases where our model obtains 96.53% accuracies over RAVDESS, 96.22% velocities for EmoDb and of upto 95 performance on emotivissancevo; These results suggest that deep learned mind models could learn acoustic cues from those data encrypted features extracted sucession. These results demonstrate the state of art in multilingual emotion detection with 97.89% accuracy as described earlier using speaker and language-independent model [15].

The Odyssey 2024 Emotion Recognition Challenge includes tasks that compare systems at classifying speech into eight emotional classes (unimodal), when predicting continuous emotional attributes. System integration of text and speech emotion recognition models by TalTech For the audio-based model, we apply Wav2Vec2-BERT and for the text based model: finetuned LLaMA2-7B. Combining them using this multi-class logistic regression model gives an average F1 score of 0.354, which is ranked the best only second to Meldians . The fusion model ranked 6th for attribute prediction with a score of 0.5144 [16].

This project created and tested audio, visual emotion detection models with confusion matrices. A ResNet50 architecture for face emotions, and A Conv1D layers in CNN model for speech emotion. The approach is validated by fusing the modalities to enhance emotion detection in speech and visual emotions with accuracies of 70.0% for SAVEE /Ravdess using both audio/visual emotional information, while only vocal-based (unimodal) speaker/raverages are close at 63.5%. This is reflected in the analysis, which supports multimodal integration that can significantly enhance accuracy [17].

The target of Speech Emotion Recognition is emotion recognition in the utterance. In this study, a Korean emotional speech database is constructed and RNN with the feature combination proposed as shown in (1) to improve recognition performance. We extract and statstically analyse the acoustic features (F0, MFCC & spectral) to get an ideal combination. The performance of the model in classifying speech emotions has been observed and it's more accurate than previous studies using a recurrent neural network [18].

Speech emotion recognition plays a vital role to improve the human-computer interaction and personalized experience. The proposed method in this research, a deep stride convolutional neural network with bi-directional LSTM extracts distinctive acoustic features from speech signals. The model performed with a 0.95 accuracy on the RAVDESS dataset which outperformed state-of-the-art by at least 20% implying that this could be useful in customer service or mental health monitoring applications [19].

The detection of the emotions from speech signals is important to Human-Computer Interaction (HCI). In this paper, we investigated deep learning techniques in improving the performance of models used for SER by using CNN-based architectures and dataset augmentation. Spectrogram and MFCC are low-level emotional information that is targeted to be captured, while the use of textual cues helps in providing semantic aspect relevant for better emotion prediction [20].

More and more social networks are leaning towards capturing in-audio behaviour, in addition to video content - which is an effective approach of emulating the momentary human behavior & emotions. In this paper, we studied the Arab region for an Arabic dialect and with a limited record of different state-of-the-art per-site software to recognize emotions in spoken languages. This work is the comparative analysis of LSTM model vs GRU, RNN and CNN Models for Emotion Recognition it solve Arabic speech emotion Database as limited resources so tried to boost accuracy detection Rate in Arabic Speech [21].

In this paper, a new strategy of emotion mapping and prediction is proposed through the Valence-Arousal-Dominance (VAD) model using RAVDESS dataset with pre-trained networks. For emotion recognition, it combines Wave2Vec 2.0 with LSTM networks to deliver stable performance across diverse datasets. This research also provides an overall speech analysis application with persisting issues and ethical concerns in the realm of SER [ 22].

Speech emotion recognition is crucial in human interactions and the study of emotional state identification includes happiness, sadness, anger disgust,fear,melancholy. This method preprocesses the audio for silence and noise removal, extracts features, and employs Probabilistic Neural Network (PNN) classifiers. The PNN model obtained 95.76% of accuracy in the EMO_DB database [13] and 84.64 from RAVDESS database, this outperformed other classifiers such as GMM, Radec Deep Neural Networks Architecture Grosialdo et al.,RNN,HMM and CNN[23].

Language is a crucial utility of communication and thought for us as human beings, unfortunately this has made emotional Speech RNA recognition computationally challenging because we humans are complex mental creatures with moods subject to all kinds of moderations. The premise of this research is to formulate an approach for detection of emotions in spoken language using deep learning incorporating CNNs and LSTMs, with which include extracting features such as Mel-frequency cepstral coefficients (MFCC). The CNN model was validated on the RAVDESS and TESS datasets, reaching accuracy rates as high as 97.1% - hence proving its effectiveness [24].

Emotion recognition from speech (ERfS), is a significant pre-requisite in the endeavour of human-machine interaction, however dealing with it due to its complexity and inadequate feature-to-emotions dissimilarity has never been an easy task. The following section presents an acoustic feature set and a lightweight deep 1-D DCNN model which provides better results with less computational cost opposing to the counterpart. Compared to traditional SER techniques, the method obtained an accuracy of 93.31 % and 94.18 % on EMODB RAVDESS datasets [25].

The Subjectivity & Contextual Language; It Is Difficult to Detect Emotion of Spoken Words In this study, we used Extreme Machine Learning (EML) to improve the performance of Gaussian Mixture Model (GMM) for speech emotion recognition( SER). The proposed method was able to obtain 74.33% accuracy in EMO-DB using MFCCs as feature extraction approach. The method is computationally efficient and can be used in different contexts such as virtual assistance, call center analytics [26].

The purpose of this study is to offer a more influential utility by finding the addiction possibility improvement cause in SER through preprocessing, symbolic feature composition and extensive elucidation techniques. The study uses VGGish and YAMNet pretrained models in an ensemble network converting speech into spectrogram images with three different datasets. The model performs better than human light-dark humans classifying with 87% accuracy, aside from using Grad CAM and LIME to interpret/explain certain results important regions of emotions for auditory analysis [27].

3.1. Proposed Flowchart

This is the figure 1 flowchart of how to use this structured methodology for constructing a model that detects emotions in speech. To-process-steps Are the pipeline steps, everything is stepwise maintained to process data and train a model on it.

Data Preparation and Pre processing : The process starts by loading the dataset which consists of speech samples to be used for detecting emotions. After the collection of our data, one of the following steps will be preprocessing which is to remove noise from your dataset and normalize or segment it for analysis. After preprocessing, feature extraction is applied to convert the raw speech data into machine-interpretable high-level attributes subsequently used for model training. This step is the most important since it records key features of speech signals that are associated with various emotions.

Data Split model -After the features have been extracted we split into 2 parts Training data and Test Data. Normally, the data is divided 80 % for training and 20% to test it. This split guarantees that the model is actually tested against real unseen data to estimate its actual performance.

Model Training Phase : This phase basically consists of using the training data to work on and improve upon a candidate model. In this stage, the model starts to learn patterns in respect with emotions that are associated with speech features. The resulting model undergoes multiple rounds of training and optimization to make it more accurate, releasable.

Model Testing & Evaluation: Finally, we test the model using Model testing data. In this phase, we compare how well the model is doing by matching its predictions with the actual emotion labels in our test data. It produces key performance metrics such as accuracy, precision, recall and F1 score to evaluate the model thoroughly.

Emotion Classes: The last Step to categorize the speech data into predefined emotion classes. The emotion classes in this flowchart are angry, disgust, fear, happy, neutral sad and surprise. The classes denote the set of emotions which model was trained to detect.

3.2. The Proposed CNN Architecture

The Proposed CNN Architecture

Convolutional Layers (Conv2D): Represented by red blocks, these layers apply convolution operations to the input data, extracting features through filters. These are stacked in multiple layers to capture various levels of feature hierarchies.
Batch Normalization Layers: Indicated by green blocks, these layers normalize the output of the previous layers to improve training stability and performance.
Max Pooling Layers (MaxPooling2D): Shown as teal blocks, these layers perform down-sampling by taking the maximum value over a sliding window, reducing the spatial dimensions of the data and helping to achieve translation invariance.
Flatten Layer: Represented by a dark blue block, this layer converts the multi-dimensional output of the convolutional layers into a one-dimensional vector, preparing it for the fully connected layers.
Dense Layers: Illustrated as yellow blocks, these layers are fully connected layers that perform high-level reasoning and classification based on the features extracted by the convolutional layers.
Dropout Layers: Shown as beige blocks, these layers randomly set a fraction of input units to 0 during training to prevent overfitting and improve generalization.

Flow of Data through the Network

The input image passes through a series of convolutional layers (Conv2D) with batch normalization (BatchNormalization) and max-pooling (MaxPooling2D) interspersed.
The spatial dimensions are progressively reduced by the max-pooling layers, while the feature depth increases.
The output of the final convolutional layer is flattened into a one-dimensional vector (Flatten).
This flattened vector is then passed through a series of fully connected layers (Dense) with dropout (Dropout) applied to regularize the model.
The final output is likely a classification layer that produces the final predictions.

3.3. Architecture for speech emotion recognition using a bi-directional Long Short-Term Memory (Bi-LSTM) network:

1. Input layer: It passes the extracted features obtained from prepossessed speech signals to a computer-based (numbers) network representation. Another layer of reshaping to make x a 3D tensor which can be then passed into the LSTM post_processor=model.add(LSTM(128))(postprocessor)

2. Bi-LSTM Layer:- The Bi-LSTM layer is the most important module in hypothesis, Inside this LSTM units process features to charaterize temporal relationship between frames. Bi-directional just means that LSTM units are processing, input data both forward and backwards, so it will pass the information of past as well future frames.

3. Dropout layer: This prevents overfitting by randomly deactivating this DEvAR unit during training, reducing the reliance on certain units or features.

4. Fully-connected Output Layer: The final layer is a fully connected (FC) output layer, which outputs the predictions for one of several emotional classes. The activations from this layer are probabilistic, and so the most common activation function used for these neurons is softmax.

5. The loss function is used to calculate how badly the predicated emotions (by our network) differ from actual for each soeaker signal in training set. A popular loss function for multi-class classification is called categorical cross-entropy.

6. Optimizer: The Optimizer is used to update the weight in-network, during training process with respect of loss. Stochastic Gradient Descent (SGD) with variants like Adam & RMSprop are some popular optimizers for deep learning.

7. Metrics: Metrics to monitor how the network is performing on both validation and test sets. Metrics such as accuracy, precision, recall, F1 score are some commonly used ones.

3.4. Algorithm for speech emotion recognition using a bi-directional Long Short-Term Memory (Bi-LSTM) network:

Input : Load Dataset

Output : Emotion Classify

Step 1: Preprocessing collected data.

Step 2: Feature extraction.

Step 3: Model architecture.

Step 4: Model training.

Step 5: Model evaluation.

Step 6: Model testing.

Step 7: Deployment.

Step 8: Stop.

1. Collect a large dataset with speech signals labeled by emotions - Data collection By tokenizing the text into individual words, we get a dataset which will be used to train and test the Bi-LSTM model.

2. Preprocessing: The speech signals are pre-processed to extract the relevant feature which can be used as inputs of Bi-LSTM model. This could be anything like converting speech signals into spectrograms, MFCCs(mel-frequency cepstral coefficients), or prosodic features(pitch and pace).

3. Extracted features: Take the important features from preprocessed speech signals and split data into training, validation and test set.

4. Model Architecture Define the architecture of Bi-LSTM model with multiple layers LSTM bidirectional connections The output layer should be a fully connected with softmax activation, to obtain the final prediction for each class of emotion.

5. Model training: Train your Bi-LSTM model on the train set using some optimizer (e.g., stochastic gradient descent) and loss function (e.g., categorical cross-entropy). The model will be trained to learn the mapping of input features to their corresponding emotions.

6. Model evaluation: Assess the trained model on a validation set of data, and adjust the parameters appropriately. Iterate this process until you are satisfied with the model's performance on your validation set.

7. Model evaluation: Finally, evaluate the performance of your model on testing set with accuracy, precision(positive predictive value) and recall(sensitivity), F1 score.

3.5. Pseudocode of bi-directional Long Short-Term Memory (Bi-LSTM):

1. Import libraries: Start by importing the necessary libraries, including NumPy, Pandas, and Keras.

import numpy as np
import pandas as pd
from keras.models import Sequential

from keras.layers import LSTM, Dropout, Dense

2. Load the dataset: Load the speech emotion recognition dataset into memory as a Pandas dataframe.

data = pd.read_csv('speech_emotion_dataset.csv')

3. Preprocess the data: Preprocess the data as necessary, such as converting the speech signals to spectrograms or extracting Mel-Frequency Cepstral Coefficients (MFCCs).

# Example of extracting MFCCs

from python_speech_features import mfcc

# Extract the MFCCs for each speech signal

mfccs = []
for speech_signal in data['speech_signal']:
mfccs.append(mfcc(speech_signal))

# Convert the list of MFCCs to a NumPy array

mfccs = np.array(mfccs)

4. Divide the data into training, validation, and test sets: Divide the data into training, validation, and test sets using an appropriate split ratio.

# Example of a 80-10-10 split

split_ratio = [0.8, 0.1, 0.1]

# Split the MFCCs into training, validation, and test sets

mfccs_train = mfccs[:int(split_ratio[0] * mfccs.shape[0])]
mfccs_val = mfccs[int(split_ratio[0] * mfccs.shape[0]):int((split_ratio[0]+split_ratio[1]) * mfccs.shape[0])]
mfccs_test = mfccs[int((split_ratio[0]+split_ratio[1]) * mfccs.shape[0]):]

# Split the emotions into training, validation, and test sets

emotions_train = data['emotion'][:int(split_ratio[0] * mfccs.shape[0])].values
emotions_val = data['emotion'][int(split_ratio[0] * mfccs.shape[0]):int((split_ratio[0]+split_ratio[1]) * mfccs.shape[0])].values
emotions_test = data['emotion'][int((split_ratio[0]+split_ratio[1]) * mfccs.shape[0]):].values

5. Define the model architecture: Define the architecture of the Bi-LSTM model using the Keras library.

model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(mfccs_train.shape[1], mfccs_train.shape[2])))

3.6. Advantages of bi-directional Long Short-Term Memory (Bi-LSTM) network for speech emotion recognition over existing methods:

1. More faithful representation of context: Bidirectional means that we can leverage both the past and future words to predict an emotion, which overcomes stepping through a sentence-which is useful when distilling sentiment from text-word by word.

2. Validation on speech emotion recognition:Bi-LSTMs have been proved to be better than of any other deep learning architecture like with CNNs it has got higher accuracy in the case of republishing emotions from voice. 3. Model robustness against variable-length sequences: When processing speech signals that have different sequence lengths much like the case in SER tasks, Bi-LSTMs are optimal as they can adapt their memory state to match the length of an input.

4. Noise and variations in speech : traditional machine learning algorithms often fail to capture noise that is inherently present when humans speak such as speaking styles, accent, environmental noises etc.Bi- LSTMs can encode the emotional content despite these challenging factors.

5. Additional information incorporation capability: Bi-LSTMs can also be seamlessly used to incorporate other available information, e. g., from acoustic, prosodic or even physiological signals into more accurate speech emotion recognition system.

3.7. Comparison between standard BI-LSTM and a proposed BI-LSTM for emotion detection in speech

Table 1. Comparison between standard BI-LSTM and a proposed BI-LSTM for emotion detection in speech

Feature	Standard BI-LSTM	Proposed BI-LSTM
Architecture	Bidirectional LSTM layers	Enhanced with additional layers (e.g., attention, CNN)
Context Awareness	Utilizes forward and backward context	Improved context understanding with hybrid models
Feature Engineering	Basic feature [Mel-Frequency Cepstral Coefficients (MFCCs)] extraction	Advanced feature [Gammatone Filterbank Features] extraction and pre-processing
Regularization	Standard techniques [Batch Normalization]	Advanced regularization methods [Layer-wise Adaptive Rate Scaling (LARS)] to prevent overfitting
Performance	High accuracy in clean speech environments	Superior accuracy and robustness in diverse environments

4.1 Hardware and Software requirement

Emotion Recognition system on speech by featurizing the spectrograms using hybrid model of Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory( BI-LSTM) techniques needs rich hardware and software resources. At the hardware level, a high-end GPU (eg NVIDIA RTX 3080 or better) is necessary for speeding up deep learning computations, while at least 32GB of RAM and multi-core CPU (Intel i7 / AMD Ryzen 7 or more powerful equivalents), are needed to optimize data preprocessing and model training. For software, the system must be built on a Linux-based operating System (O.S.) like Ubuntu 20.04 and deep learning frameworks such as TensorFlow or PyTorch for model development jobs. And also all the libraries for supporting these requirements, stuffs such as Keras for building neural networks, librosa to work with audio files inorder to extract mel-spectogram (functional approach), scikit-learn its kind of essential here where you can use this library alongwith Gridsearch parameters and Model evaluation lines etc. Thus, a Python programming environment with full Jupyter Notebook support could be the solution to model building and visualization of emotion recognition results in an iterative manner.

4.2 Dataset

RAVDESS contains 7356 files: 1440 speech and song, with 1012 male and female actors. Total size is better than above being only 136MB compressed in a.zip - And thankfully here so your time-training downloads are less painful from scratch also at Kaggle. These are two lexically identical phrases spoken by professional actors (12 female, 12 male) performing in a neutral North American dialect. Speech expressions were calm, happy, sad, angry, terrified surprised and disgusting; whilst for musical expression there was: calm hapy saying Angry frightening The CREMA-D dataset consists of 7,442 original clips from 91 actors (48 males and 43 females), between the ages of twenty to seventy-four years; spanning five racial/ethnic groups: African American with sixteen actors; Asian with seventeen actors; Caucasian having thirty-two touchdowns Hispanic consisting another eighteen tunes Unspecified. These sentences are read by actors (15 males) reciting 12 simple declarative statements with one of six emotions or neutral as well at four different intensity levels : low, medium and high each mesure separately plus an unpecified level. Another 200 target words were spoken by two actresses (ages: 26 years/64 years, respectively) using the test formulation "Say the word _" and recorded with seven different emotional conditions such as angry, disgust, fear shock/happiness surprise/sadness or neutral which results in total of 2.800 audio files for each language company.

Dataset Source

https://krastiona.site/?_=%2Fdocs%2Fml%2Fdatasets%2Fravdess-datas
https://www.kaggle.com/datasets/ejlok1/cremadet%2F%23lirKEvczM7%2FYEt8Bk80FBLB%2FPI6OWQXnGpiSAqbI
https://www.kaggle.com/datasets/ejlok1/cremad

4.3 Illustrative Example

Horizontal bar chart of tope 4 emotions Figure 3 _ ViewChild The Below screen shot represents the Horizontal Bar Chart displaying frequency with respect to other different types of emotion, that seems Sad (1),Happy(2), Disgust(3), Angry It is represented in Blue(default color palette). Disgust is the longest bar, meaning highest frequency of that emotion in contrast to happy and sad which have shorter bars at about equal length. However "Angry" has medium frequency and the length of its bar is between the lengths that belongs to happy and disgust. This visual representation stresses the relative distribution of emotions in the dataset You may observe that disgust is appearing more than any other emotion and both happy and sad have been felt similarly number of times.

Figure 4.-Waveplot of "angry" speech audio signal for emotion over time. The x-axis represents the time in seconds and y-axis is of amplitude that also represent an audio signal. After about 1-seconds, there is a massive jump in amplitude indicating an angry outburst. This peak time is aperiod of intense variability and energy in the waveform, which shows the differenceen nt emotional can state. Below and above this peak the signal once again goes down, implying zones with much less loud speech. It illustrates the fluctuating nature of anger in speech quite well, with chunks of very loud and vocally tense areas.

Figure 5 is a spectrogram showing the frequency content of an angry speech for each individual in time. Time is visualized on the x-axis in seconds and frequency (Hz) plotted perpendicularly across ranging from 0-10,000 Hz. The colors are used as a measure of the amplitudes and intensity correspond to amplitude, red/yellow being high frequencies while blue is low frequency. Indeed, in the spectrogram we see bursts of energy mostly present below 4000 Hz, with especially high concentration around halfway between 0.5 and 1.5 seconds - which roughly corresponds to highest intensity of angry speech Impulse response extracted from microphone recording (same band pass filtering as used for analysis). These bursts include high-energy components at low and mid frequencies that are presumably associated with the harsh, loud phonation produced during angry vocalizations. This visualization illustrates the temporal and spectral structure in speech that corresponds to an expression of anger, shedding light on which aspects of its acoustics characterize it as angry.

Figure 6: waveplot of the speech disgust emotion signal over time. Time in seconds is shown on the x-axis and Amplitude of audio data is shown in y-axis As we can see from the plot, there is a sharp increase in amplitude around 0.5 seconds before amplitudes start decreasing again. The pattern reflects a prominent expression of disgust with intermittent gusts in speech volume. The formulation of the disgust emotion in speech is, as diverse and emphasized it may be for instance to reflect complexity,which can easily sensed from waveform that leads to high variability between same phrase. This representation captures well the rapid changes in amplitude that characterize vocal disgust expressions.

Figure 7 depicts a spectrogram analysis of speech data conveying the emotional state "disgust" with its corresponding development in time. The x-axis being the time in seconds, and y axis as frequency in Hertz or Hz up to 10KHz Warm colors reds and yellows show high amplitudes, while cold colors blues and purples indicate low amplitude. Bursts of energy are spread over around 3000 to almost forming a line below the 4000 Hz, and especially between ~500 ms -~1500ms on this spectrogram. This bursts reflected the highly patterned and varying vocal qualities that are part of disgust expressivity. The figure displays the spectral temporal profile of the "disgust" emotion in speech, showing its distinctive acoustic characteristics.

This is a waveplot, shown as a figure 8 indicates what the audio signal of speech representing "fear" looks like over time. The x-axis is the time in seconds and y-axis represents audio signal amplititude. The plot shows large spikes in amplitude at around 0.5 seconds and another peak just after the 1-second mark, which indicate bursts of vocal intensity related to fear. The peaks are sharp rises in amplitude, indicating the rapid up and down nature of feeling afraid. This is what some might call a lot of peak-period variability, characteristic also of the twitchy and anxious nature that fear stereotype vocal emissions exhibit. This visual capture represents the fluid and complex nature of fear in speech.

Figure 9 is a Spectrogram of speech representing the "fear" emotional content as a function over time. The time of the x-axis is given in seconds, whereas frequency up to 10,000 Hz for y axis The amplitude of the frequencies is shown in color intensity, with high amplitudes colored red and yellow; low amplitudes are displayed as blue. Spectral energy distribution is shown for a segment of PSGD that exhibits relatively low call rates. The spectrogram demonstrates focused bouts (primarily <4000 Hz) between 0.5 and 1 seconds, indicative of vocally active periods associated with fear.setForeground color tuning was used to create images suitable for print version[]. Energy tends to run up and down across the frequencies- at times changing rapidly, which corresponds with changeableness of fear gradients. This visualization demonstrates the intricate temporal and spectral characteristics of speech fear, characterizing its acoustic hallmarks.

Figure 10 above is a waveplot of speech audio that conveys the "happy" emotion as it occurs over time. The x-axis shows the time (in seconds) and y-axis represents amplitude of the audio signal. It is also evident that the plot shows large peaks in amplitude especially at 0.25 second and 0.75 seconds respectively, representing happiness vocal onset time intervals each accompanied by higher amplitudes than other sections of this sound track. These peaks represent high-energy, consistent amplitude changes which illustrate the emotional sale of happiness and its infectious enthusiasm. This waveform shows a lot of change and shape, especially when you look at the spikes: we hope to see all kinds of craziness there because that reflects somehing about more joyful vocalizations! This creates a strong visual image of how happiness looks on paper with language.

Image 11 is a spectrogram with frequency content over time for the "happy" emotion of speech. Time (s) for the x-axis, up to 40 s and Frequency (Hz) for y axis, from 0-10.000 Hz The intensity of the color corresponds to frequency amplitude, red and yellow bewing substantially higher amplitudes then blue. A spectrogram reveals bursts of energy centered primarily below 4000 Hz, with considerable activity between 0.25 seconds and 0.75 second, and again at the region from one to a half-second: These bursts represent episodes of high vocal intensity corresponding to joy. Spectral patterns be are very erratic and high in energy, mirroring the responsive nature of happy. Fig 1: Temporal and spectral dynamics of happiness in speech, with acoustic features activating the lower darker region to mid-to-upper frequency ranges.

Figure 12: Waveplot of Sad Speech Audio Signal with Time The audio signal amplitude is represented on the y-axis, and time in seconds is shown on x-axis. The plot shows a small local maximum at approximately 0.75 seconds, which corresponds to the moment of sharpest intensity in her voice when she expresses sadness. This peak tends to occur more smoothly, with a longer build up and fall from the climax than say distress peaks or affinity drops (the other intense emotions), reflective of how sad has less suddenness involved. The flattened and more constrained vocal zones of the sadness utterances are apparent in the calmly shaped waveform with a low variability, particularly pre and post peak positions. The visual captures the composure and steadiness that is sadness in speech.

Figure 13: Image of the speech spectrogram expressing depression over time. Time (in seconds) and frequency up to 10000 Hz are plotted along the x- and y-axis respectively. Color intensity shows the amplitude of frequencies; red and yellow are high amplitudes, and blue is low. If we look at the spectrogram above, you will see a lot of energy below 4000 Hz (frequency) and specifically between about.5-1.5 seconds was where all this action took place! These bursts capture areas of vocal expression linked with sadness in only moderate energy and smooth transitions between frequencies, also. Spectral patterns, however show significantly lower diversity and strength in comparison to more dynamic emotions which is the depressive nature of sadness. This diagram visualizes the temporal and spectral evolution of sadness in speech by underscoring these exclusive acoustic cues, especially around lower to mid-frequency bands.

The following figure 14 is a waveplot of the audio signal of speech of the “neutral” emotion. The x-axis presents time in seconds, while the y-axis presents the amplitude of the audio signal. The plot shows rather moderate fluctuations of the amplitude in the area of 0.5 seconds and 1 second, and also reveals that the amplitude gradually increases. It implies that during first second there would a peak in vocal intensity. These peaks are stable in the moderate nature of going up and down and reflect the balanced and level nature of the peak. The waveform is moderately variable so has an equal and stable control. It means that the peak tense would be rather plantation than aggressive and passive dependent. This visual way captures the amplitude as an okay but not too dependent way.

Figure 15: A spectrogram visualizing the frequency content of utterances expressing a "neutral" emotion over time The time (in seconds) along the x-axis, and frequency between 0 to 10,000 Hz on y axes. The color intensity of the image represents amplitude with red and yellow representing higher amplitudes, blue showing lower ones. The spectrogram displays moderate energy bursts mainly below 4000 Hz (from about the.5 to 1.five second time mark). In the Neutral Speech File, these bursts signal moments of vocal output that are committed to neutral speech and thus express a certain amount of but not superfluous energy. The spectral patterns of this emotion show a very flat energy distribution over all frequencies, which can be seen as the stable and controlled state associated with neutral. This visualization demonstrates the temporal and spectral dynamics of others-neutral speech, displaying unique acoustic profile with some activity in low-middle frequency bands.

A waveplot of the audio signal for speech expressing the "surprise" emotion over time. Figure 16 On the x-axis is time in seconds and the y-axis indicates amplitude of an audio signal. There are clear peaks around 1.5sec and at about 2.0-2.5sec in the plot which suggest a high level of suprisingness conveying patterns in vocal amplitude (upper layer). They exhibit a very high and sudden difference in amplitude because of the rapid nature that their emotional curve rises to as surprise. The waveform is highly variable and dynamic in sound quality, especially at the peaks where it always conforms to fear-induced explosive vocalizations. This visual representation effectively illustrates the transient, explosive nature of surprise in speech.

Figure 17: Spectrogram of speech expressing the emotion surprise. The x-axis is the time in seconds and y axis to 10 kHz frequency (hertz) Color intensity represents the amplitude of the frequencies, with red and yellow for higher amplitudes while blue is lower. The spectrogram displays a cluster of burst of energy around the 1 and then again between the 2-2.5 s showing long vocal intensity often associated with surprise They are accompanied by high-amplitude activity across the full spectral range, reflecting the sudden and intense nature of a surprise emotion. Abstract: The spectral patterns are followed by fast and dynamic variations, especially in lower to mid-frequency bands that drive the time-varying acoustic properties listener expects from a surprised speech. This visualization represents the dynamic temporal and spectral nature of surprise in speech, with its highly energetic burst.

Figure 18: Zero Crossing Rate (ZCR) plot over time x-axis : time(or frame index), y- axis: ZCR values Zero Crossing Rate: Zero crossing rate is the rate at which an audio signals change sign - going from positive to negative or zero(-1,0,+1), The zero specialist furnishes data of whether sound glides are uproarious. The figure shows the changes in ZCR, with many peaks and valleys indicating different frequency content, modulation of signal periodicity. Major peaks are at approximately frame indices 40 and 55, indicating some high-frequency content or noise present in these frames. Lower values of ZCR around frame indices 10 and 75 indicates these are more stable or less noisy regions in the signal. The ZCR plot offers a view into the temporal evolution of the audio signal and thereby indicates how its frequency characteristics are changing over time.

Figure 19 is a waveform showing the time history of Root Mean Square (RMS) value of an audio signal. The x-axis is the representation of time or frame index, and the y-axis value RMS represent measurement to signal energy or loudness. The plot contains many high peaks and low troughs, meaning that the signal fluctuates a lot. A regular peak occurs near frame index 20, which represents a region of the audio with considerable energy in it. There are other large peaks at a frame indices 40 and 55, for example,indexing the existence of additional high-energy segments. After these peaks, the RMS levels slowly decrease, which corresponds with areas of lower energy and quieter sections in the audio. You can use this RMS plot to understand the dynamic range and loudness fluctuations of an audio signal in time, it will show you which parts are increasing or decreasing as some sampleology work.

Figure 20 - Mel-Frequency Cepstral Coefficients (MFCC) across time of an audio signal as heatmap On the X-axis, we have time in seconds and on Y-axis are various MFCC coefficients which plays an important role when it comes to Speech Analysis/Recognition. The color of the heatmap gives information about the sign and magnitude of a coefficient, where yellow/green for high magnitude (positive) coefficients and purple high magnitudes(negative). We observe these patterns and variations across the audio signal length in a heatmap from our MFCCs, which indicates properties of speech vital for recognition of phonetic components as well as emotions. This exploration enchantingly encapsulates the temporal and spectral facets of audio, leading to possible insider scoop into its acoustic features & easy understanding of how-to get speech-features or other in a task like emotion-detection,speech-recognition.

4.4 Result

Figure 21: Line graph showing Model Accuracy for Both Training and Validation Datasets over 20 Epochs. The number of epochs (x-axis) and the accuracy (y-axis). Here the blue line represents Training Accuracy which also rise over time and few fluctuations will occur upto 0.10 Marks. The orange line is the validation accuracy, which initially shows a similar curve to training data but began to decrease after about 6 epochs to around 0.095RETURNS_EMPTY_DATA_SITE_GENES_FOR_CALL). The fact that they start to separate after the first couple of epochs shows us it's likely a case of overfitting - meaning our model will actually keep getting better on training set but not so for validation. In general, the pattern illustrated here is indicative of potentially necessary training modifications to increase generalization capabilities.

Figure 22: Line plot of model loss over 20 epochs for the training and validation dataset Fig.1 Represented the loss over number of epochs (x-axis gives you the range for epoch value and y-access gives you what is calculated in-blue or continuous line) The blue line represents the training loss, which drops steeply from 60 to below 10 in only a few epochs and then plateaus. The orange line shows the validation loss, which has an initially low value and is flat at around 0 during training. The negative loss of training tends to zero as it indicates that the models learns very fast with respect to this data. But a flat validation loss line starting from the beginning is indicative of one or more things - either your training data and validate data are completely different, some issue here; or instead its an overfitting scenario where the model does so well with the train set while it doesn't generalize at all hence looks like validating results don't make any difference. The warning pattern: model fits the training data well, but more investigation is needed to make sure generalization on new data.

Figure 23 is a confusion matrix summarizing the results of how this classification model did on predicting seven emotion classes: Angry, Disgust, Fear. On x-axis we have predicted labels, and on y axis true coloumn. The values in the cells are how many times a particular true label was predicted as class i, per colour intensity. The diagonal elements, which should have large values ideally, specify correct classifications. Most of the correct classifications are "Disgust" class, with 74 instances in that matrix and least amount comes from second best recognized label which is for "Fear", being only 86 true positives; followed by an underbalanced recognition rates as also in order like; Happy (76) - Neutral(74)- Sad (84) upto Surprise last come to number where distinction predicted right seemly currect for can be 80 images. Despite this, the lower half of the matrix suggests that a few emotions can get mistakenly predicted as "Disgust" more often than not having values (eg., 37 for angry, 32 for disgust itself etc.), hinting at a considerable bias in making our model predict "Disgust". This pattern reveals the performance differences of the model and suggests potential improvement points in classifying these emotion classes.

4.5 Comparison result proposed and existing work

Table 2. Comparison result proposed and existing work

Models	Accuracy	Precision	Recall	F1-Score
Probabilistic Neural Network [23]	95.56	94.29	95.84	94.68
Long Short-Term Memories [24]	97.1	96.75	96.85	96.98
One-dimensional deep convolutional neural network (1-D DCNN) [25]	93.31	93.21	93.08	93.14
Gaussian Mixture Model (GMM) [26]	74.33	73.69	73.58	74.28
Deep Learning Transfer Models [27]	86.54	86.89	85.28	85.98
Proposed CNN + Bi-LSTM	98.48	97.25	98.29	97.39

In Table 2 and Figure 24, we summarise the performance metrics of several models for recognition for these emotions as Accuracy, Precision Recall F1-Score. The Probabilistic Neural Network [23] gets 95.56% in accuracy, precision is equal to 94.29%, recall reaches up at a level of 95.84 % and the F1-Score reported as being equivalent to m=93hon68 %. In comparison, Long Short-Term Memories (LSTM) [24] reports relatively higher performance with 97.1% accuracy, a precision of 96.75%, recall of 96.85$, and an F1-Score performance is found at 96.98%. The One-dimensional deep convolutional neural network (1-D DCNN) [25] have an accuracy of 93.31%, a precision-rate of 93.21%(precision in this case is the percentage how many successfully predicted as belonging to one-class over number actually belonged ), recall-rate - too, but with respect on true belongs per plan and F1-Score is equal to 92%. (26) with the lowest performance (Gaussian Mixture Model) at an accuracy of 74.33%; precision=73.69%, recall=73.58% and F1-Score = 74,28%. The Deep Learning Transfer Models [27] have a fair level of performance with an accuracy of 86.54%, Precision: 86.89%, Recall:85.28%, F1-Score : 85.98%. The Proposed model CNN + Bi-LSTM shows the highest performance overall, resulting in an accuracy of 98.48%, precision of 97.25 %recall value is about (98.29)% and F1-Score as ~97.39% which help us to say that this proposed architecture has more ability for emtion recogonition task compared with other models but it can be evaluated through exactly new dataset.

Our comparative study on the emotion recognition over speech based various models indicates that, - The use of Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) techniques in a hybrid model for acquiring emotional cues from vocal recordings yields far better performance than different methodologies. Results: The Proposed CNN + Bi-LSTM Model with the Highest Acc (98.48%), Prec(97.25%), Recall(98.29%) and F1-SCORE 97.39% Showed The Capability Of Identifying Emotion State In Speech Data With Outstanding Performance In contrast, baseline models including the Probabilistic Neural Network, Long Short-Term Memories and a One-dimensional Deep Convolutional Neural Network all achieved substantial performance but inferior to our proposed hybrid model. In particular, the Gaussian Mixture Model (GMM) achieves lowest the performance metrics and demonstrates constraints of legacy approaches capturing fine-grained emotional information in natural speech. Is Offer performance are Balanced but still quite lagging by the Advanced Hybrid Approach in constitute Transfer Models. Our end-to-end evaluation illustrates the strength of a CNN-Bi-LSTM integration in improving emotion recognition accuracy and reliability, which is critical if we are to use such tools for real-world applications within speech analysis.

Declaration of interests

I declare that no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The author is an Author for [SNCS] and was not involved in the editorial review or the decision to publish this article.

Competing interests, No competing interests in this section.

Funding No funding for the research reported.

Authors' contributions This paper focuses on using important features of speech and deep learning model for Speech Emotion Recognition (SER). To effectively handle the complexity of complex data in speech, we utilized a deep learning algorithm, specifically, employed an LSTM (Bi directional Long Short-Term Memory) algorithm along with MFCC (Mel-Frequency Cepstral Coefficients) features.

Availability of data and material Data supporting the results reported in the articleand found in the paper. By data we mean the minimal dataset that would be necessary to interpret, replicate and build upon the findings reported in the article.

Research Involving Human and /or Animals: Not Applicable

Consent for publication If manuscript contains any individual person’s data in any form (including individual details, images or videos), consent to publish must be obtained from that person, or in the case of children, their parent or legal guardian. All presentations of case reports must have consent to publish.

Goncalves, L., Salman, A. N., Naini, A. R., Velazquez, L. M., Thebaud, T., Garcia, L. P., … Busso, C. (2024). Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results. Development, 10(9,290), 4–54.
Costa, F., India, M., & Hernando, J. (2024). Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge. arXiv preprint arXiv:2406.10598.
Gismelbari, M. A., Vixnin, I. I., Kovalev, G. M., & Gogolev, E. E. (2024, May). Speech Emotion Recognition Using Deep Learning. In 2024 XXVII International Conference on Soft Computing and Measurements (SCM) (pp. 380–384). IEEE.
Lian, Z., Sun, H., Sun, L., Wen, Z., Zhang, S., Chen, S., … Tao, J. (2024). MER 2024:Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition. arXiv preprint arXiv:2404.17113.
Ismaiel, W., Alhalangy, A., Mohamed, A. O., & Musa, A. I. A. (2024). Deep Learning, Ensemble and Supervised Machine Learning for Arabic Speech Emotion Recognition. Engineering, Technology & Applied Science Research, 14(2), 13757–13764.
Geetha, A. V., Mala, T., Priyanka, D., & Uma, E. (2024). Multimodal Emotion Recognition with deep learning: advancements, challenges, and future directions. Information Fusion, 105, 102218.
Vaidehi, K., & Nisha, Q. (2024, February). A Machine Learning and Deep Learning based Approach to Generate a Speech Emotion Recognition System. In 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 573–577). IEEE.
Duret, J., Rouvier, M., & Estève, Y. (2024). MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition. arXiv preprint arXiv:2407.05746.
Jayakumar, D. (2024, February). Voice Assisted Facial Emotion Recognition System For Blind Peoples With Tensorflow Model. In 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS) (pp. 1–4). IEEE.
Madhura, M., Meghana, S., Varshitha, V. S., Kodipalli, A., & Rao, T. (2024, April). Neural Networks and Emotions: A Deep Learning Perspective. In 2024 IEEE 9th International Conference for Convergence in Technology (I2CT) (pp. 1–7). IEEE.
Diatlova, D., Udalov, A., Shutov, V., & Spirin, E. (2024). Adapting WavLM for Speech Emotion Recognition. arXiv preprint arXiv:2405.04485.
Tomar, P. S., Mathur, K., & Suman, U. (2024). Fusing facial and speech cues for enhanced multimodal emotion recognition. International Journal of Information Technology, 16(3), 1397–1405.
Haldorai, A., Murugan, S., & Balakrishnan, M. (2024). Bi-Model Emotional AI for Audio-Visual Human Emotion Detection Using Hybrid Deep Learning Model. In Artificial Intelligence for Sustainable Development (pp. 293–315). Cham: Springer Nature Switzerland.
Kapileswar, N., Simon, J., Devi, K. K., Polasi, P. K., Vinod, D. N., & Harish, C. (2024, April). An Intelligent Emotion Recognition System based on Speech Terminologies using Artificial Intelligence Assisted Learning Scheme. In 2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM) (pp. 1–7). IEEE.
Bhattacharya, S., Borah, S., Mishra, B. K., & Mondal, A. (2022). Emotion detection from multilingual audio using deep analysis. Multimedia Tools and Applications, 81(28), 41309–41338.
Härm, H., & Alumäe, T. TalTech Systems for the Odyssey 2024 Emotion Recognition Challenge.
Joseph, J., Aneesh, R. P., & Zacharias, J. (2024, June). Deep learning based emotion recognition in human-robot interaction with multi-modal data. In AIP Conference Proceedings (Vol. 3122, No. 1). AIP Publishing.
Byun, S. W., & Lee, S. P. (2021). A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Applied Sciences, 11(4), 1890.
Khan, W. A., ul Qudous, H., & Farhan, A. A. (2024). Speech emotion recognition using feature fusion: a hybrid approach to deep learning. Multimedia Tools and Applications, 1–28.
Tiwari, R., Prajapati, A., Chandran, S., Agrawal, D., Rasool, A., & Jadhav, A. (2024, February). Emotion Detection through Human Verbal Expression Using Deep Learning Techniques. In 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS) (pp. 1–7). IEEE.
Mahmoudi, O., & Bouami, M. F. (2023, January). Arabic speech emotion recognition using deep neural network. In International conference on digital technologies and applications (pp. 124–133). Cham: Springer Nature Switzerland.
Rizhinashvili, D., Sham, A. H., & Anbarjafari, G. (2024). Enhanced speech emotion recognition using averaged valence arousal dominance mapping and deep neural networks. Signal, Image and Video Processing, 1–10.
Deshmukh, S., & Gupta, P. (2023). Application of probabilistic neural network for speech emotion recognition. International Journal of Speech Technology, 1–10.
Choudhary, R. R., Meena, G., & Mohbey, K. K. (2022, March). Speech emotion based sentiment recognition using deep neural networks. In Journal of physics: conference series (Vol. 2236, No. 1, p. 012003). IOP Publishing.
Bhangale, K., & Kothandaraman, M. (2023). Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics, 12(4), 839.
Koti, V. M., Murthy, K., Suganya, M., Sarma, M. S., Kumar, G. V. S., & Balamurugan, N. (2024). Speech Emotion Recognition using Extreme Machine Learning. EAI Endorsed Transactions on Internet of Things, 10.
Kim, T. W., & Kwak, K. C. (2024). Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques. Applied Sciences, 14(4), 1553.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Emotion Recognition on Speech using Hybrid Model CNN and BI-LSTM Techniques

Status:

Version 1

Abstract

Figures

I. INTRODUCTION

II. LITERATURE REVIEW

III. PROPOSED METHODOLOGY

IV. IMPLEMENTATION AND RESULT

V. CONCLUSION

Declarations

References

Additional Declarations

Status:

Version 1