4.1 Hardware and Software requirement
Emotion Recognition system on speech by featurizing the spectrograms using hybrid model of Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory( BI-LSTM) techniques needs rich hardware and software resources. At the hardware level, a high-end GPU (eg NVIDIA RTX 3080 or better) is necessary for speeding up deep learning computations, while at least 32GB of RAM and multi-core CPU (Intel i7 / AMD Ryzen 7 or more powerful equivalents), are needed to optimize data preprocessing and model training. For software, the system must be built on a Linux-based operating System (O.S.) like Ubuntu 20.04 and deep learning frameworks such as TensorFlow or PyTorch for model development jobs. And also all the libraries for supporting these requirements, stuffs such as Keras for building neural networks, librosa to work with audio files inorder to extract mel-spectogram (functional approach), scikit-learn its kind of essential here where you can use this library alongwith Gridsearch parameters and Model evaluation lines etc. Thus, a Python programming environment with full Jupyter Notebook support could be the solution to model building and visualization of emotion recognition results in an iterative manner.
4.2 Dataset
RAVDESS contains 7356 files: 1440 speech and song, with 1012 male and female actors. Total size is better than above being only 136MB compressed in a.zip - And thankfully here so your time-training downloads are less painful from scratch also at Kaggle. These are two lexically identical phrases spoken by professional actors (12 female, 12 male) performing in a neutral North American dialect. Speech expressions were calm, happy, sad, angry, terrified surprised and disgusting; whilst for musical expression there was: calm hapy saying Angry frightening The CREMA-D dataset consists of 7,442 original clips from 91 actors (48 males and 43 females), between the ages of twenty to seventy-four years; spanning five racial/ethnic groups: African American with sixteen actors; Asian with seventeen actors; Caucasian having thirty-two touchdowns Hispanic consisting another eighteen tunes Unspecified. These sentences are read by actors (15 males) reciting 12 simple declarative statements with one of six emotions or neutral as well at four different intensity levels : low, medium and high each mesure separately plus an unpecified level. Another 200 target words were spoken by two actresses (ages: 26 years/64 years, respectively) using the test formulation "Say the word _" and recorded with seven different emotional conditions such as angry, disgust, fear shock/happiness surprise/sadness or neutral which results in total of 2.800 audio files for each language company.
Dataset Source
- https://krastiona.site/?_=%2Fdocs%2Fml%2Fdatasets%2Fravdess-datas
- https://www.kaggle.com/datasets/ejlok1/cremadet%2F%23lirKEvczM7%2FYEt8Bk80FBLB%2FPI6OWQXnGpiSAqbI
- https://www.kaggle.com/datasets/ejlok1/cremad
4.3 Illustrative Example
Horizontal bar chart of tope 4 emotions Figure 3 _ ViewChild The Below screen shot represents the Horizontal Bar Chart displaying frequency with respect to other different types of emotion, that seems Sad (1),Happy(2), Disgust(3), Angry It is represented in Blue(default color palette). Disgust is the longest bar, meaning highest frequency of that emotion in contrast to happy and sad which have shorter bars at about equal length. However "Angry" has medium frequency and the length of its bar is between the lengths that belongs to happy and disgust. This visual representation stresses the relative distribution of emotions in the dataset You may observe that disgust is appearing more than any other emotion and both happy and sad have been felt similarly number of times.
Figure 4.-Waveplot of "angry" speech audio signal for emotion over time. The x-axis represents the time in seconds and y-axis is of amplitude that also represent an audio signal. After about 1-seconds, there is a massive jump in amplitude indicating an angry outburst. This peak time is aperiod of intense variability and energy in the waveform, which shows the differenceen nt emotional can state. Below and above this peak the signal once again goes down, implying zones with much less loud speech. It illustrates the fluctuating nature of anger in speech quite well, with chunks of very loud and vocally tense areas.
Figure 5 is a spectrogram showing the frequency content of an angry speech for each individual in time. Time is visualized on the x-axis in seconds and frequency (Hz) plotted perpendicularly across ranging from 0-10,000 Hz. The colors are used as a measure of the amplitudes and intensity correspond to amplitude, red/yellow being high frequencies while blue is low frequency. Indeed, in the spectrogram we see bursts of energy mostly present below 4000 Hz, with especially high concentration around halfway between 0.5 and 1.5 seconds - which roughly corresponds to highest intensity of angry speech Impulse response extracted from microphone recording (same band pass filtering as used for analysis). These bursts include high-energy components at low and mid frequencies that are presumably associated with the harsh, loud phonation produced during angry vocalizations. This visualization illustrates the temporal and spectral structure in speech that corresponds to an expression of anger, shedding light on which aspects of its acoustics characterize it as angry.
Figure 6: waveplot of the speech disgust emotion signal over time. Time in seconds is shown on the x-axis and Amplitude of audio data is shown in y-axis As we can see from the plot, there is a sharp increase in amplitude around 0.5 seconds before amplitudes start decreasing again. The pattern reflects a prominent expression of disgust with intermittent gusts in speech volume. The formulation of the disgust emotion in speech is, as diverse and emphasized it may be for instance to reflect complexity,which can easily sensed from waveform that leads to high variability between same phrase. This representation captures well the rapid changes in amplitude that characterize vocal disgust expressions.
Figure 7 depicts a spectrogram analysis of speech data conveying the emotional state "disgust" with its corresponding development in time. The x-axis being the time in seconds, and y axis as frequency in Hertz or Hz up to 10KHz Warm colors reds and yellows show high amplitudes, while cold colors blues and purples indicate low amplitude. Bursts of energy are spread over around 3000 to almost forming a line below the 4000 Hz, and especially between ~500 ms -~1500ms on this spectrogram. This bursts reflected the highly patterned and varying vocal qualities that are part of disgust expressivity. The figure displays the spectral temporal profile of the "disgust" emotion in speech, showing its distinctive acoustic characteristics.
This is a waveplot, shown as a figure 8 indicates what the audio signal of speech representing "fear" looks like over time. The x-axis is the time in seconds and y-axis represents audio signal amplititude. The plot shows large spikes in amplitude at around 0.5 seconds and another peak just after the 1-second mark, which indicate bursts of vocal intensity related to fear. The peaks are sharp rises in amplitude, indicating the rapid up and down nature of feeling afraid. This is what some might call a lot of peak-period variability, characteristic also of the twitchy and anxious nature that fear stereotype vocal emissions exhibit. This visual capture represents the fluid and complex nature of fear in speech.
Figure 9 is a Spectrogram of speech representing the "fear" emotional content as a function over time. The time of the x-axis is given in seconds, whereas frequency up to 10,000 Hz for y axis The amplitude of the frequencies is shown in color intensity, with high amplitudes colored red and yellow; low amplitudes are displayed as blue. Spectral energy distribution is shown for a segment of PSGD that exhibits relatively low call rates. The spectrogram demonstrates focused bouts (primarily <4000 Hz) between 0.5 and 1 seconds, indicative of vocally active periods associated with fear.setForeground color tuning was used to create images suitable for print version[]. Energy tends to run up and down across the frequencies- at times changing rapidly, which corresponds with changeableness of fear gradients. This visualization demonstrates the intricate temporal and spectral characteristics of speech fear, characterizing its acoustic hallmarks.
Figure 10 above is a waveplot of speech audio that conveys the "happy" emotion as it occurs over time. The x-axis shows the time (in seconds) and y-axis represents amplitude of the audio signal. It is also evident that the plot shows large peaks in amplitude especially at 0.25 second and 0.75 seconds respectively, representing happiness vocal onset time intervals each accompanied by higher amplitudes than other sections of this sound track. These peaks represent high-energy, consistent amplitude changes which illustrate the emotional sale of happiness and its infectious enthusiasm. This waveform shows a lot of change and shape, especially when you look at the spikes: we hope to see all kinds of craziness there because that reflects somehing about more joyful vocalizations! This creates a strong visual image of how happiness looks on paper with language.
Image 11 is a spectrogram with frequency content over time for the "happy" emotion of speech. Time (s) for the x-axis, up to 40 s and Frequency (Hz) for y axis, from 0-10.000 Hz The intensity of the color corresponds to frequency amplitude, red and yellow bewing substantially higher amplitudes then blue. A spectrogram reveals bursts of energy centered primarily below 4000 Hz, with considerable activity between 0.25 seconds and 0.75 second, and again at the region from one to a half-second: These bursts represent episodes of high vocal intensity corresponding to joy. Spectral patterns be are very erratic and high in energy, mirroring the responsive nature of happy. Fig 1: Temporal and spectral dynamics of happiness in speech, with acoustic features activating the lower darker region to mid-to-upper frequency ranges.
Figure 12: Waveplot of Sad Speech Audio Signal with Time The audio signal amplitude is represented on the y-axis, and time in seconds is shown on x-axis. The plot shows a small local maximum at approximately 0.75 seconds, which corresponds to the moment of sharpest intensity in her voice when she expresses sadness. This peak tends to occur more smoothly, with a longer build up and fall from the climax than say distress peaks or affinity drops (the other intense emotions), reflective of how sad has less suddenness involved. The flattened and more constrained vocal zones of the sadness utterances are apparent in the calmly shaped waveform with a low variability, particularly pre and post peak positions. The visual captures the composure and steadiness that is sadness in speech.
Figure 13: Image of the speech spectrogram expressing depression over time. Time (in seconds) and frequency up to 10000 Hz are plotted along the x- and y-axis respectively. Color intensity shows the amplitude of frequencies; red and yellow are high amplitudes, and blue is low. If we look at the spectrogram above, you will see a lot of energy below 4000 Hz (frequency) and specifically between about.5-1.5 seconds was where all this action took place! These bursts capture areas of vocal expression linked with sadness in only moderate energy and smooth transitions between frequencies, also. Spectral patterns, however show significantly lower diversity and strength in comparison to more dynamic emotions which is the depressive nature of sadness. This diagram visualizes the temporal and spectral evolution of sadness in speech by underscoring these exclusive acoustic cues, especially around lower to mid-frequency bands.
The following figure 14 is a waveplot of the audio signal of speech of the “neutral” emotion. The x-axis presents time in seconds, while the y-axis presents the amplitude of the audio signal. The plot shows rather moderate fluctuations of the amplitude in the area of 0.5 seconds and 1 second, and also reveals that the amplitude gradually increases. It implies that during first second there would a peak in vocal intensity. These peaks are stable in the moderate nature of going up and down and reflect the balanced and level nature of the peak. The waveform is moderately variable so has an equal and stable control. It means that the peak tense would be rather plantation than aggressive and passive dependent. This visual way captures the amplitude as an okay but not too dependent way.
Figure 15: A spectrogram visualizing the frequency content of utterances expressing a "neutral" emotion over time The time (in seconds) along the x-axis, and frequency between 0 to 10,000 Hz on y axes. The color intensity of the image represents amplitude with red and yellow representing higher amplitudes, blue showing lower ones. The spectrogram displays moderate energy bursts mainly below 4000 Hz (from about the.5 to 1.five second time mark). In the Neutral Speech File, these bursts signal moments of vocal output that are committed to neutral speech and thus express a certain amount of but not superfluous energy. The spectral patterns of this emotion show a very flat energy distribution over all frequencies, which can be seen as the stable and controlled state associated with neutral. This visualization demonstrates the temporal and spectral dynamics of others-neutral speech, displaying unique acoustic profile with some activity in low-middle frequency bands.
A waveplot of the audio signal for speech expressing the "surprise" emotion over time. Figure 16 On the x-axis is time in seconds and the y-axis indicates amplitude of an audio signal. There are clear peaks around 1.5sec and at about 2.0-2.5sec in the plot which suggest a high level of suprisingness conveying patterns in vocal amplitude (upper layer). They exhibit a very high and sudden difference in amplitude because of the rapid nature that their emotional curve rises to as surprise. The waveform is highly variable and dynamic in sound quality, especially at the peaks where it always conforms to fear-induced explosive vocalizations. This visual representation effectively illustrates the transient, explosive nature of surprise in speech.
Figure 17: Spectrogram of speech expressing the emotion surprise. The x-axis is the time in seconds and y axis to 10 kHz frequency (hertz) Color intensity represents the amplitude of the frequencies, with red and yellow for higher amplitudes while blue is lower. The spectrogram displays a cluster of burst of energy around the 1 and then again between the 2-2.5 s showing long vocal intensity often associated with surprise They are accompanied by high-amplitude activity across the full spectral range, reflecting the sudden and intense nature of a surprise emotion. Abstract: The spectral patterns are followed by fast and dynamic variations, especially in lower to mid-frequency bands that drive the time-varying acoustic properties listener expects from a surprised speech. This visualization represents the dynamic temporal and spectral nature of surprise in speech, with its highly energetic burst.
Figure 18: Zero Crossing Rate (ZCR) plot over time x-axis : time(or frame index), y- axis: ZCR values Zero Crossing Rate: Zero crossing rate is the rate at which an audio signals change sign - going from positive to negative or zero(-1,0,+1), The zero specialist furnishes data of whether sound glides are uproarious. The figure shows the changes in ZCR, with many peaks and valleys indicating different frequency content, modulation of signal periodicity. Major peaks are at approximately frame indices 40 and 55, indicating some high-frequency content or noise present in these frames. Lower values of ZCR around frame indices 10 and 75 indicates these are more stable or less noisy regions in the signal. The ZCR plot offers a view into the temporal evolution of the audio signal and thereby indicates how its frequency characteristics are changing over time.
Figure 19 is a waveform showing the time history of Root Mean Square (RMS) value of an audio signal. The x-axis is the representation of time or frame index, and the y-axis value RMS represent measurement to signal energy or loudness. The plot contains many high peaks and low troughs, meaning that the signal fluctuates a lot. A regular peak occurs near frame index 20, which represents a region of the audio with considerable energy in it. There are other large peaks at a frame indices 40 and 55, for example,indexing the existence of additional high-energy segments. After these peaks, the RMS levels slowly decrease, which corresponds with areas of lower energy and quieter sections in the audio. You can use this RMS plot to understand the dynamic range and loudness fluctuations of an audio signal in time, it will show you which parts are increasing or decreasing as some sampleology work.
Figure 20 - Mel-Frequency Cepstral Coefficients (MFCC) across time of an audio signal as heatmap On the X-axis, we have time in seconds and on Y-axis are various MFCC coefficients which plays an important role when it comes to Speech Analysis/Recognition. The color of the heatmap gives information about the sign and magnitude of a coefficient, where yellow/green for high magnitude (positive) coefficients and purple high magnitudes(negative). We observe these patterns and variations across the audio signal length in a heatmap from our MFCCs, which indicates properties of speech vital for recognition of phonetic components as well as emotions. This exploration enchantingly encapsulates the temporal and spectral facets of audio, leading to possible insider scoop into its acoustic features & easy understanding of how-to get speech-features or other in a task like emotion-detection,speech-recognition.
4.4 Result
Figure 21: Line graph showing Model Accuracy for Both Training and Validation Datasets over 20 Epochs. The number of epochs (x-axis) and the accuracy (y-axis). Here the blue line represents Training Accuracy which also rise over time and few fluctuations will occur upto 0.10 Marks. The orange line is the validation accuracy, which initially shows a similar curve to training data but began to decrease after about 6 epochs to around 0.095RETURNS_EMPTY_DATA_SITE_GENES_FOR_CALL). The fact that they start to separate after the first couple of epochs shows us it's likely a case of overfitting - meaning our model will actually keep getting better on training set but not so for validation. In general, the pattern illustrated here is indicative of potentially necessary training modifications to increase generalization capabilities.
Figure 22: Line plot of model loss over 20 epochs for the training and validation dataset Fig.1 Represented the loss over number of epochs (x-axis gives you the range for epoch value and y-access gives you what is calculated in-blue or continuous line) The blue line represents the training loss, which drops steeply from 60 to below 10 in only a few epochs and then plateaus. The orange line shows the validation loss, which has an initially low value and is flat at around 0 during training. The negative loss of training tends to zero as it indicates that the models learns very fast with respect to this data. But a flat validation loss line starting from the beginning is indicative of one or more things - either your training data and validate data are completely different, some issue here; or instead its an overfitting scenario where the model does so well with the train set while it doesn't generalize at all hence looks like validating results don't make any difference. The warning pattern: model fits the training data well, but more investigation is needed to make sure generalization on new data.
Figure 23 is a confusion matrix summarizing the results of how this classification model did on predicting seven emotion classes: Angry, Disgust, Fear. On x-axis we have predicted labels, and on y axis true coloumn. The values in the cells are how many times a particular true label was predicted as class i, per colour intensity. The diagonal elements, which should have large values ideally, specify correct classifications. Most of the correct classifications are "Disgust" class, with 74 instances in that matrix and least amount comes from second best recognized label which is for "Fear", being only 86 true positives; followed by an underbalanced recognition rates as also in order like; Happy (76) - Neutral(74)- Sad (84) upto Surprise last come to number where distinction predicted right seemly currect for can be 80 images. Despite this, the lower half of the matrix suggests that a few emotions can get mistakenly predicted as "Disgust" more often than not having values (eg., 37 for angry, 32 for disgust itself etc.), hinting at a considerable bias in making our model predict "Disgust". This pattern reveals the performance differences of the model and suggests potential improvement points in classifying these emotion classes.
4.5 Comparison result proposed and existing work
Table 2. Comparison result proposed and existing work
Models
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
Probabilistic Neural Network [23]
|
95.56
|
94.29
|
95.84
|
94.68
|
Long Short-Term Memories [24]
|
97.1
|
96.75
|
96.85
|
96.98
|
One-dimensional deep convolutional neural network (1-D DCNN) [25]
|
93.31
|
93.21
|
93.08
|
93.14
|
Gaussian Mixture Model (GMM) [26]
|
74.33
|
73.69
|
73.58
|
74.28
|
Deep Learning Transfer Models [27]
|
86.54
|
86.89
|
85.28
|
85.98
|
Proposed CNN + Bi-LSTM
|
98.48
|
97.25
|
98.29
|
97.39
|
In Table 2 and Figure 24, we summarise the performance metrics of several models for recognition for these emotions as Accuracy, Precision Recall F1-Score. The Probabilistic Neural Network [23] gets 95.56% in accuracy, precision is equal to 94.29%, recall reaches up at a level of 95.84 % and the F1-Score reported as being equivalent to m=93hon68 %. In comparison, Long Short-Term Memories (LSTM) [24] reports relatively higher performance with 97.1% accuracy, a precision of 96.75%, recall of 96.85$, and an F1-Score performance is found at 96.98%. The One-dimensional deep convolutional neural network (1-D DCNN) [25] have an accuracy of 93.31%, a precision-rate of 93.21%(precision in this case is the percentage how many successfully predicted as belonging to one-class over number actually belonged ), recall-rate - too, but with respect on true belongs per plan and F1-Score is equal to 92%. (26) with the lowest performance (Gaussian Mixture Model) at an accuracy of 74.33%; precision=73.69%, recall=73.58% and F1-Score = 74,28%. The Deep Learning Transfer Models [27] have a fair level of performance with an accuracy of 86.54%, Precision: 86.89%, Recall:85.28%, F1-Score : 85.98%. The Proposed model CNN + Bi-LSTM shows the highest performance overall, resulting in an accuracy of 98.48%, precision of 97.25 %recall value is about (98.29)% and F1-Score as ~97.39% which help us to say that this proposed architecture has more ability for emtion recogonition task compared with other models but it can be evaluated through exactly new dataset.