RQ1. What algorithm is most commonly used for image detection?
Table IV shows the frequent use of deep learning algorithms in emotional recognition in research.
Table 4
Algorithm by Research Paper
Algorithm | References |
CNN | [17], [18], [19], [20], [21], [22], [5], [23], [24], [25], [26], [27], [28], [5], [4], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38] |
CNN + GoogleNet | [39], [40], [41], [42] |
CNN + VGG 19 | [43], [44], [27], [28], [5], [42] |
CNN + VGG 19 + Resnet-50 | [45], [46], [4], [28], [41], [47] |
CNN + FCN | [48], [49] |
TDNN | [50] |
R-CNN | [51], [52], [53] |
CBAM + ResNet | [54], [55] |
CNN + RNN (LSTM/BiLSTM) | [11], [56], [57], [58], [59], [60], [61] |
DNN | [62], [63], [5] |
SVM + VGG | [64] |
SHCNN | [65] |
Based on the research collected regarding emotion recognition from facial expressions, most of them utilize a Convolutional Neural Network (CNN) [66]. VGG, ResNet34, GoogleNet, and R-CNN are all CNN-based model developments [66][67][68]. However, the ability of CNN to detect emotions from faces still relies on the dataset, architecture, model, and training techniques. That's why, to improve the detection accuracy in the aforementioned research examples, some studies combine CNN with other related technologies such as FCN, DBN, Image Edge Computing, Transfer Learning, CRBM, and CBAM. Fully Convolutional Network (FCN) is a development of VGG [68][69][36][35].
There is also SHCNN which uses Leaky ReLU to avoid the "Dead ReLU problem" which can bring better convergence on the dataset [65]. CNN can also combine with CRBM and Transfer Learning. This combination aims to address feature extraction complexity in the target dataset. Pre-training with CRBM helps overcome content differences among datasets, while replacing the fully connected layer with CRBM in the CNN model during the transfer learning stage enhances the ability to recognize abstract features, particularly in facial expression recognition in environments with complex backgrounds. This method successfully improves the effectiveness of feature recognition in the target dataset, as demonstrated by experimental results indicating the effectiveness and feasibility of this hybrid transfer learning approach [35].
CBAM, or Convolutional Block Attention Module, is a specialized attention mechanism designed to enhance the capabilities of Convolutional Neural Networks (CNNs) in capturing important characteristics and interactions within images [70]. Additionally, TDNN is a type of neural network architecture used in signal processing and speech recognition [71]. Apart from CNN's dominance in emotion recognition across several studies, Recurrent Neural Network (RNN), specifically with its LSTM model, is also used for developing learning models to recognition emotions based on facial expressions [57].
RQ2. Which algorithm performs well in image processing in recognizing emotion from facial expressions?
In Table 5, the average recognition rates of algorithms for emotion recognition are presented.
Table 5
Average Rate Result of Accuracy from Algorithm
Algorithm | Recognition emotion accuracy average | References |
CNN | 97.75% | [37] |
CNN + GoogleNet | 75.09% | [39] |
CNN + VGG 19 | 95.35% | [43] |
CNN + VGG 19 + Resnet-50 | 95.39% | [46], [4] |
CNN + FCN | 80.60% | [49] |
TDNN | 90.00% | [50] |
R-CNN | 82.38% | [72] |
CBAM + ResNet | 88.27% | [55] |
CNN + RNN (LSTM/BiLSTM) | 99.43% | [59] |
According to Table 5, the emotion recognition algorithm with the lowest accuracy is the combination of CNN and GoogleNet. This algorithm has a lower accuracy percentage compared to other algorithms. This could be due to the quantity of training and testing datasets used, which may have resulted in poorer performance for the GoogleNet algorithm [39]. On the other hand, the algorithm with the highest accuracy is CNN + RNN (LSTM/BiLSTM). This is because data augmentation is performed on the image data before the analysis, which improves the performance of the trained data and helps address the issue of overfitting [59].
Based on the analysis above, it can be said that several factors that can influence the accuracy of algorithms for analyzing human emotions. These factors include the quantity of training and testing datasets, data augmentation, and the features present in the LSTM algorithm, such as the Global Feature Attention Layer. In the context of using the Soft-Attention mechanism in Deep Learning architectures, this mechanism allows the model to focus attention on important parts of the image or other input data. In this process, an attention distribution is used to determine the weights assigned to each hidden state generated by the model. These weights are then used to calculate the weighted average result of the hidden states, which reflects the most relevant or important information in the processed image or data and improves the classification process of the model [59].
RQ3. What are the applications of deep learning in image processing for emotion recognition?
The following table VI presents data showing the application of deep learning algorithms based on relevant paper references.
Table 6
Application of Algorithm Based on Research
Application | Forms of Application | References |
Education | The application of emotional recognition is used to determine students' interest in a particular subject. | [11], [73] |
Robotic | The human face and emotional recognition that embeds into a robot | [13], [61] |
Automotive | Monitoring of driver’s emotional state when driving | [74] |
From the table above, several applications of facial detection technology in everyday life can be further explained as follows:
In the field of education, research is divided into two parts based on the online and onsite learning models. In the onsite learning model, surveillance cameras are installed inside the classroom to record the actions and expressions of students. Guiping Yu applied a more comprehensive method to identify students' emotions by utilizing information from their faces, body movements, and contextual cues to enhance facial emotion recognition. The face identification and pre-processing process were conducted on a dataset of images, and the faces were ultimately identified and processed using a computer. To gather continuous video data, surveillance cameras were installed in the classrooms where the students were present. These videos were extracted at a certain frame per second (FPS). Then, the facial images were cropped and underwent pre-processing steps such as face localization, alignment, grayscale conversion, and scale normalization. Due to the low-quality or noisy nature of the provided images, these pre-processing steps were crucial for the expression recognition system. Compared to static images, detecting faces in video surveillance scenarios presents greater challenges [11].
On the other hand, in the online learning model, activity recording is done as usual in online learning using the camera on the students' devices. The recorded data is then analyzed and identified. Research in the field of education is also conducted by Swadha Gupta and colleagues. They designed a student engagement detection system that utilizes facial emotions to detect student engagement in real-time scenarios. The application scenario includes the following steps:
-
First, facial emotion information captured by the cameras of each device is used to evaluate online student engagement.
-
Face detection is performed using a pre-trained Faster R-CNN model.
-
A modified landmark extractor called MFACEXTOR extracts 470 facial landmark points or key points.
-
For real-time learning scenarios, deep learning models such as Inception-V3, VGG19, and ResNet-50 are used to classify student emotions such as anger, sadness, happiness, neutrality, and other emotions using the softmax function.
-
An engagement evaluation algorithm is proposed, which utilizes the output of facial emotion classification to determine the engagement index.
-
The system determines online student engagement based on the engagement index value.
This research focuses on using facial emotions as a means to detect student engagement in real-time learning scenarios, utilizing various deep learning models and algorithms for emotion classification and engagement evaluation [73].
Furthermore, in the field of robotics, Tzuu-Hseng S. Li et al. underscore the crucial role of emotion recognition in advancing human-robot interaction (HRI). The authors elaborate on how emotions, involving cognitive appraisal, bodily language, action tendencies, expressions, and feelings, become integral elements in human interaction, allowing individuals to convey thoughts without words. To address challenges in classifying facial expressions under different conditions, they propose the use of the Facial Action Coding System (FACS) as a practical solution. FACS measures human facial movement based on muscle actions, decomposing facial expressions into component actions that can be further applied. The study advocates for the use of six basic emotions (happiness, anger, disgust, fear, sadness, and surprise) as the foundation for emotion recognition. To enhance human-robot interaction, the authors propose an emotion recognition system based on deep neural networks, specifically the Convolutional Neural Network (CNN) trained with static images, and the Long Short-Term Memory (LSTM) network to capture temporal and contextual information in dynamic facial expressions. The use of transfer learning is introduced to overcome traditional machine learning limitations. The research results in a CNN and LSTM-based model for facial emotion recognition, incorporating transfer learning concepts and validated through experiments with a humanoid robot [61].
Not only in education and robotics but the application of deep learning for emotion recognition is also carried out in the automotive field. Mira Jeong and other researchers have developed a previous technology called Advanced Driver Assistance System (ADAS) with a research focus on autonomous vehicles. Autonomous vehicles have the advantage of Driver State Monitoring (DSM) system. The research related to DSM consists of three types: 1) analysis of driving patterns based on movement data; 2) evaluation of psychophysiological states based on sensor information; and 3) image analysis inside the vehicle based on camera sensors. The last method utilizes facial images captured by cameras installed in the vehicle to identify the driver's condition. This image-based DSM approach aims to comprehensively recognize the driver's emotional state and prevent accidents caused by fatigue or drowsiness. It also aims to make driving more comfortable. Furthermore, the driver's status can be monitored by the DSM method to provide information about the appropriate timing for transitioning control from autonomous mode to manual mode when needed. Additionally, detecting the driver's facial expressions in autonomous vehicles can prevent passengers from getting motion sickness and create a more comfortable journey by adjusting the vehicle's ambiance according to the driver's facial expressions. The research conducted by Mira Jeong and her colleagues focuses on developing Advanced Driver Assistance System (ADAS) technology for autonomous vehicles. The Driver State Monitoring (DSM) system utilizes facial images captured by in-vehicle cameras to analyze the driver's emotional state, prevent accidents caused by fatigue, and enhance driving comfort. This approach aims to improve the overall driving experience in autonomous vehicles by monitoring the driver's condition and adjusting the vehicle's mode accordingly [74].
Thus, deep learning technology provides tremendous positive impacts on human life, and if further developed, it will undoubtedly benefit more areas of life, especially with this emotion recognition technology.