Technical Route
The first phase proposes an automatic image classification model for pain expression based on the ResNet50 network. This approach addresses the shortcomings of traditional machine learning in facial expression image recognition, the high complexity of manual feature extraction, and the limitations of conventional deep learning, such as network degradation with increasing network depth. The model flow chart is depicted in Fig. 1. Face detection is achieved by video using Multi-Task Cascaded Convolutional Neural Networks (MTCNN). Detected pre-processed images are then inputted into the ResNet50 network, which is trained and optimized for pain level classification using Bayesian optimization. Once the requirements are met, voice playback and data recording functions are implemented. The Pain Expression Evaluation Assistant (PEEA) is an automated software system that analyzes facial pain expressions for the detection and classification of features relevant to pain assessment.
Multi-Task Cascaded Convolutional Neural Networks (MTCNN)
In engineering practice, the MTCNN algorithm is renowned for its high detection speed and accuracy. MTCNN employs a three-layer cascade architecture to accomplish face detection and key point localization in images[14]. MTCNN comprises three networks: P-Net, R-Net, and O-Net.
The three stages of MTCNN can be simply explained as follows: The first network layer, P-Net, scales the image at multiple levels and performs sliding detection on each image using a 12×12 sliding window with a step size of 2 for each scale. Small pictures can detect large faces, while large pictures can detect small faces. All detected face frames undergo NMS (Non-Maximum Suppression) to obtain the face frames, which are then converted to the original size, and the short side is filled to convert them into squares of 24×24.The second network layer, R-Net, processes multiple 24×24 faces from the previous step using the CNN network to obtain more precise face frames, which then undergo NMS and are ultimately resized to a 48×48 square. The third network layer, O-Net, takes multiple 48×48 faces from the previous step as input to obtain multiple more accurate boxes, five face position points, and confidence scores. NMS is then applied to the face frames to obtain the final required face frame.
ResNet 50 Convolutional Neural Network
A transfer learning approach utilizing ResNet50 is employed in this study. The method involves pre-training a convolutional neural network (CNN) on a large existing dataset, and then transferring the pre-trained CNN to a target dataset for fine-tuning. The proposed model is initially pre-trained on the ImageNet dataset at the TB level[15]. All layers, except the Softmax layer, are initialized with the pre-trained model parameters, as opposed to traditional random initialization[16]. The Softmax layer is added to process the dataset used in this study. This transfer learning method offers several advantages, including superior model generalization performance, significant depth, high accuracy, and good convergence[17].
Bayesian optimization
Bayesian optimization is an algorithm that uses Bayesian probability to search for the optimal value of an objective function, with the probabilistic surrogate model and acquisition function being the key components [18]. The most widely used probabilistic surrogate model is the Gaussian process, and the acquisition function is based on the posterior probability of the objective function. The goal of the Bayesian optimization algorithm is to minimize the total loss r, and this is achieved by selecting the evaluation point xi using the acquisition function, which is formulated as follows:
where X is the decision space, λ(x,D1:i) is the acquisition function, and y* is the optimal solution [19].
The implementation of Bayesian optimization followed the procedure outlined below [20]:
Determining the maximum number of iterations N.
Using the acquisition function to obtain the evaluation point xi.
Evaluating the acquisition function yi at the evaluation point yi.
Updating the probabilistic agent model by integrating the data Dt.
If the previous number of iterations n is less than the maximum number of iterations N, output xi; otherwise, return to step
(2) and continue iterating, else output xi.
Experiment
Database Construction
In order to construct the pain expression database, this study followed the existing methods and schemes. Firstly, the researchers recorded videos of elderly patients expressing different degrees of pain. Next, key frames capturing the required expressions were extracted from the videos. Experienced orthopedic specialist nurses and doctors evaluated and classified these key frames. Finally, consistent images were selected and included in the database.
Video collection
Patients with hip fractures attending the inpatient facility of a hospital’s hip joint trauma department participated in the video collection. The Hospital Research Ethics Committee approved the study protocol, and patients provided informed consent before participation. The study aim, subjects’ rights, and investigators’ obligations were explained to the respondents using uniform language. The privacy of the participants was ensured throughout the study.
For image or video detection, a clinical nurse used a camera to capture the patient's facial expressions. The camera was positioned 1-1.5 meters away from the patient's face, and the video acquisition time was 20–25 seconds. During this period, the patient did not need to hide their state.
The inclusion criteria for the study were: (1) radiological diagnostic criteria[21] of hip fracture in the elderly; (2) consciousness and understanding; (3) age > 65; (4) written informed consent.
The exclusion criteria were: (1) neurological diseases, such as Alzheimer's disease; (2) deafness, aphasia, and other symptoms that hindered communication.
Data Acquisition
The method involved sequentially recording videos of elderly individuals expressing different degrees of pain. The required key expression frames were extracted from these videos, and experienced nurses and doctors evaluated and classified them. The acquired images were labeled according to the criteria of mild, obvious, severe, and intense pain. To ensure the reliability of the database, images with high consistency in multiple evaluators' scores were mainly selected and included in the database, which summarized in the following aspects: turning over and patting the back, moving from flat car to bed,lower limb flexion and extension training ,straight leg raising training ,bedside standing and walking training.
Key Frame Capture
To focus on facial expression recognition algorithm research and minimize the influence of other factors, the original images were normalized. This involved three steps: rotation correction, image cropping, and scale normalization. The goal was to correct for background interference and angle offsets caused by the shooting environment and changes in face pose. This ensured that the face of the patient in the image was upright with both eyes in a horizontal state. The method removed redundant background information as much as possible, retaining only the effective facial region containing expressions. The center points of the two eyes were manually marked, and the image was rotated using the axis connecting the center points of the eyes as a reference. The center points of the eyes were then adjusted to the same horizontal line to eliminate angle deviation. Next, the patient's facial region was manually cropped from the corrected image. After evaluating and selecting from the video key frames, a database of facial pain expressions of elderly patients was established, consisting of 4,538 images, including 2,247 images of mild pain (VAS: 1–3), 735 images of obvious pain (VAS: 4–6), 729 images of severe pain (VAS: 7–8), and 827 images of intense pain (VAS: 9–10), in accordance with guidelines for pain management[1].
Data Enhancement
After randomizing the entire dataset, it was partitioned into three sets: the training set (60%), validation set (20%), and test set (20%). The training dataset underwent rotation, cropping, and translation to increase the number of training samples and improve the model's robustness and generalization performance.
In the field of computer vision, a model's performance depends not only on the data and the quality of its structure but also on the optimizer, loss function, and data enhancement methods. The effectiveness of a model's training strategy, such as the optimizer, data augmentation, and regularization technique, is highly important and can significantly impact its performance. This study utilized data enhancement techniques (Table 1) to improve the accuracy of model training.
Table 1
Comparison of Model Accuracy Before and After Data Enhancement
|
Accuracy/%
|
Before data enhancement
|
67.62
|
After data enhancement
|
73.99
|
BO-ResNet 50 network
Furthermore, Bayesian optimization was employed to determine the optimal hyperparameters, including the learning rate, with a maximum of 60 iterations, which was expected to enhance the acquisition function. Table 2 shows the decision space for the optimization process.
Table 2
Super parameters and their ranges of improved ResNet 50
Hyperparameter
|
Minimum value
|
Maximum value
|
Initial Learn Rate
|
1×10− 2
|
1
|
Momentum
|
0.8
|
0.98
|
L2 Regularization
|
1×10− 10
|
1×10− 2
|
Optimizer
|
RMSprop, Adam,SGDM
|
Model evaluation criteria
In this study, the classification and generalization ability of the model for recognizing pain expressions in elderly patients with hip fracture were assessed based on the prediction accuracy and model training time [22].The formula used for the prediction accuracy P1 is
where nT is the number of test sets used to validate the model, and mTA is the number of accurately classified samples in the sample test set.
Pain Level Prediction
Input the normalized image into the trained BO-Resnet50 model, use the Softmax classifier to receive the feature matrix input by the fully connected layer, and output the probability value of each category corresponding to the input object, assuming that there are \(N\) input objects \({\left\{{X}_{\text{i}}{Y}_{\text{i}}\right\}}_{i-1}^{k}\), the label for each object is \({\text{y}}_{i}\{\text{1,2},\dots ,k\}\), \(k\) is the number of the model output categories (\(k\ge 2\))[22], 4-classification (1, 2, 3, 4) is performed for the pain expression, and \(k\) is 4. Input \({X}_{\text{i}}\), use the assumption function \({\text{f}}_{}\left({X}_{\text{i}}\right)\) to estimate the probability value of its corresponding class j: \(\text{P}\left({y=j/X}_{\text{i}}\right)\), and its function is:
The pain level of the pain expression is determined by taking the label category with the highest probability of the Softmax output. To avoid false detections and improve system stability, the detected label must reach a stable and continuous number of frames before it can be used as the output result of the pain level determination. The collected video data is organized into a folder and transmitted to a dedicated computer where the automatic pain expression classification system is accessed in video detection mode. The report is generated based on the pain expression with the longest duration in the video (Fig. 2,3).
Measure
The study involved 15 nurses, each with more than five years of clinical pain assessment experience, who underwent a training session. The session covered the structure of the Pain Evaluation and Entitlement Act (PEEA) report and included three videos to demonstrate the process. The trainer, who was familiar with the outputs of PEEA, provided guidance on where to find relevant information in the graphical outputs, but did not interpret any videos or images to allow the nurses to utilize their medical expertise.
During the session, the nurses were instructed to rate the pain level based solely on their visual inspection of the patients' condition. To ensure accurate assessment and avoid reader fatigue, the nurses were allowed to take an unlimited amount of time to complete the assessments at their convenience. The researchers then used the same videos to assess pain levels using the PEEA system.
Statistical Analysis
Agreement Rates
Agreement rates between nurses and PEEA were assessed by intraclass correlation (ICC),15 assuming random effects for the nurses. 95% confidence intervals were calculated according to the original derivations by Shrout and Fleiss [23].Standard errors of the mean for ICCs were estimated by resampling the observations with replacement (bootstrap) 1000 times. We selected the two-way of the same raters for all subjects in this study.