In recent years, benefited from the growth of surveillance systems for both public and personal usage, pose estimation and detection methods have been greatly developed to meet the emerging needs of various industries. There are many problems behaviors that university students have in class, such as sleeping, playing mobile phones and chatting. These inappropriate behaviors will affect students' classroom learning. classroom learning efficiency is one of the important factors affecting their academic performance. Therefore, in the university classroom, students pose estimation and detection are the application of computer vision technology in the field of university education, it has very important research significance and application value.
There are two candidates for this task. One is based on object detection algorithms. The paper [1][2] used the improved Faster R-CNN [3] model as the object recognition to detect student postures in classroom. And the paper [4] only detects students sleeping pose base on improved R-FCN [5]. These methods can detect poses in small, low-quality pictures of classroom, where students are concentrated and the collected pictures have low resolution. However, using object detection methods to estimate poses of each individual on this occasion are hindered by two main obstacles. First, each method is low scalability. If we need to detect a new pose, it has to re-train the entire network. Secondly, those methods only identified pose that significantly differed from others, such as sitting, standing. For other less distinct pose, such as: reading, chatting and raising your hand, are generally cannot recognized.
The other method is based on the pose estimation network. The paper [6] uses OpenPose [7] to estimate the location of human body keypoints, and then use a classifier to classify the collected keypoints. The advantage of this method is easy to implement and it has fast calculation speed, but the disadvantage is low accuracy of the calculation results. The paper [8] uses pose estimation maps(heatmaps), the byproduct of pose estimation, to recognize human action. This method is very effective when applied to behaviors with large movements, but it is difficult to learn behaviors with small movements.
Due to the large number of students in classrooms, and human bodies covered by objects such as tables or other human bodies, there are four main obstacles in recognizing students' posture in classroom.
- The estimation of the human body keypoints has a high error rate and low accuracy rate.
- Part of human joints is invisible to camera due to heavy occlusions, relying on only a few unreliable features to estimate human body keypoints. So, it is hard to recognize the grovel posture.
- Most of the Top-Down pose estimation methods calculation speed is low, so the results cannot be displayed in real time.
- If we only use object detection method to detect pose in classroom, there will be some problems such as it can only detect a single gesture or it will have poor scalability.
Faced with these difficulties, we propose a pose recognition method in classroom which combines the pose estimation algorithm and the object detection algorithm. Our contributions are three-fold as follows: (1) We capitalize on YOLOv3 (You Only Look Once) model [9] as the object detection algorithm to detect human object and student groveling on the table object. (2) We propose improved HRNet model as the pose estimation algorithm to solve the problem of high error rate of estimate human body keypoints. We term our improved HRNet as SE-HRNet. SE-HRNet model is achieved by embedding the SENet [10] structure into the HRNet model [11]. (3) Finally, we design a multi-posture classification network base on Support Vector Machine (SVM) [12]. Experimental results show, the AP that uses YOLOv3 to detect the Groveling pose is 91.6 points, the AP that uses the SE-HRNet model to detect keypoints of the human body is 73.7 points, the accuracy of the pose classification is 88.6%, and the computation speed of our classroom student postures recognize method is 7 images per second.
The rest of this paper is organized as follows. In Section 2, we will introduce the related work. And then proposes the classroom student postures recognize method in Section 3. Finally, in Section 4 we introduce our dataset and then discuss our experimental results in detail. Conclusion is summarized in Section 5.
2. Related work
2.1 Pose estimation methods
At this stage, multi-person human poses estimation methods are divided into two categories:
Top-Down approaches: First, perform object detection of the human body on the image and crop. Then use the single-person pose estimation for each cropped human body. So, for each detection, a single-person pose estimator is run, and the more people there are, the greater the computational cost. but the accuracy of the Top-Down method is usually relatively higher. The common models include CPN [13], Hourglass [14], CPM [15], Alpha Pose [16] and so on.
Bottom-Up approaches: First it detects all the keypoints of the human body in the picture, and then matches these points to different individuals, so this method is faster in calculation, but the accuracy is slightly lower than Top-Down method. The common model such as: OpenPose [7].
High-resolution network (HRNet) [11] is a state-of-the-art human body pose estimation method. And a Top-Down method. High-resolution network pose estimation is able to maintain high resolution representations through the whole process. It starts from a high-resolution subnetwork as the first stage, gradually adding high-to-low resolution subnetworks one by one to form more stages, and connecting the multi-resolution subnetworks in parallel [11]. It performs multiple multi-scale fusions by repeatedly exchanging information across parallel multi-resolution subnetworks through the whole process. It estimates the keypoints of the human body through the high-resolution representations of the network output. The architecture of the HRNet is illustrated in Figure. 1.
It has two benefits in comparison to the common pose estimation networks [13-16]. First, this approach connects high to low resolution subnetworks in parallel, rather than serial, as most existing solutions do. Therefore, it is able to maintain high resolution rather than restoring resolution through a low to high process. and accordingly, the predicted heatmap is spatially more precise. Second, most existing fusion schemes combine low-level and high-level representations [11]. On the contrary, this method uses the low-resolution representation of the same depth and similar level to perform multiple multi-scale fusion to improve the high-resolution representation, and vice versa, making the high-resolution representation also rich in pose estimation. As a result, this method predicted heatmaps are more accurate.
HRNet can maintain high-resolution features through the whole process without the need of recovering the high resolution. It is also fuse parallel multi-resolution representations repeatedly, enhance reliability of high-resolution representations. It yields accurate and spatially precise point heatmaps. But because HRNet is a Top-Down method, its image processing speed will be slower than Bottom-Up method. In addition, in order to realize HRNet multi-person pose estimation, the object detection algorithm needs to process the image first. Therefore, the detection speed of object detection algorithm has great influence on the speed of pose estimation.
2.2 Squeeze-and-Excitation Networks
Squeeze-and-Excitation Networks (SENet) [10]. SENet introduced a new architectural unit, which term the Squeeze-and-Excitation (SE) blocks, with the goal of improving the quality of representations produced by a network by explicitly modelling the interdependencies between the channels of its conventional features [10]. In this structure, Squeeze and Excitation are two very critical operations. A new "feature recalibration" strategy is adopted, through this mechanism networks can learn to use global information to selectively emphasize informative features and suppress less useful features.
The structure of the SE building block is shown in Figure. 2. Among them: SE represents the SENet blocks. The first passed through a squeeze operation, which first performs global average pooling on the input feature map to obtain a feature map of size C×1×1 (C is the number of feature map channels), allowing information from the global receptive field of the network to be used by all its layers. The aggregation is followed by an excitation operation, through the parameter is used to generate weight for each feature channel, where the parameter is learned the correlation between the feature channels. After two fully connected layers (first dimensionality reduction and then dimensionality increase), it uses the sigmoid activation function to obtain a weight of C×1×1. Finally, there is a reweight operation. We regard the output weight as the importance of each feature channel after feature selection, and then weight the previous features one by one through multiplication to complete the feature recalibration. The output of the SE blocks which can be fed directly into subsequent layers of the network. Both BasicBlock and Bottleneck are the classic residual modules used in the ResNet network. SE-BasicBlock means to embed the SE structure into the regular BasicBlock unit, and SE-Bottleneck means to embed the SE structure into the regular Bottleneck unit.
The structure of the SE block is simple and can be directly embedded into the existing state-of-the-art network architectures, which has a significant improvement in the results of network, also computationally lightweight and impose only a slight increase in model complexity and computational burden.
2.3 Object detection method
The existing object detection algorithms are mainly divided into two types, the two-stage method (region proposal method) and the one-stage method (regression method). The common two-stage object detection algorithms include RCNN [18], Fast RCNN [19] and Faster-RCNN [20]. Among them, Faster R-CNN tends to be a slower but more accurate model [21]. Faster R-CNN consists of two stages. In the first stage, called the region proposal network (RPN). Images are processed by RPN to predict class-agnostic box proposals. In the second stage, these proposals boxes is used to crop features from the same intermediate feature maps which are then entered into the feature extractor in order to predict a class for each proposal box and to optimize for each proposal box.
In one-stage object detection algorithm, e.g., SSD [22] and YOLO [23], conduct object classification and bounding-box regression concurrently without a region proposal stage [24]. YOLO converts object detection into regression work. Based on a single end-to-end network, complete the calculation from the original image to the output of the object position and category. These one-stage methods usually have a high detection speed and high efficiency but low accuracy. YOLOv3 [9] can detect multiple objects with a single inference, so its detection speed is therefore extremely fast; in addition, by applying a multi-stage detection method, it can complement the low accuracy of YOLO and YOLOv2 [25]. Although YOLOv3 is not as good as Faster-RCNN in terms of detection accuracy for very small targets. But from the perspective of detection speed, YOLOv3 is significantly better than Faster-RCNN [20]. So YOLOv3 is suitable for many engineering applications.
Considering the detection speed and detection accuracy of the algorithm, this paper uses the YOLOv3 as the object detection algorithm to detect human body and groveling poses in the classroom, and provides the foundation for our real-time classroom human pose recognize method based on SE-HRNet.