Local Feature Enhancement for Robust 2D Multi-Person Pose Estimation via Posture Refinement Network

doi:10.21203/rs.3.rs-5034986/v1

Download PDF

Research Article

Local Feature Enhancement for Robust 2D Multi-Person Pose Estimation via Posture Refinement Network

https://doi.org/10.21203/rs.3.rs-5034986/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Accurate 2D multi-person pose estimation remains challenging due to issues such as occlusion, missing body parts, and low resolution, particularly in complex backgrounds. This paper proposes a novel posture refinement network that leverages local feature enhancement and fusion to address these limitations. The network employs HRNet as the backbone to extract multi-scale feature maps, introducing a Dilated Convolution Module (DCM) with cascaded dilated convolutions to enrich pose keypoints representations. Additionally, a Hybrid Self-Attention Module (HSM) integrates contextual information to further refine pose estimates. Extensive experiments on the MSCOCO and CrowdPose datasets demonstrate that our method outperforms comparable methods, particularly in estimating human end joint positions with greater accuracy and robustness. Our findings highlight the effectiveness of local feature enhancement in robust multi-person pose estimation. The code and models are available at https://github.com/Twl-GZ/Human-pose.

Human Pose Estimation

Dilated Convolution

Hybrid Self-Attention

Multi-scales

In recent years, deep learning has made significant progress in the field of Human Pose Estimation (HPE) [1–24], which shows great potential in application fields such as motion capture and interactive entertainment. Existing HPE methods primarily focus on detecting multiple person in images, facing challenges such as occlusion, missing body parts, and high computational demands. Moreover, the low resolution of surveillance cameras results in blurred pose keypoints. Consequently, addressing the accurate estimation and robustness of human posture amid these challenges remains a prominent research concern.

2D multi-person HPE method requires accurate estimation of joint keypoints for all human targets. In terms of data space partitioning for human-related features, 2D multi-person HPE primarily consists of two approaches: top-down HPE methods [1–8] and the bottom-up HPE methods [9–18], both employing keypoints heatmaps for joint localization. The former methods first detect the body region and then incrementally regress keypoints, achieving high accuracy but with computational complexity rapidly escalating as the number of person in the image increases. The latter methods directly optimize all keypoints and match them to human joints, offering faster processing but facing challenges in accurately matching ambiguous overlapping joints in complex scenes.

Sun et al. [1] proposed the parallel HRNet framework, which repeatedly exchanges multi-scale fusion through parallel multi-resolution sub-networks to extract rich semantic features while maintaining high-resolution representation, and achieves good results in multi-person keypoints regression. Subsequently, Braso et al.[2] proposed an attention mechanism CGNet to cluster each keypoints to its corresponding center. It enhances accuracy in situations with human occlusion, yet it still falls short in improving the acquisition of pose keypoints. In 2022, Xiao et al. [18] proposed an adaptive pose method that uses a position sensing module to adaptively regress seven relevant points of human body parts. Then, the central heatmap of the human body is predicted by aggregating adaptive points. It can better capture various posture deformations and adaptively decompose long-distance displacements from the center to the joints, making multi-person posture estimation more compact and efficient. Chen et al. [29] improved the SSD algorithm and extracted shallow and deep features in different ways to improve the detection accuracy of objects. Liu et al. [30] proposed the Multi-branch Parallel Feature Pyramid Network (MPFPN) to recover lost feature information during the sampling process, thus improving the accuracy of object detection.

The HPE methods based on CNN are susceptible to blurring of pose joint details, due to the influence of multi-layer convolution and upsampling operations, leading to feature errors and incorrect keypoints regression. The exist methods are difficult to accurately identify, locate, and regress keypoints of the human body in complex backgrounds and low resolutions. To address above issues, this paper proposes a simple yet effective posture refinement network for 2D multi-person pose estimation through local feature enhancement and fusion. The method mainly includes the following contents:

We employ a multi-scale cascaded dilated convolution module (DCM) to expand the receptive field of local joint features through various convolution operations. This enhances the local feature representation capability of pose keypoints.
We design an efficient feature fusion strategy guided by hybrid self-attention module (HSM), which learns a weighted matrix for semantic information of features and positions. This matrix can be integrated into any feature extraction architecture.
We conduct comparative and ablation experiments on CrowdPose [40], MS COCO val2017and COCO test-dev 2017 [44] databases to validate the effectiveness of the proposed method, especially in terms of multiple human pose estimation in complex backgrounds.

The article is organized as follows. We introduce the research background in the “Introduction” section and review the related work in the “Related work” section. Afterwards, the “The Proposed method” section describes the proposed method in detail. The experiments are given in “Experimental results and comparison”. We conclude the article and discuss future work in the “Conclusion and future work” section.

In this section, we briefly review the relevant literature of human pose estimation methods in recent year, including the top-down [1–8] and the bottom-up HPE methods [9–18], and local enhancement methods [25–32].

The top-down HPE methods initially detect all human body regions within the image, resize them uniformly, and then regress pose keypoints for each individual, as in HRNet [1], MSPN [3], Fast HPE [8] methods through top-down path enhancement, and fully utilizes shallow features to accurately locate target information, thereby improving the detection efficiency of pose keypoints. These methods

aids the network model in implicitly learning the positional distribution of each keypoints. However, it may lose contextual information beyond the human detection boundaries, and the computational complexity rapidly increases with the number of people within the image, leading to comparatively slower processing.

In the bottom-up HPE methods, all joint keypoints in the image are initially detected and then matched to individual human subjects. These methods have fast processing speed, such as DEKR [15] and OpenPose [17], etc. However, due to constraints in algorithm performance of keypoints matching optimization and combined classification, its accuracy trails the top-down HPE methods. Particularly in images with complex scenes, such as human occlusion and group similar movements, the bottom-up HPE method struggles with classifying overlapping joint keypoints and depth ambiguity. To enhance the accuracy of pose keypoints regression, Cao et al. [16] proposed a dense single-stage and anchor-free detection framework KAPAO, which achieves faster and more accurate results by simultaneously detecting human pose and keypoints, and combining the detection results.

In multi-person pose estimation tasks, CNN-based methods perform multiple convolution and pooling operations to reduce image resolution and extract more deep feature information. This process blurs pose joint details, leading to feature representation errors and impacting the performance of multi-person pose keypoints regression. Consequently, some methods enhance the local receptive field of the target by deepening the network structure and aggregating local information at higher layers. Artacho et al. [25] proposed a UniPose that uses the waterfall module with cascaded dilated convolution to predict the position of the joints through the context information, and improved the accuracy and efficiency of HPE. HigherHRNet [26] proposed a multi-stage upsampling module for generating multi-scale heatmaps and aggregating them during the computation process. Luo et al. [27] introduced an adaptive Gaussian kernel to represent the size of instances, making the model more sensitive to scale variance. Zoph et al. [28] enhanced the detection capability of objects through the data augmentation strategies and sampling methods.

Some methods integrate contextual information of the target to improve the detection performance of pose joints. Zhan et al. [31] proposed a method for multi-scale feature prediction and refinement of object size and position through an interleaved cascade architecture. Xi et al. [32] replaced the backbone network VGG16 in SSD with DenseNet and generated feature maps using three consecutive

3×3 convolutions, preserving the feature information of the input image and reducing the number of computational parameters.In this paper, we employ HRNet [1] as the backbone network, utilizing its multiple parallel-connected convolutional structures to obtain multi-scale feature heatmaps of human posture. Further, we designed a cascaded dilated convolution module (DCM) to enrich local feature information and a hybrid self-attention module (HSM) to strengthen the fusion of contextual information. The proposed method improves the feature representation and keypoints regression capabilities for local joints, enhances the accuracy of human pose estimation while reducing computational complexity.

The main framework of the proposed method is shown in Fig. 1. The parallel network HRNet [1] is used to extract preliminary feature maps ($\:\text{i}\in\:\left[\text{1,4}\right]$) at four different scales, and $\:\text{C}$, $\:\text{H}$ and $\:\text{W}$ represents the number of channels, height and width of the feature maps, respectively. Subsequently, the dilated convolution module (DCM) and hybrid self-attention module (HSM) process these feature maps separately and sequentially, amplifying the receptive field of human poses and enhance the local feature information. Ultimately, the adaptive convolutional networks are employed to obtain the keypoints of multi-person posture.

3.1 Dilated Convolution Module

Multi-layer network convolution kernels enhance the number of channels and enrich the feature representation, though they also increase network computation time and cost. To expand the receptive field of posture joints and reduce computational costs, this paper introduces the Dilated Convolution Module (DCM) which employs cascaded dilated convolutions. The DCM receives feature maps $\:{\text{f}}_{\text{i}}$ at each scale as input, where $\:\text{i}\in\:\left[\text{1,4}\right]$. The DCM architecture comprises three convolutional layers and a concatenation operation, as illustrated in Fig. 2. In the DCM, the first layer utilizes 1×1 convolutional blocks to enhance the network's nonlinear representation capabilities. The second layer consists of three standard convolution blocks, with 1×3, 3×1, and 3×3 convolution operations respectively.

The third layer in DCM consists of four cascaded 3×3 convolutional blocks with dilated rates $\:\text{r}=[\text{1,3},\text{3,5}]$, as shown on the right of Fig. 2 (a). By cascading these four convolutional blocks with varying dilated rates, we extend the receptive field for posture joint points. That is, with the exception of the first dilated block, the inputs for the other three dilated convolutions are derived from the previous branch, progressively increasing the receptive field, thus the fourth dilated block encompasses an even larger receptive field. The convolution kernel size $\:\text{K}$ after each cascade expansion can be calculated according to the following formula:

$$\:\text{K}=\text{k}+(\text{k}+1)(\text{r}-1)$$

where $\:\text{k}$ is the original convolution kernel, and $\:\text{r}$ is the dilated rate. Table 1 shows the initial convolution kernel size, dilation rate, number of input and output channels, and dilated convolution kernel size for four dilated convolution blocks. As can be seen, the receptive field range after cascading four dilated convolutions is 31×31, and the number of output channels is $\:\frac{\text{C}}{8}$.

Through cascading dilated convolutions, the receptive field of the target features is expanded to capture joint details from a larger contextual area, thereby enhancing local features and securing more precise keypoints locations. To enhance computational efficiency, we halve the number of channels in the convolutional blocks progressively, as outlined in Table 1, ensuring low computational costs. Finally, we concatenate the outputs of the four dilated blocks and perform residual connections with the original input feature map $\:{\text{f}}_{\text{i}}$. The output $\:{\text{d}}_{\text{i}}$ of DCM can be represented as:

$$\:{\text{d}}_{\text{i}}={\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}}_{\text{j}=1}^{4}\left[{\text{D}}_{\text{j}}\right({\text{c}\text{o}\text{n}}_{\text{j}}\left({{\text{W}}_{1\times\:1}(\text{f}}_{\text{i}}\right)\left)\right]+{\text{f}}_{\text{i}},\:\text{i}\in\:\left[\text{1,4}\right]$$

where $\:{\text{W}}_{1\times\:1}(\bullet\:)$ is the 1×1 group convolution, $\:{\text{c}\text{o}\text{n}}_{\text{j}}(\cdot\:)$ is the convolution operation of the second layer, $\:\text{j}$ is the label of the branch, and $\:{\text{D}}_{\text{j}}(\cdot\:)$ is the dilated convolution operation of the third layer.

Table 1

The Cascade process of four dilated convolution blocks, where C is the original channel number, the step size (stride) of four convolution blocks is 1.
Order	$\:\text{O}\text{r}\text{i}\text{g}\text{i}\text{n}\text{a}\text{l}\:\text{k}\text{e}\text{r}\text{n}\text{e}\text{l}\:\text{s}\text{i}\text{z}\text{e}\:k\times\:k$	Dilation rate $\:r$	Input channel	Output channel	$\:\text{D}\text{i}\text{l}\text{a}\text{t}\text{e}\text{d}\:\text{k}\text{e}\text{r}\text{n}\text{e}\text{l}\:\text{s}\text{i}\text{z}\text{e}\:K\times\:K$
1	3×3	1	C	C/2	3×3
2	3×3	3	C/2	C/4	7×7
3	3×3	3	C/4	C/8	15×15
4	3×3	5	C/8	C/8	31×31

3.2 Hybrid Self-attention Module

In the previous section, we expanded the receptive field of the posture joints using DCM, thereby enhancing the representation of detailed features for local joints. To further capture the contextual information of the output feature map $\:{\text{d}}_{\text{i}}$ of DCM, we introduces a hybrid self-attention module (HSM) to enhance the network's capability to fuse feature semantic information, as shown in Fig. 1. In HSM, following layer normalization (LN), $\:{\text{d}}_{\text{i}}$ is fed into Convolutional Self-Attention (CSA) block and Coordinate Attention (CA) block [33] for parallel processing. The CSA block is designed to capture both local and global features of multiple individuals' joints, and its structures is shown in Fig. 3. The CA block [33] is used to address the significant loss of pose and position information. The outputs from both are combined via summation and residual connections, thereby enhancing the integration of local joint features with their context.

In the CAS architecture as shown in Fig. 3, the Q, K, and V components are generated via 1×1 convolution. Compared to fully connected networks, employing 1×1 convolution enables attention networks with enhanced contextual awareness, reduces computational demands, and strengthens the network's capacity for nonlinear expression. The $\:\text{Q}$ and $\:\text{K}$ matrices are multiplied and subsequently input into the SiLU function via the fully connected layer (FC) to enhance linear expressiveness.. The output is $\:{\text{T}}_{1}$:

$$\:{T}_{1}=SiLU\left(FC\right(Q\otimes\:K\left)\right)$$

$\:{\text{T}}_{1}$ is then fed into the FC and the Tanh function to derive weight information $\:{\text{T}}_{2}$:

$$\:{T}_{2}=T{anℎ}\left(FC\right({T}_{1}\left)\right)$$

Multiplying $\:{\text{T}}_{1}$ with $\:\text{V}$, followed by a residual connection with $\:\text{V}$ via a 1×1 convolutional layer, the output $\:\text{C}\text{S}\text{A}\left(\text{L}\text{N}\right({\text{d}}_{\text{i}}\left)\right)$ of CSA block is represented as:

$$\:CSA\left(LN\right({d}_{i}\left)\right)=({T}_{2}\otimes\:V)+LN\left({d}_{i}\right)$$

Inspired by reference [33], we deployed the Coordinate Attention (CA) block to bolster the network's capacity for integrating location information. The CA block divides channel attention into two one-dimensional features, aggregating them across two spatial dimensions: one to capture feature correlations and the other to retain precise positional data. Subsequently, the feature maps are encoded into direction-aware and position-sensitive heatmaps, which can be complementary to the input feature maps to enhance the feature representation of the object of interest. The output of CA block is represented as $\:\text{C}\text{A}\left(\text{L}\text{N}\right({\text{d}}_{\text{i}}\left)\right)$.

Finally, these two output of attention blocks is summed with residuals and represented as $\:{\text{a}}_{\text{i}}$:

$$\:{a}_{i}=CSA\left(LN\right({d}_{i}\left)\right)+CA\left(LN\right({d}_{i}\left)\right)+{d}_{i}$$

Then, $\:{\text{a}}_{\text{i}}$ is processed through LN and MLP layers to produce the final output $\:{\text{ℎ}}_{\text{i}}$ of the HSM:

$$\:{ℎ}_{i}=MLP\left(LN\right({a}_{i}\left)\right)+{a}_{i}$$

The hybrid self-attention module (HSM) proposed in this paper improves the detection performance of keypoints in human posture, indicating that HSM can focus on more pixels by combining contextual content, have stronger ability to represent local pose features, and more accurate joint position localization.

3.3 Multi-person keypoints Regression

In the previous section, we obtained four different scales of $\:{\text{ℎ}}_{\text{i}}$, $\:\text{i}\in\:\left[\text{1,4}\right]$. To integrate the diverse semantic information across feature maps of varying scales, we adjusted the resolutions of feature maps $\:{\text{ℎ}}_{2}$, $\:{\text{ℎ}}_{3}$, and $\:{\text{ℎ}}_{4}$ to match that of $\:{\text{ℎ}}_{1}$, and merged the four maps via concatenation to produce the final feature map $\:{\text{ℎ}}_{\text{s}}$. Subsequently, we employ multi-branch parallel adaptive convolution [15] for keypoints regression of $\:{\text{ℎ}}_{\text{s}}$, with each branch dedicated to the pixel region of its respective keypoints, ultimately regressing the positions of $\:\text{n}$ keypoints for human joints ($\:\text{n}=17$ in this paper). Specifically, the backbone network's feature map is partitioned into $\:\text{n}$ segments, with each segment fed into a distinct branch. Each branch employs adaptive convolution to extract features specific to keypoints and subsequently outputs their two-dimensional offset vectors via the convolutional layer. Due to the enhanced local details and contextual features in the feature maps generated by our method, the network can more precisely deduce the positions of corresponding keypoints for human joints in the image.

4.1 Datasets and Metrics

Datasets. We evaluated the proposed method for 2D multi-person pose estimation on two datasets, Microsoft COCO[44] and CrowdPose[40]. The COCO dataset [44] is a large comprehensive dataset used for object detection, segmentation, subtitles, and human keypoints annotation. It comprises 250,000 labeled human body instance images with 17 keypoints, including 57,000 images in the training set, 5,000 in the validation set, and 20,000 in the test set. The CrowdPose dataset [40] contains multi-person images with joint annotations, head and torso directions, and body part occlusion, including 10,000 images in the training set, 2K images in the validation set, and 20K images in the test set.

Metric. Our evaluation metric is based on the Object keypoints Similarity (OKS), which can be represented by formula (11):

$$\:OKS=\frac{\left(\sum\:_{n}{e}^{-{\text{c}}_{n}^{2}/2{S}^{2}{\text{k}}_{n}^{2}}\right)\delta\:\left({v}_{n}>0\right)}{\sum\:_{n}\delta\:\left({v}_{n}>0\right)}$$

where $\:\text{n}$ is the number of keypoints, $\:{\text{c}}_{\text{n}}$ represents the estimate of the Euclidean distance between the n-th keypoints and its true value, $\:{\text{v}}_{\text{n}}$ represents the visibility of the n-th keypoints, $\:\text{S}$ represents the scale of the target, and $\:{\text{k}}_{\text{n}}$ is the decay control constant. We used an OKS measure similar to IOU to evaluate the mean average precision (AP) for all instances in the datasets, at IOU thresholds of 0.5 (AP⁵⁰) and 0.75 (AP⁷⁵), respectively, and for medium (AP^M) and large size (AP^L) instances in the COCO dataset. For the CrowdPose dataset, we record the AP of easy (AP^E), medium (AP^M), and hard (AP^H) sized instances.

4.2 Implementation Details

Training process. We have adopted various data augmentation techniques from DEKR [15] to improve network robustness. To enhance data diversity, we applied random rotations between − 30° to 30° degrees to generate skewed samples. In addition, we employed random scaling between 0.75 and 1.5 to generate samples of varying scales and introduced random translations within the range of -40 to 40. During training, we used the Adam optimizer and performed a total of 140 iterations for optimization. We initialize the learning rate at 1e-3, which decreases to 1e-4 at the 90th epoch and further to 1e-5 at the 120th epoch. The training process concludes after 140 epochs. We deployed a low-resolution HRNet-W32 network (input size 512×512) and a high-resolution HRNet-W48 network (input size 640×640) as the backbone.

Testing process. We adjust the short side of the image to 512/640 while maintaining the aspect ratio between height and width. We evaluate the trained model using both single-scale testing mode and multi-scale testing mode. In single-scale testing, we employ the original-resolution images and measure pose accuracy by the OKS metric. In multi-scale testing, images with sizes of 0.5 times, 1 time, and 2 times the original resolution are selected, and the final results are derived by averaging the OKS metrics across these three distinct resolution scales.

Table 2

Compare results of the methods on the CrowdPose test set. (The best results are highlighted in bold.)
Methods	Backbone	Input size	AP	AP⁵⁰	AP⁷⁵	AP^E	AP^M	AP^H
Single-scale testing
DEKR[15]	HRNet-W32	512×512	65.7	85.7	70.4	73.0	66.4	57.5
Heatmap-Guided[34]	HRNet-W32	512×512	64.9	84.5	69.6	72.7	65.5	56.1
SimCC[39]	HRNet-W32	256×192	66.7	82.1	72.0	74.1	67.8	56.2
JC-SPPE[40]	ResNet-101	-	66.0	84.2	71.5	75.5	66.3	57.4
Ours	HRNet-W32	512×512	67.3	87	72.4	74.6	68.1	59.0
DEKR[15]	HRNet-W48	640×640	67.3	86.4	72.2	74.6	68.1	58.7
Heatmap-Guided[34]	HRNet-W48		66.1	84.6	71.2	73.4	66.9	57.1
HigherHRNet[26]	HRNet-W48		65.9	86.4	70.6	73.3	66.5	57.9
Ours	HRNet-W48		68.7	87.3	73.7	75.0	69.1	60.4
Multi-scale Testing
DEKR[15]	HRNet-W32	512×512	67.0	85.4	72.4	75.5	68.0	56.9
Heatmap-Guided[34]	HRNet-W32		67.5	86.1	72.6	75.5	68.2	58.2
Ours	HRNet-W32		68.5	85.9	73.8	76.3	69.3	58.4
DEKR[15]	HRNet-W48	640×640	68.0	85.5	73.4	76.6	68.8	58.4
Heatmap-Guided[34]	HRNet-W48		68.2	85.7	73.4	75.9	69.0	58.9
HigherHRNet[26]	HRNet-W48		67.6	87.4	72.6	75.8	68.1	58.9
Ours	HRNet-W48		69.1	86.0	74.6	77.1	69.8	59.2

4.3 Comparison Results and Analysis

In this paper, the regression of keypoints is carried out based on the DEKR baseline method [15]. We compare our method with other state-of-the-art methods on CrowdPose test set, COCO validation set and COCO Test-dev2017 set. All experiments in this paper are completed in the same configuration environment.

4.3.1 Comparison Results on CrowdPose Dataset

The dense appearance of people in CrowdPose images is a difficult problem for human pose estimation.Table 2 shows the performance of our method and five methods on the CrowdPose test dataset.By using HRNet-W32 and HRNet-W48 as the backbone network, our method achieves significant improvements.

Single-scale testing. Under the same backbone network, our method performs better than DEKR [15] and Heatmap-Guided method [34], achieving an AP score of 67.3% on a 512×512 input resolution and 68.7% on a 640×640 input resolution. Compared with the DEKR baseline method [15], the AP^H score increased from 57.5–59.0%.This is mainly because our method obtains semantic information of small targets by expanding the receptive field of convolutional network, thereby improving the accuracy of human pose estimation. In addition, for high-resolution input of 640×640, the detection effect of 69.1% was also achieved. Compared to top-down HPE methods, our method does not require additional human detectors, but has significant improvements over top-down methods such as SimCC [39] and JC-SPPE [40].

Multi-scale testing. Our method is only lower than Heatmap-Guided [34] and HigherHRNet [26] on AP50 score,mainly because KRN has more pixel information outside the reference focus through adaptive convolution, so the regression accuracy is low in the case of small IOU. But performs best on other scores at 512×512 and 640×640 input resolutions.The AP^H score of our method is the best among the HRNet-W32 and HRNet-W48 backbone networks, indicating the effectiveness of our pose enhancement mechanism in keypoints regression.

Figure 4 shows the qualitative effect of posture keypoints detection on the CrowdPose dataset. The main challenges of these eight images are multiple person motion and occlusion, inconsistent target sizes, and complex backgrounds. From Fig. 4, it can be seen that our method can accurately detect keypoints in handling occlusion, small target human bodies, and different scale scenarios. These examples demonstrate the effectiveness of our method in handling individual occlusion, as well as human pose detection at different scales.

Table 3

Compare results of the methods on the COCO validation set. (The best results are highlighted in bold.)
Method	Backbone	Input size	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L
Single-scale testing
DEKR[15]	HRNet-W32	512×512	68.0	86.7	74.5	62.1	77.7
HigherHRNet[26]	HRNet-W32		67.1	86.2	73.0	-	-
Heatmap-Guided[34]	HRNet-W32		67.8	86.8	74.0	62.0	76.4
HGG[35]	Hourglass		60.4	83.0	66.2	-	-
Ours	HRNet-W32		68.9	87.4	75.2	62.9	77.9
PersonLab[10]	ResNet	1401×1401	66.5	86.3	71.9	65.6	77.9
PersonLab[10]	ResNet	601×601	54.1	76.4	57.7	43.5	77.4
DEKR[15]	HRNet-W48	640×640	71.0	88.3	77.4	66.7	78.5
Ours	HRNet-W48	640×640	71.5	88.7	77.8	67.2	79.1
Multi-scale Testing
DEKR[15]	HRNet-W32	512×512	70.7	87.7	77.1	66.2	77.8
HigherHRNet[26]	HRNet-W32		69.9	87.1	76.0	-	-
Heatmap-Guided[34]	HRNet-W32		70.7	88.0	76.9	66.1	77.7
HGG[35]	Hourglass		68.3	86.7	75.8	-	-
Ours	HRNet-W32		71.3	88.2	77.5	66.9	78.5
DEKR[15]	HRNet-W48	640×640	72.3	88.3	78.6	68.6	78.6
Point-set Anchors[41]	ResNeXt-101-DCN		69.8	88.8	76.3	65.9	76.6
HigherHRNet[26]	HRNet-W48		72.1	88.4	78.2	67.8	78.3
Ours	HRNet-W48		73.1	88.9	79.1	69.5	79

4.3.2 Comparison Results on COCO Dataset

We have compared our method with similar methods on both the COCO validation set and the COCO test-dev2017 set, as shown in Table 3 and Table 4, respectively. Our method has achieved the best scores compared to previous state-of-the-art methods in both single-scale testing and multi-scale testing.

Table 3 shows the results of our method compared with existing methods on the COCO validation dataset. The experimental results demonstrate that our method achieves higher scores in five scores than the comparative methods. In the multi-scale testing with input size is 640×640 and the backbone network is HRNet-W48, the overall accuracy AP of our method is the highest, reaching 73.1%, which is 0.8% higher than the DEKR baseline method [15]. This improvement in accuracy of keypoints estimation metrics and IOU on the COCO validation dataset is mainly attributed to the local feature enhancement module in our method, which effectively handles small human poses.

Table 4 shows the comparative results with existing methods on the larger and more complex COCO Test-dev dataset. It can be seen that, in single-scale testing, our method outperforms the comparison methods in all metrics. In multi-scale testing with a multi-scale input of 640×640 and an HRNet-W48 backbone, our overall accuracy (AP) reaches at 71.8%, which higher than other methods. While we are 0.2% behind the Point-set Anchors method [41] in AP50, we outperform them by 2.7% in AP75, signifying higher accuracy on stricter IOU metrics. Smaller human individuals are often occluded and have fewer pixels, making it challenging to detect local joints like ankles and wrists. The proposed DCM and HSM modules in our method enhance keypoints detection precision for local joints in small human individuals by expanding the detection receptive field and enhancing the context relevance of adjacent pixels. As Fig. 5 illustrates, by enlarging the keypoints localization detail images of small human individual, our pose estimation method demonstrates superior accuracy and robustness in scenarios with illumination variation and cluttered backgrounds. These results further valuate that the proposed method has achieved significantly improved in estimating poses of smaller and more challenging human individuals.

Table 4

Compare results of the methods on COCO test dev2017. (The best results are highlighted in bold.)
Method	Backbone	Input size	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L
Single-scale testing
DEKR[15]	HRNet-W32	512×512	67.3	87.9	74.1	61.5	76.1
Heatmap-Guided[34]	HRNet-W32		66.6	87.8	72.8	61.1	74.5
CenterNet-DLA[42]	Hourglass		57.9	84.7	63.1	52.5	67.4
PoseDet[43]	HRNet-W32		64.4	86.7	71	58.8	73.3
Ours	HRNet-W32		67.9	88.4	74.5	61.8	76.6
DEKR[15]	HRNet-W48	640×640	70.0	89.4	77.3	65.7	76.9
PoseDet[43]	HRNet-W48	640×640	66.2	87.5	73.2	62.3	73.2
PersonLab[10]	ResNet	1401×1401	66.5	88	72.6	62.4	72.3
Ours	HRNet-W48	640×640	70.8	89.6	77.8	66.3	77.6
Multi-scale Testing
DEKR[15]	HRNet-W32	512×512	69.8	89	76.6	65.2	76.5
PoseDet[43]	HRNet-W32		66.7	88.5	73.7	62.2	75.5
HGG[35]	Hourglass		67.6	85.1	73.7	62.7	74.6
Ours	HRNet-W32		70.2	89.3	77.2	65.5	76.8
DEKR[15]	HRNet-W48	640×640	71.0	89.2	78.0	67.1	76.9
PoseDet[43]	HRNet-W48		67.8	89.3	75	63.3	75.5
Point-set Anchors[41]	ResNeXt-101-DCN		68.7	89.9	76.3	64.8	75.3
Ours	HRNet-W48		71.8	89.7	79.0	67.8	77.7

4.4 Ablation Studies

To validate the proposed method, we conducted three types of ablation studies: DCM with various dilated convolution structures, different attention fusion strategies with our HSM, and combinations of various modules in this paper. In all ablation studies, human pose keypoints regression was conducted on the HRNet-W32 backbone network, with same training and testing performed on the CrowdPose dataset. The overall accuracy was recorded under single-scale testing. The validation demonstrates that the human pose estimation method proposed in this paper enhances the extraction of local spatial features and the fusion of local context, aggregating richer features and achieving superior pose keypoints regression results.

Firstly, we performed ablation studies using various dilated convolution structures within DCM, as illustrated in Fig. 6. (a), (b) and (c) represent three different structures, with structure A being the simplest, consisting of four non-cascaded dilated convolution blocks. In structure B, the four convolution blocks are cascaded, but all have same output channels equal to C/4. Structure (c), employed in this study, dilated convolution blocks both are cascaded and varied output channel counts. The AP metrics for the three structures, as displayed in (d), demonstrate that structure (c) achieves the highest AP score. This is attributed to the fact that structure (a) independently extracts features from diverse receptive fields, leading to disconnected semantic features. Structure (b), with an equal number of channels, minimally affects the extraction of local features. Structure (c) augments model discrimination in local regions with larger receptive field branches while allocating more channels (and greater weights) to smaller branches, facilitating the extraction of local detail features.

Secondly, we performed ablation studies on various attention strategies to validate the superiority of our HSM architecture. As Fig. 7 (a) shows, our results demonstrate that the HSM attention module introduced in this study achieves a higher AP score than other attention modules. This suggests that our approach strengthens the integration of local information from extended convolution with the context, enhancing the accuracy of human pose keypoints regression.

Lastly, we conducted ablation studies on the integration of our proposed DCM and HSM modules to validate the superiority of our design. As shown in Fig. 7 (b), the validation results show that using the DCM module alone significantly enhances the network's extraction of local feature information, achieving a higher AP score than the HSM module alone. This implies that the HSM module has a limited capacity for contextual information integration. However, combining DCM with HSM not only broadens the scope of local information but also significantly enhances the accuracy of human pose estimation through comprehensive contextual information fusion.

This paper introduces a posture refinement network designed to enhance local joint details of multiple individuals. The network consists of a novel Dilated Convolution Module (DCM) and a Hybrid Self-Attention Module (HSM). The DCM employs four cascaded dilated convolutions to enlarge the receptive field, thereby better capturing local feature details at various keypoints of human poses. The HSM integrates positional attention block with convolutional self-attention block, enhancing the effective fusion of global contextual information within feature maps and selectively refining the feature map's content to advance human pose detection. The network concludes by regressing joint keypoints using adaptive convolution. Our approach has demonstrated effective performance on Microsoft COCO[44] and CrowdPose[41] datasets, particularly in complex scenes, where it has yielded superior keypoints detection outcomes. In the future, we plan to study the combination of local enhancement details and global correlations of human joints to better improve the accuracy of keypoints and apply them in 3D human pose estimation scenarios.

Ackowlegments.This work is supported by the Guangdong Provincial Natural Science Foundation General Project (2024A1515011971), Special Project for Key Fields in Higher Education of Guangdong, China (2022ZDZX1013, 2020ZDZX3077), National Natural Science Foundation of China (62172113), Guangdong Provincial Key Laboratory of Intellectual Property and Big Data (2018B030322016), Key Discipline Research Capacity Improvement Project of Guangdong Province (2022ZDJS013), Humanities and Social Science Fund of Ministry of Education (18JDGC012), Guangdong Provincial Science and Technology Project (KTP20210197), Guangdong Polytechnic Normal University (22GPNUZDJS16).

Author Contribution

Weili Tian and Zhaokang Guan completed the main experiments, while Chensheng Yi and Yufeng Zeng completed the ablation experiments. Weili Tian and Jin Zhan wrote the main manuscript text, Zhaokang Guan and Chensheng Yi prepared Figure 3-7 and Table 1-2 and Yufeng Zeng prepared Table 3-4. Fangyuan Lei, Xiaoyon Liu and Huihui Li reviewed the manuscript and provided some suggestions.

Data Availability

Experimental data is provided within the manuscript. The code and models that support the findings of this study are available at https://github.com/Twl-GZ/Human-pose.

Sun, K., Xiao, B., Liu, D.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5693–5703 (2019)
Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: Center-keypoints grouping via attention for multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11853–11863 (2021)
Li, W., Wang, Z., Yin, B.: Rethinking on multi-stage networks for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1901.00148 (2019)
He, K., Gkioxari, G., Dollár, P.: Mask r-cnn. In: Proceedings of the IEEE International conference on computer vision (ICCV), pp. 2961–2969 (2017)
Chen, Y., Wang, Z., Peng, Y.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7103–7112 (2018)
Ning, G., Pei, J., Huang, H.: Lighttrack: A generic framework for online top-down human pose tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CCPR), pp. 1034–1035 (2020)
Tian, Z., Chen, H., Shen, C.: Directpose: Direct end-to-end multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1911.07451 (2019)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3517–3526 (2019)
Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. Advances in neural information processing systems, 30 (2017)
Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: Fast multi-person pose estimation using pose residual network. In: Proceedings of the European conference on computer vision (ECCV), pp. 417–433 (2018)
Papandreou, G., Zhu, T., Chen, L.: Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European conference on computer vision (ECCV). pp. 269–286 (2018)
Artacho, B., Savakis, A.: BAPose: Bottom-Up Pose Estimation with Disentangled Waterfall Representations. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (CACV), pp. 528–537 (2023)
Cheng, Y., Wang, B., Yang B.: Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7649–7659 (2021)
Wang, D., Zhang, S., Hua, G.: Robust pose estimation in crowded scenes with direct pose-level inference. Advances in Neural Information Processing Systems, 34, pp. 6278–6289 (2021)
Geng, Z., Sun, K., xiao, B.: Bottom-up human pose estimation via disentangled keypoints regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 14676–14686 (2021)
Cao, Z., Simon, T., Wei, S.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7291–7299 (2017)
McNally, W., Vats, K., Wong, A.: Rethinking keypoints representations: Modeling keypoints and poses as objects for multi-person human pose estimation. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 37–54 (2022)
Xiao, Y., Wang, X., Yu, D.: Adaptivepose: Human parts as adaptive points. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2813–2821 (2022)
Yao, J., Chen, J., Niu, L., Sheng, B.: Scene-aware human pose generation using transformer. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2847–2855 (2023)
Zeghoud, S., Ali, S. G., Ertugrul, E., Kamel, A., Sheng, B.: Real-time spatial normalization for dynamic gesture classification. The Visual Computer, pp. 1–13 (2022)
Aouaidjia, K., Sheng, B., Li, P.: Efficient body motion quantification and similarity evaluation using 3-D joints skeleton coordinates, IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp. 2774–2788 (2019)
Karambakhsh, A., Kamel, A., Sheng, B.: Deep gesture interaction for augmented anatomy learning. International Journal of Information Management, 45, pp. 328–336 (2019)
Kamel, A., Liu, B., Li, P., Sheng, B.:An investigation of 3D human pose estimation for learning Tai Chi: A human factor perspective. International Journal of Human–Computer Interaction, pp. 427–439 (2019)
Jiang, M., Tian, Z., Yu, C., Shi, Y., Liu, L.: Intelligent 3D garment system of the human body based on deep spiking neural network. Virtual Reality & Intelligent Hardware, pp. 43–55 (2024)
Artacho, B., Savakis, A.: Unipose: Unified human pose estimation in single images and videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7035–7044 (2020)
Cheng, B., Xiao, B., Wang, J.: Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5386–5395 (2020)
Luo, Z., Wang, Z., Huang, Y.: Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 13264–13273 (2021)
Zoph, B., Cubuk, D., Ghiasi, G.: Learning data augmentation strategies for object detection. In: Proceedings of European conference on computer vision (ECCV), pp. 566–583 (2020)
Cheng, J., Wang, Q., Yang, W.: SSD object detection algorithm with multi-scale convolution feature fusion. Journal of Frontiers of Computer Science and Technology, 13(6), pp. 1049–1061 (2019)
Liu, Y., Yang, F., Hu, P.: Small-object detection in UAV-captured images via multi-branch parallel feature pyramid networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 145740–145750 (2020)
Wang, P., Lu, Y., Zhan, T.: Small object detection algorithm based on PDSSD improved neural network. Computer Applications and Software, 38(1), pp. 149–156 (2021)
Xi, Q., Zhang, D., PENG, L.: Small object detection algorithm based on improved dense network and quadratic regression. Computer Engineering, 47(4), pp. 241–247 (2021)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 13713–13722 (2021)
Sun, K., Geng, Z., Meng, D.: Bottom-up human pose estimation by ranking heatmap-guided adaptive keypoints estimates. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2006.15480 (2020)
Jin, S., Liu, W.: Differentiable Hierarchical Graph Grouping for Multi-person Pose Estimation. In: Proceedings of European conference on computer vision (ECCV), pp. 718–734 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7132–7141 (2018)
Kirillov, A., Mintun, E., Ravi, N.: Segment anything. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4015–4026 (2023)
Woo, S., Park, J., Lee, Y.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Li, Y., Yang, S., Liu, P.: Simcc: A simple coordinate classification perspective for human pose estimation. In:Proceedings of European Conference on Computer Vision (ECCV), pp. 89–106 (2022)
Li, J., Wang, C., Zhu, H.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 10863–10872 (2019)
Wei, F., Sun, X., Li, H.: Point-set anchors for object detection, instance segmentation and pose estimation. In: Proceedings of European conference on computer vision (ECCV), pp. 527–544 (2020)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 1904.07850 (2019)
Tian, C., Yu, R., Zhao, X.: Posedet: fast multi-person pose estimation using pose embedding. In: Proceedings of 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–8 (2021)
Lin, T., Maire, M., Belongie, S.: Microsoft COCO: Common objects in context. In: Proceedings of Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp. 740–755 (2014)

No competing interests reported.

Download PDF

Reviewers invited by journal
29 Sep, 2024
Editor assigned by journal
05 Sep, 2024
Submission checks completed at journal
05 Sep, 2024
First submitted to journal
04 Sep, 2024

You are reading this latest preprint version

Local Feature Enhancement for Robust 2D Multi-Person Pose Estimation via Posture Refinement Network

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 The Proposed Method

3.1 Dilated Convolution Module

3.2 Hybrid Self-attention Module

3.3 Multi-person keypoints Regression

4 Experimental results and comparison

4.1 Datasets and Metrics

4.2 Implementation Details

4.3 Comparison Results and Analysis

4.3.1 Comparison Results on CrowdPose Dataset

4.3.2 Comparison Results on COCO Dataset

4.4 Ablation Studies

5 Conclusions and future work

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Status:

Version 1