The main framework of the proposed method is shown in Fig. 1. The parallel network HRNet [1] is used to extract preliminary feature maps (\(\:\text{i}\in\:\left[\text{1,4}\right]\)) at four different scales, and \(\:\text{C}\), \(\:\text{H}\) and \(\:\text{W}\) represents the number of channels, height and width of the feature maps, respectively. Subsequently, the dilated convolution module (DCM) and hybrid self-attention module (HSM) process these feature maps separately and sequentially, amplifying the receptive field of human poses and enhance the local feature information. Ultimately, the adaptive convolutional networks are employed to obtain the keypoints of multi-person posture.
3.1 Dilated Convolution Module
Multi-layer network convolution kernels enhance the number of channels and enrich the feature representation, though they also increase network computation time and cost. To expand the receptive field of posture joints and reduce computational costs, this paper introduces the Dilated Convolution Module (DCM) which employs cascaded dilated convolutions. The DCM receives feature maps \(\:{\text{f}}_{\text{i}}\) at each scale as input, where \(\:\text{i}\in\:\left[\text{1,4}\right]\). The DCM architecture comprises three convolutional layers and a concatenation operation, as illustrated in Fig. 2. In the DCM, the first layer utilizes 1×1 convolutional blocks to enhance the network's nonlinear representation capabilities. The second layer consists of three standard convolution blocks, with 1×3, 3×1, and 3×3 convolution operations respectively.
The third layer in DCM consists of four cascaded 3×3 convolutional blocks with dilated rates \(\:\text{r}=[\text{1,3},\text{3,5}]\), as shown on the right of Fig. 2 (a). By cascading these four convolutional blocks with varying dilated rates, we extend the receptive field for posture joint points. That is, with the exception of the first dilated block, the inputs for the other three dilated convolutions are derived from the previous branch, progressively increasing the receptive field, thus the fourth dilated block encompasses an even larger receptive field. The convolution kernel size \(\:\text{K}\) after each cascade expansion can be calculated according to the following formula:
$$\:\text{K}=\text{k}+(\text{k}+1)(\text{r}-1)$$
1
where \(\:\text{k}\) is the original convolution kernel, and \(\:\text{r}\) is the dilated rate. Table 1 shows the initial convolution kernel size, dilation rate, number of input and output channels, and dilated convolution kernel size for four dilated convolution blocks. As can be seen, the receptive field range after cascading four dilated convolutions is 31×31, and the number of output channels is \(\:\frac{\text{C}}{8}\).
Through cascading dilated convolutions, the receptive field of the target features is expanded to capture joint details from a larger contextual area, thereby enhancing local features and securing more precise keypoints locations. To enhance computational efficiency, we halve the number of channels in the convolutional blocks progressively, as outlined in Table 1, ensuring low computational costs. Finally, we concatenate the outputs of the four dilated blocks and perform residual connections with the original input feature map \(\:{\text{f}}_{\text{i}}\). The output \(\:{\text{d}}_{\text{i}}\) of DCM can be represented as:
$$\:{\text{d}}_{\text{i}}={\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}}_{\text{j}=1}^{4}\left[{\text{D}}_{\text{j}}\right({\text{c}\text{o}\text{n}}_{\text{j}}\left({{\text{W}}_{1\times\:1}(\text{f}}_{\text{i}}\right)\left)\right]+{\text{f}}_{\text{i}},\:\text{i}\in\:\left[\text{1,4}\right]$$
2
where \(\:{\text{W}}_{1\times\:1}(\bullet\:)\) is the 1×1 group convolution, \(\:{\text{c}\text{o}\text{n}}_{\text{j}}(\cdot\:)\) is the convolution operation of the second layer, \(\:\text{j}\) is the label of the branch, and \(\:{\text{D}}_{\text{j}}(\cdot\:)\) is the dilated convolution operation of the third layer.
Table 1
The Cascade process of four dilated convolution blocks, where C is the original channel number, the step size (stride) of four convolution blocks is 1.
Order | \(\:\text{O}\text{r}\text{i}\text{g}\text{i}\text{n}\text{a}\text{l}\:\text{k}\text{e}\text{r}\text{n}\text{e}\text{l}\:\text{s}\text{i}\text{z}\text{e}\:k\times\:k\) | Dilation rate \(\:r\) | Input channel | Output channel | \(\:\text{D}\text{i}\text{l}\text{a}\text{t}\text{e}\text{d}\:\text{k}\text{e}\text{r}\text{n}\text{e}\text{l}\:\text{s}\text{i}\text{z}\text{e}\:K\times\:K\) |
1 | 3×3 | 1 | C | C/2 | 3×3 |
2 | 3×3 | 3 | C/2 | C/4 | 7×7 |
3 | 3×3 | 3 | C/4 | C/8 | 15×15 |
4 | 3×3 | 5 | C/8 | C/8 | 31×31 |
3.2 Hybrid Self-attention Module
In the previous section, we expanded the receptive field of the posture joints using DCM, thereby enhancing the representation of detailed features for local joints. To further capture the contextual information of the output feature map \(\:{\text{d}}_{\text{i}}\) of DCM, we introduces a hybrid self-attention module (HSM) to enhance the network's capability to fuse feature semantic information, as shown in Fig. 1. In HSM, following layer normalization (LN), \(\:{\text{d}}_{\text{i}}\) is fed into Convolutional Self-Attention (CSA) block and Coordinate Attention (CA) block [33] for parallel processing. The CSA block is designed to capture both local and global features of multiple individuals' joints, and its structures is shown in Fig. 3. The CA block [33] is used to address the significant loss of pose and position information. The outputs from both are combined via summation and residual connections, thereby enhancing the integration of local joint features with their context.
In the CAS architecture as shown in Fig. 3, the Q, K, and V components are generated via 1×1 convolution. Compared to fully connected networks, employing 1×1 convolution enables attention networks with enhanced contextual awareness, reduces computational demands, and strengthens the network's capacity for nonlinear expression. The \(\:\text{Q}\) and \(\:\text{K}\) matrices are multiplied and subsequently input into the SiLU function via the fully connected layer (FC) to enhance linear expressiveness.. The output is \(\:{\text{T}}_{1}\):
$$\:{T}_{1}=SiLU\left(FC\right(Q\otimes\:K\left)\right)$$
3
\(\:{\text{T}}_{1}\) is then fed into the FC and the Tanh function to derive weight information \(\:{\text{T}}_{2}\):
$$\:{T}_{2}=T{anℎ}\left(FC\right({T}_{1}\left)\right)$$
4
Multiplying \(\:{\text{T}}_{1}\) with \(\:\text{V}\), followed by a residual connection with \(\:\text{V}\) via a 1×1 convolutional layer, the output \(\:\text{C}\text{S}\text{A}\left(\text{L}\text{N}\right({\text{d}}_{\text{i}}\left)\right)\) of CSA block is represented as:
$$\:CSA\left(LN\right({d}_{i}\left)\right)=({T}_{2}\otimes\:V)+LN\left({d}_{i}\right)$$
5
Inspired by reference [33], we deployed the Coordinate Attention (CA) block to bolster the network's capacity for integrating location information. The CA block divides channel attention into two one-dimensional features, aggregating them across two spatial dimensions: one to capture feature correlations and the other to retain precise positional data. Subsequently, the feature maps are encoded into direction-aware and position-sensitive heatmaps, which can be complementary to the input feature maps to enhance the feature representation of the object of interest. The output of CA block is represented as \(\:\text{C}\text{A}\left(\text{L}\text{N}\right({\text{d}}_{\text{i}}\left)\right)\).
Finally, these two output of attention blocks is summed with residuals and represented as \(\:{\text{a}}_{\text{i}}\):
$$\:{a}_{i}=CSA\left(LN\right({d}_{i}\left)\right)+CA\left(LN\right({d}_{i}\left)\right)+{d}_{i}$$
6
Then, \(\:{\text{a}}_{\text{i}}\) is processed through LN and MLP layers to produce the final output \(\:{\text{ℎ}}_{\text{i}}\) of the HSM:
$$\:{ℎ}_{i}=MLP\left(LN\right({a}_{i}\left)\right)+{a}_{i}$$
7
The hybrid self-attention module (HSM) proposed in this paper improves the detection performance of keypoints in human posture, indicating that HSM can focus on more pixels by combining contextual content, have stronger ability to represent local pose features, and more accurate joint position localization.
3.3 Multi-person keypoints Regression
In the previous section, we obtained four different scales of \(\:{\text{ℎ}}_{\text{i}}\), \(\:\text{i}\in\:\left[\text{1,4}\right]\). To integrate the diverse semantic information across feature maps of varying scales, we adjusted the resolutions of feature maps \(\:{\text{ℎ}}_{2}\), \(\:{\text{ℎ}}_{3}\), and \(\:{\text{ℎ}}_{4}\) to match that of \(\:{\text{ℎ}}_{1}\), and merged the four maps via concatenation to produce the final feature map \(\:{\text{ℎ}}_{\text{s}}\). Subsequently, we employ multi-branch parallel adaptive convolution [15] for keypoints regression of \(\:{\text{ℎ}}_{\text{s}}\), with each branch dedicated to the pixel region of its respective keypoints, ultimately regressing the positions of \(\:\text{n}\) keypoints for human joints (\(\:\text{n}=17\) in this paper). Specifically, the backbone network's feature map is partitioned into \(\:\text{n}\) segments, with each segment fed into a distinct branch. Each branch employs adaptive convolution to extract features specific to keypoints and subsequently outputs their two-dimensional offset vectors via the convolutional layer. Due to the enhanced local details and contextual features in the feature maps generated by our method, the network can more precisely deduce the positions of corresponding keypoints for human joints in the image.