A Fine-grained Attributes Recognition Model for Clothing Based on Improved the CSPDarknet and PAFPN Network

doi:10.21203/rs.3.rs-4092097/v1

Download PDF

Research Article

A Fine-grained Attributes Recognition Model for Clothing Based on Improved the CSPDarknet and PAFPN Network

https://doi.org/10.21203/rs.3.rs-4092097/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

An efficient and accurate recognition model for fine-grained attributes of clothing has significant business prospects and social implications. However, the inherent diversity and complexity of clothing makes acquiring datasets with fine-grained attributes a costly endeavor. To address these challenges, we propose a lightweight clothing fine-grained attributes recognition model. First, the Ghost module is introduced into the CSPDarknet network to enhance the depth and expressiveness of feature learning while reducing the parameters and computational complexity. Then, the Conv module is replaced with the GSConv module in the PAFPN network to further reduce the network computational load, and the SE attention mechanism is also added to enhance the perception of key features. Finally, the Detect module is utilized to achieve effective recognition of fine-grained attributes of clothing. To evaluate the model performance, we construct a clothing dataset containing 20 fine-grained attributes. The experimental results show that the model achieves precision, recall and mAP of 76.2%, 78.9% and 81.7%. Compared to the original model, the parameters are reduced by 26.2%, and the FPS is improved by 25.4%. Our proposed model performs well on the small-scale dataset and improves its performance in resource-constrained environments, which has practical applications in clothing recommendation, virtual fitting, and personalization.

Fine-grained recognition

feature extraction

object detection

lightweight model

With the increasing aesthetic awareness of consumers and the rapid development of the Internet, the fashion industry holds great market potential. This trend has promoted the urgent need for digital and intelligent technology in the fashion field to better meet the personalized needs of consumers. At the same time, the research on intelligent analysis of fashion clothing in the fields of multimedia, pattern recognition and computer vision has also attracted wide attention from academia and industry [1–3]. In recent years, due to the rapid development of apparel e-commerce, there is a wide market demand for fine-grained attributes recognition of apparel images. At the same time, the recognition of clothing attributes is one of the fundamental issues in subsequent intelligent analysis tasks, which has important research significance as it can promote the completion of important tasks such as trend prediction [4] and retrieval [5]. Dresses are a very fashionable category in consumer daily clothing purchases. In this paper, the problem of fine-grained attributes recognition of dresses based on deep learning and lightweight networks is investigated. The fine-grained attributes to be recognized in the dress image are shown in Fig. 1.

Fine-grained attributes recognition of clothing requires a high level of expertise to classify and label clothing attributes. In addition, the demand for real-time recognition in short video platforms and e-commerce platforms challenges the computational complexity of models. Previous research on attributes recognition typically used image analysis or object detection algorithms. The extraction and expression capabilities of methods for extracting low-level features of images (such as Sift [6], HOG [7], DPM [8] are limited; Object detection algorithms (such as Faster-RCNN [9], SSD [10], YOLO series [11] still have problems such as low detection efficiency and large model parameters, making it difficult to popularize in practical scenarios.

Therefore, based on the CSPDarknet network and the PAFPN network, this paper proposes a new lightweight fine-grained attributes recognition model for clothing, which provides a more practical and efficient digitization solution for the fashion field, and offers a new research direction for practical applications in resource-constrained scenarios.

The paper is structured as follows: Section 2 reviews related research on clothing attributes recognition. Section 3 details the fine-grained attributes dataset we constructed. Section 4 introduces the proposed model and method. Section 5 presents experiments and comparative analysis. Finally, Section 6 summarizes the study.

Research in clothing image recognition provides an efficient image processing solution for e-commerce and fashion industries. In recent years, research in the field of clothing image recognition has focused on clothing classification, clothing segmentation and attributes recognition.

Many research efforts have been devoted to improving the accuracy of clothing classification. Shajini et al. [12] proposed a T-S pair model based on knowledge sharing and semi-supervised multitask learning for clothing classification and attributes prediction, and validated the feasibility of the approach on the DeepFashion dataset. Zhou et al. [13] proposed a method to improve the classification accuracy of clothing images via DenseNet201 network, transfer learning and optimized RVFL. Despite the high classification accuracy of the DenseNet201 neural network, its large number of parameters makes it difficult to ignore the latency problem in practical applications.

There are also many research works focusing on segmentation of clothing to enable fine image editing of clothing. Inacio et al. [14] proposed a framework called EPYNET for extracting clothing features, which is based on SSDs and FPNs, with the EfficientNet model as the backbone to improve the accuracy of segmentation. Zhang et al. [15] proposed a new framework called ClothingOut, a new framework that utilizes GAN to solve the clothing transformation problem from images containing human bodies to flat clothing images.

Further, research on clothing images has evolved to recognize local attributes of clothing. Chun et al. [16] proposed SAC network for clothing attributes recognition by combining self-attention mechanism with CNN. And they self-constructed a new clothing dataset for predicting fashion styles with 8 attributes. Gu et al. [17] proposed a novel clothing attributes recognition algorithm based on improved YOLOv4-Tiny, which improves the accuracy of clothing attributes recognition. Xiang et al. [18] designed an R-CNN clothing attributes recognition algorithm that utilizes L-Softmax loss. This algorithm performs well in identifying clothing attributes such as shirt collar shape. Li et al. [19] automatically identified and segmented the sleeves of shirts and optimized the LSSVM parameters by using PSO algorithm, which achieved better results in the case of small samples. Zhu et al. [20] proposed sRA-Net to accurately obtain attributes representations by utilizing multiple latent relationships in clothing images to improve the performance of fashion attributes recognition.

In this study, fine-grained attributes recognition of clothing is considered as an object detection task. Different from the previous methods, a dataset of dresses containing 20 attributes is constructed in this paper. By adopting a one-stage object detection algorithm without the need for complex localization and classification steps, and optimizing to reduce computational resource consumption, better detection performance and higher detection efficiency have been achieved.

Many open-source deep learning datasets focusing on clothing categorization exist, such as DeepFashion [21], DeepFashion2 [22], Fashion MNIST [23], Streetstyle [24], and so on. However, these datasets generally suffer from the common problem of containing only some major categories and attributes, which cannot satisfy the need for detailed classification and labeling of clothing attributes, limiting their application to specific tasks. In order to train the recognition model, a dress image dataset containing approximately 10,000 images was established. The images in the dataset come from shopping websites such as Farfetch, SSENSE, and LuisaViaRoma.

Through the analysis of the screening function of clothing attributes on e-commerce platforms and interviews with textile and clothing experts, we screened 4 important clothing attributes dimensions that have an impact on consumer behavior from the perspective of clothing form as collar shape, sleeve shape, silhouette, and sleeve length. And, we further analyzed the design elements of dresses, combined with the products of the clothing consumer market, the analysis of experts and interviews with clothing consumers, meticulously divided the 4 dimensions into 20 attributes, and the fine-grained attributes of clothing are shown in Fig. 2.

To ensure the uniformity of the distribution of apparel attributes in the dataset, we screened 2500 apparel images from the dataset for the recognition task. Next, we used the Labelme for labeling the dress attributes. This detailed categorization provides the basis for further research and provides us with a strong support for deeper understanding of apparel consumer behavior and market trends.

In this paper, an improved lightweight clothing attributes recognition model is proposed based on the CSPDarknet network and the PAFPN network. The structure of the fine-grained attributes recognition model for clothing is shown in Fig. 3.

4.1 The backbone network

4.1.1 The GhostConv module

The GhostConv module is a convolution module in the GhostNet [25] network that replaces normal convolution. The structure of the GhostConv module is shown in Fig. 4(a). The GhostConv divides convolution into two steps, a Conv and a lightweight linear transformation. The Conv is to reduce the number of convolution kernels of the ordinary convolution operation to 1/2 of the original, thus reducing the computation by 1/2. The lightweight linear transformation is a cheap operation on the feature map extracted by the first step operation. Finally, through the concat operation will be the two parts of the feature map spliced into a complete feature map. The GhostConv module enhances feature learning depth, enabling better capture of abstract features in clothing images.

4.1.2 The C3Ghost module

The C3Ghost module is further improved on the GhostConv module. The structure of the C3Ghost module is shown in Fig. 4(b). The input feature map is first divided into two parts, one of which is subjected to a convolution operation by the Ghostbottleneck module obtained by stacking GhostConv to reduce the computational complexity.

Then, the feature map obtained from the convolution operation with the other part is subjected to a Concat operation and convolved again to improve the feature representation. The C3Ghost module can improve the feature representation while keeping the computational workload low, which provides significant advantages for clothing attributes recognition tasks, especially in scenarios that require efficient computation and lightweight models.

4.2 The neck network

4.2.1 The GSConv module

The core of the GSConv module [26] lies in the group shuffle operation. The structure of the GSConv module is shown in Fig. 5. The GSConv module first uses the traditional Conv to extract feature information from the input feature map, and scales the number of channels in the feature map to half of the output feature map channels. Secondly, using the DWConv (Depthwise Convolution) to extract features again from the feature information obtained by the traditional Conv. Then, concat the feature information obtained from the Conv and the DWConv. Finally, the feature information extracted by the Conv is completely mixed into the feature information extracted by the DWConv using a shuffle operation. This mixing operation facilitates information sharing between channels and enhances the model's ability to capture multi-scale features. The time complexity corresponding to GSConv is:

$${T}_{GSConv}=O\left(W\times H\times {K}_{1}\times {K}_{2}\times \frac{{C}_{2}}{2}\left({C}_{1}+1\right)\right)$$

Among them, $W$ represents the width of the output feature map, and $H$ represents the height of the output feature map; ${K}_{1}$, ${K}_{2}$represents the size of the convolutional kernel; ${C}_{1}$ represents the number of channels for each convolutional kernel, which is also equal to the number of channels in the input feature map, ${C}_{2}$ represents the number of channels for outputting the feature map.

The application of the GSConv module in the Neck network effectively provides the model with a more diverse and multi-scale feature representation through feature extraction and integration. This is particularly crucial for tasks such as apparel attributes recognition, where apparel attributes typically manifest across various scales and hierarchies in images.

4.2.2 The SE attention mechanism

The neck network needs to pay more attention to the fusion and extraction of high-level semantic features, and the introduction of the SE attention mechanism [27] can improve the perceptual ability and performance of the model. Although the attention mechanism in Transformer [28] excels in some areas, it may require more computational resources due to its larger parameter size and is less suitable for some lightweight scenarios. The network structure of the SE attention mechanism is shown in Fig. 6.

The SE attention model consists of the Squeeze operation and the Excitation operation. Squeeze is a global average pooling of the input feature map, compressing the features of each channel into a scalar to obtain global contextual information. Excitation is the activation of the output of the Squeeze operation, which assigns different weights, thus achieving the purpose of enhancing key channels and compressing useless channels.

Through the SE attention mechanism, the model can capture these attributes more sensitively, which makes the overall model perform better in the complex task, and provides an effective means to improve the performance and adaptability of the model.

4.3 The head module

Losses of the model include classification loss (VFL loss) and regression loss (CIOU loss) + distribution focus loss (DFL), among them, box loss weight is 0.75, cls loss weight is 0.5, and dfl loss weight is 1.5. The formulas for these three loss functions are as follows:

$$VFL\left(p,q\right)=\left\{\begin{array}{c}-q\left(q\left(log\left(p\right)+\left(1-q\right)log\left(1-p\right)\right)\right), q>0\\ -\alpha {p}^{\gamma }log\left(1-p\right), q=0\end{array}\right.$$

$${L}_{CIoU}=1-IoU+\frac{{\rho }^{2}\left(b,{b}^{gt}\right)}{{c}^{2}}+\alpha \upsilon$$

$$DFL\left({S}_{i},{S}_{i+1}\right)=-\left(\left({y}_{i+1}-y\right)log\left({S}_{i}\right)+\left(y-{y}_{i}\right)log\left({S}_{i+1}\right)\right)$$

Where $q$ is label, $IoU$ is the intersection ratio, $b$and ${b}^{gt}$ represent the center point of the two rectangular boxes, $\rho$ is the Euclidean distance between the rectangular boxes, $c$ is the diagonal distance of the closure area of the two rectangular boxes, $\nu$ is used to measure the consistency of the relative proportion of the two rectangular boxes, $\alpha$ is the weight coefficient,$y$ is the general distribution value, $i$ is the number, ${S}_{i}=\frac{{y}_{i+1}-y}{{y}_{y+1}-{y}_{i}}$, and ${S}_{i+1}=\frac{y-{y}_{i}}{{y}_{y+1}-{y}_{i}}$.

5.1 Experiment setup

The system’s hardware environment is an HP workstation (Z840 TOWER: CPU-E5-2623 v4 @ 2.60GHz, Memory 32G) with a NVIDIA TITAN XP GPU (11G graphics memory). The software environment is Python 3.8, CUDA 12, and Pytorch 1.13.1. The training hyperparameter settings include: epoch set to 150, batch size set to 16, optimizer selected SGD for training, initial learning rate set to 0.01, momentum set to 0.937, and weight decay set to 0.0005.

5.2 Evaluation indicators

The main indicators for evaluating network performance are: precision (P), recall (R), and mean average precision (mAP). In the discussion of balancing lightweight and performance, the number of parameters and computational complexity of the model must also be considered. Therefore, it is necessary to add two evaluation indicators: parameters(Param)and FPS.

5.3 Experimental results and analysis

5.3.1 Comparative attentional mechanisms experiment

Table 1 shows the results of adding different attention modules [29–31]. From the experimental data, it can be observed that although the introduction of the SimAM module resulted in the smallest parameter increment, its improvement in recognition accuracy is limited. In contrast, the model with the introduction of the SE attention mechanism has a smaller parameter increment but performs well in the clothing attributes recognition task, showing the highest precision, recall and mAP values. Therefore, the introduction of the SE attention mechanism can significantly enhance the extraction of important information, and suppress the influence of irrelevant information, thus effectively improving the detection precision.

Table 1

The results of adding different attention modules
Method	P (%)	R (%)	mAP (%)	Param(M)	FPS
YOLOv8n	75.3	77.6	80.8	3.02	118
+SE	75.6	79.2	82.2	3.02	97
+SimAM	75.3	78.0	81.7	3.02	105
+CA	75.5	77.1	82.0	3.03	96
+ECA	72.0	79.1	81.5	3.16	93

5.3.2 Comparative lightweight module experiment

In the process of improving the model, this study not only focuses on the lightweight performance of the model, but also pays much attention to the accuracy of clothing attributes recognition. We introduced five different lightweight modules [32–33] to replace the traditional Conv modules.

Table 2

Results of lightweight module additions
Method	P (%)	R (%)	mAP (%)	Param(M)	FPS
YOLOv8n	74.3	76.6	79.8	3.01	118
+Ghost	73.5	74.1	79.5	2.31	146
+GSConv	74.6	75.9	79.7	2.80	135
+MobileNet	70.4	75.2	77.4	1.19	183
+Shufflenet	69.6	67.2	73.9	0.63	203
+Ghost + GSConv	75.2	79.8	81.2	2.10	152

Table 2 shows the experimental results after adding different lightweight modules. From Table 2, it can be seen that the method combining the Ghost module with the GSConv module is relatively balanced in terms of performance, and instead of excelling in one indicator, it achieves competitive results in several performance indicators. Although the method has a slightly higher number of model parameters than the MobileNet module and the Shufflenet module, the improvements in precision, recall, and mAP are relatively significant. Therefore, considering both performance and parameters, we chose the Ghost module combined with the GSConv module for model lightweighting.

5.3.3 Ablation experiment

In order to evaluate the effectiveness of the improved algorithm, this paper designs five sets of ablation experiments using the same equipment and dataset for training and testing to ensure comparability. The experimental results are shown in Table 3.

Table 3 shows that after the introduction of the Ghost module, recall and mAP of the model are significantly improved, but precision is slightly decreased. At the same time, the model parameters and size are reduced, while FPS is significantly improved. After the introduction of the GSConv module, precision and mAP of the model are improved, but recall is slightly decreased. With the introduction of the SE attention mechanism, precision, recall and mAP of the model are improved, but FPS is slightly decreased. The model parameters and size also increased slightly. Figure 7 depicts a visual comparison of some heatmaps of detection results before and after adding the SE attention mechanism. The above improvement methods have improved in a certain indicator, but the cost is sacrificing precision or efficiency. Obviously, the comprehensive performance is insufficient and cannot meet the requirement of ensuring recognition precision while lightweight models are proposed in this study.

The improved model in this paper increases precision by 2.6%, recall by 3.0% and mAP by 2.4% compared with the original model. Meanwhile, the model parameters are reduced by 26%, and FPS is greatly improved by 20.3%. This indicates that the improved model achieves significant advantages in terms of comprehensive performance, which provides strong support for the effective implementation of lightweight models in fine-grained recognition of clothing attributes.

5.4 Recognition performance analysis

In the following, we will evaluate the performance of the improved model in terms of four aspects of clothing attributes recognition.

5.4.1 Collar shape recognition effect

Figure 8(a) shows the results of the model proposed in this paper on collar shape recognition. For the collar shape recognition task, stand-up collar has the best recognition results, while the recognition performance for one-shoulder neckline and u-neck are relatively low. Among them, the low recall of u-neck may be due to the similarity of this collar shape with v-neck and one-shoulder neckline, which makes it difficult for the model to accurately distinguish them. Subsequently, more samples will be introduced and improved by adjusting the training strategy.

5.4.2 Sleeve shape recognition effect

Figure 8(b) shows the results of the model proposed in this paper on sleeve shape recognition. For the sleeve shape recognition task, regular sleeve is best recognized and the flutter sleeve performs relatively well on recall due to the fact that this sleeve type has a relatively unique shape and features that make it relatively easy for the model to classify correctly. For the ruffle sleeve recognition task, the precision is high but the recall is low. This is due to the fact that the appearance and shape of the ruffle sleeve varies a lot in the image, which makes it challenging for the model to recognize this type of sleeve.

Table 3

The ablation experiment results
+Ghost	+GSConv	+SE	P (%)	R (%)	mAP (%)	Param(M)	FPS
			74.3	76.6	79.8	3.01	118
√			73.5	74.1	79.5	2.31	143
	√		74.6	75.9	79.7	2.80	135
		√	75.6	79.2	82.2	3.02	97
√	√	√	76.2	78.9	81.7	2.22	148

For the flutter sleeve recognition task, the low precision but high recall of the flutter sleeve suggests that the model may misclassify some sleeves that are not the flutter sleeve as the flutter sleeve, but at the same time successfully identifies the majority of flutter sleeves.

5.4.3 Sleeve shape recognition effect

Figure 8(c) shows the results of the model proposed in this paper on sleeve shape recognition. For the sleeve shape recognition task, x-shaped and s-shaped has high precision and recall indicating that the model has better recognition on these two attributes. These two attributes have relatively clear contours and features that allow the model to classify more accurately. However, for a-shaped and h-shaped the recognition appears lower mAP is due to the fact that the shape features of these two attributes are not obvious, making it difficult for the model to accurately make the distinction.

5.5.4 Sleeve length recognition effect

Figure 8(d) shows the results of the model proposed in this paper on sleeve length recognition. For the sleeve length recognition task, for the short sleeve has relatively high precision and recall leading to higher mAP. The possible reason for this is that short sleeves have distinctive features in the image making it easier for the model to learn its features. Recognizing the half sleeve has relatively lower precision and higher recall leading to lower mAP. This could be due to the fact that the appearance and length of the half sleeves varies widely which makes the model to make errors in recognizing this type of sleeve length. The model achieved good performance in terms of precision, recall and mAP for the long sleeve and the sleeveless attributes recognition. This indicates that the model is able to accurately capture the features of these two attributes with the best recognition results.

5.5 Visualization results

To directly verify the detection ability of the model, we visualized the results. Figure 9 shows the detection results of YOLOv8n and our model.

In the first clothing image, it can be seen that the original model incorrectly recognized the collar shape of the dress as a u-neck, while our proposed model can correctly recognize it as a round collar. In the second image, it can be seen that the original model incorrectly recognized the silhouette of the dress as an x-shape, while the improved model can correctly recognize it as an s-shape. In the third image, it can be seen that the original model incorrectly identified the collar shape of the dress as round neck and v-neck, while the improved model can correctly identify it as round neck. In the fourth image, it can be seen that the original model confused the half sleeve and the short sleeve. In the last clothing picture, the original model predicted duplicate prediction boxes. The visualization results show that our proposed improved model can effectively detect and recognize fine-grained attributes of clothing.

5.6 Comparative experiments

We validate this algorithm's lightweight and excellent detection performance by comparing it with mainstream object detection algorithms. Table 8 presents the comparative experimental results of different methods under the same experimental equipment, dataset, and parameter settings.

The experimental results show that our proposed model performs the best in terms of performance, and especially achieves significant advantages in detection accuracy and object localization of fine-grained attributes of clothing. In addition, compared to other methods, our model significantly reduces the computational complexity and parameters, improves the FPS, and achieves significant improvements in lightweight and real-time performance. Therefore, our model is suitable for deployment on low-cost, low-computing power devices with limited computational resources.

Table 8

The comparative results of different methods
Method	P (%)	R (%)	mAP (%)	Param (M)	FPS
Faster-RCNN	74.5	73.6	77.5	40.07	12
SSD	70.2	69.5	73.8	26.41	45
YOLOv5n	72.2	76.6	79.5	2.51	105
YOLOv6n	74.4	76.3	80.1	4.24	94
Ours	76.2	78.9	81.7	2.22	148

In this study, we improve the CSPDarknet network and the PAFPN network and propose a new model. We also built a fine-grained attributes dataset of clothing to lay the foundation for future research. Combined with the proposed approach, our model achieves precision 76.2%, recall 78.9%, mAP 81.7%, parameters 2.22 M, and FPS 148. Our proposed fine-grained attributes recognition model for clothing optimizes the detection performance while meeting the lightweight requirements, and achieves the balance between lightweight and high performance. It is suitable for deployment in devices with limited arithmetic power.

The model proposed in this paper has shown good performance on the small-scale dataset, but there is still room for improvement. Future research directions will focus on further improving the recognition performance. In addition, we can also explore the adoption of more advanced attention mechanisms and lightweight modules to meet the demands of increasing data volume and complex clothing styles. These efforts will help promote the further development and application of fine-grained attributes recognition techniques for clothing.

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Funding

We thank all the anonymous reviewers for their insightful comments and constructive suggestions. This work was supported by the National Natural Science Foundation of China (Grant No.61976105) and the Textile Vision Basic Research Program (Grant No. J202006).

Author Contribution

BP and JX conceptualized the study, BP and RP developed the methodology, NZ and JX conducted formal analysis and investigation, BP drafted the original manuscript, BP and RP reviewed and edited the manuscript, and RP supervised the study.All authors reviewed the manuscript.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Fang, N., Qiu, L., Zhang, S., Wang, Z., Hu, K., Wang, K.: A novel DAGAN for synthesizing garment images based on design attribute disentangled representation. Pattern Recogn. 136, 109248 (2023)
Yu, F., Chen, Z., Jiang, M., Tian, Z., Peng, T., Hu, X.: Smart Clothing System With Multiple Sensors Based on Digital Twin Technology. Ieee Internet Things. 10, 6377–6387 (2023)
Alirezazadeh, P., Dornaika, F., Moujahid, A.: Deep Learning with Discriminative Margin Loss for Cross-Domain Consumer-to-Shop Clothes Retrieval. Sensors-Basel. 22, 2660 (2022)
Huang, F.-H., Lu, H.-M., Hsu, Y.-W.: From Street Photos to Fashion Trends: Leveraging User-Provided Noisy Labels for Fashion Understanding. Ieee Access. 9, 49189–49205 (2021)
Li, C., Peng, C., Yao, L., Fu, Q., Dai, Y., Yang, J.: Clothes retrieval based on ResNet and cluster triplet loss. Text Res J. 93, 2421–2431 (2023)
Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. Int J Comput Vision. 60, 91–110 (2004)
Zhang, X., Zhang, L., Lou, X.: A Raw Image-Based End-to-End Object Detection Accelerator Using HOG Features. Ieee T Circuits-I. 69, 322–333 (2022)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object Detection with Discriminatively Trained Part-Based Models. Ieee T Pattern Anal. 32, 1627–1645 (2010)
Pan, H., Zhang, H., Lei, X., Xin, F., Wang, Z.: Hybrid dilated faster RCNN for object detection. IFS. 43, 1229–1239 (2022)
Zhang, Q., Hu, X., Yue, Y., Gu, Y., Sun, Y.: Multi-object detection at night for traffic investigations based on improved SSD framework. Heliyon. 8, e11570 (2022)
Gupta, C., Gill, N.S., Gulia, P., Chatterjee, J.M.: Correction to: A novel finetuned YOLOv6 transfer learning model for real–time object detection. J Real-Time Image Proc. 20, 54 (2023)
Shajini, M., Ramanan, A.: A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction. Visual Comput. 38, 3551–3561 (2022)
Zhou, Z., Liu, M., Deng, W., Wang, Y., Zhu, Z.: Clothing Image Classification with DenseNet201 Network and Optimized Regularized Random Vector Functional Link. J Nat Fibers. 20, 2190188 (2023)
De Souza Inacio, A., Lopes, H.S.: EPYNET: Efficient Pyramidal Network for Clothing Segmentation. Ieee Access. 8, 187882–187892 (2020)
Zhang, H.: ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl. (2020)
Chun, Y., Wang, C., He, M.: A Novel Clothing Attribute Representation Network-Based Self-Attention Mechanism. Ieee Access. 8, 201762–201769 (2020)
Gu, M., Hua, W., Liu, J.: Clothing attribute recognition algorithm based on improved YOLOv4-Tiny. Signal Image Video P. 17, 3555–3563 (2023)
Xiang, J., Dong, T., Pan, R., Gao, W.: Clothing Attribute Recognition Based on RCNN Framework Using L-Softmax Loss. Ieee Access. 8, 48299–48313 (2020)
Li, T., Lyu, Y., Guo, Z., Du, L., Zou, F.: Construction of the PSO-LSSVM prediction model for sleeve pattern dimensions based on garment flat recognition. Int J Cloth Sci Tech. 35, 67–87 (2023)
Zhu, S., Zou, X., Qian, J., Wong, W.K.: Learning Structured Relation Embeddings for Fine-Grained Fashion Attribute Recognition. Ieee T Multimedia. Multimedia. 26, 1652–1664 (2024)
Roy, P., Bhattacharya, S., Ghosh, S., Pal, U.: Multi-scale attention guided pose transfer. Pattern Recogn. 137, 109315 (2023)
Chen, Y., Song, J., Song, M.: Hierarchical gate network for fine-grained visual recognition. Neurocomputing. 470, 170–181 (2022)
Seo, Y., Shin, K.: Hierarchical convolutional neural networks for fashion image classification. Expert Syst Appl. 116, 328–339 (2019)
Matzen, K., Bala, K., Snavely, N.: StreetStyle: Exploring world-wide clothing styles from millions of photos, http://arxiv.org/abs/1706.01869, (2017)
Wang, Z., Li, T.: A Lightweight CNN Model Based on GhostNet. Comput Intel Neurosc. 2022, 1–12 (2022)
Wu, Z., Zou, X., Zhou, W., Huang, J.: YOLOX-PAI: An Improved YOLOX, Stronger and Faster than YOLOv6, http://arxiv.org/abs/2208.13040, (2023)
Chen, L., Liu, R., Zhou, D., Yang, X., Zhang, Q.: Fused behavior recognition model based on attention mechanism. Vis Comput Ind Biome. 3, 7 (2020)
Zhao, K., Lu, R., Wang, S., Yang, X., Li, Q., Fan, J.: ST-YOLOA: a Swin-transformer-based YOLO model with an attention mechanism for SAR ship detection under complex background. Front Neurorobotics. 17, 1170163 (2023)
Zhu, D., Qi, R., Hu, P., Su, Q., Qin, X., Li, Z.: YOLO-Rip: A modified lightweight network for Rip currents detection. Front Mar Sci. 9, 930478 (2022)
Shen, L., Lang, B., Song, Z.: CA-YOLO: Model Optimization for Remote Sensing Image Object Detection. Ieee Access. 11, 64769–64781 (2023)
Gu, X., Xie, Y., Tian, Y., Liu, T.: A Lightweight Neural Network Based on GAF and ECA for Bearing Fault Diagnosis. Metals-Basel. 13, 822 (2023)
Chawla, T., Mittal, S., Azad, H.K.: MobileNet-GRU fusion for optimizing diagnosis of yellow vein mosaic virus. Ecol Inform. 81, 102548 (2024)
Chen, Z., Yang, J., Chen, L., Jiao, H.: Garbage classification system based on improved ShuffleNet v2. Resour Conserv Recy. 178, 106090 (2022)

No competing interests reported.

Download PDF

Reviews received at journal
24 Jun, 2024
Reviewers agreed at journal
24 Jun, 2024
Reviews received at journal
10 Apr, 2024
Reviewers agreed at journal
10 Apr, 2024
Reviewers agreed at journal
21 Mar, 2024
Reviewers invited by journal
20 Mar, 2024
Submission checks completed at journal
15 Mar, 2024
Editor assigned by journal
15 Mar, 2024
First submitted to journal
13 Mar, 2024

You are reading this latest preprint version

A Fine-grained Attributes Recognition Model for Clothing Based on Improved the CSPDarknet and PAFPN Network

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related works

3 Dataset

4 Methods

4.1 The backbone network

4.1.1 The GhostConv module

4.1.2 The C3Ghost module

4.2 The neck network

4.2.1 The GSConv module

4.2.2 The SE attention mechanism

4.3 The head module

5 Experiments

5.1 Experiment setup

5.2 Evaluation indicators

5.3 Experimental results and analysis

5.3.1 Comparative attentional mechanisms experiment

5.3.2 Comparative lightweight module experiment

5.3.3 Ablation experiment

5.4 Recognition performance analysis

5.4.1 Collar shape recognition effect

5.4.2 Sleeve shape recognition effect

5.4.3 Sleeve shape recognition effect

5.5.4 Sleeve length recognition effect

5.5 Visualization results

5.6 Comparative experiments

6 Conclusion

Declarations

Funding

Author Contribution

Data availability

References

Additional Declarations

Status:

Version 1