In this work, we utilized and enhanced the existing methodologies proposed by Rui-Yang Ju et al. in their paper on pediatric wrist fracture detection using YOLOv8 and attention mechanisms [14, 15]. The architecture of our model follows a similar design as presented in their work, explicitly building upon the YOLOv8 backbone and integrating attention mechanisms to improve detection accuracy. We refer to the original architecture proposed by Rui-Yang Ju for detailed insights into the basic framework, and the improvements introduced in this work are elaborated below.
The YOLOv8 architecture serves as the foundation for this work. It consists of four key components: the Backbone, Neck, Head, and Loss Function, and is largely based on the structure proposed by Chien et al. [16]:
-
Backbone: The Cross-Stage Partial network forms the backbone, optimized for computational efficiency. YOLOv8 replaces YOLOv5's C3 module with the C2f module, enhancing feature extraction while reducing computational load. The Convolution-Batch Normalization-SiLU structure is used in all convolutional layers.
-
Neck: YOLOv8 combines Feature Pyramid Networks and Path Aggregation Networks for multi-scale feature extraction. Following Ju et al. [15], we made minor modifications, including attention modules.
-
Head: YOLOv8 adopts a decoupled head structure, allowing separate classification and regression processing. It uses an anchor-free approach, improving accuracy for small objects like fractures.
-
Loss Function: YOLOv8 uses Binary Cross-Entropy for classification and Distribute Focal Loss with Complete Intersection over Union for regression, enhancing small object detection.
Hyperparameter tuning were conducted to enhance our models' performance and develop the improved YOLOv8 (iYOLOv8) model. We began our experiments by training the model for 60 epochs, as recommended by baseline YOLOv8 studies. However, we quickly discovered that increasing epochs yielded better results. Systematically testing up to 100 epochs revealed significant improvements in precision and recall.
Curious about the potential benefits of extended training, experimentation with 300 epochs was also performed. While this did result in a slight increase in accuracy, we noted diminishing returns beyond 100 epochs, with only marginal improvements in mean Average Precision (mAP) and longer training times. The optimal learning rate was identified through several iterations as 1e− 2, paired with a weight decay of 5e− 4. This combination allowed the model to converge quickly without overfitting. Additionally, a batch size of 16 specifically for fracture detection in pediatric wrist X-rays was selected, striking a balance between computational efficiency and model performance. This size facilitates stable gradient updates while preserving the nuances of small-scale features critical for accurate fracture identification. The SGD optimizer was preferred over Adam due to its superior high-dimensional medical image data handling. Specifically, SGD demonstrated more consistent convergence in refining model weights for fracture detection tasks, ultimately enhancing feature extraction and classification accuracy for subtle fractures. The newly modified architecture of the model is illustrated in Fig. 1.
To further refine feature extraction and bolster the model's ability to identify fractures in pediatric wrist X-rays, multiple attention mechanisms (AM) were incorporated into the architecture, which led to iYOLOv8-AM models. These include the Convolutional Block Attention Module (CBAM), Global Attention Mechanism (GAM), Efficient Channel Attention (ECA), Shuffle Attention (SA), and Global Context (GC) Block Development (Fig. 2). Each of these modules was independently added after the four C2f modules in the Neck, enabling the model to selectively focus on the most relevant features while effectively suppressing irrelevant information.
-
CBAM: Sequentially applies Channel and Spatial Attention to emphasize informative parts of the image.
-
GAM: Simplifies feature recalibration, removing max pooling to preserve details in medical images better.
-
ECA: Utilizes 1D convolution for efficient channel-wise attention, improving feature integration.
-
SA: Uses Channel Shuffle to focus on grouped feature maps, balancing accuracy and efficiency.
-
GC Block: Captures both global and local features, which are crucial for identifying subtle wrist fractures.
One of the primary innovations in this research was developing and refining the GC block, which proved to be the most effective attention mechanism compared to others such as SA, ECA, and GAM. While the GC block had been previously introduced in object detection models, we proposed critical structural improvements to make it more powerful and efficient in medical image analysis, particularly for fracture detection (Fig. 3).
The original GC block was designed to capture global information from images, enhancing the network's ability to handle complex object detection tasks by aggregating global features. However, certain inefficiencies were identified in addressing more minor features, such as subtle fractures in medical images. To tackle these shortcomings, several modifications were proposed. In the original GC block, global and local features were aggregated without prioritizing critical regions within the image. To improve this, a dynamic weighting mechanism that assigns greater importance to regions likely to contain fractures while still considering the global context was introduced [17]. This adjustment allows the model to focus more on relevant areas, such as bone structures in X-rays, while filtering out irrelevant background noise.
Let the feature map be denoted as \(\:F\:\in\:{R}^{C\times\:H\times\:W}\), where C is the number of channels, and H and W are the height and width of the feature map. Dynamic weighting is applied using a learned weighting map. \(\:W\:\in\:{R}^{C\times\:H\times\:W}\), which modifies the feature map by element-wise multiplication:
$$\:{F}_{weighted\:}=F\:⨀\:W$$
In this case, \(\:⨀\)W represents element-wise multiplication, and it is generated through a learned function that applies more significant weight to regions with high fracture likelihood. This helps the model focus on relevant areas.
Moreover, the standard GC block utilized a static global pooling layer, often resulting in the loss of detailed spatial information crucial for fracture detection. To address this, we proposed an adaptive pooling layer that adjusts the pooling size based on the detected features. This adaptive pooling ensures that more minor features, such as fine fractures, are preserved during feature extraction while capturing the broader global context. Adaptive pooling is performed with varying sizes for an input feature map FFF to maintain international and local features. Let \(\:{P}_{s}\)(F) be the adaptive pooling operation with size s. The final output is a combination of pooled features at multiple scales:
$$\:{F}_{pooled}=Concat({P}_{1}\left(F\right),\:{P}_{2}\left(F\right),\:{P}_{3}\left(F\right),\dots\:)\:$$
Additionally, the GC block was enhanced with cross-dimensional interactions to improve the feature refinement process, allowing it to learn dependencies between spatial and channel dimensions more effectively [18]. This change enables the model to process spatial and contextual information jointly, improving the overall feature representation of both small and large fractures. For a feature map F, this is expressed as:
=(F) (F)
\(\:{F}_{interaction}\) =\(\:{F}_{c}\)(F) \(\:⨀\) \(\:{F}_{s}\)(F)
Where \(\:{F}_{c}\)(F) denotes the Channel attention map, \(\:{F}_{s}\)(F) denotes the spatial attention map, and \(\:⨀\) denotes element-wise multiplication. The GC block's effectiveness was enhanced while also focusing on computational efficiency. By streamlining the feature aggregation process and reducing redundant operations, the GC block maintained a low inference time of 8.2 ms, critical for real-time medical applications.
Key parameters and metrics to define the models’ performance:
-
Epochs: Represents one full cycle where the model goes through the entire dataset during training. Each epoch helps the model learn and refine its internal parameters to improve accuracy in predicting fractures.
-
Parameters (PARMS): Internal values that the model learns during training. These include weights and biases, which are adjusted to minimize error and improve the fracture detection performance.
-
Inference: Phase where the trained model is used to make predictions on new data, such as detecting fractures in medical images after the model has been trained.
-
Precision: Proportion of correctly predicted positive cases (true positives) out of all predicted positive cases (both true positives and false positives). It tells us how reliable the positive predictions are.
-
Recall: Proportion of actual positive cases (true positives) that the model correctly predicted. It reflects the model's ability to detect all relevant cases.
-
F1-Score: Combines precision and recall into a single metric to assess the model’s overall accuracy, especially in cases where there’s an imbalance between the number of fracture and non-fracture instances.
-
mAP50 (Mean Average Precision at IoU 50%): Model's average detection accuracy when using a 50% overlap threshold between predicted bounding boxes and the actual location of fractures. It’s commonly used to evaluate object detection tasks like medical imaging.
-
mAP95 (Mean Average Precision at IoU 95%): mAP95 extends mAP50 by calculating the average precision across multiple IoU thresholds (ranging from 50–95%), providing a more comprehensive assessment of the model’s ability to locate fractures accurately.
-
FLOPs (Floating-Point Operations) Quantifies the computational complexity of the model by counting the number of floating-point operations needed during inference. It indicates how much computational effort is required to detect fractures in new data.
The GRAZPEDWRI-DX dataset was used, comprising over 20,000 X-ray images, to detect pediatric wrist fractures. To further enhance the model's performance, several steps were implemented to improve the dataset's quality.
First, the dataset underwent a thorough cleaning process, during which low-quality images—such as those with artifacts or poor resolution—were removed. Mislabeling issues were also addressed by cross-referencing image annotations with expert radiologist reviews, with particular attention to underrepresented cases like "bone anomaly" fractures.
Another significant challenge in the dataset was the imbalance between different fracture types. To mitigate this, synthetic data augmentation techniques was employed, including random rotations, flips, and brightness adjustments, specifically targeting minority classes such as "soft tissue" and "bone anomaly" fractures. This approach enhanced the model’s ability to detect these rare fractures.
Additionally, the brightness and contrast of the X-ray images were normalized to achieve greater uniformity across the dataset. This step reduced noise and allowed the model to generalize better across various X-ray sources. To ensure robust evaluation, stratified random split was done to create balanced training, validation, and test sets, preserving the ratio of different classes in each split. This strategy improved the model’s generalization capability and helped reduce overfitting.
Integrating these attention mechanisms and the dataset improvements resulted in substantial performance gains over the baseline YOLOv8 model. Specifically, the mAP50 improved from 63.6–66.32%, surpassing previous state-of-the-art results. Remarkably, the model maintained an efficient inference time, increasing by only 0.2 ms per image despite the added complexity. Moreover, detection accuracy was notably enhanced for challenging cases, such as small fractures and underrepresented classes, thanks to the attention mechanisms and the improved dataset balance.
Combining the new iYOLOv8 architecture with advanced attention mechanisms and dataset enhancements, this work offers a robust solution for pediatric wrist fracture detection, demonstrating significant improvements in accuracy and efficiency.