A total of 4023 images were used for training, 501 images for validation, and another 501 images for testing. Eight different models were used in the implementation of the system. The model complexity is measured by two metrics: parameters (Params) and floating-point operations (Flops) [34]. Table 4 presents the Flops and Params of each model. It is observed that MobileNetV3 has the lowest complexity, with 58.79 M Flops and 1,519,906, Params. This is primarily due to the use of depthwise separable convolution as discussed in section 2.3. On the other hand, VGG16 has the highest complexity among the models, with 15.53 G Flops and 134,277,186 Params, mainly due to its linear connection structure. Models like ResNet and GoogLeNet fall in between MobileNet and VGG in terms of complexity, as they incorporate certain architectural features, such as Shortcut structure in ResNet and Inception modules in GoogLeNet, to avoid the linear connection structure.
Table 4
Params and Flops of each model.
Model | Flops | Params |
VGG16 | 15.53G | 134,277,186 |
ResNet18 | 1.82G | 11,177,538 |
ResNet34 | 3.67G | 21,285,698 |
ResNet50 | 4.12G | 23,512,130 |
MobileNetV2 | 318.96M | 2,226,434 |
MobileNetV3 | 58.79M | 1,519,906 |
GoogLeNet-IncepitonV1 | 1.51G | 5,601,954 |
GoogLeNet-InceptionV3 | 2.85G | 21,789,666 |
Figure 9 shows the loss and accuracy curve of each CNN models during training and validating. Figure 9(a, b) shows the loss curve of each model during training and validating vary the epochs. The models can be divided into three groups based on the training loss: MobileNetV2, MobileNetV3 and InceptionV3 from the first group, ResNet18, ResNet34 and ResNet50 comprise the second group, VGG16 and GoogLeNet are in the third group. It is evident that the first group start to converge at approximately 15 epochs, the second group converges around 10 epochs, and VGG16 begins to converge at around 20 epochs. After 50 epochs, MobileNetV2, MobileNetV3 and InceptionV3 networks achieve the lowest training loss, with values of 0.00314, 0.004452, 0.0007174, respectively. In Fig. 9(b), the validation loss curves show that all models start to converge before 10 epochs. Similar to Fig. 9(a), the models can be divided into three groups by the validation loss. After 50 epochs MobileNetV2, MobileNetV3 exhibit the lowest validation loss among eight models, with values of 0.02386, 0.03566, 0.08964, respectively.
Considering Fig. 9(c, d), it can be seen that accuracy curve during training and validating for the eight models. In Fig. 9(c), except for the first group, the remaining five models show comparable accuracy, ranging from 0.8 to 0.9. MobileNetV2, MobileNetV3 and InceptionV3 have higher accuracy, with values between 0.95 and 1.0. In Fig. 9(d), the accuracy of ResNet, GoogLeNet and VGG16 is around 0.9, while MobileNet achieves an accuracy of about 1.0, InceptionV3 is close to 0.95.
Overall, all the models converge well and have good performance during training and validation processes. These models also show good generalization ability during validating. Among these models, the models of first group exhibit the best performance in the modeling process. These trained models are further evaluated using the testing dataset and the performance of each model can be more deeply analyzed using the confusion matrix, which reports the percentage of defect and non-defect fluorescent indications for each case.
Figure 10(a-h) display the confusion matrices of eight models on the testing set. As mentioned in section 3.3, the TP, FN, FP, TN can be observed from these matrices. In aerospace precision castings, even small defect can lead to significant losses. Therefore, in the detection process, it is crucial to detect as many defects as possible, which means the TP value should be close to the number of true defects and FN should be close to 0. Based on this criterion, the fluorescent defect identification performance of the eight models is ranked as follows: MobileNetV2, MobileNetV3, InceptionV1, InceptionV3, ResNet50, ResNet34, ResNet18, VGG16.
However, cost reduction is also an important consideration, and the accuracy of defect detection needs to be taken into consideration. In this case, both FN and FP should be close to 0. It can be seen from the Fig. 11 that the performance of the eight models is ranked as follows: MobileNetV2, MobileNetV3, InceptionV3, ResNet18, ResNet34, InceptionV1, ResNet50, VGG16.
In both cases, MobileNetV2 consistently exhibits the best performance among the eight models. This can be attributed to the efficiency of depthwise separable convolution and inverted residual block in the feature extraction process [32].
The ROC curve is a plot that allows the comparison of the true positive rate against the false positive rate for varying discrimination threshold[35]. Figure 11 shows the ROC curves obtained using the MobileNet, ResNet, GoogLeNet and VGG methods for the testing dataset. Considering Fig. 11, these models can be divided into two groups: MobileNetV2, MobileNetV3 and InceptionV3 and the remaining networks. In the first group, the three models exhibit similar capabilities at lower discrimination thresholds, with MobileNetV2 performing better at higher discrimination thresholds. In the second group, the ResNet18, ResNet34, ResNet50, VGG16 and InceptionV1 architectures have similar capabilities for all discrimination thresholds.
The area under curve (AUC) in ROC curve represents the classifier’s ability to classify samples. A larger AUC indicates better classification capability, meaning that the classifier ranks more positive cases before negative cases. The AUC scores for each architecture are listed in Table 5. The eight architectures are ranked by AUC value from largest to smallest as follows: MobileNetV2, MobileNetV3, InceptionV3, ResNet18, ResNet50, ResNet34, InceptionV1 and VGG16, with corresponding AUC value:0.999, 0.998, 0.988, 0.955, 0.954, 0.954, 0.952 and 0.950, respectively. Theoretically, a perfect classifier would have an AUC value of 1. All these CNN architectures show a good performance in binary task, and MobileNetV2 outperforms among these models.
Within the testing dataset, there are some false indications which look more obviously defect-like than others. It is expected the MobileNet exceeds the other architectures when the system gets past these more obvious cases to the more difficult cases as inverted residual block is able to extract more complex features.
The Precision-Recall (PR) curve was conducted on four representative models to investigate the effect of dataset imbalance on the fluorescent defect detection system. Only four representative models from the three groups were considered, as shown in Fig. 12. The models trained on balanced dataset exhibited superior performance compared to those trained on imbalanced datasets. In contrast, the models trained on imbalanced datasets showed similar performance for all discrimination thresholds when compared to the models trained on balanced datasets. Notably, the models trained on balanced datasets displayed a division into two groups based on their performance at lower discrimination thresholds. The difference show that features may not have been adequately learned in the imbalanced dataset, resulting in the performance of models being both less distinguishable and lower than that of models trained on balanced datasets.
The results presented in Table 5 show the classification evaluation metrics for eight CNN architectures on the testing set. Both the defect and non-defect type classification results are listed. It can be seen that MobileNetV2 achieves the best performance in all metrics for the defect type, with 99.2% Precision, 99.2% Recall, 99.2% F1-score, 99.2% Accuracy and 0.999 AUC. These results indicate that MobileNetV2 exhibits the highest performance among the architectures considered, making it a strong candidate for defect detection in the automated fluorescent defect detection system.
Table 5
Overall Precision, Recall, F1-score, Accuracy, AP and AUC of models on balanced testing set.
Model | Type | Precision | Recall | F1-score | AP | AUC | Accuracy |
VGG16 | Defect | 87.2% | 88.9% | 88.0% | 0.945 | 0.950 | 87.8% |
Non-defect | 88.5% | 86.7% | 87.6% | 0.959 | 0.951 |
ResNet18 | Defect | 88.1% | 91.3% | 89.7% | 0.947 | 0.955 | 89.4% |
Non-defect | 90.8% | 87.6% | 89.2% | 0.964 | 0.955 |
ResNet34 | Defect | 86.6% | 92.1% | 89.2% | 0.945 | 0.954 | 88.8% |
Non-defect | 91.4% | 85.5% | 88.4% | 0.964 | 0.954 |
ResNet50 | Defect | 85.3% | 92.1% | 88.5% | 0.948 | 0.954 | 88.0% |
Non-defect | 91.3% | 84.0% | 87.4% | 0.961 | 0.954 |
MobileNetV2 | Defect | 99.2% | 99.2% | 99.2% | 0.999 | 0.999 | 99.2% |
Non-defect | 99.2% | 99.2% | 99.2% | 0.999 | 0.999 |
MobileNetV3 | Defect | 96.5% | 98.8% | 97.6% | 0.998 | 0.998 | 97.6% |
Non-defect | 98.8% | 96.4% | 97.6% | 0.998 | 0.998 |
GoogLeNet-InceptionV1 | Defect | 82.4% | 98.4% | 89.7% | 0.937 | 0.952 | 88.6% |
Non-defect | 98.0% | 78.7% | 87.3% | 0.787 | 0.963 |
GoogLeNet-InceptionV3 | Defect | 96.0% | 94.0% | 95.0% | 0.986 | 0.988 | 95.0% |
Non-defect | 94.1% | 96.0% | 95.0% | 0.991 | 0.988 |
The good performance of the InceptionV3 architecture can be attributed to the InceptionV3 block, which effectively extracts features from the input images [30]. On the other hand, the MobileNetV2 and MobileNetV3 show better performance among the considered architectures due to the inverted residual block and depthwise convolution. However, the MobileNetV2 slightly outperforms MobileNetV3, which could be the Neural Architecture Search (NAS), which may not be optimally suited for the specific task of this study [31].
The overall performance of ResNet architecture in this study aligns with the findings of Shipway et al. in crack fluorescent defect dataset [11]. It is observed thath as the ResNet architecture becomes deeper, the accuracy decreases, but the recall increases. It is noticed that the ResNet shows the performance about 89% accuracy, which slightly exceeds VGG16 network in this study. This indicates that the residual module does not perform as effectively on the fluorescent defect dataset.
Interestingly, the InceptionV1 shows excellent performance in recall (98.4%), but a not in accuracy (88.6%). As shown in Fig. 10(e), there are 53 non-defect samples misclassified as defect samples. It may be attributed to the InceptionV1 block’s introduction of different sized kernels to extract features at different scales [28]. In some non-defect samples, there exist a lot backgrounds like defect samples, making it difficult for the classifier to classify them due to similar background features.
Figure 13 shows the 2-D representation of the fluorescent images within the testing set: the classification classes are represented by means of the colors: red for defect indications; green for false indications. In this figure, it is possible to observe that the initial distribution is semi-elliptical with small clusters, meaning that the initial distribution of images cannot be directly linearly classified. The high concentration of mixed defect and false indications in the bottom-left of the picture is the most important to focus on classification in industry production.
Figure 14 shows the t-SNE visualization after the feature extraction by four representative CNN models. Notably, a majority of the samples are well-clustered, indicating effective feature extraction. However, in the case of VGG16 and ResNet50, as shown in Fig. 14(a, b), there is a small proportion of samples that remain mixed together. Conversely, there is only a few false indication samples mixed with the defect indications in MobileNetV2 and InceptionV3, as Fig. 14(c, d) show. It validates the excellent feature extraction performance of MobieleNetV2 and InceptionV3 with respect to VGG and ResNet models, and its increased accuracy in defect detection.
To gain deeper insights into the feature extraction of fluorescent defects by CNN model, Grad-CAM [36] was used to visualize the extracted features. The last feature extraction layer before classification was extracted as the input to visualize here. As shown in Fig. 15, Grad-CAM (GC) heat map, Guided Grad-CAM (GGC) and feature fusion image are used to visualize the features extracted by the models. Figure 15(a) reveals that the final extracted feature presents like a hollow polygon. When combined with Fig. 15(b), the GGC image indicates that the extracted features correspond to the boundaries of fluorescent display, with a probability of being defective at 0.87 (Fig. 15(c)). It indicates that VGG16 pays more attention to boundary features during classification. ResNet50, as shown in Fig. 15(d-f), the behavior of the extracted features is somewhat similar to that of VGG16, but it focuses more on the lower right boundary. Conversely, InceptionV3 appears to concentrate more on the bright region and the upper boundary of the display, as depicted in Fig. 15(g-i). As for MobileNetV2, illustrated in Fig. 15(j-l), the GC heatmap and GGC image show that the features are distributed at the top and bottom in an oval-like shape, with a high probability of being defective at 0.99. This suggests that MobileNetV2 pays particular attention on the upper and lower boundaries of fluorescent display and gradually expanding towards the center of the bright region.
The attention to feature extraction varies among the four models, but there are some commonalities among them, particular in terms of boundary features and the texture of bright region. These features are especially prominent in InceptionV3 and MobileNetV2, which also happen to exhibit better performance. These features are similar to the current guidelines for manual inspection. However, the visualization of features extracted by CNN models enables a more precise localization of upper and lower boundary features of fluorescent displays, as well as the texture of bright regions. This provides valuable guidance for the design of white box fluorescent defect displays features.