As discussed, the first experimental setup data is being used for the analysis. Figure 9 presents the time domain plot and frequency domain plot of the raw vibration signals for case 7 under both gear conditions. The analysis reveals that both the time domain (TD) plot and the frequency domain (FD) spectrum plot fail to provide information pertaining to Gear Meshing Frequencies (GMF) and associated fault frequencies. Notably, the plots exhibit striking similarities between the broken tooth scenario and the configuration with intact teeth. This consistent pattern was also observed across all other cases; however, due to space constraints, not all figures are presented herein. In the experiment, the input shaft meshing gear has 23 teeth, and it is meshing with a 29 teeth gear at the intermediary shaft. Another gear at the intermediary shaft with 25 teeth is connected to the output shaft gear with 20 teeth, as seen in Fig. 3. So, considering case 7, i.e., the input speed of 3000rpm and no-load condition, the frequency spectrum should show peaks at gear meshing frequencies at 1150Hz and 991Hz, whereas the maximum peak is seen at 200 Hz and 1202Hz, 2343Hz from Fig. 9 which cannot be considered as fault frequencies for the broken teeth condition. Since no GMF frequencies are evident in the frequency spectrum plot, the subsequent action involves denoising the signal to enhance its information content by separating it from noise.
The initial analysis of the raw vibration signal proved inefficient in detecting fault information. Consequently, a wavelet denoising approach was employed to enhance the vibration signal. The process commenced with detrending the signals through a linear filter, followed by the application of the Symlet-4 wavelet at an 8th level. The selected denoising method was Bayes with the median rule. The resulting denoised vibration signal as depicted in Fig. 10, exhibited a notable reduction in noise compared to its pre-denoised state (Fig. 9). The Signal to Noise Ratio (SNR) for case 7 with good teeth was found to be 5.86 and for the broken teeth was 6.61. Examination of the power spectrum revealed elevated amplitudes at higher frequency levels for the broken tooth, contrasting with the predominant lower frequency amplitudes observed in the case of intact teeth. It is noteworthy that the energy levels in the power spectrum significantly moderated for the broken tooth scenario compared to the healthy teeth conditions.
To provide a more nuanced representation of the broken tooth fault, spectrogram plots were generated. Despite the effort, the spectrogram images failed to yield clear indications, as the broken teeth did not exhibit discernible effects on specific frequency ranges, as seen in (Vernekar et al. 2014). The high-amplitude frequencies associated with the broken teeth were dispersed throughout the spectrum, hindering the identification of a distinct fault frequency, as anticipated in Fig. 10.
Since traditional signal processing techniques based on raw vibration signals and then denoised vibration signals using wavelets failed to detect broken tooth faults, SVM-based and CNN-based fault detection methods, as proposed, were employed. The segments of 2000 data points, as used for STFT image generation, the same segment is being used for time domain statistical parameter evaluation. Kurtosis, Crest factor, and RMS are the three-time domain statistical parameters used as input features to the SVM models for fault identification of the gears. In this paper, we utilized linear SVM, quadratic SVM, cubic SVM, fine Gaussian SVM, medium Gaussian SVM and coarse Gaussian SVM on the dataset. The SVM model with the best classification accuracy is shown in Fig. 11. It can be seen from Fig. 8 that using all three parameters as input features, i.e., Kurtosis, Crest Factor and RMS, the Fine Gaussian model gave the best classification accuracy at 87.3%, and the Linear SVM gave the worst classification accuracy with 56.6% accuracy. For individual cases, i.e., only using Kurtosis as the input feature to the SVM model, Coarse Gaussian SVM gave the best result with 57.2% accuracy, while using Crest factor as the input feature, Coarse Gaussian SVM produced the best accuracy in result with 56.1% and by using RMS as input feature to SVM model, fine Gaussian SVM gave the best accuracy with 90.7% accuracy. The confusion matrix for all the four features is shown in Fig. 11. From Fig. 8, it can be concluded that RMS has proved to be the best-suited feature for broken tooth fault detection while employing the Fine Gaussian SVM model.
As the SVM-based classification result produced the best classification accuracy of 90.7%, the next proposed CNN model is tested on the segmented vibration signals transformed into spectrogram images. The input to the proposed CNN model was such that 80% of the total dataset (i.e., 720 images from each case, i.e. broken tooth and good teeth) are being used for training the CNN model in a randomized order. It is to be noted here that the input size of the images was 656×875 RGB. The image size was selected as bigger than usual due to the fact that the spectrogram images for both the broken tooth case and the good teeth case look similar, as shown in Fig. 12, so using smaller size images leads to poor training of the CNN model.
The proposed deep learning-based CNN model was able to classify the broken tooth fault with an overall accuracy of 98.6%; the same can be seen in Fig. 14. The confusion matrix for the classification accuracy can be seen in Fig. 15, where a total of 360 images (the remaining 20% of the dataset of 1800 images) are put to the test, out of which the trained model classified five good teeth images as broken teeth and was able to classify the broken teeth with 100% accuracy. Figure 13 represents accuracy of all the tested SVM models with different input features and also the proposed CNN model.
For experimental setup 1, the model performed splendidly, but its efficacy needs to be checked for the other experimental setup data, i.e., experimental setup 2 where the fault severity is very less and data availability is also sparse. The data generated for the experimental setup 2 for binary fault class, where chipped off broken tooth is one class and the normal teeth gear data is the other class. As data was collected for just ten seconds with 10kHz sampling rate, 0.1 million data points are being generated at once which makes the data points considerably low to generate spectrogram images. To overcome this issue, overlapping of the vibration signal is used for generating spectrogram images. With low data points available, spectrogram image generation with 2000 data points with no overlapping led to generation of only 50 images and with such less images the CNN model couldn’t learn the intricate patterns leading to an accuracy of 40.8%. In order to increase the number of images being generated, the signal is segmented into three smaller segment series of 2000, 1000 and 500 data points with overlapping percentages of 50 and 80. The details of number of images being generated by segmenting into smaller length with overlap can be seen in Table 2. Reducing the number of data points below 500 leads to poor training of the CNN model, as not a single GMF will be recorded below 423 data points.
Table 2
Signal segmentation details with STFT image parameters and it’s impact on accuracy
Fault Case | Segment Length | Overlap Percentage | STFT Window Size | STFT Window Overlapping | Generated Images | Training Set | Testing Set | Accuracy (%) |
Good Teeth | 2000 | 0 | 200 | 50 | 50 | 40 | 10 | 40.8 |
2000 | 50 | 1000 | 50 | 99 | 79 | 20 | 53.2 |
2000 | 80 | 200 | 80 | 246 | 196 | 100 | 84.6 |
1000 | 80 | 100 | 80 | 496 | 396 | 100 | 91 |
500 | 80 | 50 | 80 | 996 | 796 | 200 | 93 |
Broken Teeth | 2000 | 0 | 200 | 50 | 50 | 40 | 10 | 40.8 |
2000 | 50 | 100 | 80 | 99 | 79 | 20 | 53.2 |
2000 | 80 | 200 | 80 | 246 | 196 | 100 | 84.6 |
1000 | 80 | 100 | 80 | 496 | 396 | 100 | 91 |
500 | 80 | 50 | 80 | 996 | 796 | 200 | 93 |
The importance of segmentation size with overlap plays a crucial role in the CNN model efficacy where data availability is low. Higher the number of images, better is the training of CNN model and hence better classification accuracy. Figure 17 shows the confusion matrix where maximum classification accuracy of 93% is attained. Observations from Fig. 16 suggest that the spectrogram images for good and broken teeth exhibit high visual similarity. This can be attributed to the minimal severity of the seeded damage in the broken tooth, resulting in its spectral characteristics closely resembling those of healthy teeth.
Figure 18 illustrates the potential similarity between spectrogram images from different clusters using k-Nearest Neighbours (k-NN) clustering with a minimum Euclidean distance metric. Here, PCA dimensionality reduction was applied to the original high-dimensional features (3792 x 574000) to a 3D space for visualization purposes.
Figure 18 presents the results of principal component analysis (PCA) applied to the extracted features, visualized as clusters in a 3D space. PCA dimensionality reduction to normalized data was employed to facilitate visualization in this three-dimensional plot. The figure reveals a distinct separation between Cluster 2, corresponding to broken teeth images from experimental setup 1, and the remaining clusters. Clusters 1, 3, and 4 (good teeth of experimental setup 1, good teeth and broken teeth of experimental setup 2 respectively) and appear more closely grouped. This suggests that the features of good teeth from both experimental setups (1 & 2) and chipped broken teeth from setup 2 are similar. Moreover, the cluster 3 and cluster 4 almost overlap each other leading to difficulty in classification for the experimental setup 2 data.
Apart from the number of images being used for training, the batch size parameter of CNN also plays a very crucial role in the training of the model. If the segmented vibration signal has smaller datapoints as in experimental setup 2, a lower batch size of images is required per epoch, conversely if the datapoints are enough as for the case of experimental setup 1 higher batch size could be used. The maximum classification accuracy for experimental setup 1 is achieved using a batch size of sixty-four, whereas for experimental setup 2 a batch size of fifteen is used per epoch.
The proposed CNN model is evaluated against existing deep learning models, such as VGG16, VGG19, AlexNet, and ResNet, which have been trained on extensive image datasets. The results of this comparison are presented in Table 3. The lower data training and testing accuracy achieved by these established models can be attributed to their reliance on transfer learning for feature extraction. These deep networks were originally trained on real-world image datasets for object recognition tasks, and their architectures may not be well-suited for processing spectrogram images. Applying transfer learning to spectrogram images may not enable them to fully capture the relevant features, leading to diminished accuracy. Additionally, training these pre-trained deep networks necessitates resizing the input spectrogram images. This resizing process can significantly reduce the information content within the images, further hindering their performance. Moreover, the number of images required as input to these pre-trained CNN models are quite high in contrast to the images generated for our case, resulting again in lower accuracy. The comparative analysis graph for all the models are shown in Fig. 19.
Table 3: Comparison Table of CNNs
CNN models
|
Accuracy (dataset 1)
|
Accuracy (dataset 2)
|
Proposed CNN
|
98.6
|
93
|
VGG16
|
74.2
|
61.8
|
VGG19
|
72.8
|
64.4
|
Alexnet
|
82.4
|
79.2
|
ResNet
|
75.6
|
66.6
|