3.1 Analysis of the Improvement Effect of the Wavelength Selection Algorithm
The wavelength points selected by the WBIS algorithm are intricately linked to the value of k. It is imperative to strike a balance, as an inadequate amount of data in the deep learning model may hinder complex model learning and prediction capabilities. Therefore, setting k = 10, with a window size of 200 and a step size of 50.
Figure 2 illustrates the full spectrum of the three single gases and the positions of the wavelength intervals retained after wavelength selection. Notably, the wavelength selection algorithm demonstrates adeptness in not only preserving single intervals with the highest absorption intensity and most prominent absorption peaks but also conserving numerous absorption peak positions across the full spectrum segment. Even for intervals exhibiting relatively diminutive absorption peak intensity values, the algorithm endeavors to retain intervals reflective of absorption peaks to the fullest extent possible. This meticulous selection process proves highly advantageous for subsequent quantitative testing in a mixed gas environment.
After the wavelength selection algorithm, the number of wavelength points retained by styrene, p-xylene and p-xylene are 14505, 13250 and 12063 respectively. Accounting for 43.71%, 39.92%, and 36.35% of the original full spectrum data respectively. The wavelength ranges of the three gases after wavelength selection are shown in Fig. 3. Comparing Fig. 1, it is easy to see that the selected wavelength ranges fully retain the absorption peak characteristics of the substances, which is especially obvious for paraxylene. After wavelength selection, the wavelength ranges of the three gases partially overlap. The reason is that the molecular structures of the three substances themselves are similar, which leads to a certain similarity in the infrared spectrum. However, it is not difficult to see from the number of selected wavelength points that the wavelength ranges are not the same. It is not affected by the height of the absorption peak and has relatively good discrimination.
On the four models of LR, PLS, RR, and SVR, the comparison results are shown in Table 3 with the root mean square error of the prediction set (RMSEP), the root mean square error of the calibration set (RMSEC), and the root mean square error of the model (RMSE).
Table 3
Performance of the Four Models Before and After Wavelength Selection
Data set | Model | RMSEP | RMSEC | RMSE |
Before | After | Before | After | Before | After |
Styrene | LR | 0.0092 | 0.0092 | 0.0092 | 0.0092 | 0.0092 | 0.0092 |
PLS | 0.0092 | 0.0092 | 0.0088 | 0.0088 | 0.0092 | 0.0092 |
RR | 0.0092 | 0.0092 | 0.0092 | 0.0092 | 0.0092 | 0.0092 |
SVR | 0.0355 | 0.0130 | 0.0355 | 0.0130 | 0.0355 | 0.0130 |
p-Xylenes | LR | 0.0618 | 0.0130 | 0.0618 | 0.0130 | 0.0618 | 0.0130 |
PLS | 0.0618 | 0.0130 | 0.0604 | 0.0119 | 0.0618 | 0.0130 |
RR | 0.0618 | 0.0130 | 0.0618 | 0.0130 | 0.0618 | 0.0130 |
SVR | 0.0882 | 0.0601 | 0.0882 | 0.0601 | 0.0882 | 0.0601 |
o-Xylenes | LR | 0.0687 | 0.0657 | 0.0687 | 0.0657 | 0.0687 | 0.0657 |
PLS | 0.0687 | 0.0657 | 0.0638 | 0.0615 | 0.0687 | 0.0657 |
RR | 0.0687 | 0.0657 | 0.0687 | 0.0657 | 0.0687 | 0.0657 |
SVR | 0.1289 | 0.1565 | 0.1289 | 0.1565 | 0.1289 | 0.1565 |
It can be seen from Table 3 that the overall performance of the three gases in the LR, PLS, and RR models is gradually better, which is related to the better nonlinear expressiveness of the PLS and RR models themselves. However, in the SVR model, The obvious decrease of styrene and p-Xylenes indicates that these two gases are more suitable for the SVR model, while the performance of o-xylene does not decrease but improves, indicating that o-xylene is not suitable for the SVR model. In addition, comparing the RMSE of the three gases, it can be found that the prediction accuracy of the three gases on the same model is also quite different. This fully illustrates that the data characteristics of different substances in the infrared spectrum are very different and there is no regularity. Therefore, waveform selection is particularly important for multi-component gas prediction.
Judging from the performance of the prediction set: comparing before and after waveform selection, the RMSEP decreases of styrene in the LR, PLS, RR, and SVR models are 0.32%, -0.17%, 0.32%, and 63.39% respectively. The RMSEP decreases of p-xylenes in the four models were 79.03%, 80.21%, 79.03%, and 31.86% respectively. The RMSEP decreases in the o-xylenes LR, PLS, and RR models are 4.36%, 3.62%, and 4.36% respectively. These results indicate that the model prediction accuracy remains unaffected or is improved after waveform selection.
Judging from the degree of fit on the calibration data, For styrene, on the LR, PLS, and RR models, the RMSEC does not change significantly before and after wavelength selection, but it significantly decreases on the SVR model; for p-xylene, the RMSEC on the four models All have significantly reduced; o-xylene has a certain degree of reduction in the LR, PLS, and RR models that are suitable for it. These findings indicate that the models exhibit better fitting ability to the selected dataset, particularly noticeable in the reduction of RMSEC values after waveform selection.
From the perspective of the overall performance of the model: In addition to the performance of o-xylene on the SVR model, by comparing the RMSE of the three gases on the four models before and after waveform selection, the RMSEs of the three gases on the four models have been reduced to varying degrees. The degree of reduction varies with different gas types. The same gas has a decrease in different models and its performance is stable.
Therefore, the data set after wavelength selection by the WBIS algorithm will not only not reduce the prediction accuracy of the model, but will also help the stability and accuracy of the prediction model, which shows the effectiveness of the WBIS algorithm in wavelength selection.
3.2 Analysis of model prediction effect improvement
The fundamental purpose of wavelength selection is to provide better quality data sets for prediction models. It is basically impossible for a single gas to exist in real environments. Electrical noise, instrument aging, moisture, temperature, and the interaction between multi-component gas molecules will all affect the gas infrared spectrum. The absorption spectra obtained by the same instrument in different external environments are also different. Gas infrared spectra exhibit intricate nonlinear characteristics. Simply put, the model's prediction ability for a single gas cannot guarantee the prediction accuracy. Therefore, in recent years, various deep learning models have been gradually introduced into the quantitative and qualitative analysis of infrared spectra [13, 14, 19].
To further evaluate the impact of datasets before and after WBIS wavelength selection on model prediction ability, this study extends beyond simple comparisons of single gas datasets. Instead, it employs three concentration gases before and after waveform selection, arranging them in various combinations to create a total of eight different mixed gas components for input. The output comprises three gas concentrations. 80% of the data set was randomly selected as the training set, and 20% was the test set. The epochs were all set to 30. Three different CNN models were constructed as mentioned above for experimental comparison. When setting k to 10 in the WBIS algorithm as described earlier, the retained wavelength points after wavelength selection differ in position for each gas, the method of formula (5) was used to retain 831 common wavelength point data for the three gases for subsequent CNN model testing.
Table 4
Experimental Predictions in Three CNN Models before and after Wavelength Selection
Model | RMSE | MAE |
Before | After | Before | After |
Model 1 | 0.10843 | 0.00055 | 0.08856 | 0.00041 |
Model 2 | 0.01361 | 0.00041 | 0.01100 | 0.00033 |
Model 3 | 0.00175 | 0.00059 | 0.00160 | 0.00053 |
Table 4 Prediction effects of three convolutional networks for three gases before and after waveform selection. Overall, the RMSE and MAE values are relatively small, indicating that the prediction effect of each output of the model is better.
As mentioned earlier, the complexity of the three models increases sequentially. From Table 4, it is evident that as the complexity of the CNN network model rises, the RMSE prediction accuracy gradually improves, while the mean absolute error (MAE) exhibits a noticeable downward trend. This trend underscores the rationality of employing the CNN model and indicates that CNN yields a sufficiently effective prediction for the dataset, with reasonable error accuracy. Comparing the RMSE and MAE before and after wavelength selection, it's observed that despite the dataset being only 43.71%-36.75% of the original volume after selection, the wavelength-selected dataset demonstrates superior performance and improved prediction accuracy.
This highlights the efficacy of wavelength selection. Further analysis reveals that post-wavelength selection, both RMSE and MAE show a slight increase in Model 3 compared to Model 2. This suggests that the selected dataset may be overfitting in Model 3, indicating that the increased complexity of the model does not necessarily translate to better performance. Model 2 appears to be more suitable for this dataset. In summary, the dataset after wavelength selection via the WBIS algorithm enhances the prediction accuracy of the model within the same CNN structure, obviating the need for a more complex CNN model while ensuring detection error is minimized.
Table 5
Comparison of Runtime Growth for 3 CNN Models Before and After Waveform Selection(Unit: Seconds)
Waveform Selection | Before | After |
Model 1 | 30.03 | 2.28 |
Model 2 | 30.59 | 3.08 |
Model 3 | 39.53 | 3.59 |
Table 5 presents the runtime growth of the three CNN models before and after waveform selection within the same experimental system environment (2.5GHz Core i7, 1G memory, 64-bit Win10), excluding the runtime of the waveform selection algorithm. Primarily, due to the significantly reduced data volume after applying the waveform selection algorithm, the runtime experiences a notable increase, accounting for approximately 7.6%, 10.06%, and 9.08% of the original runtime for Model 1, Model 2, and Model 3, respectively.
Further analysis of the runtime growth rate reveals that with the doubling of model complexity (the number of filters per layer for the three models being 32, 64, 128, and 256, respectively), Before wavelength selection, the runtime increase for Model 2 and Model 3 compared to models 1 is 1.86% and 31.63%, respectively. Although the post-selection models exhibit lower run times, their growth rates are comparatively larger. Specifically, when compared with Model 2 and Model 3, the runtime increases are 35.09% and 57.46%, respectively. Combined with the previous analysis, it is evident that Model 2 possesses the most suitable model structure. Thus, on the condition of ensuring sufficient prediction accuracy, a model with a simpler structure is more appropriate for near-infrared spectroscopy analysis.
Figure 4 employs inverted coordinates to display the Mean Squared Error (MSE) values of the three gases before and after wavelength selection across the three modes. it becomes evident that the full-spectrum dataset heavily relies on the model. As the model complexity increases, the prediction accuracy is significantly enhanced. For instance, the MSE for styrene dropped from the maximum value of 0.016248 to the minimum value of 2.17E-06. The primary reason behind this lies in the extensive dataset of 265,480 in the full-spectrum dataset, necessitating a more complex network structure to achieve better results. However, this also entails higher hardware requirements and reduced environmental adaptability.
After wavelength selection, the prediction accuracy of the three gases initially exhibits improved accuracy. For example, the Model 1 MSE value for styrene was 5.17E-08. Subsequently, as the complexity of the model increased, the error change significantly narrowed, indicating that the enhancement in model complexity does not substantially contribute to improving prediction accuracy. The dataset after wavelength selection demonstrates very promising prediction results on simpler models, undoubtedly enhancing the adaptability of the model and aiding in reducing actual equipment costs.
The prediction accuracy of the three gases varies under the same model. Notably, the prediction accuracy of p-Xylene is the highest both before and after wavelength selection, significantly outperforming styrene and o-xylene. This discrepancy can be attributed to the p-Xylene dataset may be more suitable for the CNN network structure. Conversely, the error is the largest before and after selection for o-xylene, with little impact on error accuracy from model complexity. This suggests that the o-xylene dataset is less suitable for the CNN network structure compared to the other two gases.
Before waveform selection, the prediction performance of styrene continues to improve as model complexity increases. Although the overall prediction effect after waveform selection is better than before, the prediction performance still improves with increasing model complexity. This indicates that the analysis of deep learning models in infrared spectrum datasets differs from common research fields. The model's expressiveness is greatly influenced by the molecular properties of the material, with different structural materials exhibiting different properties. When designing the model, improving the prediction effect cannot be achieved simply by increasing model complexity; material characteristics and complex external environments must also be considered.