Wavelength selection algorithm for near-infrared spectra of volatile organic gases based on wave-cluster interval

doi:10.21203/rs.3.rs-4027386/v1

Download PDF

Article

Wavelength selection algorithm for near-infrared spectra of volatile organic gases based on wave-cluster interval

https://doi.org/10.21203/rs.3.rs-4027386/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

A novel wavelength selection algorithm, based on Wave Cluster Interval (WBIS), for near-infrared spectroscopy in the detection of volatile organic gases is presented. The algorithm employs a series selection mode, utilizing characteristic wavelength point cluster classification and absorption peak interval screening. Initially, cluster clustering is performed to preserve significant absorption peak features while avoiding mechanical division or random uncertain point changes in the algorithm. Subsequently, an improved moving window method is devised, and a greedy algorithm is employed to re-screen wavelength points within the same cluster class. This process ensures the retention of the optimal wavelength range, crucial for representing spectral characteristics and facilitating subsequent model predictions. Experimental validation was conducted using near-infrared spectral data of styrene, para-xylene, and o-xylene, employing four models: Partial Least Squares (PLS), Ridge Regression (RR), Support Vector Machine (SVM). The results demonstrate that, while maintaining model accuracy, the dataset can be reduced to 43.71%-36.35% of its original size. Additionally, utilizing a dataset comprising three gases (two concentrations each), as well as fully arranged and combined mixed gases, we conducted comparative experiments on three different CNN structures. The effectiveness of the proposed algorithm in reducing machine learning model complexity while ensuring prediction accuracy was validated through experimental comparisons before and after spectral waveform selection, with the CNN prediction models demonstrating a 90% increase in operational efficiency post-wavelength selection.

wavelength selection algorithm

near-infrared spectra

wave-cluster Interval

volatile organic gases

Near-Infrared Spectroscopy (NIR) detection technology finds widespread applications across various fields, including environmental testing, chemical engineering, and food testing, owing to its high efficiency, cost-effectiveness, non-destructiveness, and lack of consumables [1–5]. Volatile Organic Compounds (VOCs) are omnipresent in both production and daily life, posing significant risks to human health. Consequently, there is a growing demand for affordable and portable detection equipment capable of identifying and quantifying these compounds. The continuous advancements in spectral hardware technology have led to near-infrared spectra containing tens of thousands of wavelength variables, often characterized by multiple components, complex interference factors, and electrical and environmental noise. This complexity presents significant challenges for accurate spectral quantitative prediction. Consequently, recent research efforts in the field of infrared spectroscopy have focused on addressing two key areas: the selection of spectral features and the enhancement of high-precision quantitative prediction.

In the development and refinement of quantitative models, linear regression models have been extensively researched and widely applied utilized as the foundation, such as Partial Least Squares (PLS)[6, 7, 8], Ridge Regression (RR)[9, 10], and Principal Component Regression (PCR)[11, 12]. However, with the advent of high-resolution spectra, the complexity of spectral information has increased, showcasing more nonlinear characteristics. Linear regression models often exhibit limited capacity to explain such complexity and may struggle to adapt to dynamic changes in sample components and environmental conditions. Consequently, in recent years, there has been a notable shift towards integrating deep learning into infrared spectrum analysis, emerging as a key research direction. Various standalone and combined models have been proposed and implemented, including Convolutional Neural Networks (CNN)[13–15], Adaptive Boosting (AdaBoost)[16, 17], and Radial Basis Function Neural Networks (RBF) [18, 19].However, challenges persist. The near-infrared spectrum, Unlike traditional deep learning domains such as images, near-infrared spectroscopy possesses distinctive attributes. These include nonlinearity, high dimensionality, and unique spectral structures. Notably, even within varying concentrations of the same gas, spectral characteristics can vary significantly. Each full spectrum typically encompasses tens of thousands of wavelength variables. Additionally, the computational requirements and hardware costs associated with complex deep learning models often mean that the cost of modified equipment does not meet actual needs. Therefore, a more deep learning model targeting spectral characteristics is needed.

Under the assumption that selection can enhance model prediction, the process of identifying wavelength variables within a specific range or the entire spectrum range of near-infrared characteristics to improve model performance is termed wavelength selection. Presently, based on different selection strategies, wavelength selection algorithms are broadly categorized into two types: those based on regression models, such as Forward Interval Partial Least Squares (FiPLS), and those rooted in data feature analysis, such as Random Frog Jumping (RFR)[20], Monte Carlo Uninformative Variable Elimination (MC-UVE)[21–23], and Whale Optimization Algorithm (WOA) [24]. Studies have demonstrated the effectiveness of combining various wavelength selection algorithms in specific fields [25, 26]. Overall, wavelength selection algorithms play a crucial role in eliminating spectral signal noise and enhancing model prediction capabilities [10]. Moreover, they offer an effective means to improve performance without resorting solely to increasing neural network complexity.

In contrast to the fields of large data models such as graphics and natural language commonly used in deep learning, the essence of near-infrared spectroscopy analysis lies in establishing a robust model characterized by accuracy and environmental adaptability. This article proposes a novel wavelength selection algorithm that integrates both wavelength points and intervals, capitalizing on the fact that absorption peaks at specific wavelength points reflect material characteristics and often exhibit certain continuous intervals. Initially, a new cluster aggregation algorithm is developed with wavelength points as the central focus. Subsequently, an enhanced moving window mechanism is introduced to identify reasonable wavelength intervals, with the selected intervals serving as the result for wavelength selection. Following this, multiple convolutional neural network models with distinct structures are constructed, and experiments are conducted on both the full spectrum and post-wavelength selection spectra. The outcome is the establishment of a more accurate prediction model with reduced data volume and enhanced adaptability, thereby fulfilling the objective of identifying effective wavelength variables and mitigating the computational complexity of deep learning models.

2.1 Infrared Spectrum Characteristic Analysis and Algorithm Proposal

The absorption of infrared radiation by molecules results in the production of an infrared absorption spectrum, driven by the vibration and rotation of molecules. However, due to the varied molecular structures, each gas typically involves multiple vibration and rotation modes, making it challenging to analyze their spectra using a single theoretical model.

Figure 1 illustrates the absorption spectra of different concentrations of toluene (C7H8), styrene (C8H8), p-Xylenes (C8H10), and o-Xylenes (C8H10), obtained using SpectraGryph 1.2 spectral software. It is noteworthy that p-xylene and o-xylene are isomers of dimethylbenzene, sharing the chemical formula C8H10. The structural disparity among the four substances primarily stems from the differing positions of the methyl groups. Consequently, these structural variations result in distinct physical and chemical properties, as well as markedly different infrared spectra.

The wavelength position within the absorption spectrum is closely linked to the properties of the material. While there is no specific regularity in the performance of each substance across the full spectrum, every absorption peak, regardless of its position, can reflect the substance's characteristics and exhibit a certain interval. In other words, isolated absorption peaks are non-existent. This paper introduces the WBIS (Wave-cluster-based Interval Selection) wavelength selection algorithm. Initially, the algorithm employs a clustering-like approach to identify the approximate locations of absorption peaks that represent the material's characteristics. During this phase, pertinent absorption peak information is retained within each cluster. Subsequently, the algorithm scrutinizes the wavelength points within each cluster, utilizing a greedy algorithm that employs a moving window method. This method aims to identify clusters of wavelength points and preserve the window containing the most significant fluctuations among several wavelength points. This window, which best encapsulates the absorption peak characteristics, is selected as the outcome of wavelength selection and is subsequently fed into the ensuing model prediction process.

2.2 Waveform selection algorithm model design

In order to analyze the position information of the absorption peak in the entire spectrum of each type. There is a total absorption spectrum of a certain gas with a certain concentration, and its spectral absorption intensity data set $P=\left\{{p}_{1},{p}_{2}\cdots ,{p}_{n}\right\}$,, let $C=\{{c}_{1},{c}_{2},\cdots {c}_{k}\}$ is the absorption peak center, divide the data set into K clusters, and set the objective function as:

$$D\left(C,\left\{S\right\}\right)={\sum }_{K=1}^{K}{\sum }_{i=1}^{n}{S}_{ik}\bullet {‖{p}_{i}-{c}_{k}‖}^{2}$$

Among them, $S=\left\{{s}_{ik}\right\}$ represents a binary matrix composed of each wavelength point and its corresponding absorption peak intensity. Here ${s}_{ik}=1$indicates that the data point ${p}_{i}$belongs to cluster$k$, while ${s}_{ik}=0$ indicates that the data point ${p}_{i}$ does not belongs to cluster$k$. The total number of absorption intensities in the full spectrum is denoted by $n$. Based on the principle of minimizing the objective function $D$, the cluster allocation relationship $R$ of $C$ and $P$ is found through continuous iteration. The convergence condition is set as follows:

$$\forall j,\underset{i\to \infty }{\text{lim}}{‖{C}_{i+1}^{\left(j\right)}-{C}_{i}^{\left(j\right)}‖}^{n}\le \epsilon$$

where ${C}_{i+1}^{\left(j\right)}$ represents the $j$-th cluster center after the $i$-th iteration is completed, and $\epsilon$ is the set tolerance. This approach not only refrains from mechanically dividing the full spectrum into several intervals but also prevents the isolation of absorption intensities. Instead, it identifies the position interval that is most conducive to analyzing the absorption peak in the total gas absorption spectrum.

For the aforementioned wavelength point cluster class $CW=\left\{{cw}_{1},c{w}_{2},\cdots {cw}_{k}\right\}$, the starting interval of${cw}_{j}$is denoted by $\left[a,b\right]$, and a moving window mechanism is employed to further analyze the absorption intensity in $c{w}_{j}$, ${w}_{j}$represents a moving window within ${cw}_{i}$, and the adjacent degree value${ d}_{j}$ between peak intensities is calculated as follows:

$${d}_{j}={max}_{j}\left|{max}_{\left({x}_{p,}{y}_{p,}\right),\left({x}_{q},{y}_{q}\right)\in {w}_{j}}\left(\frac{{x}_{p}\bullet {x}_{q}+{y}_{p}\bullet {y}_{q}}{\sqrt{{x}_{p}^{2}+{y}_{p}^{2}}\bullet \sqrt{{x}_{q}^{2}+{y}_{q}^{2}}}\right)\right|$$

Designate $D=\left\{{d}_{1},{d}_{2},\cdots {d}_{k}\right\}$as the outcome of wavelength selection and is utilized for subsequent quantitative or qualitative analysis.

1.3 Wavelength Selection Algorithm Verification Model Design

To evaluate the effectiveness of the WBIS wavelength selection algorithm, four models were devised: Least Squares (LS), Partial Least Squares (PLS), Ridge Regression (RR), and Support Vector Machine (SVM). Data normalization preprocessing was omitted for each model. The essential information concerning each model, based on spectral characteristics, is outlined below:

The PLS model allows a maximum of 100 model iterations with a convergence tolerance of 1e-06.

For the RR model, the regularization strength is set to 1, with a convergence tolerance of 1e-3. The calculation of the intercept term is permitted, and the Cholesky decomposition method is utilized as the solution method. Random seeds are not utilized due to the non-continuous nature of the selected wavelength interval.

In the SVR model, the Gaussian radial basis function is employed as the kernel function. Coefficients for the "poly" and "sigmoid" kernel functions are specified. The model stopping criterion tolerance is set to 1e-3, with a model error tolerance range of 0.1.

1.4 Design of Convolution Models with Different Structures

To evaluate the impact of wavelength selection on the prediction accuracy of deep learning models, three convolutional neural network (CNN) models with distinct structures were developed. A standard CNN model comprises an input layer, convolutional layers, pooling layers, a fully connected layer, and an output layer. Deeper and more complex convolutional models generally excel in capturing features in the data and exhibit higher prediction capabilities. However, in near-infrared spectrum prediction, increased complexity translates to higher hardware cost requirements and reduced environmental adaptability of the instrument. Hence, an effective wavelength selection algorithm should reduce model complexity while maintaining prediction accuracy.

To ensure an objective comparison of model prediction results, all convolutional models utilize the ReLU activation function, Adam optimizer, and mean square error as the loss function. Additionally, each model includes fully connected and output layers. However, to directly gauge the complexity of the convolutional model, differences exist in the number of convolutional layers, the quantity and size of filters, and the number of pooling layers. The structural comparison of the three CNN models is illustrated in Table 1.

Table 1

Comparison of Three Different CNN Architectures
	Mode1	Mode 2	Mode 3
Number of convolutional layers	1	2	4
Number of filters (per layer)	32	32,64	32,64,128,256
Filter size (per layer)	3	3,3	3,3,3,3
Maximum number of pooling layers	1	2	4
Number of fully connected layers	1	1	2
output layer	1	1	1

2 Experimental Part

2.1 Experimental Data

The spectral database provided by the US Environmental Protection Agency (EPA) served as the data source for this study. The database includes spectra for three gases: styrene, p-xylene, and o-xylene. Spectra were measured at a temperature of 25°C, with an optical path length of 3m, a starting wavenumber of 400 cm^− 1, and an ending wavenumber of 4000 cm^− 1, with a spectral resolution of 0.12 cm^− 1. These three substances exhibit similar molecular structures and high overlapping absorption peaks, commonly found in materials such as coatings, paints, and rubbers. Therefore, the simulated experiment is considered typical and holds practical application value. Table 2 presents the experimental sample data.

Table 2

Gas and Concentration of Sample Dataset (Unit: ppm)
O-xylene	P-xylene	Styrene
499	100	99
99	502	500

The spectral data for each concentration of each gas in the dataset comprises a substantial amount, totaling up to 33,185 data points in the full spectrum. The original data is precise to 4 decimal places, resulting in a total dataset size of 199,110 entries. For the wavelength selection stage, styrene at a concentration of 99 ppm, paraxylene at a concentration of 100 ppm, and o-xylene at a concentration of 499 ppm—was independently used to complete the wavelength selection and conduct effect analysis.

To assess the impact of the wavelength selection algorithm on the prediction performance of different CNN models, the input datasets for the three CNN models in section 1.4 were not single gases but simulated according to Lambert-Beer's law without considering gas interactions. These datasets comprised multi-component mixed gases, with each of the three gases listed in Table 2 having 2 concentrations, resulting in a total of 8 combinations. This yielded 265,480 sample data under full spectrum conditions. Clearly, utilizing the full spectrum as input data for the model would severely limit the general performance and adaptability of the model.

2.2 Experimental process and verification methods

All experiments were conducted using PyCharm 2020.3.2.

Initially, wavelength screening was conducted on the full spectrum of the three gases, and the datasets before and after screening were employed to make predictions using the four models: LS, PLS, RR, and SVM. To ensure a more accurate evaluation of comparison and generalization performance in prediction, The mathematical models used for cross-validation by the four models are as follows:

$$K\left(D,K\right)=\frac{1}{K}\sum _{k=1}^{K}{RMSE}_{k}$$

Among them, RMSE_k represents the root mean square error of the k-th cross-validation. Both the full-spectrum dataset and the wavelength-selected dataset will be divided into 5 mutually exclusive subsets, and then 5-fold cross-validation will be conducted. During each training iteration, 4 subsets are utilized as the training set, while the remaining 1 subset serves as the test set. This process is repeated 5 times. Model performance indicators will be calculated based on the results of these 5 cross-validations.

Wavelength selection is only for a single gas, so the wavelength variables left after wavelength selection are different for each gas. Therefore, when generating a mixed gas data set according to Lambert-Beer's law, the following algorithmic model is utilized to ensure retention:

The data set ${D}_{1},{D}_{2},{\cdots D}_{n}$,is an independent data set after wavelength selection of a single full spectrum, ${D}_{i}=\left\{⟨{\lambda }_{i},{I}_{i}⟩，i=\text{1,2},\cdots m\right\}，{D}_{i}\in {C}_{i}$, ${C}_{i}$is the corresponding full spectrum segment data sets of the substance. Construct a new data set $S=\left\{{S}_{1},{S}_{2},{\cdots S}_{n}\right\}$, where ${S}_{i}$ satisfies the following:

$${S}_{k}=\left\{\left({\lambda }_{k},{I}_{1,k},{I}_{2,k},{I}_{3,k},\cdots ,{I}_{n,k}\right)，{\lambda }_{k}\epsilon {D}_{1}\cap {D}_{2}\cap \cdots {D}_{n}\right\}$$

Complete the full spectrum data set of the aforementioned 8 mixed gases and the data set after wavelength selection, and perform randomization.

Complete the prediction verification of three CNN models before and after wavelength selection.

3.1 Analysis of the Improvement Effect of the Wavelength Selection Algorithm

The wavelength points selected by the WBIS algorithm are intricately linked to the value of k. It is imperative to strike a balance, as an inadequate amount of data in the deep learning model may hinder complex model learning and prediction capabilities. Therefore, setting k = 10, with a window size of 200 and a step size of 50.

Figure 2 illustrates the full spectrum of the three single gases and the positions of the wavelength intervals retained after wavelength selection. Notably, the wavelength selection algorithm demonstrates adeptness in not only preserving single intervals with the highest absorption intensity and most prominent absorption peaks but also conserving numerous absorption peak positions across the full spectrum segment. Even for intervals exhibiting relatively diminutive absorption peak intensity values, the algorithm endeavors to retain intervals reflective of absorption peaks to the fullest extent possible. This meticulous selection process proves highly advantageous for subsequent quantitative testing in a mixed gas environment.

After the wavelength selection algorithm, the number of wavelength points retained by styrene, p-xylene and p-xylene are 14505, 13250 and 12063 respectively. Accounting for 43.71%, 39.92%, and 36.35% of the original full spectrum data respectively. The wavelength ranges of the three gases after wavelength selection are shown in Fig. 3. Comparing Fig. 1, it is easy to see that the selected wavelength ranges fully retain the absorption peak characteristics of the substances, which is especially obvious for paraxylene. After wavelength selection, the wavelength ranges of the three gases partially overlap. The reason is that the molecular structures of the three substances themselves are similar, which leads to a certain similarity in the infrared spectrum. However, it is not difficult to see from the number of selected wavelength points that the wavelength ranges are not the same. It is not affected by the height of the absorption peak and has relatively good discrimination.

On the four models of LR, PLS, RR, and SVR, the comparison results are shown in Table 3 with the root mean square error of the prediction set (RMSEP), the root mean square error of the calibration set (RMSEC), and the root mean square error of the model (RMSE).

Table 3

Performance of the Four Models Before and After Wavelength Selection
Data set	Model	RMSEP		RMSEC		RMSE
Data set	Model	Before	After	Before	After	Before	After
Styrene	LR	0.0092	0.0092	0.0092	0.0092	0.0092	0.0092
	PLS	0.0092	0.0092	0.0088	0.0088	0.0092	0.0092
	RR	0.0092	0.0092	0.0092	0.0092	0.0092	0.0092
	SVR	0.0355	0.0130	0.0355	0.0130	0.0355	0.0130
p-Xylenes	LR	0.0618	0.0130	0.0618	0.0130	0.0618	0.0130
	PLS	0.0618	0.0130	0.0604	0.0119	0.0618	0.0130
	RR	0.0618	0.0130	0.0618	0.0130	0.0618	0.0130
	SVR	0.0882	0.0601	0.0882	0.0601	0.0882	0.0601
o-Xylenes	LR	0.0687	0.0657	0.0687	0.0657	0.0687	0.0657
	PLS	0.0687	0.0657	0.0638	0.0615	0.0687	0.0657
	RR	0.0687	0.0657	0.0687	0.0657	0.0687	0.0657
	SVR	0.1289	0.1565	0.1289	0.1565	0.1289	0.1565

It can be seen from Table 3 that the overall performance of the three gases in the LR, PLS, and RR models is gradually better, which is related to the better nonlinear expressiveness of the PLS and RR models themselves. However, in the SVR model, The obvious decrease of styrene and p-Xylenes indicates that these two gases are more suitable for the SVR model, while the performance of o-xylene does not decrease but improves, indicating that o-xylene is not suitable for the SVR model. In addition, comparing the RMSE of the three gases, it can be found that the prediction accuracy of the three gases on the same model is also quite different. This fully illustrates that the data characteristics of different substances in the infrared spectrum are very different and there is no regularity. Therefore, waveform selection is particularly important for multi-component gas prediction.

Judging from the performance of the prediction set: comparing before and after waveform selection, the RMSEP decreases of styrene in the LR, PLS, RR, and SVR models are 0.32%, -0.17%, 0.32%, and 63.39% respectively. The RMSEP decreases of p-xylenes in the four models were 79.03%, 80.21%, 79.03%, and 31.86% respectively. The RMSEP decreases in the o-xylenes LR, PLS, and RR models are 4.36%, 3.62%, and 4.36% respectively. These results indicate that the model prediction accuracy remains unaffected or is improved after waveform selection.

Judging from the degree of fit on the calibration data, For styrene, on the LR, PLS, and RR models, the RMSEC does not change significantly before and after wavelength selection, but it significantly decreases on the SVR model; for p-xylene, the RMSEC on the four models All have significantly reduced; o-xylene has a certain degree of reduction in the LR, PLS, and RR models that are suitable for it. These findings indicate that the models exhibit better fitting ability to the selected dataset, particularly noticeable in the reduction of RMSEC values after waveform selection.

From the perspective of the overall performance of the model: In addition to the performance of o-xylene on the SVR model, by comparing the RMSE of the three gases on the four models before and after waveform selection, the RMSEs of the three gases on the four models have been reduced to varying degrees. The degree of reduction varies with different gas types. The same gas has a decrease in different models and its performance is stable.

Therefore, the data set after wavelength selection by the WBIS algorithm will not only not reduce the prediction accuracy of the model, but will also help the stability and accuracy of the prediction model, which shows the effectiveness of the WBIS algorithm in wavelength selection.

3.2 Analysis of model prediction effect improvement

The fundamental purpose of wavelength selection is to provide better quality data sets for prediction models. It is basically impossible for a single gas to exist in real environments. Electrical noise, instrument aging, moisture, temperature, and the interaction between multi-component gas molecules will all affect the gas infrared spectrum. The absorption spectra obtained by the same instrument in different external environments are also different. Gas infrared spectra exhibit intricate nonlinear characteristics. Simply put, the model's prediction ability for a single gas cannot guarantee the prediction accuracy. Therefore, in recent years, various deep learning models have been gradually introduced into the quantitative and qualitative analysis of infrared spectra [13, 14, 19].

To further evaluate the impact of datasets before and after WBIS wavelength selection on model prediction ability, this study extends beyond simple comparisons of single gas datasets. Instead, it employs three concentration gases before and after waveform selection, arranging them in various combinations to create a total of eight different mixed gas components for input. The output comprises three gas concentrations. 80% of the data set was randomly selected as the training set, and 20% was the test set. The epochs were all set to 30. Three different CNN models were constructed as mentioned above for experimental comparison. When setting k to 10 in the WBIS algorithm as described earlier, the retained wavelength points after wavelength selection differ in position for each gas, the method of formula (5) was used to retain 831 common wavelength point data for the three gases for subsequent CNN model testing.

Table 4

Experimental Predictions in Three CNN Models before and after Wavelength Selection
Model	RMSE		MAE
Model	Before	After	Before	After
Model 1	0.10843	0.00055	0.08856	0.00041
Model 2	0.01361	0.00041	0.01100	0.00033
Model 3	0.00175	0.00059	0.00160	0.00053

Table 4 Prediction effects of three convolutional networks for three gases before and after waveform selection. Overall, the RMSE and MAE values are relatively small, indicating that the prediction effect of each output of the model is better.

As mentioned earlier, the complexity of the three models increases sequentially. From Table 4, it is evident that as the complexity of the CNN network model rises, the RMSE prediction accuracy gradually improves, while the mean absolute error (MAE) exhibits a noticeable downward trend. This trend underscores the rationality of employing the CNN model and indicates that CNN yields a sufficiently effective prediction for the dataset, with reasonable error accuracy. Comparing the RMSE and MAE before and after wavelength selection, it's observed that despite the dataset being only 43.71%-36.75% of the original volume after selection, the wavelength-selected dataset demonstrates superior performance and improved prediction accuracy.

This highlights the efficacy of wavelength selection. Further analysis reveals that post-wavelength selection, both RMSE and MAE show a slight increase in Model 3 compared to Model 2. This suggests that the selected dataset may be overfitting in Model 3, indicating that the increased complexity of the model does not necessarily translate to better performance. Model 2 appears to be more suitable for this dataset. In summary, the dataset after wavelength selection via the WBIS algorithm enhances the prediction accuracy of the model within the same CNN structure, obviating the need for a more complex CNN model while ensuring detection error is minimized.

Table 5

Comparison of Runtime Growth for 3 CNN Models Before and After Waveform Selection(Unit: Seconds)
Waveform Selection	Before	After
Model 1	30.03	2.28
Model 2	30.59	3.08
Model 3	39.53	3.59

Table 5 presents the runtime growth of the three CNN models before and after waveform selection within the same experimental system environment (2.5GHz Core i7, 1G memory, 64-bit Win10), excluding the runtime of the waveform selection algorithm. Primarily, due to the significantly reduced data volume after applying the waveform selection algorithm, the runtime experiences a notable increase, accounting for approximately 7.6%, 10.06%, and 9.08% of the original runtime for Model 1, Model 2, and Model 3, respectively.

Further analysis of the runtime growth rate reveals that with the doubling of model complexity (the number of filters per layer for the three models being 32, 64, 128, and 256, respectively), Before wavelength selection, the runtime increase for Model 2 and Model 3 compared to models 1 is 1.86% and 31.63%, respectively. Although the post-selection models exhibit lower run times, their growth rates are comparatively larger. Specifically, when compared with Model 2 and Model 3, the runtime increases are 35.09% and 57.46%, respectively. Combined with the previous analysis, it is evident that Model 2 possesses the most suitable model structure. Thus, on the condition of ensuring sufficient prediction accuracy, a model with a simpler structure is more appropriate for near-infrared spectroscopy analysis.

Figure 4 employs inverted coordinates to display the Mean Squared Error (MSE) values of the three gases before and after wavelength selection across the three modes. it becomes evident that the full-spectrum dataset heavily relies on the model. As the model complexity increases, the prediction accuracy is significantly enhanced. For instance, the MSE for styrene dropped from the maximum value of 0.016248 to the minimum value of 2.17E-06. The primary reason behind this lies in the extensive dataset of 265,480 in the full-spectrum dataset, necessitating a more complex network structure to achieve better results. However, this also entails higher hardware requirements and reduced environmental adaptability.

After wavelength selection, the prediction accuracy of the three gases initially exhibits improved accuracy. For example, the Model 1 MSE value for styrene was 5.17E-08. Subsequently, as the complexity of the model increased, the error change significantly narrowed, indicating that the enhancement in model complexity does not substantially contribute to improving prediction accuracy. The dataset after wavelength selection demonstrates very promising prediction results on simpler models, undoubtedly enhancing the adaptability of the model and aiding in reducing actual equipment costs.

The prediction accuracy of the three gases varies under the same model. Notably, the prediction accuracy of p-Xylene is the highest both before and after wavelength selection, significantly outperforming styrene and o-xylene. This discrepancy can be attributed to the p-Xylene dataset may be more suitable for the CNN network structure. Conversely, the error is the largest before and after selection for o-xylene, with little impact on error accuracy from model complexity. This suggests that the o-xylene dataset is less suitable for the CNN network structure compared to the other two gases.

Before waveform selection, the prediction performance of styrene continues to improve as model complexity increases. Although the overall prediction effect after waveform selection is better than before, the prediction performance still improves with increasing model complexity. This indicates that the analysis of deep learning models in infrared spectrum datasets differs from common research fields. The model's expressiveness is greatly influenced by the molecular properties of the material, with different structural materials exhibiting different properties. When designing the model, improving the prediction effect cannot be achieved simply by increasing model complexity; material characteristics and complex external environments must also be considered.

In conclusion, this paper introduces a novel wavelength selection method for near-infrared spectra, which combines wavelength points and absorption peak intervals. The algorithm employs cluster centers based on spectral absorption characteristics and utilizes moving windows and greedy algorithms to identify multiple absorption peak intervals beneficial to deep learning models. This process results in a refined dataset. After applying wavelength screening to concentrations of Styrene, p-Xylenes, and o-Xylenes, the number of wavelength points was reduced to 36.35% of the original dataset in the optimal case.

Comparative analyses were conducted on LR, PLS, RR, and SVR models, demonstrating improved prediction performance post-wavelength selection. To further verify the adaptability of the wavelength selection algorithm, a mixed gas dataset was constructed using two concentrations of each gas, and three CNN models with varying structures were compared. Results indicated that the 2-layer convolution structure (Model 2) yielded the best error effects and operating efficiencies for all three gases.

The proposed wavelength selection algorithm and CNN model hold promise for various types of near-infrared spectroscopy applications. Additionally, they offer valuable insights for enhancing application and cost control of deep learning mechanisms in near-infrared spectroscopy.

Author Contribution

I wrote and reviewed the manuscript.

Data Availability

The data that support the findings of this study are available from the corresponding author upon request.

Ibrahim, E. A., Alhaithloul, H. A., Shamseldin, S. A., Awaly, S. B., Abd Hesham, E. L., Abdelkader, M. F., ... & Abdein, M. A. (2024). Morphological, Biochemical, and Molecular Diversity Assessment of Egyptian Bottle Gourd Cultivars. Genetics Research, 2024.
Boddapati, V., Ferris, A. M., & Hanson, R. K. (2024). Predicting the physical and chemical properties of sustainable aviation fuels using elastic-net-regularized linear models based on extended-wavelength FTIR spectra. Fuel, 356, 129557.
Mekonnen, B. K., Yang, W., Hsieh, T. H., Liaw, S. K., & Yang, F. L. (2020). Accurate prediction of glucose concentration and identification of major contributing features from hardly distinguishable near-infrared spectroscopy. Biomedical Signal Processing and Control, 59, 101923.
Della Ventura, G., Radica, F., Marcelli, A., Tranfo, G., Macis, S., Mancini, T., ... & Lupi, S. (2023). High-resolution quantitative monitoring of VOCs using MIR (medium infrared) spectroscopy coupled with a multipass cell. In An integrated array of fixed and mobile sensors for dynamical spatio-temporal mapping of volatile compounds in work environments (pp. 35-37).
Spatial Differentiation Analysis of Water Quality in Dianchi Lake Based on GF-5 NDVI Characteristic Optimization[J]. Lin Hu;Shu Gan;Xiping Yuan;Yan Li;Guokun Chen;Sha Gao.Journal of Spectroscopy,2021
Cost-efficient unsupervised sample selection for multivariate calibration[J]. Fonseca Diaz Valeria;De Ketelaere Bart;Aernouts Ben;Saeys Wouter.Chemometrics and Intelligent Laboratory Systems,2021
Miao, X., Miao, Y., Liu, Y., Tao, S., Zheng, H., Wang, J., ... & Tang, Q. (2023). Measurement of nitrogen content in rice plant using near infrared spectroscopy combined with different PLS algorithms. Spectrochimica Acta Part a: Molecular and Biomolecular Spectroscopy, 284, 121733.
Oliveira, M. M., Badaró, A. T., Esquerre, C. A., Kamruzzaman, M., & Barbin, D. F. (2023). Handheld and benchtop vis/NIR spectrometer combined with PLS regression for fast prediction of cocoa shell in cocoa powder. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 298, 122807..
Liu, J., Geng, T., Jiang, W., Fan, S., Chen, J., Jia, C., & Ji, S. (2024). A novel method of the NIRS model for Polygonum multiflorum based on Elasticnet regression. Microchemical Journal, 110095.
Ismy, A. S., Siahaan, H. H., & Sitorus, A. (2023). A novel strategy of multivariate calibration of NIR spectra in the presence both of small dataset and non-linearity: A comparative study. Case Studies in Chemical and Environmental Engineering, 100384.
Budiastra, I. W., Ramadhana, H., & Nurfadila, N. (2023, May). Determination of Chemical Content of Vanilla Pods (Vanilla planifolia) Non-destructively Using NIR Spectroscopy. In IOP Conference Series: Earth and Environmental Science (Vol. 1187, No. 1, p. 012024). IOP Publishing.
Budiastra, I. W., & Noviyanti, A. A. (2023, May). Determination of Chemical Content of Porang Flour (Amorphophallus muelleri blume) by Near Infrared Spectroscopy. In IOP Conference Series: Earth and Environmental Science (Vol. 1187, No. 1, p. 012027). IOP Publishing.
Li L Q, Pan X P, Feng Y C, et al. Deep convolution network application in identification of multi-variety and multi-manufacturer pharmaceutical [J] .Spectroscopy and Spectral Analysis, 2019,39(11) :3606-3613
Hosseinpour-Zarnaq, M., Omid, M., Sarmadian, F., & Ghasemi-Mobtaker, H. (2023). A CNN model for predicting soil properties using VIS–NIR spectral data. Environmental Earth Sciences, 82(16), 382.
Passos, D., & Mishra, P. (2023). Deep Tutti Frutti: Exploring CNN architectures for dry matter prediction in fruit from multi-fruit near-infrared spectra. Chemometrics and Intelligent Laboratory Systems, 243, 105023.
Li, Y., Xia, H., Liu, Y., Huo, L., Ni, C., & Gou, B. (2023). Detection of Moisture Content of Pinus massoniana Lamb. Seedling Leaf Based on NIR Spectroscopy with a Multi-Learner Model. Forests, 14(5), 883.
Belghit, A., Lazri, M., Ouallouche, F., Labadi, K., & Ameur, S. (2023). Optimization of One versus All-SVM using AdaBoost algorithm for rainfall classification and estimation from multispectral MSG data. Advances in Space Research, 71(1), 946-963.
Nantongo, J. S., Serunkuma, E., Burgos, G., Nakitto, M., Davrieux, F., & Ssali, R. Machine Learning Methods in Near Infrared Spectroscopy for Predicting Sensory Traits in Sweetpotatoes. Available at SSRN 4586255.
Li, J., Sun, L., & Li, R. (2020). Nondestructive detection of frying times for soybean oil by NIR-spectroscopy technology with Adaboost-SVM (RBF). Optik, 206, 164248.
Huo, J., Li, C., Wang, H., & Li, H. (2020, October). LASSO Based Similarity Learning of Near-Infrared Spectra for Quality Control. In 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS) (pp. 424-427). IEEE.
Shi, S., Zhao, D., Pan, K., Ma, Y., Zhang, G., Li, L., ... & Jiang, Y. (2023). Combination of near-infrared spectroscopy and key wavelength-based screening algorithm for rapid determination of rice protein content. Journal of Food Composition and Analysis, 118, 105216.
Fu, J., Yu, H. D., Chen, Z., & Yun, Y. H. (2022). A review on hybrid strategy-based wavelength selection methods in analysis of near-infrared spectral data. Infrared Physics & Technology, 125, 104231.
Mishra, P., Herrmann, I., & Angileri, M. (2021). Improved prediction of potassium and nitrogen in dried bell pepper leaves with visible and near-infrared spectroscopy utilising wavelength selection techniques. Talanta, 225, 121971.
Tang, N., Sun, J., Yao, K., Zhou, X., Tian, Y., Cao, Y., & Nirere, A. (2021). Identification of Lycium barbarum varieties based on hyperspectral imaging technique and competitive adaptive reweighted sampling‐whale optimization algorithm‐support vector machine. Journal of Food Process Engineering, 44(1), e13603.
Cheng, J. H., Chen, Z. G., & Zhang, Q. H. (2020). Comparison of different wavelength selection methods in SOM content detection.
Li, J. M., Yin, Y., Yu, H. C., Yuan, Y. X., & Li, Y. (2022). Feature wavelength selection of three-dimensional fluorescence data of tomato storage room gas based on wavelet packet decomposition for early warning of its spoilage.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Wavelength selection algorithm for near-infrared spectra of volatile organic gases based on wave-cluster interval

Status:

Version 1

Abstract

Figures

1.Introduction

2. Wavelength Selection and Prediction Model Establishment

2.1 Infrared Spectrum Characteristic Analysis and Algorithm Proposal

2.2 Waveform selection algorithm model design

1.3 Wavelength Selection Algorithm Verification Model Design

1.4 Design of Convolution Models with Different Structures

2 Experimental Part

2.1 Experimental Data

2.2 Experimental process and verification methods

3 Results and Discussion

3.1 Analysis of the Improvement Effect of the Wavelength Selection Algorithm

3.2 Analysis of model prediction effect improvement

4 Conclusion

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1