3.1. Sample and spectral analysis
Dried samples of G. lucidum (Fig. 1A), G. sinense (Fig. 1B), and G. applanatum powder (Fig. 1C) were very similar in appearance. It was difficult for ordinary consumers to verify the authenticity by their appearance. Figure 2 showed that despite the different adulteration levels of G. lingzhi powder, the positions of the peaks and the absorbance curves follow a similar trend. In the bands 4000–5000, 7000–7500, and 8000–8500 cm− 1, the absorption values of the peaks were significantly different, which may be an important factor in distinguishing G. lucidum, G. sinense, and adulterated G. lingzhi. The 7000–7500 cm− 1 region is associated with the C-H first overtone stretching vibrational modes in the CH3 and CH2 groups, while the bands located among 4000–5000 cm− 1 are from the combined C-H bands that are characteristic of proteins, amino acids, and carbohydrates. And the band 8000–8500 may be caused by C-H second overtone stretch vibrations in -CH3 (Tahir et al., 2023; Chen et al., 2022). Meanwhile, the graph reveals that the absorption of curve A was higher than that of curve B. This may be due to that G. lucidum contains more triterpenoids and G. sinensis contains more aromatic terpenoids, triterpenoids had stronger absorption ability for NIRS than aromatic terpenoids, whereas the NIRS curves of G. lingzhi doped with G. applanatum are lower than the pure samples. Obviously, no significant differences can be observed by the naked eye from the original spectra. Therefore, it was necessary to use chemometric methods to identify doped samples and gauge the doping level.
3.2. Qualitative analysis
In this experiment, identifications of G. sinensis, G. lucidum, and adulteration were done. Different combinations of modeling methods, including PLA-DA, RF, SVM, and BPNN, as well as how various spectrum preprocessing techniques affected the models were compared. The final table was obtained by SG, D1, D2, SNV, and MSC preprocessing (Table 1).
Table 1
Classification results of PLS-DA, RF, SVM, and BPNN models combining different preprocessing, feature variable selection.
Mission | Combination of modeling methods | Number of characteristic variables | Training set correct rate | Prediction set correct rate |
Identification of G. sinensis and G. lucidum species | PLS-DA original data | 1557 | 95.00 | 90.00 |
PLS-DA + SG | 1557 | 97.50 | 95.00 |
PLS-DA + SNV | 1557 | 98.75 | 95.00 |
PLS-DA + MSC | 1557 | 98.75 | 95.00 |
PLS-DA + D1 | 1557 | 100.00 | 100.00 |
PLS-DA + D2 | 1557 | 100.00 | 95.00 |
PLS-DA + D1 + CARS | 54 | 100.00 | 100.00 |
PLS-DA + D1 + SPA | 99 | 100.00 | 90.00 |
PLS-DA + D1 + GA | 94 | 100.00 | 100.00 |
PLS-DA + D1 + UVE | 1087 | 100.00 | 100.00 |
PLS-DA + D1 + UVE + CARS | 62 | 100.00 | 100.00 |
PLS-DA + D1 + UVE + SPA | 99 | 100.00 | 90.00 |
PLS-DA + D1 + UVE + GA | 42 | 100.00 | 100.00 |
RF original data | 1557 | 100.00 | 95.00 |
RF + SG | 1557 | 100.00 | 85.00 |
RF + SNV | 1557 | 100.00 | 100.00 |
RF + MSC | 1557 | 100.00 | 95.00 |
RF + D1 | 1557 | 100.00 | 85.00 |
RF + D2 | 1557 | 100.00 | 95.00 |
RF + SNV + CARS | 16 | 100.00 | 75.00 |
RF + SNV + SPA | 99 | 100.00 | 90.00 |
RF + SNV + GA | 19 | 100.00 | 90.00 |
RF + SNV + UVE | 1010 | 100.00 | 95.00 |
RF + SNV + UVE + CARS | 17 | 100.00 | 85.00 |
RF + SNV + UVE + SPA | 99 | 100.00 | 90.00 |
RF + SNV + UVE + GA | 36 | 100.00 | 85.00 |
SVM original data | 1557 | 93.75 | 70.00 |
SVM + SG | 1557 | 90.00 | 85.00 |
SVM + SNV | 1557 | 93.75 | 90.00 |
SVM + MSC | 1557 | 96.25 | 80.00 |
SVM + D1 | 1557 | 100.00 | 100.00 |
SVM + D2 | 1557 | 100.00 | 90.00 |
SVM + D1 + CARS | 54 | 100.00 | 100.00 |
SVM + D1 + SPA | 99 | 100.00 | 100.00 |
SVM + D1 + GA | 94 | 100.00 | 100.00 |
SVM + D1 + UVE | 1087 | 100.00 | 100.00 |
SVM + D1 + UVE + CARS | 62 | 100.00 | 100.00 |
SVM + D1 + UVE + SPA | 99 | 100.00 | 95.00 |
SVM + D1 + UVE + GA | 42 | 97.50 | 95.00 |
BPNN original data | 1557 | 98.75 | 95.00 |
BPNN + SG | 1557 | 98.75 | 90.00 |
BPNN + SNV | 1557 | 95.00 | 85.00 |
BPNN + MSC | 1557 | 96.25 | 95.00 |
BPNN + D1 | 1557 | 100.00 | 100.00 |
BPNN + D2 | 1557 | 98.75 | 95.00 |
BPNN + D1 + CARS | 54 | 98.75 | 95.00 |
BPNN + D1 + SPA | 99 | 97.50 | 95.00 |
BPNN + D1 + GA | 94 | 100.00 | 95.00 |
BPNN + D1 + UVE | 1087 | 100.00 | 95.00 |
BPNN + D1 + UVE + CARS | 62 | 100.00 | 100.00 |
BPNN + D1 + UVE + SPA | 99 | 98.75 | 95.00 |
BPNN + D1 + UVE + GA | 42 | 95.00 | 95.00 |
Whether G. lucidum is adulterated | PLS-DA original data | 1557 | 90.25 | 86.00 |
PLS-DA + SG | 1557 | 89.75 | 83.00 |
PLS-DA + SNV | 1557 | 88.00 | 80.00 |
PLS-DA + MSC | 1557 | 86.50 | 81.00 |
PLS-DA + D1 | 1557 | 100.00 | 92.00 |
PLS-DA + D2 | 1557 | 100.00 | 98.00 |
PLS-DA + D2 + CARS | 168 | 100.00 | 99.00 |
PLS-DA + D2 + SPA | 81 | 97.75 | 89.00 |
PLS-DA + D2 + GA | 94 | 95.75 | 91.00 |
PLS-DA + D2 + UVE | 235 | 100.00 | 93.00 |
PLS-DA + D2 + UVE + CARS | 58 | 96.50 | 95.00 |
PLS-DA + D2 + UVE + SPA | 84 | 99.00 | 93.00 |
PLS-DA + D2 + UVE + GA | 83 | 97.50 | 92.00 |
RF original data | 1557 | 100.00 | 84.00 |
RF + SG | 1557 | 99.75 | 85.00 |
RF + SNV | 1557 | 100.00 | 79.00 |
RF + MSC | 1557 | 100.00 | 86.00 |
RF + D1 | 1557 | 100.00 | 85.00 |
RF + D2 | 1557 | 100.00 | 97.00 |
RF + D2 + CARS | 168 | 100.00 | 98.00 |
RF + D2 + SPA | 81 | 100.00 | 96.00 |
RF + D2 + GA | 94 | 100.00 | 94.00 |
RF + D2 + UVE | 235 | 100.00 | 93.00 |
RF + D2 + UVE + CARS | 58 | 100.00 | 97.00 |
RF + D2 + UVE + SPA | 84 | 100.00 | 99.00 |
RF + D2 + UVE + GA | 83 | 100.00 | 97.00 |
SVM original data | 1557 | 82.00 | 78.00 |
SVM + SG | 1557 | 85.00 | 79.00 |
SVM + SNV | 1557 | 87.50 | 81.00 |
SVM + MSC | 1557 | 88.00 | 81.00 |
SVM + D1 | 1557 | 100.00 | 93.00 |
SVM + D2 | 1557 | 100.00 | 98.00 |
SVM + D2 + CARS | 168 | 100.00 | 100.00 |
SVM + D2 + SPA | 81 | 97.00 | 91.00 |
SVM + D2 + GA | 94 | 97.25 | 92.00 |
SVM + D2 + UVE | 235 | 100.00 | 99.00 |
SVM + D2 + UVE + CARS | 58 | 97.50 | 96.00 |
SVM + D2 + UVE + SPA | 84 | 99.50 | 95.00 |
SVM + D2 + UVE + GA | 83 | 98.25 | 97.00 |
BPNN original data | 1557 | 92.50 | 85.00 |
BPNN + SG | 1557 | 92.25 | 89.00 |
BPNN + SNV | 1557 | 98.00 | 96.00 |
BPNN + MSC | 1557 | 98.50 | 94.00 |
BPNN + D1 | 1557 | 98.25 | 94.00 |
BPNN + D2 | 1557 | 99.25 | 95.00 |
BPNN + D2 + CARS | 168 | 100.00 | 100.00 |
BPNN + D2 + SPA | 81 | 98.25 | 95.00 |
BPNN + D2 + GA | 94 | 94.50 | 91.00 |
BPNN + D2 + UVE | 235 | 95.50 | 86.00 |
BPNN + D2 + UVE + CARS | 58 | 97.50 | 93.00 |
BPNN + D2 + UVE + SPA | 84 | 98.25 | 93.00 |
BPNN + D2 + UVE + GA | 83 | 98.00 | 94.00 |
Whether G. sinensis is adulterated | PLS-DA original data | 1557 | 85.06 | 81.58 |
PLS-DA + SG | 1557 | 88.96 | 73.68 |
PLS-DA + SNV | 1557 | 85.71 | 76.32 |
PLS-DA + MSC | 1557 | 85.06 | 81.58 |
PLS-DA + D1 | 1557 | 100.00 | 100.00 |
PLS-DA + D2 | 1557 | 100.00 | 98.68 |
PLS-DA + D1 + CARS | 113 | 100.00 | 100.00 |
PLS-DA + D1 + SPA | 114 | 100.00 | 84.21 |
PLS-DA + D1 + GA | 109 | 100.00 | 92.11 |
PLS-DA + D1 + UVE | 249 | 100.00 | 92.11 |
PLS-DA + D1 + UVE + CARS | 43 | 100.00 | 100.00 |
PLS-DA + D1 + UVE + SPA | 114 | 100.00 | 92.11 |
PLS-DA + D1 + UVE + GA | 40 | 98.05 | 97.37 |
RF original data | 1557 | 100.00 | 86.84 |
RF + SG | 1557 | 99.35 | 84.21 |
RF + SNV | 1557 | 100.00 | 68.42 |
RF + MSC | 1557 | 100.00 | 71.05 |
RF + D1 | 1557 | 100.00 | 94.74 |
RF + D2 | 1557 | 100.00 | 100.00 |
RF + D2 + CARS | 129 | 100.00 | 100.00 |
RF + D2 + SPA | 114 | 100.00 | 100.00 |
RF + D2 + GA | 86 | 100.00 | 100.00 |
RF + D2 + UVE | 83 | 100.00 | 97.37 |
RF + D2 + UVE + CARS | 34 | 100.00 | 97.37 |
RF + D2 + UVE + SPA | 29 | 100.00 | 96.52 |
RF + D2 + UVE + GA | 24 | 100.00 | 94.74 |
SVM original data | 1557 | 74.03 | 71.05 |
SVM + SG | 1557 | 72.73 | 68.42 |
SVM + SNV | 1557 | 84.42 | 81.58 |
SVM + MSC | 1557 | 84.42 | 78.95 |
SVM + D1 | 1557 | 100.00 | 97.37 |
SVM + D2 | 1557 | 100.00 | 100.00 |
SVM + D2 + CARS | 129 | 100.00 | 100.00 |
SVM + D2 + SPA | 114 | 100.00 | 89.47 |
SVM + D2 + GA | 86 | 100.00 | 100.00 |
SVM + D2 + UVE | 83 | 100.00 | 97.37 |
SVM + D2 + UVE + CARS | 34 | 100.00 | 94.74 |
SVM + D2 + UVE + SPA | 29 | 98.56 | 97.35 |
SVM + D2 + UVE + GA | 24 | 99.35 | 97.37 |
BPNN original data | 1557 | 92.21 | 86.84 |
BPNN + SG | 1557 | 97.40 | 86.84 |
BPNN + SNV | 1557 | 98.70 | 92.11 |
BPNN + MSC | 1557 | 91.56 | 81.58 |
BPNN + D1 | 1557 | 94.81 | 78.95 |
BPNN + D2 | 1557 | 100.00 | 100.00 |
BPNN + D2 + CARS | 129 | 100.00 | 100.00 |
BPNN + D2 + SPA | 114 | 94.81 | 89.47 |
BPNN + D2 + GA | 86 | 93.51 | 92.11 |
BPNN + D2 + UVE | 83 | 98.70 | 94.74 |
BPNN + D2 + UVE + CARS | 34 | 99.35 | 97.37 |
BPNN + D2 + UVE + SPA | 29 | 99.78 | 97.37 |
BPNN + D2 + UVE + GA | 24 | 100.00 | 97.37 |
For the species discrimination task of G. sinensis and G. lucidum, we found that the better models were PLS-DA + D1, RF + SNV, SVM + D1, and BPNN + D1, which had 100% correctness in both the training and validation sets. And on this basis, four algorithms, CARS, SPA, GA, and UVE were applied for characteristic variable selection. Owing to the unsatisfactory results of feature variable selection combined with the UVE algorithm (the number of variables screened out was too numerous), the associations with CARS, SPA, and GA algorithms on the basis of UVE were performed. It was found that the number of variables screened out by the association on the foundation of UVE would be reduced. The five preprocessing methods combined with RF were the least effective, probably because too much useful information was removed, resulting in too many feature variable selections and not as high a correct rate as the other three models. So, the best model in the variety identification task was PLS-DA + D1 + UVE + GA.
In order to determine whether G. lucidum was adulterated, the preprocessing correctness was ranked as D2 > D1 > MSC > SNV > SG among the four modeling methods. The best feature variable selection method was UVE coupled with the SPA algorithm, which used RF modeling and combined with D2 to select 84 feature variables (Fig. 3B(b)), which was 1473 less than the full wavelength. However, the effect of feature variables selection using a combination based on the UVE algorithm, despite a slight increase in the correct rate and the number of variables, was not better than the effect of the other algorithms alone. The effect of using CARS alone is better, so the best methods were selected as SVM + D2 + CARS and BPNN + D2 + CARS.
During the task of judging whether G. sinensis was adulterated, the preprocessing selection of D1 and D2 had obvious advantages and the accuracy was much higher than other preprocessing methods could be observed. In the PLS-DA model, the most suitable feature variable selection mean is UVE + CARS (Fig. 3C(a)). While in the RF and SVM models, the most appropriate feature variable selection approach was GA (Fig. 3C(b); Fig. 3C(c)), and in the BPNN model, the most effective feature variable selection choice was CARS (Fig. 3C(d)). It can also be noticed that other algorithms used alone such as SPA and UVE also had higher accuracy, comparing UVE jointly with the other three algorithms, the number of feature variables filtered out using alone was higher, but the correct rate was improved. Therefore, using one algorithm alone was better than jointly to determine whether G. sinensis was adulterated, which coincided with the discoveries of earlier research (Gao & Xu, 2022).
Overall, the UVE algorithm was more suitable for identifying species in conjunction with other feature variable selection algorithms, while identifying whether to adulterate was more suitable for combining a feature wavenumbers selection method alone, which was to be expected since different algorithms had different efficiency and adaptability in utilizing different tasks and information (Jiang et al., 2023).
3.3. Quantitative analysis
3.3.1. Quantitative analysis with PLS
The PLS regression method can accurately ascertain the amount of G. applanatum in adulterated samples. Different pretreatment methods had different effects on the PLS regression models. Various pretreatment methods were used to obtain the most suitable model for detecting adulterated content.
For the task of predicting G. lucidum powder adulteration levels (Table 2), MSC appeared to be the best pretreatment, as the model using MSC had the highest R2P value (0.93) and the lowest RMSEP (0.11) among the five pretreatments, while R2C and RMSEP were very close to the values of the other pretreatments, which indicated the strong predictive power and stability of the MSC model. The best combination of modeling methods was PLS + MSC + UVE + GA, 57 wavenumbers were screened (Fig. 4A). And the values of R2P were preferentially compared when the RMSEP, R2C, RMSEP, and R2P of all modeling methods did not differ significantly, at which point R2P (0.91) was the highest among all combinations. In general, the accuracy of the regression model is considered excellent when R2 ≥ 0.90 (Wang et al., 2022).
Table 2
The quantitative results of PLS with different spectral pretreatment methods (G. lucidum mixed with G. applanatum) and (G. sinense mixed with G. applanatum)
Mission | Combination of modeling methods | Number of characteristic variables | Training set results | Prediction set results |
RMSECV | RC2 | RMSEP | RP2 |
Prediction of G. lucidum adulteration levels | PLS original data | 1557 | 0.08 | 0.96 | 0.13 | 0.90 |
PLS + SG | 1557 | 0.08 | 0.96 | 0.12 | 0.91 |
PLS + SNV | 1557 | 0.08 | 0.96 | 0.11 | 0.89 |
PLS + MSC | 1557 | 0.09 | 0.95 | 0.11 | 0.93 |
PLS + D1 | 1557 | 0.09 | 0.95 | 0.13 | 0.91 |
PLS + D2 | 1557 | 0.09 | 0.95 | 0.13 | 0.88 |
PLS + MSC + CARS | 18 | 0.08 | 0.96 | 0.14 | 0.85 |
PLS + MSC + SPA | 76 | 0.08 | 0.95 | 0.13 | 0.90 |
PLS + MSC + GA | 51 | 0.10 | 0.93 | 0.15 | 0.85 |
PLS + MSC + UVE | 1319 | 0.11 | 0.92 | 0.12 | 0.88 |
PLS + MSC + UVE + CARS | 36 | 0.10 | 0.93 | 0.15 | 0.86 |
PLS + MSC + UVE + SPA | 76 | 0.11 | 0.93 | 0.13 | 0.88 |
PLS + MSC + UVE + GA | 57 | 0.11 | 0.92 | 0.13 | 0.91 |
Prediction of G. sinensis adulteration levels | PLS original data | 1557 | 0.04 | 0.85 | 0.08 | 0.47 |
PLS + SG | 1557 | 5.32*10− 5 | 1.00 | 0.14 | 0.86 |
PLS + SNV | 1557 | 2.95*10− 5 | 1.00 | 0.14 | 0.87 |
PLS + MSC | 1557 | 4.54*10− 5 | 1.00 | 0.14 | 0.84 |
PLS + D1 | 1557 | 4.01*10− 5 | 1.00 | 0.14 | 0.92 |
PLS + D2 | 1557 | 3.70*10− 5 | 1.00 | 0.15 | 0.87 |
PLS + D1 + CARS | 44 | 3.63*10− 5 | 1.00 | 0.14 | 0.91 |
PLS + D1 + SPA | 115 | 4.89*10− 5 | 1.00 | 0.13 | 0.89 |
PLS + D1 + GA | 78 | 2.94*10− 5 | 1.00 | 0.16 | 0.86 |
PLS + D1 + UVE | 215 | 4.23*10− 5 | 1.00 | 0.14 | 0.84 |
PLS + D1 + UVE + CARS | 32 | 3.52*10− 5 | 1.00 | 0.18 | 0.70 |
PLS + D1 + UVE + SPA | 114 | 4.42*10− 5 | 1.00 | 0.11 | 0.90 |
PLS + D1 + UVE + GA | 8 | 1.92*10− 5 | 1.00 | 0.17 | 0.73 |
In the task of predicting the adulteration concentration of G. sinensis powder, the R2C of models built by preprocessing such as D1, D2, SNV, MSC, and SG were similar and were all 1.00, indicating their high correlation. D1 had the best prediction accuracy with smaller RMSEP values and significantly higher R2P values than the other preprocessing methods. In general, D1's preprocessing method showed excellent performance, and different preprocessing methods had different effects on different spectral datasets and the derivative preprocessing method worked best in the PLS model of G. sinensis. This result was consistent with previous studies (Li et al., 2020). PLS + D1 + CARS was the best modeling combination (Table 2, Fig. 4B), and other combinations could not guarantee prediction accuracy despite the smaller number of feature variables selected.