Classification performance of SVM for T2 relaxation curve data
The applicability of ML algorithms to the data mining of T2 relaxation curves was firstly evaluated because such analyses have not been reported in the literature, and several ML approaches have shown relatively good performance in classification problems compared with multivariate analyses.[14] Conventional SVM classification, which is a typical ML method, was performed using category information from two groups of compressive force data as explanatory variables for the T2 relaxation curve data (a total of 627 curves) of various fish muscle samples collected from multiple coastal sites in Japan. Unfortunately, the classification performance was worse than expected; the AUC, accuracy, correctly classified rate for group A (CCR-A), and correctly classified rate for group B (CCR-B) were 0.780, 0.748, 0.475, and 0.865, respectively (Table S2).
Variable optimization to enhance the classification performance
To improve the SVM classification performance, we used a variable optimization approach to enhance the quality of information obtained from simple T2 relaxation curve data. The variable optimization approach employed was a search method that determined the best length of raw curve to improve the classification performance via the reduction of variables from long- to short-relaxation-time components in sequential order. This variable reduction idea is based on the elimination of “noise” variables, which possibly arise from background noise and/or from free water, i.e., relatively long T2 relaxation time components in the relaxation curve are barely related to the characteristic features of samples and were suspected of interfering with the accuracy of the SVM learning step.
In this study, the variables were gradually reduced at the rate of 10% from all variables (the number of variables was 256) to 10% usage rate (25 variables) by omitting the relatively long T2 relaxation time components. For instance, the dataset at 90% usage rate included 230 variables (from 0 to 0.92 s of T2 relaxation time components) and more than 0.92 s of the components was removed. The each dataset generated by the variable reductions was applied to performance evaluation by SVM classification. Based on the classification performance, the optimized number of variables was determined.
The incorporation of the variable optimization approach into the SVM classification method improved the classification performance for the two groups (Fig. 2). The best SVM classification performance was obtained with a 10% reduction (90% usage rate) of data points (variables) from the T2 relaxation curve, with AUC and accuracy values of 0.806 and 0.779. Although the value of CCR-A was also improved from 0.475 to 0.538, the relatively low value still required improvements to obtain satisfactory classification performance.
Bootstrap resampling to balance the sample size between groups
The classification performances of conventional SVM with and without the above variable optimization approach were profoundly affected by the biased sample size between groups in the data matrix of the T2 relaxation curve. To circumvent this difficulty, we focused on a bootstrap resampling-based matrixing that constructs a matrix (subtraining dataset) by considering the evenness of the sample size between groups for the model construction of ML classifications (Fig. 1). In this approach, the hyperparameters (i.e., the number of resamples for balanced matrixing and the number of datasets generated by bootstrap resampling) were evaluated (Fig. 3). When the number of generated datasets was set to 100, a slightly better performance was obtained in terms of AUC and accuracy compared with only 10 generated datasets, whereas different resample sizes yielded almost the same AUC and accuracy values, except for resample sizes below 50 per group (a total of 100 samples per dataset). Furthermore, the CCR-A slightly decreased with increasing resamples. Therefore, we performed variable optimization combined with bootstrap resampling-based matrixing by using 100 generated datasets with 150 resamples (a total of 300 samples) per dataset (Fig. S4). At the 90% level of variable usage rates for T2 relaxation curves exhibiting the best SVM classification performance, the classification performance of the constructed analytical framework was significantly improved compared with conventional SVM in terms of the ROC curve and the AUC value (Fig. 4). The AUC, accuracy, and CCR-A values significantly increased from 0.780, 0.748, and 0.475 to 0.820, 0.771, and 0.710, respectively (Fig. S5). In addition, robustness evaluation of the developed method for fluctuation of each variable was performed using datasets generated by random resampling based on permutation for a variable (Fig. S6). The fluctuation of each variable had relatively little effect on the SVM classification performance using the analytical framework developed in this study, indicating that the developed method enables to construct robust models for variable fluctuation. Therefore, the analytical framework described here, namely, the incorporation of the variable optimization approach and the bootstrap resampling-based matrixing into the ML calculations, resulted in improved classification performance in the data mining of T2 relaxation curve data and enhanced the robustness when using unbalanced datasets. Furthermore, the relaxometric learning method developed here enabled the extraction of features related to the physical properties (hardness and tenderness) of fish muscle.
Applicability of the analytical framework to other machine learning methods
The analytical framework developed in this study is optimized for data mining of T2 relaxation curve data. Thus, we considered that not only SVM but also other ML algorithms and multivariate analyses may be useful for the data mining of T2 relaxation curve data. To test this hypothesis, RF and PLS were used as alternatives to SVM for data mining based on the developed analytical framework. The classification performance of RF improved slightly in terms of the ROC curve and the AUC value, and the CCR-A values were significantly improved by the incorporation of our analytical framework (Figs. 5 and S7). On the other hand, the classification performance of PLS improved drastically in terms of the values of both AUC and CCR-A (Figs. 5 and S7). These results suggest that our analytical framework is applicable to various ML algorithms and multivariate analyses to enhance classification performance, but the extent of improvement is method dependent. Therefore, the relaxometric learning approach developed in this study should find use as a versatile and useful method for the analysis of T2 relaxation curve data.
Applicability of relaxometric learning to the determination of geographical origin
NMR-based metabolomics approaches are capable of determining the geographical origins of food products such as fish,[14, 16, 29-31] beef,[32] durum wheat,[33] white rice,[34] apple,[35] cabbage,[36] honey,[37] coffee beans,[38] and wine.[39] Therefore, relaxometric learning was also considered applicable for such analyses. In addition, the quality of biological tissue (such as water content and water dynamics) in different environment is varied according to the geographical origins.[40] Here, we proposed that the difference in the water conditions could be detected by pattern recognition of T2 relaxation curves using NMR but not high performance liquid chromatography or mass spectrometry methods. Then, we performed experiments to evaluate the applicability of relaxometric learning to discriminate between geographical differences (i.e., to extract features in terms of the habitats of Girella punctata belonged to Kyphosidae fish living in Tokyo Bay and Sagami Bay) based on T2 relaxation curves (Fig. S8). The SVM-based relaxometric learning method exhibited relatively good performance in the geographical origin discrimination of fish compared with conventional SVM, thus leading to a significant increase in AUC value from 0.886 to 0.936 (Fig. S8). These results suggest that relaxometric learning is applicable as a method for determining geographical provenance, solving problems related to food fraud, and certifying the “terroir” of food, similar to the case for conventional metabolomics approaches.
Compact benchtop or portable NMR spectrometers are low-cost alternatives to conventional high-field and high-resolution spectrometers. Benchtop low-field NMR spectrometers can theoretically obtain T2 relaxation curves with a similar quality to those obtained with high-resolution NMR spectrometers, such as the one used in this study; therefore, similar classification accuracies can be expected. Relaxometric learning using benchtop and portable NMR spectrometers even without using D2O might also find applications in on-site quality control and fleshiness management, optimization of production processes, and improvement of product quality not only in food but also in various industrial fields such as polymers, cosmetics, fabrics, pharmaceuticals, and healthcare. Relaxometric learning is expected to be a versatile and powerful approach for the characterization and evaluation of industrial products and as an option for biological and chemical research that requires a nondestructive, cost-effective, and time-saving method.