3.1 Features of MS data obtained from direct ionization of black peppers
An example spectrum (Fig. 2) highlighted some interesting features from the MS data obtained. First, it is surprising, at first glance, that a peak corresponding to piperine (m/z 286 for [M + H]+) or its derivatives cannot be seen at all. Piperine is responsible for the pungency property of black pepper, and makes up to 7% of its dried weight44. This disappearance is likely due to the higher melting point and boiling point, as well as the higher polarity, compared to volatile terpenes, which is another group of compounds commonly found in black pepper. In fact, an aqueous extract of black pepper was shown to provide a clear signal of piperine via PS-MS (Fig. S2). Since the main goal of this work is to discover a key set of data that can differentiate the origins of black peppers, the absence of a single, albeit major, compound is deemed to be inconsequential. Next, volatile terpenes were indeed clearly found in this experimental setup. For example, a peak corresponding to monoterpene (C10H16, m/z 137 for [M + H]+) can be seen. Apart from piperine, these are a group of terpenes that directly contribute to the flavor profiles of black peppers45–48. Examples include myrcene, sabinene, and terpinene. Also, another peak at m/z 205 is visible, which is attributed to sesquiterpene (C15H24), with caryophyllene as a prominent example.
Importantly, the amounts of these compounds in each source of black pepper, as reflected in the ion intensities, can vary. For instance, while the representative sample in Fig. 2 had m/z 151, likely carvone, as the most intense peak, some other samples had other peaks as the major component (Fig. S3 and S4 for the complete set of MS spectra, which were plotted from the complete set of numerical data in the supplementary data). These differences, when treated with systematic statistical analysis, may be sufficient for the differentiation of black peppers from different origins.
3.2 Data analysis of obtained MS data
In this study, multivariate data analysis was performed on the mass spectra in order to discriminate peppers based on three categorizations: I) the types of peppers (black and white peppers including all origins), II) the origins of black peppers (Thailand, China, India and UK), and III) the origins of white peppers (Thailand and China). First, one-sample t-tests were used to analyze the raw data for their variabilities in the analysis. It was found that all m/z data points from seeds of the same source do not show any statistical difference at the 5% significance level, i.e., accepting the null hypothesis. Thus, this indicates that variability of analysis was generally low. Then, all m/z data points were used for PCA analyses on the types of peppers, the origins of black peppers, and the origins of white peppers using the first three PCs (Fig. 3) with the total variance > 70% in all cases. Overall, the PCA score projections in this manner exhibit poor separation between classes in most studies. In addition, the contingency table of the classification using PCA-LDA is shown in Table S1. Classification rates of 76.92%, 74.46% and 69.35% were observed in the discrimination of the types of pepper, the origins of black peppers and the origins of white peppers, respectively. This result indicates that PCA based on the largest principal components may not always be the most effective method. In some cases, especially in biological systems43,49, the PCs with high variance may instead correlate with the background noise of biological samples and the best discriminator might present at the latter PC with small variance. Hence, to explore the possibility of obtaining useful information in small-intensity data points, some analysis on other PCs was conducted.
As shown in Fig. 4 (A,C,E), it can be seen that the classification accuracy using the first three PCs is poor for all three studies. With the inclusion of latter PCs, it was revealed that 18, 14, and 11 PCs were optimal for the classification of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. In addition, the prediction strength of each PC was calculated to visualize the discrimination of the PCs in each study (shown in Fig. S5). The higher the prediction strength, the better the discrimination power. It can be seen that some latter PCs showed higher prediction strength. That is, PC 3-4-10, PC 3-4-7, and PC 1-7-8, are the sets of PCs with the highest prediction strength for the studies of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. The score plots with the PCs with the highest prediction strength are shown in Fig. 4 (B,D,F).
To provide some numerical data for comparison, the classification rates for each PCA-LDA model built from both the first three PCs and the optimal number of PCs are illustrated in Fig. 5 (Table S1 for contingency tables). The overall classification accuracies using the first three PCs are merely 76.92%, 44.68% and 72.58% for the studies of the type of peppers, the origins of black peppers, and the origins of white peppers, respectively. On the other hand, the classification rates for the cases that were built from the optimal number of PCs appear to be significantly higher, with the scores of 98.72 %, 98.94 % and 100 %. Although the most optimal numbers of PCs for the LDA classifications were > 10 PCs in all cases, the classification rates increased dramatically only in the range of PC1-5. These results suggested that the first five PCs are the most relevant, but the latter PCs are still meaningful in differentiating between the classes. This is in good agreement with the PC score plots in Fig. 4 using the latter PCs with high prediction strength.
3.3 Significant m/z data points as markers
A conventional approach for variable selection is to perform a selection method (Fisher weight in this case) on the entire dataset, and determine the significance of an m/z value as a marker from its magnitude of test value. However, selecting a set of markers from an entire dataset may encounter an overfitting issue. An alternative approach is to identify potential m/z data points from several spitted training sets. This is called as “iterative reformulation of training set models”, which was first introduced elsewhere43. In this study, we also employed this approach to confirm the relevance of the obtained m/z data points. That is, a randomized partial set (70 % of the entire dataset) was used for the discovery of relevant m/z data points, whose appearances were then counted. After 100 iterations, all m/z values were then evaluated for the number of times they appeared in the list of relevant m/z data points of each split training set. As different m/z values may be differentially selected in each iteration, those data points that are most frequently selected are likely the most significant markers.
Figure 6 illustrates the number of times (out of 100 iterations) that each m/z value was selected in the subset of the top 10% m/z data points with the highest magnitude of Fisher weight value. Any m/z values that presented 100 times were identified as impactful markers. The study on the types of peppers provides a set of 3 m/z data points (m/z = 205.2, 206.2 and 295.2) that appeared in all 100 iterations (Fig. 6A), while the study about the origins of white peppers gave only 6 m/z data points (m/z = 163.1, 205.2, 206.2, 222.2, 224.2, and 238.3) that appeared 100 times (Fig. 6C). In the case of the study about the origins of black peppers, there were 7 m/z values (121.1, 205.2, 206.2, 245.2, 273.3, 291.2, 292.1, 292.2) that appeared in all 100 iterations (Fig. 6B), which indicated a more challenging case due to higher numbers of variables. It can be seen that in all cases, m/z 205.2 appeared as a significant marker, which is attributed to sesquiterpenes, a major class of compounds found in peppercorn seeds. It should be noted, however, that focusing on individual data points may be misleading due to the possibility for strategic adulteration/modification. This is actually reflected in the results, where different studies required different numbers of variables for effective cluster separations. Therefore, the non-targeted approach50, as practiced in this study, is deemed to be more flexible and versatile for broader applications. This is because it is not subject to strategic adulteration from a selected small set of known markers.