In this work we introduce the FlavorMiner algorithm, which takes as input the Isomeric Smiles of a set of molecules and produces as output their flavor profile (Fig. 1). The first step is to query a database of 13,387 molecules with known flavor profiles. Only the set of molecules with no database match pass to the prediction step. Then, the respective mathematical representation of the molecules is generated. In the next step, this mathematical representation is fed to seven independent binary classifiers. The average prediction capability of these predictors is 0.88 (ROC AUC Score). Each classifier predicts one of the seven target flavor categories (bitter, floral, fruity, off-flavor, nutty, sour, and sweet). The results are provided in a table, including the predicted flavor profile for each compound and the source of the flavor profile (database match or prediction). The probability values are also provided, indicating the confidence level of each prediction. Finally, a radar chart showing the recurrence of the molecules with each target flavor is also generated.
2.1. Development of ML models for flavor prediction, including management of class imbalance biases
To train the classifiers incorporated in FlavorMiner, a flavor molecule dataset was assembled, containing 13,387 compounds with experimentally validated flavor profiles. The positive examples (those with a specific flavor) represent on average 20% of the dataset, while the negative examples (those without a specific flavor) represent 80% (Supplementary Fig. 1). This is a sign of class imbalance, which is an important challenge in the development of ML models, as it can lead to bias towards the majority class[33, 34]. Due to this class imbalance, all algorithms trained on the original data, except for those trained on sweet molecules, had poor recall (Fig. 2), which measures the ability to correctly identify positive examples[30, 31]. On the other hand, the specificity, which measures the ability to correctly identify negative examples[30, 31], was significantly high. This bias towards the majority class was consistently observed regardless of the target flavor, algorithm, or mathematical representation.
Figure 2. Classification Metrics for Algorithms Trained with Original Descriptor Data on Test Set. The metrics include Recall (blue bar), Specificity (orange bar), and ROC AUC Score (green bar) for each algorithm. (a) Random Forest trained with molecular descriptors. (b) Random Forest trained with extended connectivity fingerprint. (c) K-Nearest Neighbors trained with molecular descriptors. (d) K-Nearest Neighbors trained with extended connectivity fingerprint. (e) Convolutional Graph Neural Network trained with molecular graph.
Most algorithms trained on the original data showed a specificity higher than 0.9 during the test. Nonetheless, these models had a recall lower than 0.5. This performance evidence a bias towards the majority class higher than 40% for most algorithms. The Convolutional Graph Neural Network trained with the original molecular graph had the lowest recall for most target flavors (close to zero) (Fig. 2c). This is likely because this algorithm is more complex (larger number of parameters) and hence it requires more data to be trained effectively[35, 36]. On the other hand, the sweet category has a bias of less than 10% with Random Forest and K-Nearest Neighbors, trained either with RDKit descriptors or ECFP, which can be explained because it had the smallest class imbalance. The number of sweet positives is only 2% lower than the negative examples. Conversely, the sour category has a class imbalance of 97% and showed the highest bias towards the majority class (> 85%). This is a common issue in ML models dealing with imbalanced data[37].
Additionally, overfitting was observed in models using Random Forest and K-Nearest Neighbors after the first training iteration with the original data. This is likely due to the limited number of positive data, which can lead the model to exceedingly consider specific features from the negative training set that constrain the ability to adequately generalize and predict when presented with previously unseen data[30, 38, 39]. The difference between the train and test specificity was under 10% in most cases, but the recall showed a considerable drop between 20–90% from train to test for most algorithms trained with the original data (Supplementary Fig. 2). This is a clear sign of overfitting[30]. Only some models obtained with the Convolutional Graph Neural Network showed no sign of overfitting. This is only because the recall was near to zero both during training and testing. Similarly, there was a proportional relationship between class imbalance and overfitting. For example, the models for the sweet flavor (the class with no imbalance) showed the lowest overfitting, while the sour flavor (the class with a high imbalance and special problems associated with its perception, v.i.) had the highest overfitting percentage.
SMOTE and Cluster Centroid sampling techniques were implemented to address the class imbalance. These strategies significantly reduced bias and overfitting. SMOTE, an oversampling technique previously used in flavor predictors[19, 20, 28], was applied to the minority class to increase the number of positive examples. This resulted in a bias of less than 20% for most algorithms (Supplementary Fig. 3). The overfitting level was also reduced to less than 30% for most algorithms (Supplementary Fig. 4). Under-sampling with Cluster Centroid [40] was also applied to reduce the number of negative examples (Supplementary Fig. 5). This resulted in an overfitting reduction to less than 30% for K-Nearest Neighbors models and less than 15% for Random Forest models (Supplementary Fig. 6). Most K-Nearest Neighbors models had a bias of less than 10%, while most Random Forest models had a bias of over 20%.
Bias and overfitting were reduced due to a significant increase in the recall after applying the resampling strategies. Although the bias and overfitting were still slightly high, this represented a significant improvement compared to the performance with the original data. The recall of all the algorithms trained with resampled data was over 50%, while the specificity of most of these models remained above 70%. Multiple studies have shown that both oversampling and undersampling can be used to correct the problems caused by class imbalance in machine learning approaches[33, 34, 40]. In the context of flavor prediction, several studies have investigated the effect of SMOTE oversampling[19, 20, 28]. These studies have focused mainly on sweet, bitter, and sour flavors, and have obtained results similar to those of the present work[19, 25, 38].
On the other hand, using a balancing transformer on the molecular graph to train a convolutional graph neural network significantly improved the recall, but also significantly reduced the specificity. The recall for classes with more class imbalance improved by 73–99%, but the specificity dropped by a similar proportion (Supplementary Fig. 7). Additionally, the recall for classes such as sweet and bitter decreased. Consequently, the bias and overfitting increased for all models trained with the balanced molecular graph compared to the original data. The bias was higher than 50% for most target flavors and was as high as 90% for fruity, off-flavor, nutty, and sour flavors. This indicates that the balancing transformer had a significant negative effect on the specificity of the models. The overfitting for bitterness and sweetness predictions increased with the balanced data. For fruity, off-flavor, nutty, and sour, the recall decreased to negative values by more than 20%. A negative recall change value indicates underfitting, which occurs when the model does not learn a strong enough pattern from the training data[30, 31]. This can be solved by performing a more intense hyperparameter optimization, but this may come at a considerable computational cost compared to Random Forest and K-Nearest Neighbors algorithms.
The balancing transformer and resampling techniques (SMOTE and cluster centroid) differ in how they address class imbalance. The balancing transformer focuses on the weights of positive and negative examples in the neural network, while resampling techniques focus on the feature space[30, 40, 41]. The balancing transformer does not change the input data or the number of examples in each class[30]. The poor results obtained with this strategy demonstrate that this is insufficient to solve the severe class imbalance of the input data. Resampling techniques, on the other hand, change the input data by creating new synthetic examples in the minority class (SMOTE) or by removing examples from the majority class and replacing them with cluster centroids[40]. Considering the significant improvement in the performance of the algorithms trained with resampling techniques, this seems to be the best approach to balance the flavor compound database. Unfortunately, it is challenging to implement resampling strategies on molecular graphs, and only possible with molecular descriptors and fingerprints. This is because clustering molecular graphs without affecting their structure and losing valuable information is nearly impossible. Also, in flavor studies, minor changes in structure (graphs) can cause severe changes in perception. Thus, synthetic filling can cause more rather than fewer problems. Although other balance methods are available for graph data, their usefulness with molecular graphs remains to be evaluated[41].
2.2. FlavorMiner combines the best ML models for prediction of different flavor classes.
Random Forest outperformed the K-Nearest Neighbors algorithm for most target flavor notes, except sour (see below for discussion). Random Forest trained with ECFP oversampled with SMOTE performed best for bitter, fruity, sweet, and off-flavor notes. Random Forest trained with RDKit descriptors performed best for floral and nutty notes. K-Nearest Neighbors trained with ECFP oversampled with SMOTE performed best for sour notes. In general, K-Nearest Neighbors had similar recall to Random Forest with the same input datasets, but slightly lower specificity. Also, algorithms trained with data resampled with the cluster centroid algorithm had slightly better recall, but a higher drop in specificity compared to datasets resampled with SMOTE. These results are consistent with previous studies, which found that Random Forest outperforms other algorithms for predicting sweet and bitter flavors[3, 22, 26]. A correlation was observed between the amount of positive data available and the performance of the classifiers. Sweet, the class with the highest number of positive instances, had the best overall performance, with an ROC AUC score of 0.97. Sour, the class with the lowest number of positive instances, had the lowest performance, with a ROC AUC score of 0.78. These results suggest a superior performance of algorithms trained with resampled datasets compared to those trained with the original data.
The performance of the seven final predictors selected for the FlavorMiner backbone is shown in Fig. 3. The average ROC score, specificity, and recall of these classifiers were 0.88, 0.82, and 0.77, respectively. The performance of FlavorMiner for bitter and sweet prediction was comparable to that of existing predictors[20, 22, 26]. For fruity and floral prediction, FlavorMiner achieved recalls of 0.71 and 0.76, respectively, representing an improvement of over 50% compared to previous studies[18, 42]. FlavorMiner is the first model to predict nutty and off-flavor notes from molecular structures. For sour prediction, FlavorMiner was outperformed by a previously published tool[25] by about 15%. However, the dataset, composition of positive and negative examples, and code used in this study are not publicly available, making it difficult to assess the reasons for the observed difference.
Figure 3. Performance of the Optimized Classifiers for Target Flavor Notes in FlavorMiner. The metrics include Recall (blue bar), Specificity (orange bar), and ROC AUC Score (green bar) for each algorithm. (a) Classification metrics obtained during training using 5-fold cross-validation. (b) Classification metrics obtained using the test set. Random Forest was used for bitter, fruity, sweet, off-flavor, floral and nutty. K-Nearest Neighbors for sour notes.
Variable Importance Plots (VIP) scores[31] revealed the most important features for predicting floral, off-flavor, and nutty notes (Supplementary Fig. 8) with RDKit molecular descriptors. Six descriptors appeared repeatedly in all three cases, accounting for around 45% of the classification. These descriptors measure properties such as the size and polarity of molecules (TPSA), their electronic structure (PEOE_VSA and EState_VSA) and stability (SMR_VSA1 and MinEStateIndex), and their tendency to partition into a hydrophobic environment (MolLogP). Supplementary Fig. 9 shows the trend of the five most relevant features for positive and negative examples of each flavor note. Off-flavor molecules tend to be smaller and less polar than non-off-flavor molecules, with a higher tendency to partition into hydrophobic environments. Floral molecules tend to be smaller and more flexible than non-floral molecules, with a higher tendency to partition into hydrophilic environments. Finally, nutty molecules tend to be smaller and less flexible than non-nutty molecules, with a higher electronic stability. These results are new for these flavor notes and provide a basis for future research to select more specific mathematical representations and use data mining techniques to better understand why molecules have these flavors.
Supplementary Fig. 10 shows the VIP scores for the Random Forest models trained on oversampled ECFP descriptors for predicting bitterness, fruitiness, and sweetness. The four most important bits for the binary classifiers predicting these flavor notes were 897, 314, 489, and 463. The fragments corresponding to these bits are shown in Supplementary Fig. 4. For the K-Nearest Neighbors algorithm, the permutation importance score[43] was used to estimate feature importance because in this case it is not possible to use the VIP score (Fig. 4). Interestingly, most of the top five fingerprints for these notes corresponded to fragments that were absent in the positive compounds. This is likely due to the higher chemical diversity of the negative compounds. E.g. many typical bitter compounds contain an (alkaloid) nitrogen, but no N-containing fragment appeared in the top 5 for bitter, actually many top fragments like bit 897 (C-O-C - moiety) appeared in bitter, fruity and sweet, i.e. they are of universal flavor relevance, but for note determination probably play there role obviously only in the context with other features (e.g. in esters for fruity or sweet in cyclic sugars). Even though resampling strategies were implemented to improve the overall performance of the models, this did not necessarily enhance the chemical diversity of the positive examples.
Fingerprints have two main advantages over molecular descriptors. First, they can provide information about the structural features that lead an algorithm to a certain decision. Future work could involve a deeper analysis of the fragments that play a central role in the classification to better understand the structural features that underlie these flavor notes. Second, they can be calculated from MS-spectra data, even when the structure of a compound is not fully elucidated[44, 45]. This makes fingerprints useful for accelerating the discovery of new flavor molecules in metabolomics experiments. Metabolomics experiments typically involve the analysis of many compounds and ECFP can help concentrate the annotation and structural elucidation on the most promising candidates. This can save time and money by focusing efforts on the most promising and likely flavor-active compounds.
The CGNModel combined with molecular graphs showed poor performance, even with a balancing transformer. This is likely due to the inherent noisiness of the data, which is exacerbated by the susceptibility of Graph Neural Networks to noisy data[46, 47]. This noisiness arises from the heavy dependence of flavor characterization on human tasters and the influence of genetic, sensory, and environmental factors on flavor perception[3, 16, 48, 49]. It is challenging to implement a denoising strategy without losing valuable information. Therefore, the CGNModel was discarded for FlavorMiner, given the limitations of the current data and the better performance of other algorithms.
The flavor profile of a molecule also depends on its concentration and the surrounding matrix[50, 51]. This is related to the concept of flavor threshold and the synergistic and antagonistic effects of flavor molecules in complex mixtures. The flavor threshold is the minimum concentration at which the flavor is detectable[50, 51]. This version of FlavorMiner only performs binary prediction, and intensity data is not yet incorporated. Although some data is available, it is not readily accessible, as there is no standardized database of threshold concentrations for molecules with known flavor profiles. Some databases such as FlavorDB[6, 7] and LSB@TUM Odorant Database (https://www.leibniz-lsb.de/en/databases/leibniz-lsbtum-odorant-database/start/) contain information on flavor thresholds. However, there is a lack of standardization in the thresholds reported in these databases. This means that a method is needed to unify and make this data comparable. Also, most information on flavor thresholds is available in unstructured format (text). Therefore, an intense text mining process is required to extract this data and make it usable for machine learning purposes.
Additionally, some studies have shown that combining several molecules with different flavor profiles can enhance the flavor profile of a mixture or block certain notes[50, 51]. However, data in this area is limited, and any effort in this direction will require a preliminary experimental process to generate it. Overcoming these challenges could lead to the development of regression algorithms that can be combined with flavor classifiers to predict not only the flavor profile of a molecule but also its threshold concentration and matrix effect.
Sour (like salty not evaluated here) is a special flavor note, as it relies on the smallest available “molecule”, the proton. Also, it is not activating a classical GPCR like the other taste (T1R and T2R) or the olfactory receptors. Only quite recently the responsible Otop1 ion channels were assigned[52]. Thus, typical structural features of a molecule might be considered irrelevant, except for its pKa properties, i.e. its ability to lower pH, an effect that will strongly depend on the matrix’ overall pH, buffer capacity and may be the proton relay/ion transport capacity. Thus, predicting sour taste from structure might be considered impossible if only the pH change is sensed. However, like GPCRs, ion channels can be influenced by more than the ion it is selective for for various reasons, including ion pairing and matrix/mucosa effects or directly at the ion channel by secondary interactions and additional binding sites which will have selective structural preferences as every protein does. In conclusion, structure-based predictions for ion channel-based tastes (here sour, but also salty) have to be considered with caution, as slight changes in the tasting parameters, e.g. of the matrix (pH, buffer capacity) can thwart results and thus all ML. To understand, if there is sour taste influence on the anionic, organic (i.e. structurally influenced) part, such taste experiments must run with a standardized, high-capacity buffered matrix, neutral pH or better at 2–3 different pH values. Only this can reveal any possible structural influences of the organic counterion or a neutral molecule influencing or mimicking sour taste. Otherwise, it will not be better than a standard pKa prediction which does not require ML. Independent of this, perception is also influenced by the other receptors. A classic example is of course the action of Miraculin.
2.3. Molecular flavor prediction for compounds involved in the processing of cocoa.
Previous studies have annotated around 210 compounds during the fermentation, drying, and roasting of fine-flavor cocoa[53, 54]. However, for less than half of these compounds a flavor profile has been reported. The existing data were analyzed with FlavorMiner to predict the flavor profile of these compounds. After the prediction, the compounds with “known” flavor profiles increased to 92%. The newly predicted compounds include 12 floral, 8 fruity, and 4 compounds with unknown fine-flavor attributes that are potentially linked to positive impacts on quality and price. Additionally, 2 compounds linked to off-flavors and 27 unknown potentially sweet compounds were suggested by the model. These predictions represent an important step forward in closing the gap between cocoa metabolic fingerprint variation during processing and flavor quality.
Figure 5 shows the frequency of compounds increasing in association with each of the seven target flavors at the end of every cocoa processing stage (Fermentation, Drying, and Roasting). In general, the frequency of compounds for the different target flavors is similar during fermentation and drying. The most relevant change through the processing chain is in sweet compounds, which decrease considerably during the process. This drop is associated with a decrease in the carbohydrate content during the processing chain[53, 54], as most of these molecules are reported as sweet agents. In the roasted samples some compounds linked to sour and bitter showed a higher abundance, but the real impact of these suggested flavor molecules still needs to be elucidated. For example, some degradation products of more complex compounds have a lower biological activity (e.i., antioxidant activity) than their precursors[55]. If a similar trend occurs with respect to flavor will require further investigations. In contrast, most compounds linked to fine flavor notes (fruity, floral, and nutty) show a relatively constant frequency throughout the cocoa processing chain. These results provide further suggestions into flavor development from biochemistry to processing, which was a missing component until now.