A Universal Validation Method for Mitigating Machine Learning Extrapolation Risk

doi:10.21203/rs.3.rs-3758965/v1

Download PDF

Article

A Universal Validation Method for Mitigating Machine Learning Extrapolation Risk

https://doi.org/10.21203/rs.3.rs-3758965/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Machine Learning (ML) can provide decision-making advice for major challenges in science and engineering, and its rapid development has led to advances in fields like chemistry & medicine, earth & life, and communications & transportation. Grasping the trustworthiness of the decision-making advice given by ML models remains challenging, especially when applying them to samples outside the domain-of-application. Here, an untrustworthy application situation (i.e., complete extrapolation-failure) that would occur in models developed by ML methods involving tree algorithms is confirmed, and the root cause is revealed. Further, a universal extrapolation risk evaluation scheme, termed the extrapolation validation (EV) method, is proposed, which is not restricted to specific ML methods and model architecture in its applicability. The EV method quantitatively evaluates the extrapolation ability of 11 popularly applied ML methods and digitalizes the extrapolation risk arising from variations of the independent variables in each method. Meanwhile, the EV method provides insights and solutions for evaluating the reliability of out-of-distribution sample prediction and selecting trustworthy ML methods.

Physical sciences/Mathematics and computing/Computational science

Physical sciences/Chemistry/Cheminformatics

Machine learning (ML) has made impressive achievements in substance discovery, data analysis, and image processing over the past decades, accelerating advances in fields as numerous as earth & life^1–3, communications & transportation^4–10, and chemistry & medicine^11–18. Spotlight to the field of chemistry, ML provides experimentalists with advice on selecting target molecules for synthesis by predicting physicochemical properties^11–15, biological effects^16–18, and reaction routes^19–21. Although ML models are still not a complete substitute for expert intuition²², they are sufficiently sophisticated to recognize complex patterns beyond the reach of expert intuition to provide decision-making advice for major challenges in science and engineering, as multiple algorithms and different architectures for ML solutions emerge.^23–25

Grasping the trustworthiness of the decision-making advice given by ML models remains challenging.^26–28 Influencing the trustworthiness of model decision-making involves the whole process of modeling, i.e., not only the preparation of data but also the process of algorithm selection, hyper-parameterization, etc.²⁹ The accurate prediction of previously unknown things and the generation of reasonable decisions by ML models are derived from the data information available during development. As such, model uncertainty arising from the range of data and its distribution may lead to models making unconvincing (high-risk) decisions. For example, Li et al.³⁰ discovered that ML models trained on Materials Project 2018 may have severely degraded performance when predicting new compounds for Materials Project 2021, which was attributed to the changes in the distribution of the dataset. Model uncertainty can be estimated using cross-validation and external validation tools.^29,31,32 External validation is performed on data not involved in modeling. Cross-validation divides the training set according to various data partitioning schemes (e.g., random, leave-one-out, cluster, or time-split³³) to evaluate the performance of the model in future applications. Worth considering, the property distribution of molecules in the training set may not be identical to the distribution of molecules encountered in the future, i.e., molecules encountered in the future may be outside of the domain-of-applicability of the model. In time-split cross-validation, the model is trained on data generated before a certain date and tested on a retained dataset generated after that date. Thus, the time-split cross-validation method is deemed to be closer to the evaluation of new discoveries, i.e., prospective validation^29,34. Due to data availability constraints, it may be difficult to obtain data that conforms to this approach in certain cases.

While cross-validation and external validation provide an important tool for testing the potential utility of ML workflows, nevertheless, they are unable to distinguish between the predictions for in-domain and out-of-domain samples, which makes it hard to provide a quantitative evaluation of the extrapolation ability of a ML model. The consequences would be inconceivable if ML model extrapolation performance degradation even extrapolation-failure occurs in artificial intelligence (AI)-driven applications, especially in high-risk scenarios such as self-driving cars, automated financial transactions, and smart healthcare. Hence, a method for quantitatively evaluating the extrapolation ability of a model is desired to reasonably circumvent the extrapolation risk.

Here, 11 ML methods are tested for out-of-domain samples prediction results on datasets with linear univariate, linear multivariate, and nonlinear multivariate functional relationships. Based on the extrapolation results, the involvement of the tree algorithm is suspected as the prime culprit in extrapolation-failure of the ML model. Subsequently, the potential reasons are explored by using the RF method as an example. To quantitatively evaluate the extrapolation ability, an extrapolation validation (EV) method is proposed. The EV method is applied to ML models with data from deterministic functional relationships, and the quantitative structural property relationship models for glass transition temperature (T_g) of polyimide (PI) in the macromolecular field as a real-world application example.

Mathematical model extrapolation testing

To observe the extrapolation ability of ML models, developed models for data with deterministic functional relationships (i.e., linear univariate, linear multivariate, and nonlinear multivariate) by 11 ML methods, including multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), ridge regression (Ridge), support vector machine (SVM), gaussian process regression (GPR), multilayer perceptron (MLP), adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), RF, K-nearest neighbor (KNN) and gradient boosting decision tree (GBDT) algorithms. A test (B) set and a test (F) set are set up to evaluate the extrapolation performance of ML models, where the test (B) set is dominated by dependent variables below the minimum value of the dependent variable in the training set, and the test (F) set is dominated by dependent variables above the maximum value of the dependent variable in the training set. Moreover, a test (I) set in which the dependent variable is included in the range of the dependent variable of the training set is used to validate the interpolation ability of the models.

Based on the results of models established by the initial hyperparameters of the 11 ML methods (Figs. 1a, b, and c, Supplementary Information Tables S1 ~ S3), it is found that the regressors involving tree algorithms (i.e., RF, KNN, XGB, AdaBoost, and GBDT) perform excellent predictability in the value domain of the training set, which is confirmed by the squared correlation coefficients (R²) of training set and test (I) set being close to 1. However, facing target values outside the value domain of the training set, their predicted vs. observed values behave as horizontal straight lines, with the fact that the R²_test(B) and the R²_test(F) are both 0. This suggests that ML models involving tree algorithms have great interpolation ability but may not have extrapolation ability. Since hyperparameters have non-negligible effects on model performance, the hyperparameters of the models established by 10 methods other than MLR are optimized. Even for the optimal models (Figs. 1d, e, and f, Supplementary Information Tables S4 ~ S9), the R²_test(B) and R²_test(F) of the ML models involving tree algorithms are still 0, which rules out the correlation between extrapolation disability and hyperparameters selection. Furthermore, the predicted values of the optimal models of the MLR, LASSO, GPR, and MLP methods are almost close to the observed values. Nevertheless, Ridge and SVM models using data from linear or nonlinear multivariate functional relationships have large prediction errors when observed values are far away from the domain of values in the training set, as evidenced by their R²_test(B) failing to reach 1 (Figs. 1e and f).

During hyperparameter conditioning of the regressors involving tree-based algorithms (Supplementary Information Figs. S1 ~ S3), all predictions in the test (B) and test (F) sets are close to the maximum and minimum values in the training set, respectively. Particularly noted is that AdaBoost, RF, and XGBoost models exhibit piecewise functional data relationships as the hyperparameters are varied, therefore, conjecturing that this may be the reason for the extrapolation-failure of the model developed by ML methods of involving tree algorithms, i.e., having constant output values for a certain range of input values.

Revealing the reason for extrapolation-failure

To gain insights into the reason for the poor extrapolation ability of regressors involving tree algorithms, an RF model developed for data from the linear univariate functional relationship is visualized, which contains 10 Decision Trees (DTs), each of depth 4 (Fig. 2, Supplementary Information Figs. S4 and S5). Each node can be considered as a dichotomy point in the decision-making process. For any input value lower than the domain of the definition of the training set, each node is determined to be “True”, so the predicted value of each DT is the minimum of its value domain. For any input value higher than the domain of the definition of the training set, each node is determined to be “False”, so the predicted value of each DT is the maximum of its value domain. The predicted value of the RF model is the average of the predicted values of all DTs. The example model, with only one independent variable, has a maximum predictive value of 979.9619 and a minimum predictive value of 417.0969, the potential predicted value range is [417.0969,979.9619] (Supplementary Information Figs. S4). Thus, the range of potential predicted values for any of the independent variables is a closed interval in the case of ML methods involving tree algorithms. When having multiple independent variables, the dependent variable is a combined transformation of values within these closed intervals. Further, the value domain constituted by the combined transformations of the potential maximum to minimum predicted values of all the independent variables is a closed interval, which may be the reason for extrapolation-failure for the models developed by ML methods involving tree algorithms.

Extrapolation Validation (EV) Method

To quantitatively evaluate the extrapolation ability of a model, the extrapolation validation (EV) method is proposed. Each independent variable is serialized, and then the training and test sets are re-divided following the order in the determined ratio. That is, the dataset is divided into training (EV) set and test (EV) set in the order of serialized independent variables, e.g., choose the first 80% as the training (EV) set and the remaining 20% as the test (EV) set (Fig. 3a). This provides data support for adopting performance of the test (EV) set (i.e., samples that are not exactly within in the domain of application for the model) to evaluate extrapolation ability of a model. Considering the contribution from all the independent variables, the serialized leverage value³⁵ (h; Methods) is applied for dividing the training and test sets. Both forward and backward sequences are adopted, that is, forward extrapolation validation and backward extrapolation validation. The extrapolation performance of the serialized independent variables is evaluated by the performance on the test (EV) set from re-fitting the model using the training (EV) set. Following this approach, all independent variables for the developed model are evaluated.

When serializing extrapolation for one independent variable, it is difficult to ensure that all independent variables in the test (EV) set are outside the corresponding range of the training (EV) set, at which point the performance of the test (EV) set will inevitably include the contribution from interpolation. Hence, the extrapolation degree (ED; Eq. (1), Fig. 3a) is defined as a metric to assist in evaluating the extrapolation ability of the model. The ED quantifies the extent to which the independent variables of the test (EV) set are outside the corresponding domain of definition of the training set thereby digitalizing the contribution of extrapolation ability in the performance of the test (EV) set. Furthermore, for samples far from the domain of definition of the training set, the predicted values of Ridge and SVM models developed with linear or nonlinear multivariate functional relationship data differ significantly from their observed values, in Section 2.1 (Figs. 1e and f). This emphasizes the fact that the reliability of the predicted values of a model for dataset out-of-distribution samples is related to the distance between the sample independent variable and the domain of definition of the training set.

\(\text{E}\text{D}=\frac{1}{{{n_i}}}\sum\limits_{i} {\left( {\frac{{\sum\limits_{j} {{e_{i,j}}} }}{{\sum\limits_{j} {{a_{i,j}}} }}} \right)}\) Eq. (1)

\({e_{i,j}}=\left\{ {\begin{array}{*{20}{c}} {\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right) - x_{{i,j}}^{{\text{t}{\text{est}}}},{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}<\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {x_{{i,j}}^{{\text{t}{\text{est}}}} - \mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right),{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}>\mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {0,{\text{ }}others} \end{array}} \right.\)

\({a_{i,j}}=\left\{ {\begin{array}{*{20}{c}} {\frac{1}{{{n_i}}}\sum\limits_{i} {x_{{i,j}}^{{\text{t}{\text{rain}}}}} - x_{{i,j}}^{{\text{t}{\text{est}}}},{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}<\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {x_{{i,j}}^{{\text{t}{\text{est}}}} - \frac{1}{{{n_i}}}\sum\limits_{i} {x_{{i,j}}^{{\text{t}{\text{rain}}}}} ,{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}>\mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {0,{\text{ }}others} \end{array}} \right.\)

where, i and j are the serial numbers of the independent variables and samples, respectively, and n_i and n_j are the number of independent variables and samples.

Adopting the EV method, the optimal models for the tested data from mathematical relationships are evaluated (Figs. 3b, c, and d, Supplementary Information Figs. S6 and S7). Root mean squared error (RMSE) is adopted as the main statistical parameter in this work. For distinction, the RMSEs on the test (EV) set for the developed model and the model re-fitted by the training (EV) set are represented by RMSE_test(Model) and RMSE_test(EV). For the data from whether linear or nonlinear relationships, the RMSE_test(EV) of models developed by methods involving tree algorithms, such as AdaBoost, RF, and GBDT, is large. For example, RMSE_test(EV) of ML models involving tree algorithms developed on the data obtained from linear univariate functions were all greater than 50 (Figs. 3b). Models established by such as MLP, GPR, and SVM methods have better extrapolation ability, i.e., RMSE_test(EV) is nearly 0. Results of EV indicate that the models developed by ML methods involving tree algorithms have poor extrapolation ability, while the models developed by ML methods non-involving tree algorithms have good extrapolation ability, which is consistent with the results of Section 2.1. Further, since the methods involving tree algorithms have small prediction errors in model development (i.e., RMSE_test(Model)) but big prediction errors in the model application (i.e., RMSE_test(EV)), this creates the phenomenon of performance degradation leading to reduced ML model trust. So, a prior evaluation of extrapolation ability using the EV method will help in selecting a trustworthy ML model.

Application of EV method

An application of the EV method is demonstrated with the help of a PI-T_g model³⁶, developed on a large dataset containing 1321 T_g. Besides the MLR model developed in the literature (R² = 0.8793, Q² = 0.8718), 10 models such as MLP, RF, and GBDT are established, all of which are fully consistent with the literature in terms of independent variables (norm index, I), training set, and test set settings (Supplementary Information Table S10 and Fig. S8). The extrapolation ability of 11 PI-T_g models is evaluated by the EV method (Figs. 4 and 5).

In the case of EV instance for PI-T_g models, the RMSE_test(EV) for forward and backward serialized extrapolation validations for models established by the tree-involving algorithm is always larger than that for models established by the non-tree-involving algorithm, for every I and h (Fig. 4a). To evaluate the overall extrapolation ability of the model, the average of RMSE for the independent variable and h extrapolation validation, i.e., the average RMSE_test(EV) and the average RMSE_test(Model), is used as the statistical parameter. The average RMSE_test(EV) of MLP, MLR, Ridge, GPR, and LASSO is around 20°C (Fig. 4b), which is close to experimental measurement error and acceptable³⁷. By contrast, the average RMSE_test(EV) of the models developed based on RF, KNN, GBDT, XGBoost, and AdaBoost methods is larger, with around 40°C (Fig. 4b), which suggests that these models involving tree algorithms have relatively poor extrapolation ability.

Furthermore, the standard deviation of the samples within the 95% confidence level interval (σ₉₅) is presented as a threshold for the evaluation of the extrapolation ability. If the RMSE_test(EV) of an independent variable is greater than σ₉₅, then the prediction error of the model may be greater than the difference between the actual value and the mean of the samples within the 95% confidence level interval. The RMSE_test(EV)s of the independent variables extrapolation validation for models established by the ML methods of involving tree algorithms are all high, with several I extrapolation validation even near the σ₉₅ (60.35°C; Figs. 5f, g, h, i, and j). For instance, the I₁₀, I₁₈, I₃, and I₅ forward extrapolation validation as well as the backward extrapolation validation of I₉, and I₈ for the AdaBoost model (Fig. 5f), the XGBoost model of I₁₀ and I₃ forward serialization extrapolation and I₉ backward serialization extrapolation (Fig. 5g). This indicates that the prediction error of this model even exceeds AE when the above mention independent variable in the sample exists far away from the corresponding domain of definition of the training set.

The MLP, MLR, LASSO, GPR, and Ridge models have RMSE_test(EV) of any I close to the corresponding RMSE_test(Model) (Figs. 5a, b, c, and e), which indicates their good predictive ability. In addition, applying these models to predict the value of a new sample can be considered reliable when the ED of the predicted sample is smaller than the maximum ED of the extrapolation validation. For I₁₀, I₁₃, and I₉ of the SVM model have low backward EDs and large RMSE_test(EV), therefore, in applying this model, if there are I₁₀, I₁₃, and I₉ in the prediction samples that are smaller than the minimum in the corresponding training set, the predicted values may be unreliable, i.e., extrapolation of such independent variable is not recommended. For the I₅ of forward extrapolation validation in the SVM model, which has a high forward ED but a small RMSE_test(EV). It means that this independent variable has little effect on the prediction reliability of the model when it exceeds the definitional domain of the corresponding training set, therefore, the predicted value of the sample can be considered reliable when such independent variables are extrapolated.

In this contribution, the extrapolation ability of models established by multiple ML methods is explored. Further, the extrapolation validation (EV) method is proposed to quantitatively evaluate the extrapolation ability of a model. Establishing ML models from data with deterministic functional relationships found that ML models involving tree algorithms are fixed for predicted values out of the training set domain, confirming its extrapolation-failure phenomenon. Taking the RF model as an example reveals the intrinsic reasons for the poor extrapolation ability of the regressor involving tree algorithms may be that the value domain constituted by the combined transformations of the potential maximum to minimum predicted values of all the independent variables is a closed interval. The EV validation results for the ML model with the data from defined functional relationships and with the 1321 PI-T_g data confirm that the models developed by ML methods involving tree algorithms have poor extrapolation ability, while the models developed by ML methods non-involving tree algorithms have good extrapolation ability. Before transitioning a model to applications, the EV method is sufficient to evaluate the extrapolation ability of the model and help in selecting trustworthy ML models. Meanwhile, the ED gives digital advice on the extent of reliability for the models to predict samples. The EV method is not restricted by modeling approaches, such as advanced generative adversarial network (GAN), convolution neural network (CNN), and recurrent neural network (RNN), and applies to any model architecture. Meanwhile, it provides the Data Science community with some insights and solutions for evaluating the reliability of out-of-distribution sample prediction in ML models (e.g., molecular and material properties, reaction yields, etc.).

Three mathematical relationships for obtaining modeling data

To obtain modeling data with clear functional relationships, five variables related to x, namely \({x_1}=\frac{1}{4}x\), \({x_2}=\sqrt x\),\({x_3}=\log (x)\), \({x_4}=\sqrt[3]{x}\), and \({x_5}=\frac{1}{x}\), were defined. With a tolerance of 2, arithmetic sequences in the range of [400, 1000), [488, 888), [20, 400), and [1000, 1400) were generated as the data for the training set, the test set (I), the test set (F), and the test set (B) x, respectively. The dependent variable (y) data was calculated based on \(y=x\), \(y={x_1}+{x_2}+{x_3}+{x_4}+{x_5}\), and \(y={x_1}+{x_2} \times {x_3}+{x_4}+\sqrt {{x_5}}\). This makes the dependent variable and the independent variable have linear univariate, linear multivariate, and nonlinear multivariate relationships accordingly. Complete datasets can be found in the Supplementary Data.

Statistical parameters

Squared correlation coefficient (R²): \({R^2}=\frac{{\sum\limits_{{i=1}}^{n} {{{\left( {(y_{{i,{\text{exp}}}}^{{}} - \overline {y} _{{{\text{exp}}}}^{{}}) \times (y_{{i,{\text{cal}}}}^{{}} - \overline {y} _{{{\text{cal}}}}^{{}})} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{(y_{{i,{\text{exp}}}}^{{}} - {{\overline {y} }_{{\text{exp}}}})}^2} \times \sum\limits_{{i=1}}^{n} {{{(y_{{i,{\text{cal}}}}^{{}} - {{\overline {y} }_{{\text{cal}}}})}^2}} } }}\)

Root mean squared error (RMSE): \(\text{R}\text{M}\text{S}\text{E}=\sqrt {\frac{1}{n}\sum\limits_{{i=1}}^{n} {{{(y_{{i,\exp }}^{{}} - y_{{i,cal}}^{{}})}^2}} }\)

Standard deviation (σ): \(\sigma =\sqrt {\frac{1}{n}\sum\limits_{{i=1}}^{n} {{{(y - \frac{1}{n}\sum\limits_{{i=1}}^{n} y )}^2}} }\)

Mean absolute error (MAE): \(\text{M}\text{A}\text{E}=\frac{1}{n}\sum\limits_{{i=1}}^{n} {\left| {y_{{i,\exp }}^{{}} - y_{{i,cal}}^{{}}} \right|}\)

where n is the number of samples.

Development of mathematical relationship models

To establish ML models for multiple algorithms, the scikit-learn³⁸ version 1.3.0 package for Python³⁹ 3.11 was adopted to develop the MLR, LASSO, Ridge, GPR, MLP, AdaBoost, RF, GBDT, SVM, and KNN models, and the xgboost⁴⁰ version 1.7.6 package was applied to develop the XGBoost model. The independent variable data were standardized for modeling. The initial 33 ML models for the three datasets were established based on the default hyperparameters of a total of 11 ML algorithms within the scikit-learn³⁸ and xgboost⁴⁰ packages.

The phenomenon of extrapolation-failure was initially identified by analyzing the performance of the initial model training set, test set (I), test set (F), and test set (B) by R². Subsequently, the hyperparameters of 33 models were optimized (Supplementary Information Figs. S1 ~ S3) by exhaustive enumeration to observe the influence of hyperparameters on the performance of different algorithms. For selecting the model with the best extrapolation ability, the model with the smallest average value of MAE_training, MAE_test(F) and MAE_test(B) (\({{(\text{M}\text{A}{\text{E}_{{\text{training}}}}+\text{M}\text{A}{\text{E}_{{\text{test(F)}}}}+\text{M}\text{A}{\text{E}_{{\text{test(B)}}}})} \mathord{\left/ {\vphantom {{(\text{M}\text{A}{\text{E}_{{\text{training}}}}+\text{M}\text{A}{\text{E}_{{\text{test(F)}}}}+\text{M}\text{A}{\text{E}_{{\text{test(B)}}}})} 3}} \right. \kern-0pt} 3}\)) statistical parameters within the tuned hyperparameters was taken as the optimal model (Supplementary Information Tables S4, S6, and S8). The R² of the optimal models for linear univariate, linear multivariate, and nonlinear multivariate data relationships of 11 ML algorithms further confirms the hypothesis that ML models involving tree algorithms may not have extrapolation ability. All models were saved and extracted using joblib⁴¹ 1.2.0 version and are available on https://github.com/https://github.com/fangyouyan.

Random Forest (RF) model visualization

An RF model (Supplementary Information Fig. S4) with linear univariate relationships which has 10 DTs with a depth of 4 each (n_estimators = 10, max_depth = 4) was visualized (Supplementary Information Fig. S5) using pydotplus version 2.0.2.

Extrapolation Validation (EV) method

The leverage value (h) is defined by all independent variables in the model. Within chemistry and drugs among other related fields, h is often used to check compounds affected by structure (i.e., independent variables) in QSPR modeling.³⁵ Considering the contribution from all the independent variables, the serialized h is applied for dividing the training and test sets. Both forward serialization (from small to large values) and backward serialization (from large to small values) are adopted, i.e., forward extrapolation validation and backward extrapolation validation. The extrapolation performance of the serialized independent variables is evaluated by the performance on the test (EV) set from re-fitting the model using the training (EV) set. Following this approach, all independent variables for the developed model are evaluated.

Leverage value (h): \(h={x_i}{({X^T}X)^{ - 1}}x_{i}^{T}\)

x _i is the independent variable row-vector of the i-th compound, x_i^T is the transpose of x_i, X is the independent variable matrix, and X^T is the transpose of X.

The performance of the re-fitted model by training (EV) set for the test (EV) set (RMSE was adopted in this work) was used to evaluate the extrapolation ability. Of note, due to the stochastic property of ML methods such as RF, GBDT, and GBDT, it is suggested to re-fit a model many times to obtain the average of several of the predicted values as the last predicted value. A model was re-fitted 100 times in this effort. Considering that the extrapolation ability of the model is related to the distance between the test samples and the training domain, the Extrapolation Degree (ED, Eq. (1)) is defined. The ED is the measure of the extent to which the sample independent variables in the test (EV) set exceed the range of the corresponding independent variable in the training (EV) set. Since not all sample independent variables of the test (EV) set are outside the range of the training set, the performance of the test set (EV) includes the contribution from interpolation ability. The ED can be an indication of the assistant evaluation of the extrapolation ability of the model. Furthermore, the standard deviation of the samples within the 95% confidence level interval (σ₉₅) is presented as a threshold for the evaluation of the extrapolation ability. If the RMSE_test(EV) of an independent variable is greater than σ₉₅, then the prediction error of the model may be greater than the difference between the actual value and the mean of the samples within the 95% confidence level interval. By obtaining the average of all RMSEs of the independent variables and h serialized extrapolation, i.e., the average RMSE, as a statistical parameter of the overall extrapolation ability of the model evaluated.

Extrapolation validation of mathematical relationship models

Applying the EV method to evaluate the extrapolation ability of the three functional relations optimal model established by 11 ML algorithms. Serialized extrapolation was performed on the h and independent variables of the training set, and the training (EV) and test (EV) sets were re-divided in the ratio of 8:2, the ED was calculated and the extrapolation ability of the model was evaluated. (Figs. 3(B), (C) and (D), Supplementary Information Fig. S6 and S7)

Development of PI- T_g models

The 29 norm indexes (I) required for MLR models developed in the literature were calculated. With these 29 Is as the independent variables and the PI glass transition temperature (T_g) data provided in the literature as the dependent variables, 10 ML algorithmic models for LASSO, Ridge, GPR, MLP, AdaBoost, RF, GBDT, SVM, KNN, and XGBoost were developed using scikit-learn³⁸ version 1.3.0 and xgboost⁴⁰ version 1.7.6. The I data were standardized for modeling. The ratio of the training set to the test set in the modeling is completely consistent with the settings in the literature. Using R² and MAE as statistical parameters to evaluate the established model, the optimal model hyperparameters were determined with the best average performance between the training and test sets. (Supplementary Information Table S12, Figure S8).

Extrapolation validation of PI- T_g models

Evaluation of the extrapolation ability of 11 PI-T_g models by using the EV method. Serialized extrapolation was performed on the h and Is of the training set, and the training (EV) and test (EV) sets were re-divided in the ratio of 8:2, the ED was calculated and the extrapolation ability of the model was evaluated. (Figs. 4 and 5)

Data availability

The data for the three mathematical relationship models obtained in this work have been provided in the Supplementary Data. The 1321 PI glass transition temperature (T_g) dataset is available from https://pubs.acs.org/10.1021/acs.jcim.2c01389. The 33 mathematical relationship ML models developed in this paper and the 10 PI-T_g models are available from GitHub (https://github.com/fangyouyan).

Code availability

Example code for the extrapolation validation (EV) method can be viewed from GitHub (https://github.com/fangyouyan). 11 machine learning (ML) algorithm models were developed using the scikit-learn³⁸ package version 1.3.0 in Python³⁹ 3.11 and the xgboost⁴⁰ package version 1.7.6. The models were accessed via the joblib⁴¹ package version 1.2.0. Random forest (RF) models were visualized with pydotplus package version 2.0.2.

Acknowledgments

This work was financially supported by the National Natural Science Foundation of China (22222807, 22078195, and 22278319), and the Natural Science Foundation of Shanghai (20ZR1429800).

Author contributions

M.X.Y. conceived the problem and carried out all detailed studies. F.Y.Y. analyzed the problem and designed the method. M.X.Y., F.Y.Y*, and Y.N.Z. co-analyzed the results. M.X.Y. wrote the manuscript, F.Y.Y* and Y.N.Z. made modifications. Q.W. provided strategic guidance. All authors contributed to useful discussions.

Competing interests

The authors declare no competing interests.

Doudesis, D. et al. Machine learning for diagnosis of myocardial infarction using cardiac troponin concentrations. Nat Med 29, 1201-1210, doi:10.1038/s41591-023-02325-4 (2023).
Fricke, E. C. et al. Collapse of terrestrial mammal food webs since the Late Pleistocene. Science 377, 1008–1011 (2022).
Ratledge, N., Cadamuro, G., de la Cuesta, B., Stigler, M. & Burke, M. Using machine learning to assess the livelihood impact of electricity access. Nature 611, 491-495, doi:10.1038/s41586-022-05322-8 (2022).
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nature Machine Intelligence 2, 573-584, doi:10.1038/s42256-020-00236-4 (2020).
So, E., Yu, F., Wang, B. & Haibe-Kains, B. Reusability report: Evaluating reproducibility and reusability of a fine-tuned model to predict drug response in cancer patient samples. Nature Machine Intelligence 5, 792-798, doi:10.1038/s42256-023-00688-4 (2023).
Yang, J., Soltan, A. A. S., Eyre, D. W. & Clifton, D. A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nature Machine Intelligence, doi:10.1038/s42256-023-00697-3 (2023).
Bures, J. & Larrosa, I. Organic reaction mechanism classification using machine learning. Nature 613, 689-695, doi:10.1038/s41586-022-05639-4 (2023).
Batra, R., Song, L. & Ramprasad, R. Emerging materials intelligence ecosystems propelled by machine learning. Nature Reviews Materials 6, 655-678, doi:10.1038/s41578-020-00255-y (2020).
Rao, Z. et al. Machine learning–enabled high-entropy alloy discovery. Science 378, 78-85 (2022).
Xu, L.-C. et al. Enantioselectivity prediction of pallada-electrocatalysed C–H activation using transition state knowledge in machine learning. Nature Synthesis 2, 321-330, doi:10.1038/s44160-022-00233-y (2023).
Wang, X. et al. Molecule Property Prediction Based on Spatial Graph Embedding. Journal of Chemical Information and Modeling 59, 3817-3828, doi:10.1021/acs.jcim.9b00410 (2019).
Dobbelaere, M. R. et al. Machine Learning for Physicochemical Property Prediction of Complex Hydrocarbon Mixtures. Industrial & Engineering Chemistry Research 61, 8581-8594, doi:10.1021/acs.iecr.2c00442 (2022).
Vermeire, F. H., Chung, Y. & Green, W. H. Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. Journal of the American Chemical Society 144, 10785-10797, doi:10.1021/jacs.2c01768 (2022).
Yang, K. et al. Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling 59, 3370-3388, doi:10.1021/acs.jcim.9b00237 (2019).
Zhu, X. et al. Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset. Journal of Chemical Information and Modeling 63, 2948-2959, doi:10.1021/acs.jcim.3c00308 (2023).
Zaslavskiy, M., Jégou, S., Tramel, E. W. & Wainrib, G. ToxicBlend: Virtual screening of toxic compounds with ensemble predictors. Computational Toxicology 10, 81-88, doi:10.1016/j.comtox.2019.01.001 (2019).
Ferraz-Caetano, J., Teixeira, F. & Cordeiro, M. N. D. S. Explainable Supervised Machine Learning Model To Predict Solvation Gibbs Energy. Journal of Chemical Information and Modeling, doi:10.1021/acs.jcim.3c00544 (2023).
Li, P. et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Briefings in Bioinformatics 22, doi:10.1093/bib/bbaa266 (2021).
Wang, Y. et al. Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks. Nature Communications 14, 6155, doi:10.1038/s41467-023-41698-5 (2023).
Chen, S. & Jung, Y. Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention. JACS Au 1, 1612-1620, doi:10.1021/jacsau.1c00246 (2021).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Central Science 3, 1237-1245, doi:10.1021/acscentsci.7b00355 (2017).
Choung, O.-H., Vianello, R., Segler, M., Stiefl, N. & Jiménez-Luna, J. Extracting medicinal chemistry intuition via preference machine learning. Nature Communications 14, 6561, doi:10.1038/s41467-023-42242-1 (2023).
Hagg, A. & Kirschner, K. N. Open-Source Machine Learning in Computational Chemistry. Journal of Chemical Information and Modeling 63, 4505-4532, doi:10.1021/acs.jcim.3c00643 (2023).
Kao, P.-Y. et al. Exploring the Advantages of Quantum Generative Adversarial Networks in Generative Chemistry. Journal of Chemical Information and Modeling 63, 3307-3318, doi:10.1021/acs.jcim.3c00562 (2023).
Heid, E., McGill, C. J., Vermeire, F. H. & Green, W. H. Characterizing Uncertainty in Machine Learning for Chemistry. Journal of Chemical Information and Modeling 63, 4012-4029, doi:10.1021/acs.jcim.3c00373 (2023).
Stein, H. S. Advancing data-driven chemistry by beating benchmarks. Trends in Chemistry 4, 682-684, doi:10.1016/j.trechm.2022.05.003 (2022).
Eshete, B. Making machine learning trustworthy. Science 373, 743-744 (2021).
Keith, J. A. et al. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chem Rev 121, 9816-9872, doi:10.1021/acs.chemrev.1c00107 (2021).
Bender, A. et al. Evaluation guidelines for machine learning tools in the chemical sciences. Nat Rev Chem 6, 428-442, doi:10.1038/s41570-022-00391-9 (2022).
Li, K., DeCost, B., Choudhary, K., Greenwood, M. & Hattrick-Simpers, J. A critical examination of robustness and generalizability of machine learning prediction of materials properties. npj Computational Materials 9, doi:10.1038/s41524-023-01012-9 (2023).
Mathai, N., Chen, Y. & Kirchmair, J. Validation strategies for target prediction methods. Brief Bioinform 21, 791-802, doi:10.1093/bib/bbz026 (2020).
Mitchell, J. B. Machine learning methods in chemoinformatics. Wiley Interdiscip Rev Comput Mol Sci 4, 468-481, doi:10.1002/wcms.1183 (2014).
Sheridan, R. P. Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53, 783-790, doi:10.1021/ci400084k (2013).
Kearnes, S. Pursuing a Prospective Perspective. Trends in Chemistry 3, 77-79, doi:10.1016/j.trechm.2020.10.012 (2021).
Fu, L. et al. Systematic Modeling of log D7.4 Based on Ensemble Machine Learning, Group Contribution, and Matched Molecular Pair Analysis. Journal of Chemical Information and Modeling 60, 63-76, doi:10.1021/acs.jcim.9b00718 (2019).
Yu, M. et al. Ring Repeating Unit: An Upgraded Structure Representation of Linear Condensation Polymers for Property Prediction. J Chem Inf Model 63, 1177-1187, doi:10.1021/acs.jcim.2c01389 (2023).
Lee, G. H. et al. Multifunctional materials for implantable and wearable photonic healthcare devices. Nat Rev Mater 5, 149-165, doi:10.1038/s41578-019-0167-3 (2020).
Scikit-learn. Scikit-learn, https://scikit-learn.org/stable/
Python. Python, www.python.org/
xgboost. xgboost, https://xgboost.readthedocs.io/en/stable/
joblib. joblib, https://joblib.readthedocs.io/en/stable/

There is NO Competing Interest.

SupplementaryData.xlsx
Linear univariate Dataset, Linear multivariate Dataset, Nonlinear multivariate Dataset
SupplementaryInformation.docx
Table S1~S10, Fig. S1~Fig. S8

Download PDF

Version 1

posted

You are reading this latest preprint version

A Universal Validation Method for Mitigating Machine Learning Extrapolation Risk

Status:

Version 1

Abstract

Figures

Main

Results

Mathematical model extrapolation testing

Revealing the reason for extrapolation-failure

Extrapolation Validation (EV) Method

Application of EV method

Discussion

Methods

Three mathematical relationships for obtaining modeling data

Statistical parameters

Development of mathematical relationship models

Random Forest (RF) model visualization

Extrapolation Validation (EV) method

Extrapolation validation of mathematical relationship models

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1