Mathematical model extrapolation testing
To observe the extrapolation ability of ML models, developed models for data with deterministic functional relationships (i.e., linear univariate, linear multivariate, and nonlinear multivariate) by 11 ML methods, including multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), ridge regression (Ridge), support vector machine (SVM), gaussian process regression (GPR), multilayer perceptron (MLP), adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), RF, K-nearest neighbor (KNN) and gradient boosting decision tree (GBDT) algorithms. A test (B) set and a test (F) set are set up to evaluate the extrapolation performance of ML models, where the test (B) set is dominated by dependent variables below the minimum value of the dependent variable in the training set, and the test (F) set is dominated by dependent variables above the maximum value of the dependent variable in the training set. Moreover, a test (I) set in which the dependent variable is included in the range of the dependent variable of the training set is used to validate the interpolation ability of the models.
Based on the results of models established by the initial hyperparameters of the 11 ML methods (Figs. 1a, b, and c, Supplementary Information Tables S1 ~ S3), it is found that the regressors involving tree algorithms (i.e., RF, KNN, XGB, AdaBoost, and GBDT) perform excellent predictability in the value domain of the training set, which is confirmed by the squared correlation coefficients (R2) of training set and test (I) set being close to 1. However, facing target values outside the value domain of the training set, their predicted vs. observed values behave as horizontal straight lines, with the fact that the R2test(B) and the R2test(F) are both 0. This suggests that ML models involving tree algorithms have great interpolation ability but may not have extrapolation ability. Since hyperparameters have non-negligible effects on model performance, the hyperparameters of the models established by 10 methods other than MLR are optimized. Even for the optimal models (Figs. 1d, e, and f, Supplementary Information Tables S4 ~ S9), the R2test(B) and R2test(F) of the ML models involving tree algorithms are still 0, which rules out the correlation between extrapolation disability and hyperparameters selection. Furthermore, the predicted values of the optimal models of the MLR, LASSO, GPR, and MLP methods are almost close to the observed values. Nevertheless, Ridge and SVM models using data from linear or nonlinear multivariate functional relationships have large prediction errors when observed values are far away from the domain of values in the training set, as evidenced by their R2test(B) failing to reach 1 (Figs. 1e and f).
During hyperparameter conditioning of the regressors involving tree-based algorithms (Supplementary Information Figs. S1 ~ S3), all predictions in the test (B) and test (F) sets are close to the maximum and minimum values in the training set, respectively. Particularly noted is that AdaBoost, RF, and XGBoost models exhibit piecewise functional data relationships as the hyperparameters are varied, therefore, conjecturing that this may be the reason for the extrapolation-failure of the model developed by ML methods of involving tree algorithms, i.e., having constant output values for a certain range of input values.
Revealing the reason for extrapolation-failure
To gain insights into the reason for the poor extrapolation ability of regressors involving tree algorithms, an RF model developed for data from the linear univariate functional relationship is visualized, which contains 10 Decision Trees (DTs), each of depth 4 (Fig. 2, Supplementary Information Figs. S4 and S5). Each node can be considered as a dichotomy point in the decision-making process. For any input value lower than the domain of the definition of the training set, each node is determined to be “True”, so the predicted value of each DT is the minimum of its value domain. For any input value higher than the domain of the definition of the training set, each node is determined to be “False”, so the predicted value of each DT is the maximum of its value domain. The predicted value of the RF model is the average of the predicted values of all DTs. The example model, with only one independent variable, has a maximum predictive value of 979.9619 and a minimum predictive value of 417.0969, the potential predicted value range is [417.0969,979.9619] (Supplementary Information Figs. S4). Thus, the range of potential predicted values for any of the independent variables is a closed interval in the case of ML methods involving tree algorithms. When having multiple independent variables, the dependent variable is a combined transformation of values within these closed intervals. Further, the value domain constituted by the combined transformations of the potential maximum to minimum predicted values of all the independent variables is a closed interval, which may be the reason for extrapolation-failure for the models developed by ML methods involving tree algorithms.
Extrapolation Validation (EV) Method
To quantitatively evaluate the extrapolation ability of a model, the extrapolation validation (EV) method is proposed. Each independent variable is serialized, and then the training and test sets are re-divided following the order in the determined ratio. That is, the dataset is divided into training (EV) set and test (EV) set in the order of serialized independent variables, e.g., choose the first 80% as the training (EV) set and the remaining 20% as the test (EV) set (Fig. 3a). This provides data support for adopting performance of the test (EV) set (i.e., samples that are not exactly within in the domain of application for the model) to evaluate extrapolation ability of a model. Considering the contribution from all the independent variables, the serialized leverage value35 (h; Methods) is applied for dividing the training and test sets. Both forward and backward sequences are adopted, that is, forward extrapolation validation and backward extrapolation validation. The extrapolation performance of the serialized independent variables is evaluated by the performance on the test (EV) set from re-fitting the model using the training (EV) set. Following this approach, all independent variables for the developed model are evaluated.
When serializing extrapolation for one independent variable, it is difficult to ensure that all independent variables in the test (EV) set are outside the corresponding range of the training (EV) set, at which point the performance of the test (EV) set will inevitably include the contribution from interpolation. Hence, the extrapolation degree (ED; Eq. (1), Fig. 3a) is defined as a metric to assist in evaluating the extrapolation ability of the model. The ED quantifies the extent to which the independent variables of the test (EV) set are outside the corresponding domain of definition of the training set thereby digitalizing the contribution of extrapolation ability in the performance of the test (EV) set. Furthermore, for samples far from the domain of definition of the training set, the predicted values of Ridge and SVM models developed with linear or nonlinear multivariate functional relationship data differ significantly from their observed values, in Section 2.1 (Figs. 1e and f). This emphasizes the fact that the reliability of the predicted values of a model for dataset out-of-distribution samples is related to the distance between the sample independent variable and the domain of definition of the training set.
\(\text{E}\text{D}=\frac{1}{{{n_i}}}\sum\limits_{i} {\left( {\frac{{\sum\limits_{j} {{e_{i,j}}} }}{{\sum\limits_{j} {{a_{i,j}}} }}} \right)}\) Eq. (1)
\({e_{i,j}}=\left\{ {\begin{array}{*{20}{c}} {\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right) - x_{{i,j}}^{{\text{t}{\text{est}}}},{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}<\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {x_{{i,j}}^{{\text{t}{\text{est}}}} - \mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right),{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}>\mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {0,{\text{ }}others} \end{array}} \right.\)
\({a_{i,j}}=\left\{ {\begin{array}{*{20}{c}} {\frac{1}{{{n_i}}}\sum\limits_{i} {x_{{i,j}}^{{\text{t}{\text{rain}}}}} - x_{{i,j}}^{{\text{t}{\text{est}}}},{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}<\mathop {\hbox{min} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {x_{{i,j}}^{{\text{t}{\text{est}}}} - \frac{1}{{{n_i}}}\sum\limits_{i} {x_{{i,j}}^{{\text{t}{\text{rain}}}}} ,{\text{ }}x_{{i,j}}^{{\text{t}{\text{est}}}}>\mathop {\hbox{max} }\limits_{i} \left( {x_{{i,j}}^{{\text{t}{\text{rain}}}}} \right)} \\ {0,{\text{ }}others} \end{array}} \right.\)
where, i and j are the serial numbers of the independent variables and samples, respectively, and ni and nj are the number of independent variables and samples.
Adopting the EV method, the optimal models for the tested data from mathematical relationships are evaluated (Figs. 3b, c, and d, Supplementary Information Figs. S6 and S7). Root mean squared error (RMSE) is adopted as the main statistical parameter in this work. For distinction, the RMSEs on the test (EV) set for the developed model and the model re-fitted by the training (EV) set are represented by RMSEtest(Model) and RMSEtest(EV). For the data from whether linear or nonlinear relationships, the RMSEtest(EV) of models developed by methods involving tree algorithms, such as AdaBoost, RF, and GBDT, is large. For example, RMSEtest(EV) of ML models involving tree algorithms developed on the data obtained from linear univariate functions were all greater than 50 (Figs. 3b). Models established by such as MLP, GPR, and SVM methods have better extrapolation ability, i.e., RMSEtest(EV) is nearly 0. Results of EV indicate that the models developed by ML methods involving tree algorithms have poor extrapolation ability, while the models developed by ML methods non-involving tree algorithms have good extrapolation ability, which is consistent with the results of Section 2.1. Further, since the methods involving tree algorithms have small prediction errors in model development (i.e., RMSEtest(Model)) but big prediction errors in the model application (i.e., RMSEtest(EV)), this creates the phenomenon of performance degradation leading to reduced ML model trust. So, a prior evaluation of extrapolation ability using the EV method will help in selecting a trustworthy ML model.
Application of EV method
An application of the EV method is demonstrated with the help of a PI-Tg model36, developed on a large dataset containing 1321 Tg. Besides the MLR model developed in the literature (R2 = 0.8793, Q2 = 0.8718), 10 models such as MLP, RF, and GBDT are established, all of which are fully consistent with the literature in terms of independent variables (norm index, I), training set, and test set settings (Supplementary Information Table S10 and Fig. S8). The extrapolation ability of 11 PI-Tg models is evaluated by the EV method (Figs. 4 and 5).
In the case of EV instance for PI-Tg models, the RMSEtest(EV) for forward and backward serialized extrapolation validations for models established by the tree-involving algorithm is always larger than that for models established by the non-tree-involving algorithm, for every I and h (Fig. 4a). To evaluate the overall extrapolation ability of the model, the average of RMSE for the independent variable and h extrapolation validation, i.e., the average RMSEtest(EV) and the average RMSEtest(Model), is used as the statistical parameter. The average RMSEtest(EV) of MLP, MLR, Ridge, GPR, and LASSO is around 20°C (Fig. 4b), which is close to experimental measurement error and acceptable37. By contrast, the average RMSEtest(EV) of the models developed based on RF, KNN, GBDT, XGBoost, and AdaBoost methods is larger, with around 40°C (Fig. 4b), which suggests that these models involving tree algorithms have relatively poor extrapolation ability.
Furthermore, the standard deviation of the samples within the 95% confidence level interval (σ95) is presented as a threshold for the evaluation of the extrapolation ability. If the RMSEtest(EV) of an independent variable is greater than σ95, then the prediction error of the model may be greater than the difference between the actual value and the mean of the samples within the 95% confidence level interval. The RMSEtest(EV)s of the independent variables extrapolation validation for models established by the ML methods of involving tree algorithms are all high, with several I extrapolation validation even near the σ95 (60.35°C; Figs. 5f, g, h, i, and j). For instance, the I10, I18, I3, and I5 forward extrapolation validation as well as the backward extrapolation validation of I9, and I8 for the AdaBoost model (Fig. 5f), the XGBoost model of I10 and I3 forward serialization extrapolation and I9 backward serialization extrapolation (Fig. 5g). This indicates that the prediction error of this model even exceeds AE when the above mention independent variable in the sample exists far away from the corresponding domain of definition of the training set.
The MLP, MLR, LASSO, GPR, and Ridge models have RMSEtest(EV) of any I close to the corresponding RMSEtest(Model) (Figs. 5a, b, c, and e), which indicates their good predictive ability. In addition, applying these models to predict the value of a new sample can be considered reliable when the ED of the predicted sample is smaller than the maximum ED of the extrapolation validation. For I10, I13, and I9 of the SVM model have low backward EDs and large RMSEtest(EV), therefore, in applying this model, if there are I10, I13, and I9 in the prediction samples that are smaller than the minimum in the corresponding training set, the predicted values may be unreliable, i.e., extrapolation of such independent variable is not recommended. For the I5 of forward extrapolation validation in the SVM model, which has a high forward ED but a small RMSEtest(EV). It means that this independent variable has little effect on the prediction reliability of the model when it exceeds the definitional domain of the corresponding training set, therefore, the predicted value of the sample can be considered reliable when such independent variables are extrapolated.