A machine-learning algorithm for accelerating the maximum-likelihood tree search
Our goal is to rank all possible SPR neighbors of a given tree according to their log-likelihood without actually computing the likelihood function. To this end, we rely on a set of features that capture essential information regarding an SPR tree rearrangement and that can be efficiently computed. Specifically, we trained a random forest for regression machine-learning algorithm to predict the ranking of all possible SPR modifications, according to their effect on the log-likelihood score. The algorithm was trained on a large set of known examples (data points). In our case, each data point is a pair (V, L). V is an array which includes the starting tree, the resulting tree following an SPR move, and the set of features, while L is a function of the log-likelihood difference between the starting and the resulting tree (see Methods). The regression model learns the association between V and L. Given a trained algorithm and a starting tree topology, an array V is computed for each possible SPR move. The trained machine-learning algorithm provides the ranking of all possible SPR moves according to their predicted L values. A perfect machine-learning model would predict the optimal SPR neighbor and would thus eliminate the need for expensive likelihood computations. A sub-optimal predictor may also be highly valuable if the vast majority of the SPR moves can be safely discarded without computing their likelihoods.
The machine-learning algorithm was trained on 16,991,941 data points, one data point for each possible SPR move of 4,300 different empirical phylogenies. The empirical alignments varied in terms of the number of sequences (7 to 70), the number of positions (62 to 50,000) and the extent of sequence divergence. The number of neighbors of each tree is affected by the number of sequences and by the tree topology and ranges between a few dozens to over ten thousand. We chose to analyze empirical rather than simulated data, as it is known that reconstructing the best tree is more challenging for the former11–13. The learning was based on 20 features, extracted from each data point (Table 1). The first eight features were extracted from the starting and resulting trees, e.g., the lengths of the branches in the pruning and regrafting locations. The remaining features were generated based on the subtrees that were induced by the SPR move, e.g., the sum of branch lengths of the pruned subtree (Fig. 1).
Performance evaluation
We evaluated the performance of our trained learner in a ten-fold cross-validation procedure. Namely, the empirical datasets were divided to ten subsets, such that in each of the ten training iterations, the induced data points of nine folds were used for training the model, and the remaining data points were used for testing. We first evaluated the accuracy of the learner in ranking alternative SPR moves. The Spearman rank correlation coefficient (ρ) was thus computed between the true ranking, inferred through a full likelihood-optimization, and the predicted ranking, based on the machine-learning predictions. The mean ρ, averaged over all 4,300 samples was 0.93 (Fig. 2a), suggesting that the machine-learning algorithm successfully discriminates between beneficial and unfavorable SPR moves.
Notably, the Spearman correlation quantifies the prediction performance when all SPR neighbors are considered. However, in a typical hill-climbing heuristic, the single best SPR neighbor is chosen as the starting tree for the next step. It is thus interesting to estimate the ability of the algorithm to predict this best neighbor. Accordingly, we measured the performance of the trained algorithm by two additional metrics: (1) the rank of this best move in the predicted ranking; (2) the rank of the predicted best move within the true rank, as obtained according to the full likelihood optimization. Our results indicated that in 63% and 91% of the datasets the best move was among the top 10% and 25% predictions (Fig. 2b). In 89% and 97% of the datasets, the top-ranked prediction was among the top 10% and 25% SPR moves (Fig. 2c). Moreover, in 99.99% of the cases, the top prediction resulted in higher likelihood compared to the starting tree, suggesting that an improvement is typically obtained. In contrast, a random move increased the likelihood score in only 3.7% of the datasets. These results strengthen the evident potential of the machine-learning algorithm to direct the tree search to a narrow region of the tree space, thus avoiding numerous expensive likelihood calculations.
We next evaluated the trained model on entirely different datasets than the training data (see Methods). Unlike the training data, the machine-learning algorithm was not optimized on these data, not even in cross-validation, thus negating possible overfitting effects. When applied to these validation data, the performance of the trained model was very similar to that reported above using cross validation (ρ = 0.91; Supplementary Fig. 1), suggesting that the machine-learning algorithm is well generalized for various datasets, evolved under an array of evolutionary scenarios.
To gain further insight into factors affecting the prediction accuracy, we analyzed whether the predictions accuracy is affected by: (1) the number of taxa; (2) the level of divergence as measured by the sum of branch lengths; (3) the six databases used (four for training and two for validation); (4) the amount of the training data. No significant correlation was observed between ρ and the number of taxa or between ρ and the level of sequence divergence (Supplementary Figures 2a and 2b). Among the six databases, predictions were most accurate for Selectome, with mean ρ of 0.95, and least accurate for ProtDBs, with mean ρ of 0.83 (Supplementary Fig. 2c). Finally, our results imply that increasing the number of trained samples above 4,300 does not significantly increase the accuracy (P value > 0.24 for ANOVA test comparing 4,300 datasets to 6,000; Supplementary Fig. 3).
Performance evaluation on an example dataset: the SEMG2 gene in primates
We exemplify the application of the machine-learning algorithm on a specific dataset, consisting of 28 primate sequences encoding the SEMG2 reproductive gene (see Methods). We reconstructed a starting tree (Fig. 3a), generated all its 2,345 SPR neighbors, and ranked them according to their log-likelihoods. We then compared this ranking to the ranking predicted by the trained machine-learning algorithm. The Spearman rank correlation (ρ) between the true and the predicted ranking was 0.91, which is similar to the average ρ reported for both the training and validation data. Indeed, the best move was among the top 6% predictions, and the best SPR move predicted by the model was the second best possible move. Furthermore, the best SPR move and the predicted best SPR move led to rearrangements in the same clade of the phylogenetic tree, containing nine species (Fig. 3).
As explained in the Methods section, our algorithm ranks neighboring trees by predicting their log-likelihoods. While the ultimate goal is to predict the ranking of the possible SPR moves in order to limit the search space, focusing on one example enables the inspection of the actual predicted change in log-likelihood between each potential resulting tree and the starting tree. For this example, a Pearson correlation (R2) of 0.88 between the predicted and true change in log-likelihood was observed (the full list of the predicted and true log-likelihood differences for all 2,345 single-step SPR moves is given in Supplementary Data 1). The predicted best move improved the log-likelihood of the initial tree by 22.8, similar to a log-likelihood improvement of 23.4 obtained by the best SPR move. Moreover, according to our model, 97 and 2,248 SPR moves were predicted to increase and decrease the log-likelihood, respectively, and these predictions were true for 66% and 92% of these cases. These results corroborate the potential of the machine-learning approach to correctly discard many irrelevant SPR neighbors.
In addition, we measured the running time for evaluating the 2,345 neighboring trees for this example. The computation of the features and the application of the trained model for each neighbor took 1.1x10-3 seconds on average. The likelihood computation took 40.49 seconds on average for each neighbor, roughly 35,000 times longer compared to the machine-learning algorithm.
We next examined whether the high performance of the trained model is maintained when applied to other intermediate trees in the chain towards the maximum-likelihood tree. When applied to the second phase of the search, i.e., starting from the best possible neighbor of the initial tree, the trained model yielded results that are highly similar to those reported for the initial tree (Spearman correlation coefficient of ρ = 0.91). Prominently, the best move according to the predictions increased the true log-likelihood score by 0.039, implying that the likelihood improvement is maintained following additional SPR steps. Finally, we examined the performance of the algorithm when the initial tree is one step away from the maximum-likelihood tree. To this end, we selected a tree that is a neighbor of the maximum-likelihood tree by applying a random SPR move to the latter. When applied to this tree, the model predicted the maximum-likelihood tree as the best possible neighbor, with a log-likelihood improvement of 0.007 (p = 0.92 ).
Feature importance
The prediction capabilities of the random forest algorithm are determined by the provided features. In turn, the relative contribution of each of these features to the predictive model, termed ‘the feature importance’, can be extracted. In our implementation, the feature that contributed most to the prediction accuracy was the sum of branch lengths along the path between the pruning and the regrafting locations, while the second best feature was the number of nodes along that path (for the importance values of all features, see Supplementary Table 1). These findings provide some justification for the common practice of considering only local changes in various tree search heuristics6,8,14. The next two features were the length of the pruned branch and the longest branch in the pruned subtree. The fifth feature was the estimated sum of branch lengths of the resulting tree, a feature that was previously suggested as a single filtering criterion by Hordijk and Gascuel7.
Many common tree search heuristics utilize a single feature to limit the scope of inspected neighbors. We thus exploited the devised framework to examine whether the use of a single feature leads to similar performance. To this end, we trained 20 random forest models on the training set, such that each model accounted for a single feature. The performance of each of these models provided a measure of the predictive power of each feature, independent of the others. The best single-feature model obtained a Spearman correlation coefficient of p = 0.706 on average across the training set, and was based on the number of nodes in the path between the pruning and the regrafting locations, a feature that was ranked second when the entire set of features was used for training (Supplementary Table 1). The average ρ for each of the remaining features was below 0.47 (Supplementary Table 2). These observations, together with the substantial increase in average ρ when comparing the usage of a single feature to using the entire set of features, combined (average ρ of 0.93), highlights the benefit of relying on a large set of features that together provide more informative prediction.