True phylogenies can be inferred from incongruent sites
Several studies have demonstrated that gene trees are often incongruent with core genome phylogenies, opening the possibility that core genome phylogenies might correspond to artifacts [18, 36, 37]. It has been observed that most sites in the core genome can be incongruent with the overall tree topology and this has been considered as evidence that the reconstructed trees are not representative of the true evolutionary history of the strains [35]. We first tested whether true phylogenetic trees can be recovered when nearly all sites of the core genome are inconsistent with the overall tree topology. As a proof-of-concept we evolved a 2Mb-long core genome in silico following a branching process. In order to obtain a realistic tree and core genome, we used the phylogenetic tree generated on a real dataset previously published [38]. We chose the tree and the parameters of the core genome of Acinetobacter pittii to conduct the simulations because the average bootstrap supports of this tree were closest to the average bootstrap support of our set of trees (average bootstrap support of 89%, see below). The real tree topology and branch lengths were used to simulate the evolution of the core genome evolving clonally (i.e. only by mutations). As expected, building the phylogenetic tree on the simulated dataset results in the same tree that was used to simulate the evolution of the core genome. Since the simulations were run from a real dataset, two nodes were poorly resolved in the simulated tree (these nodes were also poorly resolved in the real tree).
We then used the core genome of the clonal simulation to generate a new core genome alignment while introducing exactly one random recombination event at each polymorphic site with phylogenetic signal (i.e. all sites with two or more alleles, except singletons). For instance, for a site composed of two alleles A and C (each present in at least two strains), we randomly picked one of these strains and re-affected the other allele (a C instead of an A or the opposite). Our procedure resulted in exactly one recombination event at each site containing phylogenetic signal and, as expected, the vast majority of alleles (96.4%) became incongruent with the phylogeny obtained from the clonal simulation (the true phylogeny). Note that, by chance, a few recombinant alleles were still congruent with the true phylogeny when the allele was transferred to a sister strain. We then built the phylogenetic tree from this recombined alignment and, despite the fact that only 3.6% of informative alleles support the true phylogeny, we obtained the same tree topology as the real tree (Fig. 1). A single node was incongruent in the two trees, but this node was one of the two unresolved nodes in the clonal and the true phylogeny. These results indicate that even when nearly all informative alleles are incongruent with the true phylogeny, it is still possible to retrieve the true phylogeny of the core genome. It is frequently observed that most gene trees are incongruent with the core genome phylogeny [39], but it does not necessarily imply that the true phylogeny cannot be recovered. Although this result might seem counterintuitive, it can be attributed to the fact that the simulated recombination events were random and only affected two strains at a time. As a consequence, each site still retains most of the phylogenetic signal, i.e. the strains grouped together by each site of the core genome were indeed more related to each other except for the one strain that was re-assigned. Combining the signal of all sites together allowed for the retrieval of the true phylogenetic signal of the core genome. This result demonstrates that it is theoretically possible to recover correct phylogenetic signals even when all sites are incongruent with the true phylogeny of the core genome.
Impact of recombination rate on tree inference
The previous result indicates that correct phylogenetic trees can be built from core genomes even though virtually every informative site can be incongruent with the true phylogeny. However, these simulations were conducted by introducing exactly one recombination event at each informative site and it is difficult to compare this analysis to the actual levels of recombination occurring across prokaryote species. To assess the effect of recombination rate on our ability to reconstruct accurate phylogenies in prokaryotes, we simulated the evolution of core genomes with various levels of recombination using CoreSimul [40]. In order to mimic realistic conditions, we used a previously published set of 100 core genomes encompassing 13 major prokaryotic phyla (99 bacteria and one archaeon), different lifestyles (e.g. free living, commensal, obligate intracellular) and various genome sizes (from 913Kb to 9.2Mb). Detailed information about the dataset is provided TableS1. The maximum likelihood tree and the core genome of each species were used to infer the simulation parameters: GC-content, transition/transversion ratio, uneven substitution rates across codon positions, branch-specific substitution rates and tree topology (see Methods). Recombination rates r were expressed relative to mutations rates following a Poisson process. For each simulation, r was maintained as constant relative to the substitution rate across the branches of the tree. The donor and recipient genomes were randomly chosen between branches of the tree that overlapped in time. Each recombinant fragment size was pulled from a geometric distribution of mean 100bp and all the sites of the core genome had the same probability to recombine. Following this procedure, we simulated 100 core genome alignments with recombination rates varying from r = 0 to r = 10 (with a step of 0.1) for each species. Recombination rates r/m were then transformed into effective recombination rates r/m (see Methods), which represents the effective number of alleles exchanged by recombination relative to substitutions introduced by mutations. Note that we generated the simulation parameters with a range of r values varying from 0 to 10 and a fixed value of d in order to simulate genome evolution with an increasing range of r/m values in line with r/m values reported across species and studies [14, 41]. For each simulation, the phylogenetic tree was built from each simulated core genome, and the topologies of the simulated trees were compared to the true topology of the tree. We defined the tree topology score (TTS) as the percent of internal monophyletic clades shared by both trees (i.e. the percent of internal nodes that are composed of the same sets of strains). Average r/m and TTS values across the 100 species are presented Fig. 2; all r/m and TTS combined values across the 100 species are presented Fig. S1A and r/m and TTS values are presented individually for each species Fig. S2). Results are also presented for recombination rates expressed as r/m in Fig. S3.
Our results revealed that while many trees are substantially impacted by recombination rates, the phylogenetic signal is not completely lost (Fig. 2). Tree topologies sharply declined for r/m values between 0 and 2; however, even high rates of recombination did not completely erode the phylogenetic signal of the core genomes. Indeed, the average tree topology score plateaus around 50% topology similarity even for high rates of recombination (average r/m=3). This conservation of tree topologies was much higher than expected by chance (TTS =3% for comparisons against randomized tree topologies), indicating that even high levels of recombination are not completely obliterating the phylogenetic signal in these sequences. However, due to the variation of polymorphisms across species, several species presented much higher effective recombination rates (r/m) than others (i.e. more alleles were exchanged for the same number of recombination events) and those species with very high recombination rates (r/m>7) yielded trees that were completely inconsistent with the true phylogeny (TTS=0%, Fig. S1A and S2). Note that simulations with high recombination rates (r/m>2) represent cases where alleles are, on average, exchanged multiple times. In contrast to our results, a previous study found that core genome phylogenies were extremely robust to recombination [33], but this discrepancy likely stands from the fact that the cited analysis was conducted under more modest recombination rates compared to our study.
Impact of recombination rate on bootstrap supports
We tested to what extent recombination rates impact the bootstrap supports of the simulated trees. To do so, we computed the bootstrap supports of the maximum likelihood trees for the simulated core genomes of A. pittii. We observed a rather complex relationship between average bootstrap supports and recombination rates (Fig. 3A and Fig. S4), but we found an average bootstrap support of 82 (over 100) across all the simulated trees. Interestingly, average bootstrap supports were the highest for the trees simulated with either low recombination rates (r/m<1) or high recombination rates (r/m>3). Many of the simulated trees presented average bootstrap supports higher than the average bootstrap support observed in the real tree (89, Fig. 3A, blue solid line). In addition, we did not observe a significant correlation between the average bootstrap support of the simulated trees and their topology score relative to the real tree (Fig. 3B, Spearman’s Rho=0.02, P=0.88). These results indicate that recombination can result in incorrect trees presenting high bootstrap supports and that bootstrap supports therefore provide little information about the accuracy of the tree. In light of these observations, we recommend that the interpretation of bootstrap supports should be taken with care when assessing the quality of core genome phylogenies.
Impact of phylogenetic methods on core genome phylogenies
We further tested which phylogenetic methods were the most adapted to reconstruct core genome phylogenies in the presence of recombination. We asked whether core genome phylogenies are best inferred by maximum likelihood (ML) methods or distance approaches using RAxML and BIONJ, respectively. We simulated core genome evolution using parameters of A. pittii under various recombination rates. Each simulated dataset was then used to reconstruct a ML and a distance-based phylogenetic tree. Each tree topology was then compared to the topology of the real tree and the tree topology score was computed as the number of shared nodes between the trees (Fig. 4A-B). Tree topology scores were found highly correlated, indicating that both methods yielded trees of overall similar quality. However, we observed that ML trees performed slightly but significantly better than distance trees, (P<0.003, Wilcoxon signed-rank test, Fig. 4C). The better performance of ML trees was only observed when recombination rate was relatively low (r/m<1).
Impact of tree resolution on simulations
Our simulations produced a wide range of trees with various levels of topological similarities relative to the real trees, i.e. some datasets simulated with high recombination rates presented very similar topologies to the real trees, whereas others yielded inconsistent trees. We first hypothesized that these disparities were due to different levels of resolution across our dataset of real trees, since it is expected that poorly resolved trees are less likely to match the topology of the simulated trees. As expected, we found a strong positive correlation between the overall resolution of real trees (defined as the average bootstrap supports of all internal nodes) relative to the average topology score computed across all simulations (Fig. 5), revealing that the real trees that were originally poorly supported (low average bootstrap support), were most strongly affected by recombination. This indicates that many of the simulated trees were sensitive to recombination because the real tree was not robustly resolved.
Many simulated core genomes yielded inaccurate trees due to the poor resolution of the original tree; however, other core genomes yielded simulated trees that were strongly affected by recombination, although the original trees presented high bootstrap supports (average ≥90, Fig. S1B and S2). In contrast, some core genomes produced well-resolved trees that matched the topologies of their simulated data even when exposed to high levels of recombination. This observation indicates that highly recombining species can yield inaccurate trees with high bootstrap supports. Our results indicate that even for well-supported trees (average bootstrap ≥90), less than 50% of the nodes are correctly inferred for r/m > 1.5. However, the robustness of a tree to recombination rate varied a lot across our dataset of species with well-supported trees (Fig. S2), suggesting that other factors are influencing the robustness of trees to recombination rate.
Recombination disrupts tree topology when selection is strong
We further investigated the potential factors affecting the topology of the trees. From our set of trees with high average bootstrap supports (≥90, n=65), we analyzed the impact of multiple parameters on tree quality. We observed that two parameters were significantly associated with the overall quality of the trees: the average branch length of the trees (Fig. 6. Spearman’s Rho=-0.61, P<10-6) and the selective constraints as estimated by dN/dS ratios (Spearman’s Rho=0.53, P<10-5). Note that these two parameters are not independent from one another since dN/dS is a time-dependent metric and more divergent genomes are known to present lower dN/dS estimates [42]. Nevertheless, these results suggest that, typically, more divergent core genomes and/or species experiencing stronger purifying selection are more likely to yield incorrect trees when affected by substantial levels of recombination. In theory, because the r/m metric of recombination rate is expressed relative to the substitution rates along the branches of the trees, the impact of recombination on tree topology should be independent from the overall branch length of the trees. For this reason, we hypothesize that differential selective pressures are most likely to be responsible for the uneven robustness of core genome phylogenies to recombination. We showed above that recombination does not impact the tree topology even when each site has been recombined randomly (Fig. 1). This could be explained by the fact that, under weaker selective pressures, alleles are more likely to be randomly exchanged across strains. In contrast, the species evolving under stronger selective pressures likely undergo more biased exchanges of alleles (i.e. though epistasis and selection against allele incompatibilities). Other parameters such as GC-content and the number of taxa in each tree were not significantly associated with tree robustness (Fig. S5).
Estimating the overall accuracy of core genome phylogenies
The range of recombination rates used in our simulation is higher than what most studies have reported for prokaryotes [14, 43]. It is difficult to give an overall estimate of recombination rate in prokaryotes, since these values are species-specific and different methodologies often give inconsistent estimates. Nevertheless, many species have been estimated to recombine with a rate ranging from 0 to 3, with a majority of species around r/m=1 [14]. If we consider that a typical prokaryote species recombines with r/m=1, we can anticipate that ~70% of the nodes of the reconstructed trees match the real tree topology. Under these approximate numbers, this would indicate that core genome phylogenies are generally correct for most prokaryote species, albeit not completely accurate.
Caveats and limitations
Modelling homologous recombination in prokaryotes is a complex endeavor and reported recombination rates can vary greatly across datasets and methods [43]. We simulated the evolution of prokaryotic core genomes under a various set of recombination rates, but it remains difficult to estimate what constitutes typical recombination rates in these organisms. In contrast to a previous study [33], our set of parameters covered a broader range of recombination rates, which likely explains the difference in our conclusions. The recombination rates that we simulated appear more in line with rates typically reported in prokaryotes [14], but our analyses suggest that such results, inferred with phylogeny-based methods, are unlikely to yield accurate estimates when recombination rate is high and selection is strong. By using an empirical set of core genome phylogenies from prokaryote species encompassing diverse phyla, genome sizes and lifestyles, we attempted to simulate core genome evolution with recombination under more realistic conditions. However, recombination processes might be more complex in natural conditions, since rates of recombination can change over time and are likely impacted by population structure and ecology. Overall, more effort is needed to fully understand the relationship between recombination rate and population structure.