A coarse-graining approach to phylogenetic reconstruction
Figure 1 gives a brief illustration on how the proposed coarse-graining phylogenetics (CGP) algorithms fits the distribution of local single site polymorphisms (SSPs) density of the genome pairs to infer their phylogenetic tree, forgoing the reconstruction of ancestral recombination graph. In short, CGP is based on a mathematical model [5, 6] that quantitatively describes the evolution of genomic sequence divergence; this model is applicable to both nucleotide sequences and amino acid sequences, and does not assume low recombination rate. Recombination can introduce DNA stretches characterized by a high density of substitutions, and the model considers substitution densities defined on the genomic segments. A nucleotide sequence alignment (or a corresponding concatenation of amino acid sequence alignments) is divided into a chain of consecutive, non-overlapping segments, each with ls sites; for a pair of genomes, the single site polymorphisms (SSPs) on each segment are counted, resulting in an SSP distribution. CGP takes the SSP distribution of every pair of considered genomes as input. The coalescent time of two genomes can be inferred by fitting the CGP model to the empirical SSP distribution. The ultrametric tree describing the vertical component of inheritance among n genomes can be inferred from the coalescent times resulting from the fits to the n(n-1)/2 empirical SSP distributions, implemented by the score function of Eq. (3). We developed the CGP algorithm, which employs Monte Carlo simulation to sample the model+tree space, identifying the tree and parameters that result in the highest score. As an example, Figure 2 compares the phylogenetic trees reconstructed by CGP and RAxML, and Figure S3 for trees reconstructed by more different algorithms.
Testing the CGP algorithm on simulated genomes
We performed Fisher-Wright simulations with recombination to generate genome sequences, allowing us to test different phylogenetic reconstruction algorithms. In the simulation, each recombination-attempting DNA stretch starts at a random site of a genome, with equal chance to be either a micro (geometric distribution, mean 100bp) or a macro stretch (geometric distribution, mean 1kb). We used three sets of parameters that correspond to prokaryotic populations with r/m=2, 40, 80 (r/m is the ratio between substitutions contributed by mutations and by recombinations; these three settings are denoted as low, intermediate, and high recombination levels, respectively), and prepared the test groups, each with 4-10 genome sequences. For comparison, r/m values observed in nature range from 0.02 to 63.6 [11].
The MRCA of a group of random genomes in a simulated population has an average age close to the age of the population root node troot. We would like to mimic the condition where a single lineage diverges from the rest of the population and forms its own subpopulation, so that the genomes in its subpopulation continue exchanging DNA among themselves and with the rest outside. Hence, when picking genomes in the population to form test groups, we constrained the age of the MRCA of the genomes in a test group, ttest-group-root, to be ttest-group-root≪troot (see Supplementary Text in Additional File 5 for the details, and Additional File 4 for the genome sequences in each test-group).
We applied CGP, as well as the previously published methods RaxML [20], ClonalFrameML [14], and Gubbins [17] to the sequences of each test group. The RAxML and Gubbins trees are midpoint-rooted. ClonalFrameML requires an initial tree as input and we used the RAxML tree. CGP uses segment size ls=150 and lscutoff=100. We compared each reconstructed tree with the true tree, measuring their unrooted symmetric distance (SD) [21], as well as their rooted and unrooted branch score distance (BSD) [22] to quantify the accuracy of the reconstructed phylogeny (see Additional File 1 for the these values); the lower the unrooted SD / unrooted BSD / rooted BSD, the more accurate is the topology / branch lengths / root positioning, respectively. We normalized the branch lengths of each tree by its total branch length when calculating BSD.
CGP can predict the topology of a phylogeny of vertical inheritance as accurately as the other algorithms. We quantified the topological deviation between two trees by unrooted SD, which is defined as the number of phylogenetic clusters that are found in only one of the two; hence, an unrooted SD value of 0 corresponds to correct topological prediction. Figure 3 shows the histogram of unrooted SD values of different algorithms; ClonalFrameML is excluded as it uses the topology of RAxML trees. The rate of correct prediction decreases with increasing level of recombination for all algorithms; at low, intermediate, and high recombination levels, CGP’s correct prediction rate is 96%, 94%, and 84%, respectively. In addition, the distributions of unrooted SD of CGP are not significantly different from the distributions of RAxML and Gubbins. Two-sided Wilcoxon signed-rank tests (WSRT) at low, intermediate, and high recombination levels resulted in p=0.25, 0.69, 0.54 between CGP and RAxML, and p=0.25, 0.38, 0.92 between CGP and Gubbins.
Branch length predictions are more accurate with CGP than with alternative programs at a higher level of recombination. Figure 4 plots the distributions of the unrooted BSD of different algorithms and their z-scores; the unrooted BSD of trees reconstructed by different algorithms on the same test group sequences are pooled together to calculate their z-scores to help data visualization. The distribution of the unrooted BSD of CGP is significantly lower than RAxML and Gubbins (p<10-10 at low, intermediate, and high recombination levels compared to both RAxML and Gubbins). The unrooted BSD of CGP is significantly lower than ClonalFrameML except at low recombination level (p=0.76, 2.2×10-7, 10-7 at low, intermediate, and high recombination levels).
CGP can perform accurate root positioning. Figure 5 plots the distribution of the rooted BSD and their z-scores; the rooted BSD of trees reconstructed by different algorithms on the same test group sequences are pooled together to calculate their z-scores. The distribution of the rooted BSD of CGP is significantly lower than the other algorithms (p<2×10-17 at all recombination levels for CGP compared with the other three algorithms).
Testing the CGP algorithm on real E. coli genomes
We tested CGP, RAxML, ClonalFrameML, and Gubbins using E. coli and Shigella genome sequences (see Supplementary Table S1 in Additional File 5 for their names); we refer to them as E. coli, as these two species have intertwined phylogenies. We prepared test groups, each with 10 random strains, where each strain is represented by a nucleotide sequence and an amino acid sequence made from its concatenated core genes (see Additional File 2 for the strains in each test group, and also the 1,636 orthologous gene families of core genes; see Additional File 4 for their sequences). We applied CGP, RAxML, ClonalFrameML (with the topology from RAxML trees), and Gubbins on the nucleotide sequences, and CGP and RAxML on amino acid sequences; the RAxML and Gubbins trees are midpoint-rooted. CGP uses segment length ls=150 and lscutoff=100 for nucleotide sequences, and ls=50, lscutoff=50 for amino acid sequences.
We analysed the time for reconstructing each tree to appraise the computational cost of different algorithms (see Supplementary Text in Additional File 5). We found that our CGP is the slowest among the algorithms for reconstructing nucleotide sequences, and also among those for amino acid sequences (Figure S4), as our script is developed as a proof-of-concept and has not be optimized. In general, the run time of our algorithm is slower for nucleotide sequences than amino acid sequences, which is a consequent of our choice of parameters. Since we used a segment size 150 for nucleotide sequences and 50 for amino acid sequences, it involves the operations of the larger 150×150 matrices for nucleotide sequences, which is computationally more demanding compared with the smaller 50×50 matrices for amino acid sequences. Further, while the divergence between amino acid sequences is lower than nucleotide sequences, we assumed, for simplicity, the same mutation rate for both types of sequences; hence, it takes more time steps for the algorithm to explore the solution space when using nucleotide sequences, and thus has a higher computational cost.
To assess the accuracy of the phylogenetic trees reconstructed by the different algorithms, we compared the reconstructed trees with the phylogenetic signal inferred from the distribution of orthologous gene families in different genomes. We applied the GLOOME algorithm [23], which considers the interior nodes of the tree as ancestral strains and reconstructs their gene distribution; it takes a tree and the presence-and-absence of genes across the extant strains as input, and performs a reconstruction of presence-and-absence of genes in the ancestral strains based on the GLOOME posterior likelihood (GPL). We used GPL as a score to quantify the accuracy of the tree fed into GLOOME; the more consistent the phylogenetic signal from the gene distributions with a given tree, the higher the GPL (see Additional File 2 for the GPL values of the reconstructed trees). Figure 6 plots the distribution of the GPLs and the corresponding z-scores; the GPLs of trees of the same test groups reconstructed by different methods are pooled together to calculate the z-scores. Trees reconstructed from amino acid sequences have a higher GPL than trees calculated from nucleotide sequences; moreover, CGP trees based on amino acid sequences are more accurate than trees calculated using RAxML (p<4×10-14 when comparing CGP on amino acid sequences with any other algorithm; other recombination-aware algorithms are not applicable to amino acid sequences). Considering only trees reconstructed from nucleotide sequences, the CGP trees generally have higher GPL than RAxML, ClonalFrameML, and Gubbins trees (p=5.2×10-4, 0.049, 1.4×10-15, respectively).
Alternatively, we also used the consistency of a phylogenetic algorithm as a proxy indicator of its accuracy. Within a test-group that comprises 10 real genomes, we randomly assigned their core genes to either set A or set B, making a super-gene sequence A and a super-gene sequence B for each genome; tree A and tree B of the group are then reconstructed from the two sets of super-gene sequences, using the same phylogenetic algorithm. We then calculated the unrooted SD / unrooted BSD / rooted BSD between the tree pair of 100 test-groups to quantify the (in)consistency of different algorithms (see Additional File 3 for these scores; see Supplementary Text in Additional File 5 for details); these tree-distance measures are pooled together to calculate their z-score (Figure S5). The unrooted SD between tree pairs reconstructed by CGP applied to nucleotide sequences is lower than RAxML and Gubbins applied to nucleotide sequences, and also RAxML applied to amino acid sequences (p=0.0051, 0.0012, 1.2×10-5, respectively), but not CGP applied to amino acid sequences (p=0.12); CGP applied to amino acid sequences is lower than Gubbins applied to nucleotide sequences and RAxML applied to amino acid sequences (p=0.0343, 0.0036, respectively), but not RAxML applied to nucleotide sequences (p=0.2273); hence, CGP applied to nucleotide sequences, ignoring the branch length of the trees, has the highest topological consistency. The unrooted BSD for CGP applied to nucleotide sequences is higher than RAxML, ClonalFrameML, and Gubbins applied to nucleotide sequences (p=0.015, 0.00014, 0.0032, respectively), as high as CGP applied to amino acid sequences (p=0.36), and lower than RAxML applied to amino acid sequences (p=0.0183); CGP applied to amino acid sequences is higher than RAxML, ClonalFrameML, and Gubbins applied to nucleotide sequences (p=0.0015, 8.6×10-6, 3.5×10-5, respectively), and lower than RAxML applied to amino acid sequences (p=3.0×10-8); thus, CGP at best does not outperform other algorithms, except for RAxML applied to amino acid sequences, in terms of topological and branch length consistency. The rooted BSD for CGP applied to nucleotide sequences is as high as RAxML, ClonalFrameML, and Gubbins applied to nucleotide sequences, and also CGP applied to amino acid sequences (p=0.63, 0.17, 0.24, 0.70, respectively), but lower than RAxML applied to amino acid sequences (1.3×10-5); moreover, CGP applied to amino acid sequences is higher than ClonalFrameML and Gubbins (p=0.019, 0.0011, respectively), as high as RAxML applied to nucleotide sequences (p=0.12), and lower than RAxML applied to amino acid sequences (p=1.9×10-10); therefore, CGP is no better than other algorithms, except for RAxML applied to amino acid sequences, in terms of topological, branch length and root-positioning consistency. Some of the findings in this analysis contradict the observations in previous analyses. For example, trees reconstructed from amino acid sequences tend to have a higher GPL—a measure of accuracy—than those reconstructed from nucleotide sequences, whereas here we observed that tree pairs reconstructed from amino acid sequences tend to have a lower consistency (higher unrooted SD / unrooted BSD / rooted BSD) than those reconstructed from nucleotide sequences. When comparing the reconstructed tree with the real tree on simulated sequences, CGP has a lower unrooted and rooted BSD, but not a lower unrooted SD, than other algorithms; however, the opposite trend is observed in this analysis. Therefore, we concluded that this consistency test does not reflect the accuracy of algorithms.
Furthermore, we investigated how the UPGMA assumption perturbs coalescent time inference. For a phylogenetic reconstruction performed by CGP, we obtained its optimal model parameters μ, ρ, θ, and δTE; we simulated the theoretical model based on these optimal parameters, and fitted the model to the empirical pairwise SSP distributions to directly infer the coalescent time, tdirect, of different sequence pairs. Given the coalescent time inferred from the optimal tree reconstructed by CGP, ttree, we calculated the deviation between the two coalescent times, Δ, defined as , to quantify the perturbation caused by UPGMA assumption (see Supplementary Text in Additional File 5 for details). The distribution of Δ for phylogenetic reconstructions based on nucleotide sequences, and also on amino acid sequences, are mostly confined between -0.1 and 0.1 (Figure S6), roughly indicating a less-than-10% perturbation on the inferred branch length.