3.1 Genomic phenomena that might alter Trypanosomatid isolate complexity
The proportion of reads in each allele in a heterozygous SNP position can be impacted by several phenomena, including multiclonality (multiple clones or genotypes in a sample, often present in different proportions), polyploidy (multiple copies of all chromosomes) and aneuploidy (multiple copies of some chromosomes). (Fig. 1). In a non-complex clonal, euploid, diploid isolate, the mean alternate allele read depth (AARD), meaning the proportion of reads that correspond to the alternate allele in heterozygous positions, is expected to be close to 0.5, as there will be a similar number of reads mapping in both alleles (Fig. 1A). Hence, when all heterozygous SNPs in a genome are evaluated, the distribution of the AARDs will have a peak in 0.5, with a distribution of AARD values expected from ‘random draws’ of reads calling each allele. Some phenomena that have already been observed in Trypanosomatids, such as multiclonality (Fig. 1-B), polyploidy (Fig. 1-C) and aneuploidy (Fig. 1-D) will alter this proportion, changing the distribution peaks or flattening their curve, which may be seen in density plots of ARRD values for all heterozygous SNPs in a sample.
To provide a numeric and statistical large-scale evaluation of this deviation from expected AARD, we estimated CI: the absolute value of the deviation from the expected 0.5 proportion in each heterozygous SNP position [31]; and compared real samples with simulated clonal isolates at individual and populational level. To exclude deviations from AARD due to paralogous genes and aneuploidy, we included only single-copy gene regions and excluded chromosomes with somy variation and gene duplication/losses. We have estimated cutoffs for complex isolates based on the mean complexity of simulated clonal isolates from population genomic data from various Trypanosomatid clades, and used the Cochran-Mantel-Haenszel (CMH) test to support the evaluation in each isolate.
3.2 Assessing the accuracy of the CI to identify multi-clonal and polyploid isolates, using simulated or controlled data
To evaluate the accuracy of the CI metric to identify multiclonal isolates, we created sequencing read data sets to represent multiclonal isolates from L. donovani, by combining downsampled read files from laboratory cloned field isolates in various proportions. We evaluated the two features that could impact the complexity estimations in multiclonal infections: the proportion of the secondary clone and the number of heterozygous SNP positions.
To simulate multiclonal isolates with different proportion of the secondary clones, WGS data from three L. donovani lab-derived isolates that have been cloned were used: ERR205809, ERR205816 and ERR205819 [31]. Reads from these clones were combined pairwise in proportions of 2.5, 5, 10, 15, 25 and 50% of the reads from the secondary clone, resulting in 33 datasets. The complexity of these mixed samples (MIX) and the 24 clones from the Frassen 2021 [31] dataset was assessed based using two parameters: the CI: which had to be higher than the mean + 3 standard deviations (SDEV) from the simulated clonal isolates in the population; and CMH test to evaluate if the real isolate AARD differs from the expected clonal isolate, with a p-value lower than 10− 10 (Fig. 2). Based on these cutoffs, zero clones (0%), and 26 (79%) of the MIX samples were classified as complex isolates. When evaluated separately, the CI parameter was the most specific, as only one clone was classified as complex (false positive), compared to 3 for CMH. However, CI was the least sensitive, as it only classified 26/33 (79%) MIX as complex, when compared to 31/33 (94%) for CMH (Fig. 2A and B, Supplementary Figure_2).
The complexity estimation accuracy was greatly influenced by the proportion of the secondary isolate, where lower proportions resulted in false negative results. None of the six MIX samples where the secondary clone read proportion was 2.5% was classified as complex. Increasing the proportion of the secondary clone resulted in higher accuracy, where five of the six samples where the secondary clone corresponded to 5% of the reads, and all samples where the secondary clone had 10–50% of the reads were classified as complex (Supplementary Figure_2). This was expected, as a low proportion of the secondary clone had a low impact in AARD distributions (Fig. 2B). Hence, our method can detect complex isolates when the secondary clone represents at least 5–10% of the reads.
To evaluate the impact of the number of heterozygous SNPs in the complexity estimation, the heterozygous SNP counts for each MIX sample was downsampled to 10, 50, 100, 300 or full set, and the accuracy to classify each group as complex was assessed (Supplementary Figure_2). To remove potential sampling bias, the analysis was repeated in 100 iterations, re-sampling random SNP positions each time, and the final results are a combination of all iterations. When compared with the full dataset, which had between 978 and 5910 SNPs, the use of 10 SNPs resulted in poor accuracy in all proportions of the secondary clone. By using 100 and 300 SNPs, the results were similar to those observed for the full set, with lower accuracy only for samples with ~ 5% of the reads originating from the secondary clone (Supplementary Figure_2, Supplementary_table_5). Hence, the complexity index estimation requires 100 or more heterozygous SNPs to be accurate.
Besides multiclonality, another source of complexity is polyploidy, having extra full sets of chromosomal copies. To evaluate the impact of polyploidy in the complexity estimations, we used the T. cruzi dataset described in Matos 2022, [17], containing 8 diploid parental clones, and 11 triploid or tetraploid hybrids clones, where the somy of some were validated by flow cytometry Supplementary_Table_2. As performed for the multiclonal isolates, the SNPs counts for each sample were downsampled to 10, 50, 100, 300 or full set, in 100 iterations, and the accuracy to classify each group as complex was evaluated (Fig. 2C and D; and Supplementary Table 6).
Using the combination of CI and CHM cutoffs, on average 4.4, 73.5, 82.5, 90.9 and 100% of the polyploid isolates, respectively for the 10, 50, 100, 300 or the full set of SNPs were correctly classified as complex. No parental diploid clones were classified as complex in any replicate. As expected, the triploid isolates had a distribution of AARD with peak distributions in 0.33 and 0.66, while the tetraploid had peaks in 0.25, 0.5 and 0.75. Both triploid and tetraploid isolates had an CI higher than the observed for the diploid isolates (Fig. 2C and D, Supplementary_Figure 3). These results suggest that the complexity estimate can also be used to identify polyploid isolates with reasonable sensitivity (~ 80%) when 100 or more heterozygous SNPs are present.
Based on these results, we decided to use the combined results of CI and CHM tests to identify complex isolates, and to only evaluate samples with 100 or more SNPs. A conservative approach that minimises false positives, accepting some false negatives, especially in cases where the secondary clone proportion is low.
3.3 Complexity evaluation among Trypanosomatid species:
After establishing the accuracy of the CI metric to identify multiclonal and polyploid samples with simulated and controlled data, we estimated the complexity in a total of 530 laboratory and field isolates from L. donovani, L. braziliensis, T. brucei and T. cruzi, identifying a total of 28 complex isolates (Fig. 3, Table 1, Supplementary Figs. 4–7). The CI cutoff was similar among the evaluated species, with the lowest value in T. cruzi (0.072) and the highest in L. braziliensis (0.089), which supports the robustness of the method. We propose a global cutoff of 0.1 (slightly higher than the highest cutoff, in L. braziliensis) as a value that may be used to classify any Trypanosomatid isolate, or possibly other diploid eukaryotic samples, as complex, which will allow any researcher to classify single isolates without the need of population data to estimate a custom cutoff. Samples with CI values lower than the global cutoff but still higher than their species cutoff were classified as “potential complex” and evaluated separately. Only three potentially complex isolates were identified, two in T. cruzi and one in L. donovani.
Even though we removed aneuploid chromosomes that had deviations from the mean genome coverage from each isolate, the intra isolate chromosome mosaicism (mosaic aneuploidy, and chromosome imbalance) [47–51] may add noise to complexity measurements in field isolates, by having unbalanced values in a few chromosomes. Hence, we are only considering as “complex”, isolates that had at least 50% of its evaluated chromosomes with a mean complexity value higher than 0.1.
The proportion of isolates that were classified as complex varied across clades, where T. cruzi and T. brucei had the lowest (~ 2.5%) and L. braziliensis had the highest (30%) proportion of complex samples in the evaluated dataset. Complexity values also varied in different isolates, where the lowest value was observed in T. brucei SRR17479767 (0.025), and the highest in L. donovani ERR205770 (0.398). Complex isolates have more heterozygous SNPs than non complex samples (Mann-Whitney p-value = 0.003), especially for L. braziliensis (Mann-Whitney p-value 2.87 x 10− 6) and T. brucei (Mann-Whitney, p-value 0.0074). This increase was not observed in the L. donovani evaluated samples (Supplementary_Figure_8). The increase in overall heterozygous SNP counts may reflect the presence of multiclonal infections, where using WGS bulk data, clone specific homozygous SNPs are interpreted as heterozygous SNPs, increasing their counts; or allopolyploids.
The genome coverage from each set also varied, and this might have an impact on the CI limit of detection, as we only classified SNPs with coverage ≥ 5 in both alleles as heterozygous to be used in the complexity estimations. The median of the genome coverages for each dataset was 39 for L. braziliensis, 31 for L. donovani, 56 for T. brucei and 45 for T. cruzi (Supplementary_Figure_8 G), which due to our cutoff of at least 5 reads in the rarer allele, would only allow the identification of multiclonal infections where the secondary clone proportion of reads was respectively of at least 17%, 16%, 8% and 11%.
When each dataset was evaluated separately, from the 85 evaluated L. donovani samples, 6 were classified as complex (7%), and one as potentially complex. Among the 6 complex isolates, three ERR205724 (MHOM/SD/82/GILANI), ERR205770 (MHOM/IT/02/ISS2429) and ERR205774 (MHOM/BR/2003/MAM), were already classified as multiclonal by Fransen 2020 [52], and one, ERR3956121 (1052_ToD_1_primary_neg), was classified as complex by Frassen 2021. In fact, ERR205774 also presented a higher count of heterozygous SNPs in the maxicircle sequence, which further corroborates that it is a multiclonal infection (Supplementary Figure_9). Two isolates, ERR205748 (MHOM/CY/2006/CH32) and ERR205789 (MHOM/SD/62/LRC-L61), were classified by Frassen 2020 as hybrids, and had a ARRD distribution compatible with triploidy in our analysis. Hybridizations in trypanosomatids may result in polyploid lineages, which might revert back to diploidy by genome erosion [17]. The sample that was classified as potential complex ERR3956143 (1073_ToD_1_primary_neg), corresponds to a field isolate obtained from a patient from Ethiopia, which might be multiclonal.
For the L. braziliensis dataset, from the 42 evaluated samples, 13 were classified as complex and all had previous evidence of being polyploid (30%). From these 13, 10 corresponded to experimental tetraploid hybrids, described in [53], while SRR21604774 corresponded to a triploid L. braziliensis and Leishmania guyanensis hybrid. Finally the last two samples, ERR2508271 and ERR2508272, correspond to read libraries used in the assembly of the triploid L. braziliensis M2904 genome [16, 54]. We found no strong evidence of multiclonal infections in any of the evaluated L. braziliensis samples.
From the 211 T. cruzi evaluated samples, five were classified as complex, and two were classified as potential complex. From the complex set, three were isolated from the insect vector: SRR8503553 (Panstrongylus lignarius in Peru); SRR3676272 and SRR3676273 (Triatoma dimidiata in Texas) [55–57]. The AARD density peaks in these three samples are similar to those expected for triploid isolates (0.33 and 0.66), suggesting that they are polyploid. The other two complex samples were isolated from chronic chagasic human patients in Panama (SRR3676281, SRR3676310) [55], and had AARD peaks that are not similar to what is expected for tri or tetraploid isolates. This suggests that they might be multiclonal infections. In fact, SRR3676310 had also a higher count of heterozygous SNPs in the maxicircle sequence when compared to other T. cruzi isolates (Supplementary_Figure_9), which further support that it is potentially a multiclonal infection.
Finally, for T. brucei we identified 4 complex isolates in a dataset of 159 samples. From those, SRR17479764 corresponds to a triploid hybrid from the J10 and KETRI 1738 strains [58], while SRR17479766 was previously suggested to be a multiclonal infection [58]. The final two complex T. brucei strains, ERR270813 and SRR6052140 have AARD profiles that are similar to what is respectively expected for tetraploid and triploid isolates.
Table 1
Complexity evaluation of each Trypanosomatid group of samples. In the Complex samples column, the number in parentheses indicates the number of potential complex samples.
Species | Sample number | CI threshold | Max. CI | Min. CI | Complex samples | Complex % | Assessment |
---|
L. donovani | 85 | 0.083 | 0.398 | 0.041 | 6 (+ 1) | 7 | 4 multiclonal 2 polyploid |
L. braziiensis | 42 | 0.089 | 0.194 | 0.035 | 13 | 30 | 13 polyploid |
T. brucei | 159 | 0.077 | 0.167 | 0.025 | 4 | 2.5 | 1 multiclonal 3 polyploid |
T. cruzi | 211 | 0.072 | 0.227 | 0.030 | 5 (+ 2) | 2.3 | 2 multiclonal (chronic cases) 3 triploid (insect source) |
Simulated | 33 | 0.075 | 0.3 | 0.05 | 25 | 75 | - |
Polyploid | 11 | 0.084 | 0.22 | 0.134 | 11 | 100 | - |
Taken together these results suggest that complex isolates represent a small percentage of the cultured field isolates for all the TriTryp species evaluated. This corresponds to the lower bound of potential complex infections compared to what is observed in natural conditions due to limitations in parasite isolation, culture and a low proportion of the secondary clone, which hampers complexity detection by our method. We suggest that the CI should be estimated in all new WGS from Trypanosomatids isolates obtained in the future, to identify polyploid/multiclonal isolates before proceeding with downstream analysis. To facilitate this, we provide the R code for this method on github (https://github.com/jaumlrc/Complex-Infections.git), which requires only a variant call format (VCF) file to produce complexity estimates.