Optimization of the pipeline on monoclonal and simulated mixed infection samples.
Towards optimizing GATK4 for P. falciparum, we sought to create an improved training “truth set” for the pipeline. To filter raw VCFs with a high quality truth callset, which is difficult to obtain using wet lab methods, we decided to generate a robust in silico training variant dataset from the PacBio assemblies (Additional file1: Figure S1). We first ran the variant calling pipeline with default settings of GATK4’s HaplotypeCaller and GenotypeGVFs tools on high-quality public Illumina read data (Additional file1: Table S1) of 10 laboratory strains (7G8, Dd2, GA01, GB4, GN01, HB3, IT, KH01, KH01 and SN01) for which there exist accurate PacBio assemblies (Otto et al., 2018; Vembar et al., 2016). We recalibrated the raw VCFs with the published training dataset generated from 3D7 x HB3, HB3 × Dd2 and 7G8 × GB4 crosses in splenectomized chimpanzees (Walliker et al. 1987; Wellems et al. 1990; Hayton et al. 2008) and used by MalariaGEN’s Pf6 release (Miles et al. 2016; MalariaGEN et al. 2021). The results were compared to reference callsets we generated from the accurate PacBio assemblies mapped onto the 3D7 reference (Methods, Additional file1: Figure S2).
Sensitivity of this default variant calling in the core genome were 77.7 ± 1.3% (median ± interquartile range) for SNPs and 73.1 ± 5.1% for indels (Fig. 1A). When we replaced the cross dataset with our PacBio-derived in silico training dataset (called pipeline 1), we found greatly improved sensitivity ( 84.2 ± 2.5% for SNPs; and 78.8 ± 5.4% for indels). With GATK3 recalibrated with the cross training dataset, the sensitivity for both SNPs and indels was the lowest. As next step in GATK4 optimization, we altered multiple parameters of HaplotypeCaller and GenotypeGVFs of pipeline 1 to make pipeline 2, including adjusting expected SNP and indel rates (--heterozygosity and --indel-heterozygosity) and parameters that control base quality (--base-quality-score-threshold and -stand-call-conf), mapping (-mbq and -DF) and local assembly size (--min-assembly-region-size), aiming to further improve the sensitivity and reduce false calls. While default values of pipeline 1 are fairly robust, sensitivity increased significantly to 86.6 ± 1.7% for SNPs and 82.2 ± 5.9% for indels when modified parameters were used (Methods; Fig. 1A; adjusted P < 0.001, Wilcoxon test, Pipeline 2 vs. default GATK4 with cross training dataset for both SNPs and Indels). Despite the trade off between sensitivity and precision, the latter was very high (> 90%) for all GATK4 pipelines, including our pipeline 1 and 2.
We finally tested the GATK4 pipelines on simulated mixed infection samples after combining high-quality reads of IT + KH01 at different proportions (95:5, 90:10, 85:15, 80:20, 75:25 and 50:50) to make 100X coverage. Sensitivity of our optimized pipelines was higher than that of the default GATK4 pipeline trained with the cross dataset for both SNPs and indels (adjusted P = 0.001, Wilcoxon test) (Fig. 1B). Interestingly, the most significant performance gains were observed with indels for which the sensitivity of pipeline 2 was 78.3 ± 5.1% versus 38.9 ± 0.7% for default GATK4 with cross training dataset (Fig. 1B, adjusted P < 0.001, Wilcoxon test). Here the precision was similar across all GATK4 pipelines but lower compared to single infection samples.
To understand how sequencing quality affects the results of our pipelines, we added shorter (old) Illumina read samples (n = 7) from the Pf6 dataset and analyzed any potential effect of the read coverage, insert size, read length and read quality on variant calling performance (Additional file1: Table S2). In spite of the great variation in the read coverage (median between 35 and 180X), neither the specificity nor the precision was correlated with this parameter (Fig. 2). Interestingly, we found a strong positive correlation between pipeline performance and insert size and read length (Fig. 2).
Combination of SNPs and indels shows higher resolution of population structure
Since the performance of indel calling was markedly improved with GATK4, we examined if the inclusion of indels improved the resolution of global and local population structure using Malariagen’s Pf6 samples. After restricting our analysis to the core genome and filtering out low quality variants; our fully optimized pipeline detected more variants in total compared to GATK3. The majority (55%) of indel positions overlapped with each other and with SNP sites and formed complex multiallelic markers (2,341,377 SNPs, 1,381,687 indels and 1,686,041 multiallelic sites in 6,626 samples for our pipeline; and 2,441,874 SNPs, 889,667 indels and 1,327,993 multiallelic sites for the GATK3 VCFs). We used high quality SNPs and indels from the core regions of chromosome 1, excluding hypervariable and subtelomeric regions, after pruning them for linkage disequilibrium and selecting high-quality samples with less than 20% missing genotypes (aiming to reduce potential false noise mostly for indels). We calculated the pairwise variance-standardized genetic relationship matrix and performed a t-distributed stochastic neighbor embedding (tSNE) analysis using SNPs and indels, individually and in combination. We observed a similar general population structure between sub-Saharan Africa, South-East Asia, South America and Papua New Guinea similarly across all three variant sets as previously reported (MalariaGEN et al., 2021; Manske et al., 2013; Miotto et al., 2015), including SNPs, indels and SNPs+Indels (Additional file 1: Figure S3). Subsequently, we selected only African samples, a more challenging population to differentiate, and performed the same relatedness analysis to look at continental population structure. SNPs showed more pronounced separation of Democratic Republic of Congo from West African samples (Fig. 3C) whereas indels provided greater differentiation of parasite populations within the same geographical regions (Fig. 3B). Specifically, with indels, the vast majority of samples from Ghana and Guinea were clearly separated from the rest of parasites from West Africa; and Tanzania, Madagascar, Kenya, Ethiopia and Malawi formed more population structure within East and South-East Africa. Interestingly, when combined both SNPs and indels led to higher resolution of population separation within and across the different African regions (Fig. 3A) compared to each of them analyzed separately.
Increasing ploidy improves variant calling performance in unbalanced mixed infections
Given the decreased performances of both GATK4 and GATK3 in mixed infections in diploid mode in general, we tested our pipelines at ploidy 6 to improve detection of low abundance within-sample variants (due to minor low abundance parasites). Since most downstream analytic tools cannot fully process hexaploid VCFs, such as decomposing complex indels into smallest fragments prior to comparing two callsets, we limited our evaluation to SNPs. The total number of additional true positive SNPs detected by the hexaploid mode relative to the diploid one on chromosome 13 was 449, 359, 112, 102, 95 and 78 for 15%:95%, 10%:90%, 15%:85%, 20%:80%, 25%:75% and 50%:50% of IT:KH01 mixed infections, respectively (Additional file1: Table S3). For the 50%:50% mixed infection sample, the hexaploid mode did not substantially improve variant detection, confirming our hypothesis whereby the diploid mode works more accurately when there are equal proportions of individual strains in the sample. These findings suggest that the polyploid modes would be more suitable for monitoring molecular markers of malaria drug resistance or detection of low abundance strains.
To verify this, we applied our pipeline 2 at ploidy 2 and 6 on 6,626 field isolate samples from MalariaGEN’s Pf6 release that have been collected from all malaria-endemic regions worldwide (MalariaGEN et al., 2021) to analyze k13 resistant mutations that have been reported by the World Health Organization (World Health Organization, 2020). We compared mutations found by our pipelines to those present in the public GATK3 VCF (diploid mode). As expected, the hexaploid mode of the optimized GATK4 detected 32 additional resistant mutations that were totally missed by GATK3 (Table 1). These resistant samples rescued by the hexaploid mode of our optimized pipeline were confirmed by visualizing their respective BAM files in IGV (examples are illustrated in Supplementary Figure S4 for C580Y mutation). Interestingly, one sample from Cameroon in Central West Africa where artemisinin resistance is yet to be reported, classified as wildtype by GATK3, was found with C580Y mutation (Additional file1: Figure S4).
Table 1: Resistant k13 mutations detected by the optimized GATK4 at hexaploid and diploid modes versus GATK3. All the mutations were found in South-East Asia except for one sample carrying the C580Y allele that was collected in Cameroon.
K13 mutations
|
Optimized GATK4 (ploidy6)
|
Optimized GATK4 (ploidy2)
|
GATK3
|
A481V
|
5
|
5
|
5
|
C580Y
|
664
|
649
|
650
|
G449A
|
6
|
5
|
5
|
G538V
|
24
|
21
|
21
|
I543T
|
36
|
34
|
33
|
M446I
|
38
|
36
|
36
|
N458Y
|
17
|
17
|
17
|
P441L
|
28
|
28
|
28
|
P553L
|
24
|
23
|
23
|
P574L
|
30
|
30
|
30
|
R539T
|
65
|
63
|
62
|
R561H
|
24
|
24
|
24
|
Y493H
|
111
|
106
|
106
|
Our optimized pipeline with ploidy 6 enables robust estimation of complexity of infections
To examine the practical benefits of improved variant calling, we assessed the overall WGS dataset complexity of infection (COI) in field isolates from sub-Saharan Africa, South-Asia Asia and South America. We computed COI based on common SNPs on chromosome 13 called using our optimized pipeline with ploidy of 6 and 2 and the GATK3 at diploid mode. Variants with minor allele frequencies > 1% and missing genotypes > 10% were selected for the COI calculation using the REAL McCOIL package (Chang et al., 2017). Overall, more mixed infections were found when the hexaploid callset was used to estimate the COI compared to the other VCFs and these results were consistent across all sampling locations although polyclonality was generally reduced outside sub-Saharan Africa (Fig. 4). We found 52.2%, 58.2% and 58.3% monoclonal infections in sub-Saharan African regions using optimized GATK4 with ploidy 6 and ploidy 2 and GATK3, respectively. Regarding biclonal infections in the same regions, we found 28.2%, 28.6% and 29.2% with our pipeline at ploidy 6 and ploidy 2 and GATK3, respectively. The prevalence of samples with COI > 2 was 19.6%, 13.2% for the hexaploid and diploid modes of the optimized GATK4, respectively and 12.5% for GATK3. Thus, increasing the ploidy up to 6 significantly increased the detection of polyclonal samples compared to ploidy 2 for GATK4 and GATK3 (p < 0.05, Wilcoxon test) in 8 sampling sites; including Cambodia, Cameroon, Democratic Republic of Congo, Ghana, Guinea, Malawi, Mali and Tanzania.