In the current study, we re-examined the genomes of 99 individuals from the Wellderly project [15]. The wellderly phenotype describes individuals over 80 who present without any known chronic diseases and do not take any regular medication. Each individual had already been sequenced with three different next-generation sequencing platforms, namely Complete Genomics (CG), Illumina HiSeq (MOL), and Illumina HiSeq X (HSX). Variant calling had already been performed via cgatools for CG and the Isaac software for the Illumina cohorts (Fig 1a). Therefore, the raw data in our current study were the resulting vcf files. We only considered variants, i.e., SNPs and InDels with filter tag ‘PASS’ as defined by the variant calling pipeline, without equalizing the ‘PASS’ criteria between different setups. We believe that this is the most realistic approach when comparing datasets generated by different methods. Additionally, we left-aligned the variants and split multiallelic variants into consecutive blocks to equalize the variant annotation between the different sequencing cohorts.
In order to disentangle the impact of the bioinformatic processing pipeline from the sequencing technology on the concordance of variant calls, we reprocessed the Illumina sequencing data with GATK, following the current best practice recommendations for variant calling (Fig. 1a). However, due to proprietary data formats, we could not obtain and reprocess the raw data for the CG cohort. Throughout the study, we use the name of the sequencing platform (HSX, MOL, or CG) in combination with the mapping and variant calling pipeline to refer to the five distinct experimental setups we compared, namely CG, MOL+Isaac, HSX+Isaac, MOL+GATK, and HSX+GATK.
Our comparative analysis consisted of three different investigations. First, we determined the absolute number of variants in the respective experimental setups and all possible intersections. Subsequently, we analyzed the concordance of different setups within genomic regions, including exons, introns, repetitive elements, or genomic bins with varying GC content (Fig 1b). Finally, we focused on the subset of variants filtered out after applying the MAF 5% and HWE 5% filters. We compared the distribution and reliability of such variants between the different experimental setups (Fig 1c).
Comparison of concordant variants between the three platforms
In the first step of the analysis, we estimated the number of variants detected in each setup. We assessed the concordance between the five experimental approaches (pipeline details are summarized in Fig. 1b).
HSX+Isaac was associated with the highest average SNPs number, followed by HSX+GATK, MOL+Isaac, MOL+GATK, and then CG (Fig. 2a). Altogether, an average of 2,942,659 SNPs was detected per individual by all methods, which corresponds to 82.2% of HSX+Isaac SNPs, 84.8% of HSX+GATK SNPs, 86.06% of MOL+Isaax SNPs, 86.6% of MOL+GATK SNPs, and 88.6% of CG SNPs (Fig. 2a). With respect to InDels, the highest number was detected by HSX+GATK, followed by MOL+GATK, HSX+Isaac, CG, and finally MOL+Isaac (Fig 2b). The average number of InDels seen in all setups was 214,730, corresponding to 23.9% of HSX+GATK InDels, 28.8% of MOL+GATK InDels, 32.5% of MOL+GATK Indels, 54.3% of CG InDels, and 57.9% of MOL+Isaac InDels.
We tested for the statistical over- and underrepresentation of observed unique variant calls for each setup and their interaction (Fig. 2c) with a Monte Carlo simulation approach. The intersection between all setups contained 1.41 times more SNPs and 5.01 times more InDels than expected by chance (chi-squared test, p < 2.2x10-16, Fig 2c), increasing the confidence in these calls. However, the variants unique to each platform were highly overrepresented, especially in the case of SNPs (chi-squared test, p < 2.2x10-16, Fig 2c). This indicated the presence of experimental-setup-specific systematic biases in variant calls.
Distribution of variants along the genome
We were interested in the regional effects of variant calling across the genome. Therefore, we compared the genome-wide distribution of variants in particular areas, such as exons, introns, and intergenic regions, as well as the concordance in variant calls between the different experimental setups. Notably, only a very small proportion of variants were located in exons. However, the absolute number of exonic SNPs and InDels was very similar for all investigated setups (Additional file 1, Fig. S1a, b). The majority of SNPs were found in introns, followed by intergenic regions with comparable distributions under different conditions. In contrast, the number of intronic and intergenic InDels varied substantially where the HSX+Isaac, MOL+GATK, and HSX+GATK strategies detected substantially more InDels. This finding implicates InDels positioned in introns and intergenic regions as a potential primary source of heterogeneity between different setups.
In order to investigate the effect of genomic regions on the concordance of variant calls, we calculated the pairwise Jaccard distances between all setups. As expected, based on the distributions of the absolute numbers of variants, we observed a higher concordance for SNPs compared to InDels (Figure 3). The Jaccard distances were lowest in exons, whereas we detected the worst concordance in intergenic regions. Prominently, the best concordance for InDels was achieved between MOL+GATK and HSX+GATK, which correspond to the data sets reprocessed with the currently recognized best practice mapping and variant calling strategies.
Distribution of variants in repetitive genomic regions
Regions frequently excluded from the analysis of genomic variants are repetitive elements because they are known for accumulating sequencing errors [1]. In order to calculate the impact of such regions, we proceeded as follows (Fig. 1b): We obtained the RepeatMasker annotation for the following classes of repetitive elements, Alu elements, long interspersed nuclear elements (LINE), low complexity regions, long terminal repeats (LTR) and simple repeats. Then, we estimated the absolute number of variants detected by each setup in the respective repetitive region. Finally, we examined the pairwise concordance of variant calls between the different methods using the Jaccard distance.
We observed similar distributions for the number of SNPs in distinct repetitive regions under all experimental setups (Additional file 1, Fig. S1c). The majority of SNPs were detected in LINE, Alu, and LTR elements, whereas the number of SNPs in low complexity regions was much lower. Conversely, the absolute number of InDels in the RepeatMasker annotated regions varied enormously between the setups. Notably, the highest number of InDels were detected with the HSX+GATK strategy, and the difference was most prominent in Alu, LINE elements, and simple repeats (Additional file 1, Fig. S1d).
Analysis of the Jaccard distance in pairwise comparisons of experimental approaches indicated that concordance of variant calls was consistently worse for InDels compared to SNPs in all types of repetitive regions (Fig. 4). This effect was most strongly pronounced in simple repeats and Alu elements with distances as high as 0.89 and 0.85, respectively (Fig. 4b and f). This finding correlates with the fact that we observed the most considerable discrepancy in the number of InDels under each setup in these two types of repetitive regions (Additional File 1, Fig, S1d). Furthermore, concordance was lowest for SNPs in simple repeats compared to all other types of RepeatMasker regions (Fig. 4a). Notably, reprocessing data with the currently recognized mapping and variant calling techniques was again associated with improved concordance, particularly for InDels as indicated by the lowest Jaccard distances between the MOL+GATK and HSX+GATK setups.
Distribution of variants in genomic bins with varying GC content
GC-rich genomic regions are another potential source of sequencing errors, so they are often also excluded from the analysis of genomic variants [1]. To investigate the impact of GC content on variant detection, we calculated the GC content in genomic bins of 100 kbp. Then, we estimated the non-parametric correlation between the number of SNPs and InDels in the same genomic bins with the relative proportion of GC bases.
Fig. 5a, b shows the trend lines for all experimental setups fitted with generalized additive models for SNPs and InDels, respectively. Individual scatter plots with observed values for each approach are summarized in Additional file 1, Fig. S2a-j. We identified a very comparable weak positive correlation between the number of variants and the GC content in the 100 kpb genomic bins under all experimental conditions. This trend was more pronounced for bins with a GC content between 0% and approximately 37%. In contrast, the positive correlation between the number of variants and GC content was weaker in bins with a GC content between ~37% and ~64%. It is important to mention that the 6th percentile of the GC content distribution was already at 33%, thereby bins with a lower GC content are relatively rare (cf. Additional file 1, Fig. S2k).
In order to investigate the concordance of variant calls between experimental approaches as a function of GC content, we annotated genomic bins with a proportion of GC bases below 37% as having low, between 37% and 47% as medium, and above 47% as having high GC content (see Methods). Consequently, the majority of variants were located in regions with medium GC content (Additional file 1, Fig S1e-f).
The Jaccard distances for SNPs and InDels in the respective GC genomic regions are depicted in Fig. 5c-h. As previously observed, concordance was uniformly better for SNPs as compared with InDels. The differences between regions with different GC content were not pronounced with a slightly higher concordance in regions with a low GC content (Fig5c, f). Similar to the other genomic annotations (Fig. 3 and 4), the best concordance for InDels was observed for the MOL-GATK vs. HSX+GATK setup.
Distribution of rare variants and variants deviating from HWE equilibrium
Applying common filters such as MAF and HWE might improve the concordance of different experimental setups. However, rare SNPs/InDels and variants significantly deviating from HWE might be disease-associated which makes them particularly interesting for biomedical research. Therefore, we applied the MAF 5% and HWE 5% filters and focused on the variants which failed these two filtering criteria under each experimental setup. We observed a considerable increase in the numer of such SNPs and InDels after reprocessing the sequencing data with GATK (Fig. 6a,b) compared to the remaining experimental setups. In an attempt to evaluate the realiability of these variants, we employed the GIAB annotation for high and low confidence variant call intervals. A higher number of variants in high confidence intervals could be indicative of increased sensititivy for the respective setup. The MOL+GATK and HSX+GATK setups were associated with more SNPs in high but also in low confidence intervals compared to the remaining strategies. Interestingly, the CG, MOL+Isaac and HSX+Isaac setups had more SNPs located in high compared to low confidence regions (Fig. 6b) which was also true for CG Indels (Fig 6c). In contrast, all other setups were associated with considerably more InDels detected in low confidence relative to high confidence GIAB intervals (Fig. 6d). As the reliability of such variant calls might be proportional to the quantity of reads supporting them, we compared the read depth distributions of SNPs and InDels with MAF<0.05 and HWE <0.05 in high and low confidence GIAB intervals. Median read depth for SNPs was higher in high confidence intervals under all experimental setups (Fig. 6e). In contrast, MOL+GATK and HSX+GATK exhibited reduced read depth for InDels in both high and low confidence intervals compared to the remaining strategies (Fig. 6f). Therefore, it remains questionable if the increased number of variants, especially InDels after reprocessing with GATK points to higher sensitivity or potentially reduced specificity.