Overview of the Nationwide WES proficiency test in China
To assess both inter-lab and intra-lab performance in detecting germline small variants, four Quartet DNA reference materials22 (Quartet_M8 (mother), Quartet_F7 (father), and Quartet_D5 and Quartet_D6 (monozygotic twin daughters)), each with three replicates, were anonymously distributed to participating labs in a random order (Fig. 1a). Each lab employed its own WES workflow for sequencing the provided reference materials, adhering to a minimum sequencing depth requirement of 100x, and subsequently analyzed and reported germline variants. They then returned and submitted metadata containing essential experimental and analytical details, raw sequencing data (FASTQ files), variant calling results (VCF files), and targeted regions (BED files) (Methods).
A total of 103 independent labs participated in this nationwide multi-center WES proficiency test in China, with 89 labs returning their data before the deadline (Fig. 1b). Each lab employed different workflows, integrating diverse combinations of experimental and analytical methods, along with various parameter settings. The analysis of metadata information provided insights into the methods utilized by these labs, allowing us to identify the most widely accepted approaches (complete metadata for each lab are available in Supplementary Table 1). The proficiency test began with DNA fragmentation since the DNA reference materials were distributed to each lab; thus, the proficiency test did not cover the DNA extraction process. Among them, 50 labs used ultrasonic methods, while 39 labs opting for enzymatic approaches for fragmentation (Fig. 1c). Following this, a library was constructed through adaptor ligation, and exons were targeted and captured using different capture kits. More than 12 capture kits were utilized, with IDT xGEN Exome Panel (28), Agilent SureSelect (22), and Roche KAPA HyperExome (12) being the most commonly used (Fig. 1c, Supplementary Table 2). After amplifying captured targets through qPCR and conducting quality control, the resulting captured library underwent high-throughput sequencing. Sixty-seven (67) labs used the Illumina sequencing platforms, while 22 labs employed the MGI sequencing platforms. Among them, Illumina NovaSeq emerged as the most widely used, with 61 out of 89 labs adopting this platform (Fig. 1c). The computational steps of WES data began with raw reads preprocessing, involving the removal of adapters and the filtration of low-quality reads. Fastp24 (45) was the most popular preprocessing tool, followed by Cutadapt25 (16) and Trimmomatic26 (13) (Fig. 1d). BWA27 (77) was employed by almost all labs for aligning reads to the human reference genome (Fig. 1d). For variant calling, at least 18 pipelines were utilized. GATK germline variant calling best practices with HaplotypeCaller28 (64) was the most widely used, followed by Sentieon DNAscope29 (7), Strelka230 (5), Freebayes31 (3), DeepVariant32 (2), and VarScan33 (2). Five labs reported variants supported by two or more callers (Fig. 1d).
We should note that the targeted regions used by participating labs for WES did not cover 100% of the exome regions (Supplementary Table 1). Typically, they encompassed 76.2–89.2% of the exome regions (Fig. 1e). Among the participating labs, targeted regions spanned from 32.7 Mb to 100.8 Mb. While some labs utilized the same capture kits, their targeted regions varied due to the inclusion of custom targets or the addition or subtraction of 10 to 50 bp on each side of the targets. A total of 23.4 Mb targeted regions were covered by all 85 labs (see below), encompassing about 60% of the exome regions. We evaluated the coverage of clinically significant genes (a total of 2773 genes) within the targeted regions for each capture kit, encompassing 2,742 genes from ClinGen (downloaded at 09-21-2023), 171 genes from COGIV34 (catalogue of germline variants in cancer), and 157 genes from TCGA (pathogenic germline variants in cancer)35 (Supplementary Table 3). The targeted regions of different capture kits spanned from 32.6–46.8% of the clinically important gene regions (Extended Data Fig. 1), covering 2706 (97.4%) to 2736 (98.4%) genes. Less than 6% of targeted regions for all capture kits were identified as repetitive regions. NadPred Exome Plus Panel and IGT v1 covered more repetitive regions compared to others (Extended Data Fig. 1).
We carefully checked the files returned by participating labs for compliance before analysis. Upon examining their target regions, we found that three labs utilized excessively small targeted regions (13.1 Mb, 13.1 Mb and 13.7 Mb), significantly smaller than the exome regions (RefSeq release 210: 38.5 Mb and GENCODE v39: 40.1 Mb). Additionally, one lab’s targeted regions covered 30.5 Mb, but only showed a 54% overlap with the exome regions in RefSeq or GENCODE databases (Fig. 1e). Among the remaining 85 labs, one reported an incorrect format for all FASTQ files, another lab reported an incorrect FASTQ format for two replicates, and two labs submitted VCF files containing only plausible disease-causing variants instead of all germline small variants detected in the targeted regions. Consequently, we utilized 996 VCF files from 83 labs to evaluate the analytical validity of WES among sequencing service providers and clinical labs in China. Simultaneously, we used 2,012 paired end FASTQ files from 84 labs to investigate the quality of sequencing data and the impact of experimental and analytical factors on variant calling accuracy.
Read quality and capture efficiency
We performed pre-alignment quality control on raw reads using FastQC36 to evaluate read quality and FastqScreen37 to detect any library contamination. Post-alignment quality control and examination of capture efficiency were conducted on alignments to identify potential issues in library preparation and sequencing that could impact variant detection accuracy. We observed quality variations during the sequencing of identical DNA reference materials across different labs, even among those using identical capture kits and sequencing platforms (Fig. 2).
A sequencing depth of 100x was required, and 82 labs met this threshold in raw sequencing depth. After removing duplicated reads, 80 labs had a sequencing depth greater than 100x. The lowest observed duplicated sequencing depth was 44x. The duplication rate of reads ranges from 2.5–52.3% (median = 21.5%). We observed a positive correlation between increased sequencing depth and a higher duplication rate (Pearson’s correlation coefficient, r = 0.39). Despite one lab achieving the highest raw sequencing depth of 1676x, its duplication rate peaked at 46.9%, indicating wastage of sequencing data. Labs employing the Illumina platforms exhibited a higher duplication rate than those using the MGI platforms (24.3% versus 9.7%).
For raw reads quality control, all libraries consistently exhibited high base-quality scores, low levels of GC bias, low N base content, and overrepresented sequences accounting for less than 1% of the total (Extended Data Fig. 2). Three major issues were identified. First, one third of the labs (27 out of 84) reported raw reads containing adapter sequences comprising more than 10% of all reads. Secondly, biased sequences in the first 12 bases of the run was observed in 36 labs, indicating biased selection of fragments during the random priming step in library preparation38. Thirdly, during contamination checks, 19 labs had over 5% of reads mapped to both the human genome and the vector, and three labs had less than 95% of reads mapped to the human genome (Extended Data Fig. 3).
To evaluate the post-alignment quality of the sequence data from the participating labs, all raw reads were aligned to the reference genome (GRCh38) using BWA. Overall, all labs demonstrated a high mapping rate (the percentage of reads aligned to the reference sequence, median = 99.5%) and a low mismatch rate (median = 0.3%). However, despite a higher mapping rate of the MGI platforms compared to that of the Illumina platforms (99.97% versus 99.92%), the MGI platforms exhibited a higher mismatch rate (0.5% versus 0.3%) and lower Q30 rate (90.1% versus 93.9%). To prevent the generation of excessive duplicate sequences, the insert size should be at least twice the read length39. For instance, with a read length of 150 bp, the insert size should be a minimum of 300 bp, and for a read length of 100 bp, it should be at least 200 bp. However, only 11 labs met the minimum insert size requirement (Fig. 2).
Efficient exome capture is important for ensuring high-quality data. Inefficient target capture can lead to regions with insufficient read depth, necessitating additional sequencing to maintain data coverage. To evaluate the capture efficiency of each WES library, we assessed targeted coverage, defined as the percentage of target regions with at least 100x read depth. We observed considerable differences in the coverage of target regions across various labs, ranging from as low as 13.9% to as high as 98.1%, with a median of 66.5%. Approximately one fourth (24) of the labs achieved a minimum of 80% coverage with at least 100x read depth. The on-target bases rate, reflecting the percentage of bases mapped to target regions, ranged from 42.8–76.7% with a median of 55.9%. The fold-80 base penalty was used to assess coverage uniformity, indicating the extra sequencing required to achieve 80% coverage of the target bases at the mean coverage level. Values above 1 indicate uneven uniformity levels. Across all WES libraries, the fold-80 penalty ranged from 1.3 to 4.4, with a median of 1.8. Notably, only 18 labs exhibited a fold-80 less than 1.5, and six labs recorded a high fold-80 penalty greater than 3 (Fig. 2).
Based on read quality and capture efficiency, we categorized the 84 labs with correct Fastq format compliance into three quality groups: 25 were designated as low quality, 38 as middle quality, and 21 as high quality (Methods). Generally, high-quality sequencing data exhibited lower read duplication rates, increased sequencing depth after removing duplicated reads, higher mapping rates, lower mismatch rates, elevated Q30 rates, improved coverage, higher rates of on-target bases, and reduced fold-80 penalty (Extended Data Fig. 4). When library contamination status is considered, we found that 6 out of 25 (24%) labs in the low-quality group had high contamination, 5 out of 38 (13%) labs in the middle-quality group had high contamination, while none in the high-quality group had contamination (Extended Data Fig. 5).
Performance of small variants detection varied across labs
We evaluated the performance of variant calling results from each participating lab, generated from their own customized bioinformatic pipelines. First, we tallied the number of variants detected by each lab. Due to variations in target region sizes and bioinformatic pipelines, significant differences were observed in the number of detected variants among the labs. The number of detected variants ranged from 13,069 to 82,887 SNVs and 270 to 16,544 Indels after variant filtration (Extended Data Fig. 6a). Thirty-three (33) labs reported variants outside the target regions, with 20 of them reporting as many or more variants outside the target regions than the inside regions (Extended Data Fig. 6b), because these labs added 50–200 bp padding regions to the target intervals when detecting variants. Comparing the accuracy of variants reported inside and outside target regions, we found that the Mendelian concordance rates of variants outside the target regions were lower than those inside (SNV: 0.91 versus 0.97, Indel: 0.72 versus 0.83) (Extended Data Fig. 7). This highlights the need for cautious interpretation of variants detected outside the target regions, requiring orthogonal confirmation, especially if they are clinically relevant.
We assessed the accuracy of variants present only within the target regions. Notably, 86.4–99.3% of SNVs and 43.0–93.7% of indels were located within the high-confidence bed regions or benchmark regions defined for the Quartet DNA reference materials. The accuracy of these variants can be evaluated by comparing them with benchmark variants. For the remaining average 4.6% of SNVs and 26.6% of indels outside the benchmark regions, we estimated their precision by examining the Mendelian concordance among the Quartet reference samples (parents and twin daughters). Since there are typically around 30 de novo variants per generation, most Mendelian inconsistent variants are deemed to be false positives caused by sequencing or alignment errors40.
Compared to SNVs, the overall detection accuracy of indels was lower (Figs. 3a, b). The precision of SNVs was 0.997 ± 0.048, with all labs except two achieving an SNV detection precision above 0.983. For indels, the precision was 0.935 ± 0.053. The recall of SNVs was 0.974 ± 0.104, and for indels, it was 0.900 ± 0.142, which were notably lower than precision. This suggests that stringent filtration was applied by most participating labs, potentially resulting in the loss of many true variants. The Mendelian concordance rate was used to estimate the precision of variants detected both inside and outside high-confidence bed regions (SNV: 0.972 ± 0.087, Indel: 0.769 ± 0.111), which was much lower than the precision of variants only detected inside high-confidence bed regions. This discrepancy was likely due to variants in complex and repetitive genomic regions. For detection performance of small variants within clinical genes, the precision (0.983 ± 0.072) and recall (0.964 ± 0.128) of SNVs were much higher than those for Indels (precision: 0.925 ± 0.069; recall: 0.910 ± 0.137) (Extended Data Fig. 8a). Among the 2773 genes we investigated, the HLA gene loci (HLA-B: 22, HLA-DQA1:11, HLA-C:8), SMPD1 (8), KMT2C (8), SEC63 (6) and DSPP (6) had more false positives compared to other genes. This is due to the presence of repetitive genomic regions within these genes, which can cause mapping errors (Extended Data Fig. 8b).
We investigated the reasons contributing to performance differences among labs by considering both the F1 score and Mendelian concordance rate (Figs. 3c, d). According to the distribution of F1 scores and MCR, labs were distributed into four regions. Region 1 was composed of labs exhibiting high F1 scores and high MCR. Region 2 represented labs with high MCR but low F1 scores. Region 3 included labs with high F1 score but low MCR. Region 4 was comprised of labs with low F1 score and low MCR. As shown in the figures, labs in Region 1 displayed F1 scores and Mendelian concordance rates higher than the median values, suggesting excellent variant detection performance. Fourteen (14) labs exhibited high F1 scores and Mendelian concordance rates for both SNVs and Indels, indicative of high or moderate read quality and lower contamination levels. In Region 2, 13 labs exhibited F1 scores below the median values but had Mendelian concordance rates surpassing the median values for SNV detection, and 19 labs for Indel detection. The diminished F1 scores were primarily due to low recall caused by stringent filtration thresholds. In Region 3, 14 labs demonstrated F1 scores exceeding the median values, but Mendelian concordance rates lower than the median values for SNV detection, with 30 labs exhibiting such a pattern for Indel detection. The reduced Mendelian concordance rates were primarily associated with target regions encompassing repetitive genomic regions. Region 4 showed labs with both F1 scores and Mendelian concordance rates below the median values for both SNV and Indel detection. This outcome was likely influenced by multiple factors, including labs with low read quality or high contamination, particularly affecting Indel detection. Moreover, the imposition of overly stringent filtration thresholds might have exacerbated these issues. Additionally, some labs displayed inconsistent performance across replicates (Extended Data Fig. 9), potentially leading to sequencing failures for certain samples and consequently diminishing the overall Mendelian concordance rate.
We identified three labs in Region 4 with high read quality but significantly low F1 scores and Mendelian concordance rates. Upon investigation, we discovered the reasons after clustering Quartet reference samples from all labs using variant calling results (Fig. 4). One lab had mislabeled the samples; specifically, Quartet_F7 and Quartet_M8 were clustered with twins Quartet_D5 and Quartet_D6. The other lab likely experienced contamination from external DNA samples rather than Quartet samples, as the variant profiles differed markedly from those of other labs, and variants called from the twins had differences that were not entirely the same. Such mislabeling and cross-contamination cannot be detected using pre-alignment quality control methods but only from the Quartet family design. We corrected the mislabeling issue and assigned the other lab to the low-quality group for further analysis. Additionally, we observed that the variant calling results from a third lab maintained the Quartet family relationship but differed significantly from variants called by other labs. We found that they applied overly stringent filtration thresholds, removing many true variants and retaining fewer variants, resulting in high precision (0.98) but very low recall (0.28).
Impact of bioinformatic factors on variants detection performance
The participating labs employed various variant calling pipelines and filtration methods. We first investigated the impact of analytical factors on variant calling performance. We aligned the submitted raw reads to the reference genome (GRCh38) using BWA. To adjust for differences in sequencing depth across labs, the mapped reads were then normalized to mean 100x coverage and subjected to various pipelines for variant calling. We evaluated the performance of five commonly used variant calling pipelines among the participating labs: GATK HaplotypeCaller28, Sentieon HaplotypeCaller41, Sentieon DNAscope29, Strelka230, and DeepVariant32. Among the 64 labs that employed HaplotypeCaller, 24 of them didn’t apply the recommended post-alignment process – base quality score recalibration (BQSR). We further investigated the impact of including or excluding BQSR on variant calling accuracy. These analyses resulted in a total of 6036 call sets (1006 samples × 6 pipelines) (Methods).
We assessed the performance of each raw call set without filtration, based on precision and recall in the benchmark regions and the Mendelian concordance rate across the entire target regions (Figs. 5a, b). For all bioinformatic pipelines, higher-quality reads showed greater accuracy in both SNV and Indel detection. Among the pipelines evaluated, DeepVariant, Sentieon DNAscope, and Strelka2 achieved higher precision compared to Sentieon HaplotypeCaller and GATK HaplotypeCaller. DeepVariant and Sentieon DNAscope maintained high precision in the benchmark regions even when handling low-quality reads for both SNV and Indel detection. Although HaplotypeCaller exhibited lower precision compared to the other three callers, it achieved a higher recall rate. This is attributed to its identification of more variants (with a median variant number detected across all call sets of 14,928 versus 14,639), suggesting that variant filtration may be necessary to improve precision. Sentieon and GATK HaplotypeCaller exhibited comparable performance, although GATK HaplotypeCaller demonstrated lower precision in Indel detection. The inclusion or exclusion of the BQSR post-alignment process did not significantly impact the variant calling performance. Performance varied significantly across different complex and repetitive regions, with SNVs showing higher performance in low complexity and simple repeat regions, while Indels performed better in the long terminal repeat (LTR) regions (Fig. 5c). In clinically relevant gene regions, the SNV F1 score was 0.993 and the Indel F1 score was 0.929. Sentieon DNAscope had the highest performance, achieving an SNV F1 score of 0.994 and an Indel F1 score of 0.971.
A total of 33 labs identified variants outside of the targeted regions by expanding the boundaries of target regions, adding padding ranging from 50 bp to 200 bp. We assessed the precision of variants detected within these padding regions using the Mendelian concordance rate, analyzed in 50 bp bins (Fig. 5d). Comparing variants identified within the target regions with those in the padding regions, we observed a significant decrease in precision for variants located farther from the boundaries of the target regions. This suggests potential false positives in these padding regions, emphasizing the need for careful interpretation.
A common strategy employed to enhance the accuracy of variant detection for clinical samples is to utilize the intersection of two or more variant callers. We investigated the variant detection performance by employing the intersection of any combinations of the four callers on high-quality reads: DeepVariant, Sentieon DNAscope, Strelka2, and Sentieon HaplotypeCaller (Fig. 5e). Utilizing the intersection of two or more callers resulted in a slight increase in precision, while recall dropped rapidly. When dealing with low-quality sequencing datasets, employing the intersection of multiple callers didn’t significantly enhance precision, and there is a slight decrease in recall as the number of intersecting callers increased (Extended Data Fig. 10).
We also examined the impact of variant filtration on variant calling performance by comparing the results obtained from the same bioinformatic pipeline without filtration (Sentieon HaplotypeCaller with BQSR) to the submitted variant calling results after filtration. For most participating labs, F1 scores from both pipelines were similar. However, we observed a significant increase in the F1 scores of 13 labs when utilizing the Sentieon pipeline without filtration compared to their submitted variant calling results with filtration. Very few labs were found where the F1 scores of variant calling results submitted by labs were higher compared to variants without filtration. This suggests that most labs may have employed too stringent filtration thresholds, leading to the exclusion of many true variants, thereby greatly reducing recall without a significant improvement in precision (Extended Data Fig. 11).
One of the most controversial variant filtration parameters was depth, which ranged from 0 to 30. We assessed the accuracy of variants across different depth filtration thresholds using Sentieon HaplotypeCaller with 100x down-sampled datasets. At a depth filtration threshold of 5, the precision and recall for both SNV and Indel variants remained constant. However, starting from a threshold of 10, the precision showed a slight increase, while the recall decreased rapidly (Extended Data Fig. 12).
Impact of experimental factors on variants detection performance
In evaluating the influence of wet-lab experimental factors on variant detection performance, we considered various factors such as DNA input amount, DNA fragmentation method, DNA shearing time, capture kits, library sizes and concentrations, and sequencing platforms. To minimize the effects of bioinformatic analytical factors, we relied on performance metrics (precision, recall, and Mendelian concordance rate) obtained from a single caller (DeepVariant). Since the datasets originated from different labs, some produced high-quality datasets while others generated low-quality ones, to mitigate the influence of labs with evident experimental issues, we excluded labs with obvious contamination from other samples from this analysis.
We first examined the impact of experimental factors on variant calling performance using principal variance components analysis (PVCA). Capture kits contributed the most to explaining the variance in SNV (65.6%) and Indel (51.7%) detection, while sequencing platforms and fragmentation methods had no significant effect on variant calling performance (Figs. 6a, b). Notably, MGI platforms exhibited slightly lower performance compared to Illumina platforms in detecting Indels (t-test, p = 0.014). This difference may be attributed to the fact that more labs using Illumina platforms generated higher-quality sequencing datasets, and performance was influenced by multiple experimental factors beyond just the choice of a sequencing platform. Among the capture kits evaluated, KAPA, IDT v1, IDT v2, NanoWES, and NadPrep achieved higher performance for both SNV and Indel detection compared to Agilent MyGenostics and IGT v1. This superiority was attributed to their lower fold-80 penalty, higher coverage rate over 100x, and higher on-targeted bases rate (Extended Data Fig. 13). Four popular capture kits, KAPA, Agilent v6, IDT v1, and IDT v2, showed a wide performance rage. This suggests that while capture kits play a significant role in variant detection performance, experimental procedures also exert a substantial influence. The best performance achieved an F1 score of 0.995 for SNV and 0.973 for Indel, while the worst was 0.967 for SNV and 0.810 for Indel.
Moving forward, we investigated how experimental factors affect the final variant calling performance by examining the correlation between experimental factors, pre-alignment, and post-alignment quality metrics, and variant calling performance metrics (Figs. 6c, d). Various factors influenced the final variant calling performance, with some factors showing correlation with each other. Among all considered factors, fold-80 penalty significantly affected both precision and recall of variant calling results, demonstrating a notable negative correlation (Pearson’s correlation coefficient, precision: SNV: r=-0.6, Indel: r=-0.7; recall: SNV: r=-0.4, Indel: r=-0.6; p < 0.0001). When fold-80 penalty was less than 1.5 and there were no other obvious failures in the sequencing data, SNV precision exceeded 0.997, Indel precision surpassed 0.970, SNV recall was above 0.983, and Indel recall remained above 0.915. The fold-80 penalty of Agilent v6 showed large variability among participating labs, influenced by insert size; as the insert size increased, the fold-80 penalty decreased. Additionally, the fold-80 penalty was positively correlated with reads duplication rates and negatively correlated with library concentration. The percentage of coverage over 100x of target regions exhibited a strong positive correlation (Pearson’s correlation coefficient, r = 0.5, p < 0.0001) with MCR, suggesting that a higher coverage improves variant calling precision in complex genomic regions. Library size negatively correlated with recall, indicating that increased DNA sharing time resulting in shorter DNA fragment sizes led to a decreased recall rate. A longer insert size also improves on-target bases rate.
Best practices of germline small variants detection with WES
Based on our comprehensive analysis of the largest real-word WES datasets generated across over 89 labs with the well-characterized Quartet DNA reference materials, we have compiled best practices and recommendations for detecting germline small variant using WES (Table 1). Given that the Quartet DNA reference materials have high DNA integrity number (DIN) exceeding 8.5 and median fragment sizes over 60 kb, the best practices begin with DNA fragmentation, excluding DNA extraction. Ultrasonic and enzymatic methods are commonly used for DNA fragmentation, with no significant differences observed in their impact on variant detection performance. During library construction, fragment size emerges as a crucial factor, where shorter DNA sharing times result in longer fragment sizes, leading to higher on-target base rates and thereby enhancing precision and recall. After initial library construction, the next step is target capture. It’s crucial to select the right capture kit, as we found that different kits cover varying target regions. Ensuring that the target regions include the variants of interest is essential. Capture efficiency metrics such as fold-80 penalty, on-target bases rate, and coverage are pivotal in selecting capture kits, favoring those with lower fold-80 penalties and higher on-target rates and coverage. While the choice of capture kits significantly impacts variant calling performance, human intervention during experimental procedures also plays a vital role. Notably, some labs achieved high performance, while some performed sub-optimally with the same capture kit. Furthermore, uniform and high coverage of target regions enhance variant detection, particularly in difficult genomic regions.
Table 1
Best practices and recommendations for germline small variants detection using WES. Median performance values from the high read-quality group were displayed for capture efficiency metrics and sequencing quality control metrics.
Process and pipeline
|
Recommendations
|
Library construction
|
Fragmentation method:
• No significant differences between ultrasonic and enzymatic methods
Insert fragment size:
• At least twice the read length for optimal results
• Longer fragment sizes lead to higher on-target base rates, precision, and recall
|
Target capture
|
Capture efficiency metrics:
• Fold-80 penalty < 1.8
• On-target bases rate > 55%
• Coverage 100× > 69%
Capture Kit:
• Ensure that the target regions cover variants of interest
• Capture efficiency determines the performance of the capture kit
• Uniform and high coverage of the target regions improves variant detection in complex and repetitive region
• Although the capture kit is an important factor for variant calling performance, experimental procedures also play a crucial role; some labs achieve high performance, while others perform poorly
|
Sequencing
|
Quality control metrics:
• Pre-alignment
o Deduplicated sequencing depth > 100×
o Reads duplication rate < 22%
• Contamination
o Human genome mapping rate > 98%
o Mapping rate for other organisms < 1%
o No-hit rate < 1%
• Post-alignment
o Mapping rate > 0.999
o Mismatch rate < 0.003
o Q30 rate > 0.934
Platform:
• No significant differences between sequencing platforms
|
Data analysis
|
Quality control:
• Mislabeling
• Cross-sample contamination
Read preprocessing:
• Adapter removal: most labs have adapter content greater than 10%
Read alignment:
• BWA-MEM is the most popular mapper
Postalignmet process:
• No significant differences between including or excluding GATK recommended BQSR
• Some callers do not require BQSR: such as DeepVariant, Sentieon DNAscope, and Strelka2
Variant calling:
• Precision: DeepVariant ≈ Sentieon DNAscope ≈ Strelka2 > HaplotypeCaller
• Recall: HaplotypeCaller > DeepVariant ≈ Sentieon DNAscope ≈ Strelka2
• Higher read-quality results in better performance, regardless of the caller choice
• High read quality does not necessarily indicate mislabeling and cross-sample (human) contamination issues.
Variant filtration:
• Variants within target regions
• Depth > 5
• Stringent filtration thresholds slightly improve precision but significantly lower recall
• Keeping variants detected by multiple callers slightly improve precision but significantly lower recall
|
Sequencing platform doesn’t make a significant difference in variant detection performance, but sequencing data with higher read quality consistently lead to higher variant calling performance. We provided some cutoff values of quality metrics important for quality control of sequencing datasets, and they are median values from the high read-quality group, including pre-alignment quality control metrics, post-alignment quality control metrics, and library contamination. For pre-alignment metrics, a good library is expected to have high base quality scores, low levels of GC bias, low N base content, and no overrepresented sequencing. Although we required sequencing depth at least 100×, some labs generated sequencing datasets with a sequencing depth exceeding 1000x. Excessively higher sequencing depth will not ensure higher variant calling performance; instead, it leads to higher read duplication rate, thus wasting sequencing data. Uniform coverage with high sequencing depth is more important for accurate variant calling. For library contamination check, the most obvious issue is vector contamination. Over one-third (39) of the labs have greater than 1% reads mapped to vectors. FastqScreen can only detect contamination from other organisms, and it will not identify cross-sample contamination. We found one high read-quality lab with cross-sample contamination, resulting in significantly low variant calling performance. This problem could be avoided by carefully handling each sample and avoiding the problems caused by aerosols.
When receiving sequencing data, in addition to the quality control checks mentioned above, it is crucial to address mislabeling issues, as they can fundamentally impact scientific results. This can be verified using pedigree information or replicates if available. Prior to aligning the reads to the reference genome, adapters should be removed, as over one-third of the labs produced sequencing datasets with adapter content exceeding 10% of the reads. BWA is the most popular alignment tool, and several studies have reported that the mapper does not significantly contribute to variant calling performance9, 15. In the post-alignment process, no significant differences were observed between including or excluding BQSR, and some callers do not require BQSR, such as DeepVariant, Sentieon DNAscope, and Strelka2. Higher read quality consistently results in better performance, regardless of the caller chosen. DeepVariant, Sentieon DNAscope, and Strelka2 demonstrate similar performance, exhibiting higher precision but lower recall compared to GATK HaplotypeCaller. After calling variants, it is essential to retain only those within the target regions, as variants reported outside the targeted region are more likely false positives. Most labs apply overly stringent variant filtration methods, resulting in a slight improvement in precision but a significant loss in recall. Similarly, retaining variants supported by multiple callers will slightly improve precision but significantly decrease recall.