Toward best practices for detecting germline small variants: a large-scale real-world WES benchmarking study using the Quartet DNA reference materials

doi:10.21203/rs.3.rs-4525252/v1

Download PDF

Analysis

Toward best practices for detecting germline small variants: a large-scale real-world WES benchmarking study using the Quartet DNA reference materials

https://doi.org/10.21203/rs.3.rs-4525252/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Whole-exome sequencing (WES) plays a crucial role in diagnosing genetic diseases by identifying germline variants. However, reproducibility issues limit its clinical utility. We conducted a large-scale proficiency test across 89 clinical and commercial labs in China, employing the well-characterized Quartet DNA reference materials, to evaluate the impact of experimental and bioinformatic factors on the performance of small variant detection. We observed significant variability in sequencing data quality and variant calling performance, with higher raw read quality and lower contamination levels improved variant detection. Our findings emphasized the collective influence of multiple factors on variant detection, with capture efficiency metrics, such as fold-80 penalty, on-target rate, and target region coverage, instead of base-by-base quality metrics on raw sequences, emerging as the most critical. Our study not only revealed the nationwide performance of WES in China, but also provided actionable best practices for optimizing the entire WES process, from data generation to analysis, thereby enhancing variant detection quality and reliability.

Biological sciences/Biotechnology/Sequencing/DNA sequencing

Biological sciences/Biological techniques/Sequencing/DNA sequencing

Biological sciences/Genetics/Genome/Genetic variation

Although all the exons make up only 1% of the human genome, roughly 30 Mb of DNA within the protein-coding regions of the genome, they contain about 85% variants that have a significant impact on diseases¹². Inherited genetic variants are present in germline cells, contribute to normal genetic diversity, or increase the risk of hereditary diseases³. Whole-exome sequencing (WES) is a high-throughput technology that captures and analyzes genetic variants, enabling the identification and investigation of variants associated with specific phenotypes or diseases⁴. The detection of germline variants through WES plays a crucial role in clinical molecular diagnosis, significantly contributing to birth-defect prevention and aiding in studying the pathogenic mechanisms of genetic diseases, identifying molecular mechanisms in both monogenic and polygenic disorders, as well as complex hereditary diseases^{2, 5, 6}.

WES is a complex process, typically divided into wet-lab and dry-lab procedures. In the wet-lab, genomic DNA (gDNA) is extracted and fragmented from samples, then prepared into libraries. Exons are targeted and captured using biotinylated probes, followed by amplification. The library is then subjected to high-throughput sequencing. Data analysis begins with raw sequencing data and involves bioinformatic methods for reads preprocessing, alignment, variant calling, and filtering⁷. The WES workflows can vary widely, including choices of reagents, sequencing instruments, and bioinformatic methods. Multiple studies have demonstrated substantial variability in variant detection outcomes across both experimental and bioinformatic methodologies^{8, 9}. Thus, each lab must undergo rigorous quality assurance, quality control, and performance validation¹⁰.

While numerous benchmark studies have evaluated the performance of germline variant detection, they have primarily focused on whole-genome sequencing (WGS), resulting in a conspicuous void in assessing inter-lab and intra-lab performance for germline variant detection with WES, a strategy more prevalent in clinical settings^11–15. Additionally, existing benchmark studies generally examine isolated factors like capture kits or bioinformatic pipelines, overlooking potential interconnections among different experimental and bioinformatic factors across the entire WES workflow^16–20.

Furthermore, existing benchmark studies typically rely on a single gDNA reference sample and its corresponding benchmark variants, limiting their assessment solely within predefined high-confidence regions or benchmark regions, while disregarding variants beyond these boundaries. Recently, the Quartet project, aimed for quality control and data integration of multiomic profiling, established a suite of DNA reference materials from four immortalized cell lines derived from a family of parents and monozygotic twins^21–23. These Quartet DNA reference materials not only facilitate the evaluation of variant calling accuracy within high-confidence regions but also provide pedigree information from the Quartet family design and the Mendelian consistent rate (MCR), thereby extending performance assessment capabilities outside the high-confidence regions to evaluate the precision of variant calls²².

We carried out a comprehensive inter-lab and intra-lab proficiency test using well-characterized Quartet DNA reference materials to evaluate the efficiency of germline small variant detection via WES. Our study achieved three main objectives. First, we assessed the performance of germline variant detection using WES among 89 sequencing service providers and clinical labs in China. Secondly, we identified critical experimental and bioinformatic factors impacting the quality of sequencing data and the accuracy of variant detection results. Lastly, we provided best practices aimed at enhancing reproducibility and accuracy throughout the WES workflow. These practices encompassed processes from library preparation to target enrichment, high-throughput sequencing, and bioinformatic data analysis, including read alignment, variant calling, and filtering.

Overview of the Nationwide WES proficiency test in China

To assess both inter-lab and intra-lab performance in detecting germline small variants, four Quartet DNA reference materials²² (Quartet_M8 (mother), Quartet_F7 (father), and Quartet_D5 and Quartet_D6 (monozygotic twin daughters)), each with three replicates, were anonymously distributed to participating labs in a random order (Fig. 1a). Each lab employed its own WES workflow for sequencing the provided reference materials, adhering to a minimum sequencing depth requirement of 100x, and subsequently analyzed and reported germline variants. They then returned and submitted metadata containing essential experimental and analytical details, raw sequencing data (FASTQ files), variant calling results (VCF files), and targeted regions (BED files) (Methods).

A total of 103 independent labs participated in this nationwide multi-center WES proficiency test in China, with 89 labs returning their data before the deadline (Fig. 1b). Each lab employed different workflows, integrating diverse combinations of experimental and analytical methods, along with various parameter settings. The analysis of metadata information provided insights into the methods utilized by these labs, allowing us to identify the most widely accepted approaches (complete metadata for each lab are available in Supplementary Table 1). The proficiency test began with DNA fragmentation since the DNA reference materials were distributed to each lab; thus, the proficiency test did not cover the DNA extraction process. Among them, 50 labs used ultrasonic methods, while 39 labs opting for enzymatic approaches for fragmentation (Fig. 1c). Following this, a library was constructed through adaptor ligation, and exons were targeted and captured using different capture kits. More than 12 capture kits were utilized, with IDT xGEN Exome Panel (28), Agilent SureSelect (22), and Roche KAPA HyperExome (12) being the most commonly used (Fig. 1c, Supplementary Table 2). After amplifying captured targets through qPCR and conducting quality control, the resulting captured library underwent high-throughput sequencing. Sixty-seven (67) labs used the Illumina sequencing platforms, while 22 labs employed the MGI sequencing platforms. Among them, Illumina NovaSeq emerged as the most widely used, with 61 out of 89 labs adopting this platform (Fig. 1c). The computational steps of WES data began with raw reads preprocessing, involving the removal of adapters and the filtration of low-quality reads. Fastp²⁴ (45) was the most popular preprocessing tool, followed by Cutadapt²⁵ (16) and Trimmomatic²⁶ (13) (Fig. 1d). BWA²⁷ (77) was employed by almost all labs for aligning reads to the human reference genome (Fig. 1d). For variant calling, at least 18 pipelines were utilized. GATK germline variant calling best practices with HaplotypeCaller²⁸ (64) was the most widely used, followed by Sentieon DNAscope²⁹ (7), Strelka2³⁰ (5), Freebayes³¹ (3), DeepVariant³² (2), and VarScan³³ (2). Five labs reported variants supported by two or more callers (Fig. 1d).

We should note that the targeted regions used by participating labs for WES did not cover 100% of the exome regions (Supplementary Table 1). Typically, they encompassed 76.2–89.2% of the exome regions (Fig. 1e). Among the participating labs, targeted regions spanned from 32.7 Mb to 100.8 Mb. While some labs utilized the same capture kits, their targeted regions varied due to the inclusion of custom targets or the addition or subtraction of 10 to 50 bp on each side of the targets. A total of 23.4 Mb targeted regions were covered by all 85 labs (see below), encompassing about 60% of the exome regions. We evaluated the coverage of clinically significant genes (a total of 2773 genes) within the targeted regions for each capture kit, encompassing 2,742 genes from ClinGen (downloaded at 09-21-2023), 171 genes from COGIV³⁴ (catalogue of germline variants in cancer), and 157 genes from TCGA (pathogenic germline variants in cancer)³⁵ (Supplementary Table 3). The targeted regions of different capture kits spanned from 32.6–46.8% of the clinically important gene regions (Extended Data Fig. 1), covering 2706 (97.4%) to 2736 (98.4%) genes. Less than 6% of targeted regions for all capture kits were identified as repetitive regions. NadPred Exome Plus Panel and IGT v1 covered more repetitive regions compared to others (Extended Data Fig. 1).

We carefully checked the files returned by participating labs for compliance before analysis. Upon examining their target regions, we found that three labs utilized excessively small targeted regions (13.1 Mb, 13.1 Mb and 13.7 Mb), significantly smaller than the exome regions (RefSeq release 210: 38.5 Mb and GENCODE v39: 40.1 Mb). Additionally, one lab’s targeted regions covered 30.5 Mb, but only showed a 54% overlap with the exome regions in RefSeq or GENCODE databases (Fig. 1e). Among the remaining 85 labs, one reported an incorrect format for all FASTQ files, another lab reported an incorrect FASTQ format for two replicates, and two labs submitted VCF files containing only plausible disease-causing variants instead of all germline small variants detected in the targeted regions. Consequently, we utilized 996 VCF files from 83 labs to evaluate the analytical validity of WES among sequencing service providers and clinical labs in China. Simultaneously, we used 2,012 paired end FASTQ files from 84 labs to investigate the quality of sequencing data and the impact of experimental and analytical factors on variant calling accuracy.

Read quality and capture efficiency

We performed pre-alignment quality control on raw reads using FastQC³⁶ to evaluate read quality and FastqScreen³⁷ to detect any library contamination. Post-alignment quality control and examination of capture efficiency were conducted on alignments to identify potential issues in library preparation and sequencing that could impact variant detection accuracy. We observed quality variations during the sequencing of identical DNA reference materials across different labs, even among those using identical capture kits and sequencing platforms (Fig. 2).

A sequencing depth of 100x was required, and 82 labs met this threshold in raw sequencing depth. After removing duplicated reads, 80 labs had a sequencing depth greater than 100x. The lowest observed duplicated sequencing depth was 44x. The duplication rate of reads ranges from 2.5–52.3% (median = 21.5%). We observed a positive correlation between increased sequencing depth and a higher duplication rate (Pearson’s correlation coefficient, r = 0.39). Despite one lab achieving the highest raw sequencing depth of 1676x, its duplication rate peaked at 46.9%, indicating wastage of sequencing data. Labs employing the Illumina platforms exhibited a higher duplication rate than those using the MGI platforms (24.3% versus 9.7%).

For raw reads quality control, all libraries consistently exhibited high base-quality scores, low levels of GC bias, low N base content, and overrepresented sequences accounting for less than 1% of the total (Extended Data Fig. 2). Three major issues were identified. First, one third of the labs (27 out of 84) reported raw reads containing adapter sequences comprising more than 10% of all reads. Secondly, biased sequences in the first 12 bases of the run was observed in 36 labs, indicating biased selection of fragments during the random priming step in library preparation³⁸. Thirdly, during contamination checks, 19 labs had over 5% of reads mapped to both the human genome and the vector, and three labs had less than 95% of reads mapped to the human genome (Extended Data Fig. 3).

To evaluate the post-alignment quality of the sequence data from the participating labs, all raw reads were aligned to the reference genome (GRCh38) using BWA. Overall, all labs demonstrated a high mapping rate (the percentage of reads aligned to the reference sequence, median = 99.5%) and a low mismatch rate (median = 0.3%). However, despite a higher mapping rate of the MGI platforms compared to that of the Illumina platforms (99.97% versus 99.92%), the MGI platforms exhibited a higher mismatch rate (0.5% versus 0.3%) and lower Q30 rate (90.1% versus 93.9%). To prevent the generation of excessive duplicate sequences, the insert size should be at least twice the read length³⁹. For instance, with a read length of 150 bp, the insert size should be a minimum of 300 bp, and for a read length of 100 bp, it should be at least 200 bp. However, only 11 labs met the minimum insert size requirement (Fig. 2).

Efficient exome capture is important for ensuring high-quality data. Inefficient target capture can lead to regions with insufficient read depth, necessitating additional sequencing to maintain data coverage. To evaluate the capture efficiency of each WES library, we assessed targeted coverage, defined as the percentage of target regions with at least 100x read depth. We observed considerable differences in the coverage of target regions across various labs, ranging from as low as 13.9% to as high as 98.1%, with a median of 66.5%. Approximately one fourth (24) of the labs achieved a minimum of 80% coverage with at least 100x read depth. The on-target bases rate, reflecting the percentage of bases mapped to target regions, ranged from 42.8–76.7% with a median of 55.9%. The fold-80 base penalty was used to assess coverage uniformity, indicating the extra sequencing required to achieve 80% coverage of the target bases at the mean coverage level. Values above 1 indicate uneven uniformity levels. Across all WES libraries, the fold-80 penalty ranged from 1.3 to 4.4, with a median of 1.8. Notably, only 18 labs exhibited a fold-80 less than 1.5, and six labs recorded a high fold-80 penalty greater than 3 (Fig. 2).

Based on read quality and capture efficiency, we categorized the 84 labs with correct Fastq format compliance into three quality groups: 25 were designated as low quality, 38 as middle quality, and 21 as high quality (Methods). Generally, high-quality sequencing data exhibited lower read duplication rates, increased sequencing depth after removing duplicated reads, higher mapping rates, lower mismatch rates, elevated Q30 rates, improved coverage, higher rates of on-target bases, and reduced fold-80 penalty (Extended Data Fig. 4). When library contamination status is considered, we found that 6 out of 25 (24%) labs in the low-quality group had high contamination, 5 out of 38 (13%) labs in the middle-quality group had high contamination, while none in the high-quality group had contamination (Extended Data Fig. 5).

Performance of small variants detection varied across labs

We evaluated the performance of variant calling results from each participating lab, generated from their own customized bioinformatic pipelines. First, we tallied the number of variants detected by each lab. Due to variations in target region sizes and bioinformatic pipelines, significant differences were observed in the number of detected variants among the labs. The number of detected variants ranged from 13,069 to 82,887 SNVs and 270 to 16,544 Indels after variant filtration (Extended Data Fig. 6a). Thirty-three (33) labs reported variants outside the target regions, with 20 of them reporting as many or more variants outside the target regions than the inside regions (Extended Data Fig. 6b), because these labs added 50–200 bp padding regions to the target intervals when detecting variants. Comparing the accuracy of variants reported inside and outside target regions, we found that the Mendelian concordance rates of variants outside the target regions were lower than those inside (SNV: 0.91 versus 0.97, Indel: 0.72 versus 0.83) (Extended Data Fig. 7). This highlights the need for cautious interpretation of variants detected outside the target regions, requiring orthogonal confirmation, especially if they are clinically relevant.

We assessed the accuracy of variants present only within the target regions. Notably, 86.4–99.3% of SNVs and 43.0–93.7% of indels were located within the high-confidence bed regions or benchmark regions defined for the Quartet DNA reference materials. The accuracy of these variants can be evaluated by comparing them with benchmark variants. For the remaining average 4.6% of SNVs and 26.6% of indels outside the benchmark regions, we estimated their precision by examining the Mendelian concordance among the Quartet reference samples (parents and twin daughters). Since there are typically around 30 de novo variants per generation, most Mendelian inconsistent variants are deemed to be false positives caused by sequencing or alignment errors⁴⁰.

Compared to SNVs, the overall detection accuracy of indels was lower (Figs. 3a, b). The precision of SNVs was 0.997 ± 0.048, with all labs except two achieving an SNV detection precision above 0.983. For indels, the precision was 0.935 ± 0.053. The recall of SNVs was 0.974 ± 0.104, and for indels, it was 0.900 ± 0.142, which were notably lower than precision. This suggests that stringent filtration was applied by most participating labs, potentially resulting in the loss of many true variants. The Mendelian concordance rate was used to estimate the precision of variants detected both inside and outside high-confidence bed regions (SNV: 0.972 ± 0.087, Indel: 0.769 ± 0.111), which was much lower than the precision of variants only detected inside high-confidence bed regions. This discrepancy was likely due to variants in complex and repetitive genomic regions. For detection performance of small variants within clinical genes, the precision (0.983 ± 0.072) and recall (0.964 ± 0.128) of SNVs were much higher than those for Indels (precision: 0.925 ± 0.069; recall: 0.910 ± 0.137) (Extended Data Fig. 8a). Among the 2773 genes we investigated, the HLA gene loci (HLA-B: 22, HLA-DQA1:11, HLA-C:8), SMPD1 (8), KMT2C (8), SEC63 (6) and DSPP (6) had more false positives compared to other genes. This is due to the presence of repetitive genomic regions within these genes, which can cause mapping errors (Extended Data Fig. 8b).

We investigated the reasons contributing to performance differences among labs by considering both the F1 score and Mendelian concordance rate (Figs. 3c, d). According to the distribution of F1 scores and MCR, labs were distributed into four regions. Region 1 was composed of labs exhibiting high F1 scores and high MCR. Region 2 represented labs with high MCR but low F1 scores. Region 3 included labs with high F1 score but low MCR. Region 4 was comprised of labs with low F1 score and low MCR. As shown in the figures, labs in Region 1 displayed F1 scores and Mendelian concordance rates higher than the median values, suggesting excellent variant detection performance. Fourteen (14) labs exhibited high F1 scores and Mendelian concordance rates for both SNVs and Indels, indicative of high or moderate read quality and lower contamination levels. In Region 2, 13 labs exhibited F1 scores below the median values but had Mendelian concordance rates surpassing the median values for SNV detection, and 19 labs for Indel detection. The diminished F1 scores were primarily due to low recall caused by stringent filtration thresholds. In Region 3, 14 labs demonstrated F1 scores exceeding the median values, but Mendelian concordance rates lower than the median values for SNV detection, with 30 labs exhibiting such a pattern for Indel detection. The reduced Mendelian concordance rates were primarily associated with target regions encompassing repetitive genomic regions. Region 4 showed labs with both F1 scores and Mendelian concordance rates below the median values for both SNV and Indel detection. This outcome was likely influenced by multiple factors, including labs with low read quality or high contamination, particularly affecting Indel detection. Moreover, the imposition of overly stringent filtration thresholds might have exacerbated these issues. Additionally, some labs displayed inconsistent performance across replicates (Extended Data Fig. 9), potentially leading to sequencing failures for certain samples and consequently diminishing the overall Mendelian concordance rate.

We identified three labs in Region 4 with high read quality but significantly low F1 scores and Mendelian concordance rates. Upon investigation, we discovered the reasons after clustering Quartet reference samples from all labs using variant calling results (Fig. 4). One lab had mislabeled the samples; specifically, Quartet_F7 and Quartet_M8 were clustered with twins Quartet_D5 and Quartet_D6. The other lab likely experienced contamination from external DNA samples rather than Quartet samples, as the variant profiles differed markedly from those of other labs, and variants called from the twins had differences that were not entirely the same. Such mislabeling and cross-contamination cannot be detected using pre-alignment quality control methods but only from the Quartet family design. We corrected the mislabeling issue and assigned the other lab to the low-quality group for further analysis. Additionally, we observed that the variant calling results from a third lab maintained the Quartet family relationship but differed significantly from variants called by other labs. We found that they applied overly stringent filtration thresholds, removing many true variants and retaining fewer variants, resulting in high precision (0.98) but very low recall (0.28).

Impact of bioinformatic factors on variants detection performance

The participating labs employed various variant calling pipelines and filtration methods. We first investigated the impact of analytical factors on variant calling performance. We aligned the submitted raw reads to the reference genome (GRCh38) using BWA. To adjust for differences in sequencing depth across labs, the mapped reads were then normalized to mean 100x coverage and subjected to various pipelines for variant calling. We evaluated the performance of five commonly used variant calling pipelines among the participating labs: GATK HaplotypeCaller²⁸, Sentieon HaplotypeCaller⁴¹, Sentieon DNAscope²⁹, Strelka2³⁰, and DeepVariant³². Among the 64 labs that employed HaplotypeCaller, 24 of them didn’t apply the recommended post-alignment process – base quality score recalibration (BQSR). We further investigated the impact of including or excluding BQSR on variant calling accuracy. These analyses resulted in a total of 6036 call sets (1006 samples × 6 pipelines) (Methods).

We assessed the performance of each raw call set without filtration, based on precision and recall in the benchmark regions and the Mendelian concordance rate across the entire target regions (Figs. 5a, b). For all bioinformatic pipelines, higher-quality reads showed greater accuracy in both SNV and Indel detection. Among the pipelines evaluated, DeepVariant, Sentieon DNAscope, and Strelka2 achieved higher precision compared to Sentieon HaplotypeCaller and GATK HaplotypeCaller. DeepVariant and Sentieon DNAscope maintained high precision in the benchmark regions even when handling low-quality reads for both SNV and Indel detection. Although HaplotypeCaller exhibited lower precision compared to the other three callers, it achieved a higher recall rate. This is attributed to its identification of more variants (with a median variant number detected across all call sets of 14,928 versus 14,639), suggesting that variant filtration may be necessary to improve precision. Sentieon and GATK HaplotypeCaller exhibited comparable performance, although GATK HaplotypeCaller demonstrated lower precision in Indel detection. The inclusion or exclusion of the BQSR post-alignment process did not significantly impact the variant calling performance. Performance varied significantly across different complex and repetitive regions, with SNVs showing higher performance in low complexity and simple repeat regions, while Indels performed better in the long terminal repeat (LTR) regions (Fig. 5c). In clinically relevant gene regions, the SNV F1 score was 0.993 and the Indel F1 score was 0.929. Sentieon DNAscope had the highest performance, achieving an SNV F1 score of 0.994 and an Indel F1 score of 0.971.

A total of 33 labs identified variants outside of the targeted regions by expanding the boundaries of target regions, adding padding ranging from 50 bp to 200 bp. We assessed the precision of variants detected within these padding regions using the Mendelian concordance rate, analyzed in 50 bp bins (Fig. 5d). Comparing variants identified within the target regions with those in the padding regions, we observed a significant decrease in precision for variants located farther from the boundaries of the target regions. This suggests potential false positives in these padding regions, emphasizing the need for careful interpretation.

A common strategy employed to enhance the accuracy of variant detection for clinical samples is to utilize the intersection of two or more variant callers. We investigated the variant detection performance by employing the intersection of any combinations of the four callers on high-quality reads: DeepVariant, Sentieon DNAscope, Strelka2, and Sentieon HaplotypeCaller (Fig. 5e). Utilizing the intersection of two or more callers resulted in a slight increase in precision, while recall dropped rapidly. When dealing with low-quality sequencing datasets, employing the intersection of multiple callers didn’t significantly enhance precision, and there is a slight decrease in recall as the number of intersecting callers increased (Extended Data Fig. 10).

We also examined the impact of variant filtration on variant calling performance by comparing the results obtained from the same bioinformatic pipeline without filtration (Sentieon HaplotypeCaller with BQSR) to the submitted variant calling results after filtration. For most participating labs, F1 scores from both pipelines were similar. However, we observed a significant increase in the F1 scores of 13 labs when utilizing the Sentieon pipeline without filtration compared to their submitted variant calling results with filtration. Very few labs were found where the F1 scores of variant calling results submitted by labs were higher compared to variants without filtration. This suggests that most labs may have employed too stringent filtration thresholds, leading to the exclusion of many true variants, thereby greatly reducing recall without a significant improvement in precision (Extended Data Fig. 11).

One of the most controversial variant filtration parameters was depth, which ranged from 0 to 30. We assessed the accuracy of variants across different depth filtration thresholds using Sentieon HaplotypeCaller with 100x down-sampled datasets. At a depth filtration threshold of 5, the precision and recall for both SNV and Indel variants remained constant. However, starting from a threshold of 10, the precision showed a slight increase, while the recall decreased rapidly (Extended Data Fig. 12).

Impact of experimental factors on variants detection performance

In evaluating the influence of wet-lab experimental factors on variant detection performance, we considered various factors such as DNA input amount, DNA fragmentation method, DNA shearing time, capture kits, library sizes and concentrations, and sequencing platforms. To minimize the effects of bioinformatic analytical factors, we relied on performance metrics (precision, recall, and Mendelian concordance rate) obtained from a single caller (DeepVariant). Since the datasets originated from different labs, some produced high-quality datasets while others generated low-quality ones, to mitigate the influence of labs with evident experimental issues, we excluded labs with obvious contamination from other samples from this analysis.

We first examined the impact of experimental factors on variant calling performance using principal variance components analysis (PVCA). Capture kits contributed the most to explaining the variance in SNV (65.6%) and Indel (51.7%) detection, while sequencing platforms and fragmentation methods had no significant effect on variant calling performance (Figs. 6a, b). Notably, MGI platforms exhibited slightly lower performance compared to Illumina platforms in detecting Indels (t-test, p = 0.014). This difference may be attributed to the fact that more labs using Illumina platforms generated higher-quality sequencing datasets, and performance was influenced by multiple experimental factors beyond just the choice of a sequencing platform. Among the capture kits evaluated, KAPA, IDT v1, IDT v2, NanoWES, and NadPrep achieved higher performance for both SNV and Indel detection compared to Agilent MyGenostics and IGT v1. This superiority was attributed to their lower fold-80 penalty, higher coverage rate over 100x, and higher on-targeted bases rate (Extended Data Fig. 13). Four popular capture kits, KAPA, Agilent v6, IDT v1, and IDT v2, showed a wide performance rage. This suggests that while capture kits play a significant role in variant detection performance, experimental procedures also exert a substantial influence. The best performance achieved an F1 score of 0.995 for SNV and 0.973 for Indel, while the worst was 0.967 for SNV and 0.810 for Indel.

Moving forward, we investigated how experimental factors affect the final variant calling performance by examining the correlation between experimental factors, pre-alignment, and post-alignment quality metrics, and variant calling performance metrics (Figs. 6c, d). Various factors influenced the final variant calling performance, with some factors showing correlation with each other. Among all considered factors, fold-80 penalty significantly affected both precision and recall of variant calling results, demonstrating a notable negative correlation (Pearson’s correlation coefficient, precision: SNV: r=-0.6, Indel: r=-0.7; recall: SNV: r=-0.4, Indel: r=-0.6; p < 0.0001). When fold-80 penalty was less than 1.5 and there were no other obvious failures in the sequencing data, SNV precision exceeded 0.997, Indel precision surpassed 0.970, SNV recall was above 0.983, and Indel recall remained above 0.915. The fold-80 penalty of Agilent v6 showed large variability among participating labs, influenced by insert size; as the insert size increased, the fold-80 penalty decreased. Additionally, the fold-80 penalty was positively correlated with reads duplication rates and negatively correlated with library concentration. The percentage of coverage over 100x of target regions exhibited a strong positive correlation (Pearson’s correlation coefficient, r = 0.5, p < 0.0001) with MCR, suggesting that a higher coverage improves variant calling precision in complex genomic regions. Library size negatively correlated with recall, indicating that increased DNA sharing time resulting in shorter DNA fragment sizes led to a decreased recall rate. A longer insert size also improves on-target bases rate.

Best practices of germline small variants detection with WES

Based on our comprehensive analysis of the largest real-word WES datasets generated across over 89 labs with the well-characterized Quartet DNA reference materials, we have compiled best practices and recommendations for detecting germline small variant using WES (Table 1). Given that the Quartet DNA reference materials have high DNA integrity number (DIN) exceeding 8.5 and median fragment sizes over 60 kb, the best practices begin with DNA fragmentation, excluding DNA extraction. Ultrasonic and enzymatic methods are commonly used for DNA fragmentation, with no significant differences observed in their impact on variant detection performance. During library construction, fragment size emerges as a crucial factor, where shorter DNA sharing times result in longer fragment sizes, leading to higher on-target base rates and thereby enhancing precision and recall. After initial library construction, the next step is target capture. It’s crucial to select the right capture kit, as we found that different kits cover varying target regions. Ensuring that the target regions include the variants of interest is essential. Capture efficiency metrics such as fold-80 penalty, on-target bases rate, and coverage are pivotal in selecting capture kits, favoring those with lower fold-80 penalties and higher on-target rates and coverage. While the choice of capture kits significantly impacts variant calling performance, human intervention during experimental procedures also plays a vital role. Notably, some labs achieved high performance, while some performed sub-optimally with the same capture kit. Furthermore, uniform and high coverage of target regions enhance variant detection, particularly in difficult genomic regions.

Table 1

**Best practices and recommendations for germline small variants detection using WES.** Median performance values from the high read-quality group were displayed for capture efficiency metrics and sequencing quality control metrics.
Process and pipeline	Recommendations
Library construction	Fragmentation method: • No significant differences between ultrasonic and enzymatic methods Insert fragment size: • At least twice the read length for optimal results • Longer fragment sizes lead to higher on-target base rates, precision, and recall
Target capture	Capture efficiency metrics: • Fold-80 penalty < 1.8 • On-target bases rate > 55% • Coverage 100× > 69% Capture Kit: • Ensure that the target regions cover variants of interest • Capture efficiency determines the performance of the capture kit • Uniform and high coverage of the target regions improves variant detection in complex and repetitive region • Although the capture kit is an important factor for variant calling performance, experimental procedures also play a crucial role; some labs achieve high performance, while others perform poorly
Sequencing	Quality control metrics: • Pre-alignment o Deduplicated sequencing depth > 100× o Reads duplication rate < 22% • Contamination o Human genome mapping rate > 98% o Mapping rate for other organisms < 1% o No-hit rate < 1% • Post-alignment o Mapping rate > 0.999 o Mismatch rate < 0.003 o Q30 rate > 0.934 Platform: • No significant differences between sequencing platforms
Data analysis	Quality control: • Mislabeling • Cross-sample contamination Read preprocessing: • Adapter removal: most labs have adapter content greater than 10% Read alignment: • BWA-MEM is the most popular mapper Postalignmet process: • No significant differences between including or excluding GATK recommended BQSR • Some callers do not require BQSR: such as DeepVariant, Sentieon DNAscope, and Strelka2 Variant calling: • Precision: DeepVariant ≈ Sentieon DNAscope ≈ Strelka2 > HaplotypeCaller • Recall: HaplotypeCaller > DeepVariant ≈ Sentieon DNAscope ≈ Strelka2 • Higher read-quality results in better performance, regardless of the caller choice • High read quality does not necessarily indicate mislabeling and cross-sample (human) contamination issues. Variant filtration: • Variants within target regions • Depth > 5 • Stringent filtration thresholds slightly improve precision but significantly lower recall • Keeping variants detected by multiple callers slightly improve precision but significantly lower recall

Sequencing platform doesn’t make a significant difference in variant detection performance, but sequencing data with higher read quality consistently lead to higher variant calling performance. We provided some cutoff values of quality metrics important for quality control of sequencing datasets, and they are median values from the high read-quality group, including pre-alignment quality control metrics, post-alignment quality control metrics, and library contamination. For pre-alignment metrics, a good library is expected to have high base quality scores, low levels of GC bias, low N base content, and no overrepresented sequencing. Although we required sequencing depth at least 100×, some labs generated sequencing datasets with a sequencing depth exceeding 1000x. Excessively higher sequencing depth will not ensure higher variant calling performance; instead, it leads to higher read duplication rate, thus wasting sequencing data. Uniform coverage with high sequencing depth is more important for accurate variant calling. For library contamination check, the most obvious issue is vector contamination. Over one-third (39) of the labs have greater than 1% reads mapped to vectors. FastqScreen can only detect contamination from other organisms, and it will not identify cross-sample contamination. We found one high read-quality lab with cross-sample contamination, resulting in significantly low variant calling performance. This problem could be avoided by carefully handling each sample and avoiding the problems caused by aerosols.

When receiving sequencing data, in addition to the quality control checks mentioned above, it is crucial to address mislabeling issues, as they can fundamentally impact scientific results. This can be verified using pedigree information or replicates if available. Prior to aligning the reads to the reference genome, adapters should be removed, as over one-third of the labs produced sequencing datasets with adapter content exceeding 10% of the reads. BWA is the most popular alignment tool, and several studies have reported that the mapper does not significantly contribute to variant calling performance^{9, 15}. In the post-alignment process, no significant differences were observed between including or excluding BQSR, and some callers do not require BQSR, such as DeepVariant, Sentieon DNAscope, and Strelka2. Higher read quality consistently results in better performance, regardless of the caller chosen. DeepVariant, Sentieon DNAscope, and Strelka2 demonstrate similar performance, exhibiting higher precision but lower recall compared to GATK HaplotypeCaller. After calling variants, it is essential to retain only those within the target regions, as variants reported outside the targeted region are more likely false positives. Most labs apply overly stringent variant filtration methods, resulting in a slight improvement in precision but a significant loss in recall. Similarly, retaining variants supported by multiple callers will slightly improve precision but significantly decrease recall.

WES has been extensively applied in clinical molecular diagnosis of genetic diseases and research on pathogenic mechanisms. Ensuring the quality and reliability of variant calling results is essential for its scientific and clinical applications. As part of the Quartet Project, which focuses on quality control and data integration in multiomic profiling, we distributed four previously established and well-characterized Quartet DNA reference materials to 103 clinical and commercial labs in China for WES data generation and analysis with wet-lab procedures and bioinformatic analysis pipelines of their own choices. This enabled us to objectively assess the real-word performance of detecting germline small variants from WES in China and explore the impact of experimental and analytic factors on WES performance.

First, we conducted a comprehensive evaluation of the WES process across many labs, meticulously assessing various aspects including raw sequencing data quality, capture efficiency, and variant detection performance. The overall quality of sequencing data across all labs was notably high, characterized by a mapping rate exceeding 99.7% and a mismatch rate below 0.8%. Labs employing Illumina NovaSeq consistently achieved a Q30 rate exceeding 92%, while those utilizing MGI platforms typically attained a slightly lower Q30 rate of approximately 88%. Substantial variances were detected in target coverage and uniformity among probes, indicating differences in capture efficiency among labs. Although the accuracy of SNV detection were generally high across all labs, there is a need for improvements in accurately detecting Indels. Despite three labs demonstrating commendable raw read quality, issues such as sample mislabeling or cross-sample contamination significantly impeded variant detection accuracy in some instances.

We then conducted a comprehensive examination of factors impacting detection performance at both the wet-lab (data generation) and dry-lab (data analysis) stages. In our analysis of experimental factors, we found that individual raw read quality metrics, such as mapping rate, mismatch rate, duplication rate, and Q30 rate, did not exhibit a significant correlation with the final variant detection performance. While quality metrics of capture efficiency, such as fole-80 penalty, on-targeted bases rates, and coverage of targeted regions, displayed a stronger association with variant detection performance, and these metrics significantly influenced both precision and recall rates of variant detection, underscoring their critical role in optimizing the detection process. Variant calling results are influenced by a combination of factors rather than individual variables alone. All factors contribute to variant calling performance, as they are interrelated. This emphasizes the importance of maintaining high-quality metrics across all aspects of sequencing data to achieve better variant calling results.

In our investigation of analytical factors, we compared the performance of several widely used variant callers. DeepVariant, Sentieon DNAscope, and Strelka2 demonstrated better precision than HaplotypeCaller, although HaplotypeCaller exhibited a higher recall rate. It’s important to note that implementing stricter variant filtering strategies, while improving precision, can notably reduce the recall rate of variant detection.

While acknowledging certain limitations of our study, such as the inability to control certain variables due to discrepancies in wet-lab and dry-lab methodologies chosen by different labs, it remains a faithful representation of the prevailing landscape of WES practices across China. Our study also provides actionable recommendations and suggestions for optimizing the entire WES-based germline variants detection process, providing invaluable guidance for researchers in the field to refine or establish WES protocols (Table 1).

Utilizing the underlying genetic ground truth provided by the Quartet DNA reference materials, which include both parents and monozygotic twins, we conducted a thorough evaluation variant calling performance in the whole targeted regions. The precision and recall of variants within the high-confidence regions are assessed by comparing them with benchmark variants. For variants outside the high-confidence regions, Mendelian concordance rates serve as an estimate of precision, as sequencing or alignment errors significantly surpass the number of de novo variants. Through variance component analysis, we identified several factors influencing precision of variants in complex or repetitive regions (inside targeted regions, but outside of high-confidence regions), notably the fold-80 penalty and coverage of targeted regions. It underscores the significance of attaining higher and more uniform coverage to improve variant accuracy within complex or repetitive genomics regions. Additionally, we clustered all samples submitted by participating labs to determine if they grouped according to the Quartet family structure. This analysis revealed instances of mislabeling and cross-contamination, highlighting the critical importance of using multiple DNA reference materials to identify such issues. Relying solely on a single reference sample could potentially overlook these types of problems.

Finally, it’s crucial to acknowledge that WES might not provide complete coverage of all genes. The effectiveness of WES in identifying pathogenic genes heavily depends on the design of capture probes. Therefore, before conducting WES on valuable samples, researchers should evaluate the coverage offered by the specific capture probes and verify whether they encompass the variants of interest, ensuring extensive coverage across the gene spectrum within the analyzed samples.

We utilized the well-characterized Quartet DNA reference materials to objectively assess the performance of germline small variant detection across 89 labs in China. Through systematic evaluation of the intra- and inter-lab data quality and variant detection performance, we unveiled the current status of the nationwide performance of WES in China. Through our investigation on experimental and analytical factors influencing germline small variant detection using WES, we have summarized and provided actionable best practices for the entire WES process, from data generation to analysis. Our study helps to enhance WES detection capabilities and ultimately improve its utility in both research and clinical practices.

Distribution of Quartet DNA reference materials for WES proficiency test

The Quartet DNA reference materials (http://chinese-quartet.org, batch number: Quartet_FDU_DNA_20171022) were supplied to participating labs for the WES proficiency test in 2021. The Quartet DNA reference materials are genomic DNA extracted from four immortalized lymphoblastoid cell lines of a Chinese Quartet family, including father (F7), mother (M8), and monozygotic twin daughters (D5 and D6). The Quartet DNA reference materials are at -80 °C and stable (DIN > 8.5) for > 3 years and have been long-term monitored by Agilent 4200. They have been certified by China’s State Administration for Market Regulation as the First Class of National Reference Materials and are extensively being utilized for proficiency testing and method validation. The certified reference materials numbers are GBW09900 (D5), GBW09901 (D6), GBW09902 (F7), and GBW09903 (M8).

The Quartet reference materials were shipped with dry ice. Each of the Quartet DNA reference materials was distributed in triplicate, resulting in a total of 12 samples (tubes) anonymously distributed to participating labs, with serial numbers ranging from 202101 to 202112. The concentration of the Quartet DNA reference materials ranged from 30 to 40 ng/µL, with a volume of 30 µL per tube.

Each lab sequenced the 12 anonymously distributed DNA reference samples using WES, following their internal protocols, which encompassed DNA fragmentation, library preparation, target enrichment or capture, sequencing, and variant detection and filtration. For the proficiency test, the labs provided the following files: (1) Metadata containing detailed information about the experimental and analytical process; (2) FASTQ files; (3) Variant calling results before filtration (VCF files); (4) Variant calling results after filtration; and (5) Targeted regions (BED files). To check and validate whether the capture kits have been filled out correctly, we conducted a cluster analysis of the targeted regions. Based on the clustering relationships, we inferred the capture kits used by these labs (Extended Data Fig. 14).

Pre-alignment and post-alignment quality control

We employed FastQC (v) and FastqScreen (v) for pre-alignment quality control. FastQC were used to assess various aspects of sequencing data, including base quality, GC content, sequence length, adapter content, and repetitive sequence content. FastqScreen was used to identify contamination from other species or sequences in raw sequencing data. We analyzed the first 100,000 lines of raw sequencing data from each lab, expecting over 99% of sequences to be aligned to the human genome. Excessive alignment to other genomes such as bacteria, fungi, or protozoa may indicate contamination issues in the experimental procedures.

Following this, we aligned the reads to the reference genome GRCh38 and employed Picard embedded in Sentieon (v) for post-alignment quality control of BAM files. The assessed parameters encompassed mapping rate, mismatch rate, insert size, deduped average depth, as well as capture efficiency-related metrics such as on-targeted rate, fold-80 penalty, and genome coverage of targeted regions.

Variants calling quality control

Some labs reported variants outside of their targeted regions, but we only evaluated variants within those targeted regions. To assess the variant calling performance of each participating lab, we initially computed the precision, recall and F1 score based on the variants submitted by each lab (filtered VCF files) by comparing them against the benchmark variants within the benchmark regions of the Quartet DNA reference materials. Variants consistent with the benchmark variants were labeled as true positive variants (TP), variants not included in the benchmark sets were categorized as false positive variants (FP), and benchmark variants that were not detected were classified as false negative variants (FN). We utilized hap.py (v0.3.9) with ‘vcfeval’ as the comparison engine (https://github.com/Illumina/hap.py) to compare variants with benchmark variants. Precision, recall, and F1 score were calculated by the following formula:

\(Precision \left(Pr\right) = \frac{TP}{TP+ FP}\)

\(Recall \left(Re\right) = \frac{TP}{TP + FN}\)

\(F1 score = \frac{2 \times Pr\times Re}{Pr + Re}\)

We then computed the Mendelian concordance rates for variants detected outside of the benchmark regions, which cannot be assessed using the benchmark variants. This analysis was conducted in a replicate group of a Quartet family, comprising father, mother, and monozygotic twins, and was based on Quartet pedigree relationships and Mendelian inheritance laws. We treated this four-membered family as two Trio families (D5, F7, M8 and D6, F7, M8) and evaluated whether the variant detection results in these two trio families adhered to Mendelian inheritance laws using the VBT (v1.1). Variants that showed concordance in both families and were identical between the monozygotic twins were identified as Mendelian concordant variants. The Mendelian concordance rate was determined as the ratio of the number of Mendelian concordant variants to the total number of detected variants.

Grouping the labs according to the quality of the raw data and library contamination

To investigate the relationship between raw reads quality and variant detection performance, we categorized labs into three groups (high, middle, and low) based on various raw reads quality metrics, including reads duplication rate, mapping rate, mismatch rate, Q30 rate, insert size, percentage of coverage over 100x in targeted regions, percentage of on-targeted bases, and fold-80 penalty. For each quality metric, a score of 1 was assigned if the metric exceeded its median value across all participating labs; otherwise, a score of 0 was given. These scores were then summed up, and if the total score fell within the range of 6 to 8, the lab was classified as high quality; if the score was between 4 to 5, the lab was classified as middle quality; and if the score was between 1 to 3, the lab was classified as low quality.

We then assessed for outsourced contamination or cross-contamination in samples using FastqScreen and by examining whether sample clusters aligned with Quartet pedigree information. If there was excessive alignment to other genomes or if the samples did not cluster according to the Quarte family member relationship, the lab was classified into the low-quality group, regardless of whether it achieved high or middle performance in the eight quality metrics.

The labs were also categorized into three groups: high contamination, middle contamination, and low contamination groups, based on mapping rate reported from FastqScreen to human, vector, and no-hit. We also checked the mapping rate of rRNA, E.coli, virus, and yeast, as all labs displayed contamination from these sources (< 1%), thus they were not included in the categorization process. If the mapping rate to the human genome was greater than 98%, a score of 1 was assigned; if the mapping rate to the vector was less than 1%, a score of 1 was assigned; if the no-hit rate was less than 1%, a score of 1 was assigned, resulting in a total score of 3. A score of 3 corresponded to the low contamination group, a score of 2 to the middle contamination group, and a score below 1 to the high contamination group.

Variants calling by multiple pipelines

We utilized Sentieon BWA to map sequences to the reference genome GRCh38 (https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files). Following this, we removed duplicated reads and downsampled BAM files to 100x using the Picard DownsampleSam module. Subsequently, germline small variants were called from five callers: DeepVariants, Strelka2, Sentieon DNAscope, Sentieon HaplotypeCaller, and GATK HaplotypeCaller. HaplotypeCaller was used as the recommended germline small variant calling pipeline by the Broad Institute (https://gatk.broadinstitute.org/hc/en-us/articles/ 360035535932-Germline-short-variant-discovery-SNPs-Indels-). Prior to variant calling, BAM files underwent indel realignment and base quality score recalibration (BQSR).

Some participating labs omitted the steps of indel realignment and BQSR. Therefore, we also utilized down-sampled BAM files as input for Sentieon HaplotypeCaller to assess whether this approach could expedite computation speed without compromising variant calling accuracy. Indel realignment and BQSR were not necessary for DeepVariants, Strelka2, and Sentieon DNAscope. For these callers, we used down-sampled BAM files as input and used default settings for all processes.

Acknowledgements

We thank the 89 labs that participated in this proficiency test by performing whole-exome sequencing and promptly returning both the raw data and analysis results. We also thank CFFF (Computing for the Future at Fudan) and the Human Phenome Data Center of Fudan University for computing support. We thank the National Omics Data Encyclopedia (NODE) for archiving and sharing the multiomics sequencing datasets of Quartet reference materials.

Disclaimer

This is a research study, not intended to guide clinical applications.

Author contributions

R.Z., Y.T.Z., J.L., L.S. conceived and oversaw the study. R.Z., Y.T.Z., J.L., L.S., and R.P. coordinated proficiency testing with the Quartet DNA reference materials. L.R., Y.F.Z., Y.G., D.W., J.Z., Y.M., and Y.L. performed data analysis. L.R., Y.F.Z., R.Z., Y.T.Z., J.L., and L.S. wrote and/or revised the manuscript. All authors reviewed and approved the manuscript.

Funding

This study was supported in part by the National Key R&D Project of China (2023YFC3402503 and 2018YFE0201600), the National Natural Science Foundation of China (32170657 and 32300536), Shanghai Municipal Science and Technology Major Project (2023SHZDZX02), Shanghai Sailing Program (22YF1403500), State Key Laboratory of Genetic Engineering (SKLGE-2117), and the 111 Project (B13016).

Availability of data and materials

Quartet DNA reference materials were requested from the Quartet Data Portal (http://chinese-quartet.org). The raw sequencing data reported in this paper used in this study can be obtained through the Genome Sequence Archive GSA (GSA) of the National Genomics Data Center of China with BioProject ID of PRJCA026740. All scripts used for statistical analyses have been publicly available on Zenodo: https://doi.org/10.5281/zenodo.11267419.

Consent for publication

Not applicable.

Competing interests

All authors declare no competing interests.

Rabbani B, Tekin M, Mahdieh N (2014) The promise of whole-exome sequencing in medical genetics. J Hum Genet 59:5–15
Backman JD et al (2021) Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599:628–634
Yu Z et al (2024) Genetic variation across and within individuals. Nat Rev Genet
Choi M et al (2009) Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 106:19096–19101
Akbari P et al (2021) Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373
Flannick J et al (2019) Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570:71–76
Seaby EG, Pengelly RJ, Ennis S (2016) Exome sequencing explained: a practical guide to its clinical application. Brief Funct Genomics 15:374–384
Foox J et al (2021) Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 39:1129–1140
Xiao W et al (2021) Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat Biotechnol 39:1141–1150
Bertier G, Hetu M, Joly Y (2016) Unsolved challenges of clinical whole-exome sequencing: a systematic literature review of end-users' views. BMC Med Genomics 9:52
Krusche P et al (2019) Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 37:555–560
Zook JM et al (2019) An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 37:561–566
Wagner J et al (2022) Benchmarking challenging small variants with linked and long reads. Cell Genom 2
Majidian S, Agustinho DP, Chin CS, Sedlazeck FJ, Mahmoud M (2023) Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol 24:221
Pan B et al (2022) Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol 23:2
Parla JS et al (2011) A comparative analysis of exome capture. Genome Biol 12:R97
Stitziel NO, Kiezun A, Sunyaev S (2011) Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 12:227
Kumaran M, Subramanian U, Devarajan B (2019) Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 20:342
Chilamakuri CS et al (2014) Performance comparison of four exome capture systems for deep sequencing. BMC Genomics 15:449
Shigemizu D et al (2015) Performance comparison of four commercial human whole-exome capture platforms. Sci Rep 5:12742
Zheng Y et al (2023) Multi-omics data integration using ratio-based quantitative profiling with Quartet reference materials. Nat Biotechnol
Ren L et al (2023) Quartet DNA reference materials and datasets for comprehensively evaluating germline variant calling performance. Genome Biol 24:270
Jia P et al (2023) Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet. Genome Biol 24:277
Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890
Martin M (2011) CUTADAPT removes adapter sequences from high-throughput sequencing reads. EMBnet J 17
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498
Freed D et al (2022) DNAscope: High accuracy small variant calling using machine learning. bioRxiv, 2022.2005.2020.492556
Kim S et al (2018) Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 15:591–594
Garrison E, Marth G (2012) arXiv:1207.3907
Poplin R et al (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36:983–987
Koboldt DC et al (2009) VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25:2283–2285
Shi X et al (2019) Pathogenic germline mutation hotspots in east Asian cancer genomes. J Clin Oncol 37:e13011–e13011
Huang KL et al (2018) Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173:355–370e314
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Wingett SW, Andrews S (2018) FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res 7:1338
Hansen KD, Brenner SE, Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38:e131
Turner FS (2014) Assessment of insert sizes and adapter content in fastq data from NexteraXT libraries. Front Genet 5:5
Veltman JA, Brunner HG (2012) De novo mutations in human genetic disease. Nat Rev Genet 13:565–575
Kendig KI et al (2019) Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy. Front Genet 10:736

There is NO Competing Interest.

SupplementaryTable1Metadata.xlsx
Dataset 1
SupplementaryTable3GeneList.xlsx
Dataset 3
ExtendedDatas.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Toward best practices for detecting germline small variants: a large-scale real-world WES benchmarking study using the Quartet DNA reference materials

Status:

Version 1

Abstract

Figures

Introduction

Results

Overview of the Nationwide WES proficiency test in China

Read quality and capture efficiency

Performance of small variants detection varied across labs

Impact of bioinformatic factors on variants detection performance

Impact of experimental factors on variants detection performance

Best practices of germline small variants detection with WES

Discussion

Conclusions

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1