WEAP is designed for multi-threading mode by utilizing popular whole exome data analysis tools and databases in conjunction with GATK best practices for calling germline and somatic variants. This enables users to get variant annotations from FASTQ files in a single step, while also utilizing PoN. In contrast to the already available SeqMule automated workflow, WEAP integrates the latest protocol recommended by GATK best practices guidelines and automatically performs joint variant calls, filter false positive variants, genotyping and annotation. Moreover, most of the tools made for automatic variant calling are not made for somatic calling from tumor and tumor-normal paired samples [28].
Benchmarking of various variant calling tools revealed that each tool has advantages and limitations. The performance of the tools depends on various internal parameters, the quality of the samples, sequencing technology and the alignment quality of the data. GATK HaplotypeCaller and Mutect2 have been incorporated in WEAP as these have outperformed in various studies and also have been widely used in a popular genome sequencing project, The Cancer Genome Atlas (TCGA) for germline and Somatic mutation screening [29, 30]. Moreover, Large-scale WGS projects like UKB WGS Consortium, 1000 genome also employed GATK HaplotypeCaller for germline variant discovery [31, 32]. The advantage of Mutect2 over VarScan (another popular variant calling tool) is the high sensitivity of detection of somatic variants without a matched control sample. Using a PoN and a germline resource further aids in filtering out the false positive variants from the Mutect2 call sets [33, 34].
GATK uses a probabilistic model for variant calling, considering base quality scores, mapping quality, variant quality score, hard-filtering and other features. This enhances the accuracy of variant calling by providing a more comprehensive understanding of the sequencing data and mitigating potential sequencing artifacts [35]. Mutect2 supports both tumor-only and tumor-normal modes, providing flexibility in variant calling based on the available sample types. The tumor-normal mode allows for better identification of somatic variants by comparing the tumor sample against a matched normal sample [36]. GATK FilterMutectCalls employs allele-specific filtering to enhance the accuracy of variant calls, particularly in the context of tumor heterogeneity. This feature aids in distinguishing true somatic variants from sequencing errors or germline variants present in the normal samples. Mutect2 incorporates advanced artifact filtering techniques, including machine learning models, to reduce false positives caused by sequencing artifacts and systematic errors. This enhances the precision of somatic variant detection, especially in cancer genomics studies [37].
While GATK HaplotypeCaller and Mutect2 have demonstrated high accuracy and robust performance in variant calling, the choice of variant caller may depend on the specific requirements of the analysis and the nature of the genomic data being processed. There are different optional pipelines and tools available for variant calling such as Pibase, SNPSVM, and DeepVariant, but, GATK is still the most commonly used variant calling pipeline for both whole genome and whole exome data analysis [38]. In a recent study, DeepVariant (v0.8.0) showed better performance than GATK (v4.1.2.0) in a benchmark study based on a trio sample [39]. However, GATK offers tuneable parameters in Hard-Filtering on the variants that improve the variant call sets by removing false positive variants. GATK’s variant quality score recalibration (VQSR) step uses machine learning to filter variants that offer a tuneable approach to filtering variants providing high-quality training set data [40]. In the recent release of GATK, Illumina DRAGEN features were added to HaplotypeCaller in GATK v4.5.0.0 that brings us closer to functional equivalence with DRAGEN v3.7.8. Furthermore, the implementation of SmithWaterman in HaplotypeCaller and Mutect2 makes it to a hardware-accelerated version that makes a significant improvement in the speed.
Additionally, it's essential to consider the constantly evolving landscape of bioinformatics tools, and users should stay informed about updates and improvements to ensure the most accurate results in their analyses. DeepVariant from Google for variant calling and other popular annotation tools to be incorporated in the subsequent release of WEAP. Moreover, WEAP workflow will also be available to be implemented using the popular workflow manager ‘Nextflow’. The current workflow of WEAP uses ANNOVAR for annotation of the variants, however, the user can also use Variant Effect Predictor (VEP) on the generated output of WEAP in the variant calling step. WEAP is an automated pipeline for genetic variant calling with annotation, and also the only pipeline that offers to analyze the variant with most up to date BWA (BWA-MEM2) aligner and latest GATK versions (GATK v4.5.0.0) with GATK Best Practices Guidelines.
The tools used in NGS analysis are heterogenous in many characteristics such as supporting data types, provisions for available resource utilization and makes it highly challenging in analyzing large volume of datasets. Automation of the workflow often increases the robustness and reduces time with cost by nullifying the possible human errors. WEAP with parallel mode significantly reduces the time required for germline and somatic variant calls compared to the conventional step-by-step serial process. For instance, in germline mode, WEAP parallel has reduced the time required for variant calling and annotation compared to serial mode from FASTQ files of four samples from 5 hours & 58 minutes to 3 hours & 16 minutes (45.25% reduction in time). Similarly, the time taken for PoN creation from 40 samples was reduced from approximately 111 hours & 55 minutes to 36 hours & 18 minutes (67.56% reduction in time). In somatic variant calling, the time reduced from 15 hours & 28 minutes to 4 hours & 55 minutes in Tumor only mode (68.17% reduction in time), and from 21 hours & 6 minutes to 6 hours & 9 minutes (70.83% reduction in time) in Tumor with matched normal mode. WEAP successfully used the available computational resources efficiently to reduce the overall analysis time.
4.1 Strengths and Limitations
WEAP simplifies the process of calling variants from multiple samples. It works on four tasks simultaneously, streamlining the workflow from fastq to variant annotation. Users only need to input the sample directory's path and some essential parameters at the start, and WEAP takes care of the rest automatically.
However, there are certain limitations of the tools. WEAP doesn’t support pause and resume tasks during the analysis. Regarding the performance, it functions optimally on linux (Ubuntu 20.04 or higher) and Windows 10/11 using windows systems (subsystem for linux) that fall within the mid to high-end computing configuration.