WEAP: An automatic and accelerated pipeline for analysing multi-sample whole exome sequencing data

doi:10.21203/rs.3.rs-4512130/v1

Download PDF

Research Article

WEAP: An automatic and accelerated pipeline for analysing multi-sample whole exome sequencing data

https://doi.org/10.21203/rs.3.rs-4512130/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Whole Exome Sequencing (WES) is commonly used for SNP discovery in the coding regions of the human genome and has a wide range of clinical applications. Being an intensive time-consuming task, automation is key to uncomplicating and performing straightforward data analysis.

Method

The WEAP workflow starts with the alignment of FASTQ files to a reference genome, variant calling, and annotation without user intervention. WEAP utilizes the GATK workflow incorporating popular NGS analysis tools such as bwa-mem2, samtools, GATK, bcftools, and anoovar coupled with GNU parallel.

Results

WEAP successfully identified and annotated germline and somatic variants. The major steps aligning to the reference genome, converting files, and removing duplicates in germline variant discovery were made several folds (1.5 to 3.6 folds) faster in parallel mode than in serial mode. In tumor analysis, creating a PoN from 40 samples was about 3 times faster in parallel mode. Tumor-only analysis was 1.4 to 7.7 times faster in different steps. When comparing tumor samples with matched normal tissues, the time taken was significantly reduced, making the process 1.8 to 3.6 times faster.

Conclusions

WEAP accepts Quality Control (QC) checked and trimmed FASTQ reads, and provides annotated variants that enable non-bioinformaticians to perform flawless variant calling from WES data. WEAP uses GNU parallel for multiple sample processing one at a time leveraging native parallel processing of the implemented tools and software to perform the analysis faster. A comparison between the parallel mode and serial mode of WEAP revealed that WEAP can be one of the best alternative tools for end-to-end analysis of WES data integrating gold standard GATK best practices workflow.

Bioinformatics

Epigenetics & Genomics

WEAP

Automated WES Data Analysis

GNU Parallel

Variant calling

Variant Annotation

Genetic variants in protein coding sequence have a clinical consequence as it can directly influence the protein structure and can have an impact on a patient’s health and disease. The Whole Exonic region (1–2% of the whole genome) can be employed to successfully catalogue the variants in the coding region of the genome [1]. Whole Exome Sequencing (WES) provides a cost-effective and efficient way to sequence the protein-coding regions of the genome, enabling the identification of novel candidate genes associated with diseases and the technology is being widely used in clinical genomics. Large-scale exome, typically WES, data analysis is extremely important to elucidate disease-associated genetic variants [2]. To ensure precision and efficiency in clinical diagnostic lab operations, it is imperative to establish a reliable and automated data analysis pipeline. This not only helps in preventing erroneous variant calls, but also ensures the attainment of peak accuracy levels [3].

WES has revolutionized the way researchers approach the study of heterogeneous diseases like Cardiovascular disease, Cancers, and other Genetic diseases as well as inherited rare disorders. The ability to accurately identify these genes has led to the study of association with the disease and paved the way for novel approaches in diagnostics and treatment [3, 4]. The association of life-threatening diseases with specific driver gene mutations has proven to be a crucial step in the process of developing personalized medicine, allowing for designing better individual patient-based treatment regimens [5, 6]. In recent years, many novel biomarkers and drug targets have been identified in different diseases using Next Generation Sequencing methods, including cancers [7]. Apart from mutations, WES can also be employed for copy number variation detection in cancer and other diseases [8].

The Next Generation Sequence (NGS) data analysis task consists of multiple intermediary steps. As NGS is now a well-established method in biomedical research, the requirement of knowledge and expertise in different bioinformatics algorithms, software, and databases are very much essential. However, there are abundant online servers that enable researchers to analyze, but they often come with limited storage space which is a bottleneck for the analysis [9, 11]. The data quality and type such as read length are also dependent upon the sequencing platforms. However, the choice of the sequencing platform greatly depends upon the study goal, and Illumina is one of the popular NGS platform that is widely chosen for health research [12].

The typical pipeline of exome data analysis starts from various pre-processing and quality checks on the raw reads. There are various easy-to-use tools like fastqc, FastProNGS, fastp and trimmomatic that can effectively calculate the quality score recorded in the fastq files and utilize those scores for trimming the low-quality bases along with the adapter sequences to generate alignment-ready fastq files [13, 14, 15, 16]. The subsequent steps involve alignment of the quality checked reads to the reference genome using an aligner, followed by removal of duplicated reads generated during the library preparation, recalibration of base quality score, variant calling, and annotation [17]. Illumina and NVIDIA have come up with Dynamic Read Analysis for GENomics (DRAGEN) for its BioIT platform and Clara™ Parabricks with NVIDIA GPU supports, respectively for specific hardware-dependent accelerated NGS data analysis with enormous improvement of speed [18]. Although web platform like Galaxy enables researchers to analyze the data online, due to limited space and large number of user request across the globe makes it significantly challenging for joint variant call from large sample sets. Variant calling algorithms are complex and sensitive to parameter settings and the Users need to understand the intricacies of each tool. Galaxy workflows should be designed to optimize parameter choices and minimize false positives/negatives [19]. The GATK variant calling pipeline is one of the oldest variant calling pipelines with great flexibility and a wide range of platform support such as the older version to the latest version of Linux. The evolution of the GATK pipeline not only improved robustness in identifying variants, but also revolutionized the computational resource utilization, accuracy and speed. GATK’s joint genotyping method is more sensitive and flexible than traditional approaches as it reduces computational challenges and facilitates incremental variant discovery across distinct sample cohorts [20, 21].

The NGS data analysis process is a tedious job and often needs expert bioinformaticians even from data quality control (QC) to variant calling and interpretation. The NGS analysis tools and algorithms evolve rapidly and are frequently updated which makes the researchers hard to decide on the choice of appropriate tools. Therefore, we introduce an open-source automated tool WEAP, which was made to call and annotate germline and somatic variants from raw data automatically using popular bioinformatics tools without user intervention and the requirement of bioinformatics expertise in a research setup.

The binary executable ‘configuration’ script comes with WEAP allows the user to setup the necessary tools like axel download accelerator (https://github.com/axel-download-accelerator/axel), samtools [22], the most updated version of GATK (v4.5.0.0) [23, 24], bwa-mem2 [25], bcftools [26] and latest version of annovar [27]. Another binary executable ‘reference.bin’ was included to download reference genome followed by index preparation, variant of dbSNP, 1000g project, reference data from the annovar database which are required for various step during variant calling and annotation. Axel was implemented in the ‘reference.bin’ to download the reference data quickly from the source. WEAP is designed to use the best database sources for base quality score recalibration, variant quality score recalibration, and annotation of variants along with pre-classified annotation, as per ACMG guidelines and allele frequencies.

The pipeline is integrated with bwa-mem2 for aligning the paired-end reads to the reference genome (hg38) followed by conversion of the SAM files to BAM (sorted by coordinates) files using Samtools. The bwa-mem2 reads the fastq files from the input directory and assigns the sample name (directory name) to the respective SAM files. The analysis proceeds to the next SAM to BAM conversion followed by sorting the BAMs by chromosomal coordinates. Picard ((https://broadinstitute.github.io/picard/) was implemented to mark the duplicate reads that results from library preparation steps.

In germline variant calling, GATK (v4.5.0.0) was implemented for base quality score recalibration (BQSR) followed by variant calling in gVCF mode, joint genotyping, variant quality score recalibration using db-snp138.vcf, 1000g.Omni2.5.vcf and 1000g.phase1.HighConfidence.vcf) followed by hard-filtering with default criteria as provided by GATK best practices workflow. Somatic variant calling using mutect2 within GATK (v4.5.0.0) was implemented in two ways: tumor only mode (TOM) and tumor with matched normal (TWM), both coupled by a germline resource (af_only_genomead.vcf.gz) provided by GATK and a Panel of Normals (PoN) to filter out the false positive somatic variants. Further false-positive variant from the somatic variants were filtered using the default criteria of GATK FilterMutectCalls [24, 25].

Annovar is used for gene-based and filter-based annotation using refgen, avSNP150, ClinVar, 1000g (with SNP effect predictions of SIFT, Polyphen2, Proven, LRT, MutationPrediction, FATHMM), ExAc, GenomAD_Exome, COSMIC (in case somatic variant calling for cancer-associated mutation) for variants annotation [26]. At each step of the analysis, the tool retains the sample ID in the output files. To make the variant call faster, we implemented GNU-Parallel throughout the pipeline to perform four jobs at a time. It automatically decides the CPU threads to be used in the alignment process and the usage of a total of 16 threads in the variant calling step, while each step uses 4 threads. The maximum memory allocation was set in the variant calling step as 8 GB.

The tool's current version was tested on WES data generated from gastric cancer patients for both germline and somatic variant calling. Four WES data sets were evaluated from each blood, tumor tissue, and adjacent normal of gastric cancer origin, generated on an Illumina NovaSeq6000 platform at 100x depth (Supplementary Data). For somatic variant calling, a PoN was created from forty WES datasets, sequenced using the same library preparation method and on the same platform (Supplementary Data).

Alignment of reads from one sample was done using 16 CPU threads, and a total of 64 CPU threads were used to align four samples parallelly. However, before alignment, the tool also recommends the number of CPU thread to be used during the analysis based on the available resources. HaplotypeCaller and Mutect2 of GATK v4.5.0.0 used four CPU threads for one task, and using GNU parallel, four tasks were implemented at a time, utilizing a total of 16 CPU threads. Moreover, implementation of Smith-Waterman with HaplotypeCaller in the GATK v4.5.0.0 makes it faster than the previous versions of GATK.

WEAP successfully called and annotated the germline and somatic variants from the trimmed FASTQ files from Blood DNA and Tissue DNA, respectively. In germline variant calling, it was observed that the time taken by the WEAP in parallel mode was less than the serial process mode (one task at a time).

Time taken in the major steps decreased in parallel mode by approximately 2-folds in Alignment to the reference genome, sam to bam files conversion, Sorting the bam files by coordinates, duplicate reads removal, 1.5-folds in base quality score recalibration (BQSR), 3.6-folds variant calling and 3-folds in variant annotation. However, combining gVCF, joint genotyping, variant quality score recalibration (VQSR) and Variant Filtration steps were not in parallel since GATK do not offer parallel processing in these steps (Fig. 2A).

In somatic variant calling, PoN VCF was created using WES data from 40 samples of blood origin sequenced in the Illumina NovaSeq 6000 platform. WEAP created PoN VCF in 402900 seconds in serial mode, while it took 130711 seconds in parallel mode which was approximately 3-folds faster (Supplementary data).

In Tumor only Mode of somatic variant calling, we observed that WEAP parallel performed significantly faster by 2.3-fold in Alignment, 1.7-fold in SAM to BAM, 2-fold in sorting, 3.4-fold in duplicate reads removal, 3.3-fold in variant calling, 7.7-fold in Variant Filtration and 1.4-fold in variant annotation (Fig. 2B). In Tumors with Matched Normal mode of somatic variant calling, the average time required for alignment, SAM to BAM conversion, sorting, duplicate removal steps of tumor samples and Adjacent normal were reduced (Fig. 2C). Similarly, the time taken for PoN creation from 40 samples was reduced from approximately 111 hours and 55 minutes to 36 hours & 18 minutes. In somatic variant calling, the time reduced in Tumor only mode from 15 hours & 28 minutes to 4 hours & 55 minutes, and from 21 hours & 6 minutes to 6 hours & 9 minutes in Tumor only Mode. WEAP parallel performed faster in average 2.6-fold in alignment, 1.8-fold in SAM to BAM conversion, 2.2-fold in sorting and 3.4-fold in duplicate reads removal. Mutect2 performed 3.6-fold faster in WEAP parallel in variant calling. Likewise, variant filtration and Annotations were performed in 4.1-fold and 2.9-fold faster, respectively.

WEAP is designed for multi-threading mode by utilizing popular whole exome data analysis tools and databases in conjunction with GATK best practices for calling germline and somatic variants. This enables users to get variant annotations from FASTQ files in a single step, while also utilizing PoN. In contrast to the already available SeqMule automated workflow, WEAP integrates the latest protocol recommended by GATK best practices guidelines and automatically performs joint variant calls, filter false positive variants, genotyping and annotation. Moreover, most of the tools made for automatic variant calling are not made for somatic calling from tumor and tumor-normal paired samples [28].

Benchmarking of various variant calling tools revealed that each tool has advantages and limitations. The performance of the tools depends on various internal parameters, the quality of the samples, sequencing technology and the alignment quality of the data. GATK HaplotypeCaller and Mutect2 have been incorporated in WEAP as these have outperformed in various studies and also have been widely used in a popular genome sequencing project, The Cancer Genome Atlas (TCGA) for germline and Somatic mutation screening [29, 30]. Moreover, Large-scale WGS projects like UKB WGS Consortium, 1000 genome also employed GATK HaplotypeCaller for germline variant discovery [31, 32]. The advantage of Mutect2 over VarScan (another popular variant calling tool) is the high sensitivity of detection of somatic variants without a matched control sample. Using a PoN and a germline resource further aids in filtering out the false positive variants from the Mutect2 call sets [33, 34].

GATK uses a probabilistic model for variant calling, considering base quality scores, mapping quality, variant quality score, hard-filtering and other features. This enhances the accuracy of variant calling by providing a more comprehensive understanding of the sequencing data and mitigating potential sequencing artifacts [35]. Mutect2 supports both tumor-only and tumor-normal modes, providing flexibility in variant calling based on the available sample types. The tumor-normal mode allows for better identification of somatic variants by comparing the tumor sample against a matched normal sample [36]. GATK FilterMutectCalls employs allele-specific filtering to enhance the accuracy of variant calls, particularly in the context of tumor heterogeneity. This feature aids in distinguishing true somatic variants from sequencing errors or germline variants present in the normal samples. Mutect2 incorporates advanced artifact filtering techniques, including machine learning models, to reduce false positives caused by sequencing artifacts and systematic errors. This enhances the precision of somatic variant detection, especially in cancer genomics studies [37].

While GATK HaplotypeCaller and Mutect2 have demonstrated high accuracy and robust performance in variant calling, the choice of variant caller may depend on the specific requirements of the analysis and the nature of the genomic data being processed. There are different optional pipelines and tools available for variant calling such as Pibase, SNPSVM, and DeepVariant, but, GATK is still the most commonly used variant calling pipeline for both whole genome and whole exome data analysis [38]. In a recent study, DeepVariant (v0.8.0) showed better performance than GATK (v4.1.2.0) in a benchmark study based on a trio sample [39]. However, GATK offers tuneable parameters in Hard-Filtering on the variants that improve the variant call sets by removing false positive variants. GATK’s variant quality score recalibration (VQSR) step uses machine learning to filter variants that offer a tuneable approach to filtering variants providing high-quality training set data [40]. In the recent release of GATK, Illumina DRAGEN features were added to HaplotypeCaller in GATK v4.5.0.0 that brings us closer to functional equivalence with DRAGEN v3.7.8. Furthermore, the implementation of SmithWaterman in HaplotypeCaller and Mutect2 makes it to a hardware-accelerated version that makes a significant improvement in the speed.

Additionally, it's essential to consider the constantly evolving landscape of bioinformatics tools, and users should stay informed about updates and improvements to ensure the most accurate results in their analyses. DeepVariant from Google for variant calling and other popular annotation tools to be incorporated in the subsequent release of WEAP. Moreover, WEAP workflow will also be available to be implemented using the popular workflow manager ‘Nextflow’. The current workflow of WEAP uses ANNOVAR for annotation of the variants, however, the user can also use Variant Effect Predictor (VEP) on the generated output of WEAP in the variant calling step. WEAP is an automated pipeline for genetic variant calling with annotation, and also the only pipeline that offers to analyze the variant with most up to date BWA (BWA-MEM2) aligner and latest GATK versions (GATK v4.5.0.0) with GATK Best Practices Guidelines.

The tools used in NGS analysis are heterogenous in many characteristics such as supporting data types, provisions for available resource utilization and makes it highly challenging in analyzing large volume of datasets. Automation of the workflow often increases the robustness and reduces time with cost by nullifying the possible human errors. WEAP with parallel mode significantly reduces the time required for germline and somatic variant calls compared to the conventional step-by-step serial process. For instance, in germline mode, WEAP parallel has reduced the time required for variant calling and annotation compared to serial mode from FASTQ files of four samples from 5 hours & 58 minutes to 3 hours & 16 minutes (45.25% reduction in time). Similarly, the time taken for PoN creation from 40 samples was reduced from approximately 111 hours & 55 minutes to 36 hours & 18 minutes (67.56% reduction in time). In somatic variant calling, the time reduced from 15 hours & 28 minutes to 4 hours & 55 minutes in Tumor only mode (68.17% reduction in time), and from 21 hours & 6 minutes to 6 hours & 9 minutes (70.83% reduction in time) in Tumor with matched normal mode. WEAP successfully used the available computational resources efficiently to reduce the overall analysis time.

4.1 Strengths and Limitations

WEAP simplifies the process of calling variants from multiple samples. It works on four tasks simultaneously, streamlining the workflow from fastq to variant annotation. Users only need to input the sample directory's path and some essential parameters at the start, and WEAP takes care of the rest automatically.

However, there are certain limitations of the tools. WEAP doesn’t support pause and resume tasks during the analysis. Regarding the performance, it functions optimally on linux (Ubuntu 20.04 or higher) and Windows 10/11 using windows systems (subsystem for linux) that fall within the mid to high-end computing configuration.

WEAP empowers automated germline and somatic variant calling from vast WES datasets, enhancing efficiency and nullifying human error risks. In the future, WEAP will be updated with various downstream analysis modules, automated copy number analysis from tumor-normal paired samples and SNP calling from mitochondrial genomes.

BAM

Binary Alignment Map

BWA

Burrows-Wheeler Algorithm

GATK

Genome Analysis Toolkit

gVCF

Genomic Variant Call Format

PoN

Panel of Normals

SAM

Sequence Alignment/Map Format

VCF

Variant Call Format

WEAP

Whole Exome Analysis Pipeline

WES

Whole Exome Sequencing

Data Availability Statement

WEAP is available on GitHub: https://github.com/ranjanjs34/weap-v1.0.1. The Binary executables and a detailed easy-to-follow user manual are added as additional data along with this manuscript. The complete pipeline is provided as Additional data 1 and manual to run the pipeline is provided as Additional data 2.

Author Contributions

RJS: Conceptualization, Data curation, Formal analysis, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing; NSK: Supervision, Writing – review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by DBT, New Delhi. (No.BT/NER/143/SP44475/2021)

Conflict of Interest

Authors have no conflict of interest.

Acknowledgment

The authors thank the Department of Biotechnology, Govt of India, New Delhi for the funding under the Advanced Level State Biotech Hub program (No. BT/NER/143/SP44475/2021) that supported the infrastructure and manpower in the project.

E.G. Seaby, R.J. Pengelly, S. Ennis. Exome sequencing explained: a practical guide to its clinical application, Brief. Funct. Genomics. 15 (2016) 374–384. https://doi.org/10.1093/bfgp/elv054.
J. D. Backman, A.H. Li, A. Marcketta, D. Sun, J. Mbatchou, M.D. Kessler, C. Benner, D. Liu, A.E Locke, S. Balasubramanian, A. Yadav, N. Banerjee, C.E. Gillies, A. Damask, S. Liu, X. Bai, A. Hawes, E. Maxwell, Gurski L, Watanabe K, Kosmicki JA, Rajagopal V, Mighty J; Regeneron Genetics Center; DiscovEHR; M. Jones, L. Mitnaul, E. Stahl, G. Coppola, E. Jorgenson, L. Habegger, W.J. Salerno, A.R. Shuldiner, L.A. Lotta, J.D. Overton, M.N. Cantor, J.G. Reid, G. Yancopoulos, H.M Kang, J. Marchini, A. Baras, G.R. Abecasis, M.A.R. Ferreira. (2021). Exome sequencing and analysis of 454,787 UK Biobank participants, Nature. 599(7886) (2021) 628–634. https://doi.org/10.1038/s41586-021-04103-z.
Alganmi N, Abusamra H. Evaluation of an optimized germline exomes pipeline using BWA-MEM2 and Dragen-GATK tools. PLoS One. 2023 Aug 3;18(8):e0288371. doi: 10.1371/journal.pone.0288371. PMID: 37535628; PMCID: PMC10399881.
G. Goh, M. Choi. Application of Whole Exome Sequencing to Identify Disease-Causing Variants in Inherited Human Diseases, Genomics. Inform. 10 (2012) 214-219. https://doi.org/10.5808/GI.2012.10.4.214.
K. Retterer, J. Juusola, M.T. Cho, P. Vitazka, F. Millan, F. Gibellini, A. Vertino-Bell, N. Smaoui, J. Neidich, K.G. Monaghan, D. McKnight, R. Bai, S Suchy, B Friedman, J. Tahiliani, D. Pineda-Alvarez, G. Richard, T. Brandt, E. Haverfield, W.K. Chung, S. Bale. Clinical application of whole-exome sequencing across clinical indications, Genet. Med. 18 (2016) 696–704. https://doi.org/10.1038/gim.2015.148.
S.B. Seidelmann, E. Smith, L. Subrahmanyan, D. Dykas, M.D. Abou Ziki, B. Azari, F. Hannah-Shmouni, Y. Jiang, J.G. Akar, M. Marieb, D. Jacoby, A.E. Bale, R.P. Lifton, A. Mani, 2017. Application of Whole Exome Sequencing in the Clinical Diagnosis and Management of Inherited Cardiovascular Diseases in Adults. Circ. Cardiovasc. Genet. 10, e001573. https://doi.org/10.1161/CIRCGENETICS.116.001573.
M. Zhang, L. Zhang, Y. Li, F. Sun, Y. Fang, R. Zhang, J. Wu, Z. Zhou, H. Song, L Xue, B. Han, C. Zheng. Exome sequencing identifies somatic mutations in novel driver genes in non-small cell lung cancer, Aging. 12(13) (2020) 13701–13715. https://doi.org/10.18632/aging.103500.
M. Avila, F. Meric-Bernstam. Next-generation sequencing for the general cancer patient, Clin Adv Hematol Oncol. 17 (2019) 447-454.
P. Suwinski, C. Ong, M.H.T. Ling, Y.M. Poh, A.M. Khan, H.S. Ong, 2019. Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics. Front. Genet. 10 (2019) 49. https://doi.org/10.3389/fgene.2019.00049.
D. Blankenberg, J. Hillman-Jackson. Analysis of Next-Generation Sequencing Data Using Galaxy, Methods. Mol. Biol. 1150 (2014) 21–43. https://doi.org/10.1007/978-1-4939-0512-6_2.
P. Kulkarni, P. Frommolt. Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows, Comput. Struct. Biotechnol. J. 15 (2017) 471–477. https://doi.org/10.1016/j.csbj.2017.10.001.
O. An, K.-T. Tan, Y. Li, j. Li, C.-S. Wu, B. Zhang, L. Chen, H. Yang. CSI NGS Portal: An Online Platform for Automated NGS Data Analysis and Sharing, Int. J. Mol. Sci. 21(11) (2020) 3828. https://doi.org/10.3390/ijms21113828.
Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Liu X, Yan Z, Wu C, Yang Y, Li X, Zhang G. FastProNGS: fast preprocessing of next-generation sequencing reads. BMC Bioinformatics. 2019 Jun 17;20(1):345. doi: 10.1186/s12859-019-2936-9. PMID: 31208325; PMCID: PMC6580563.
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PMID: 30423086; PMCID: PMC6129281.
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590.
D.C. Koboldt. Best practices for variant calling in clinical sequencing, Genome. Med. 12, (2020) 91. https://doi.org/10.1186/s13073-020-00791-w.
K.A. O'Connell, Z.B. Yosufzai, R.A. Campbell, C.J. Lobb, H.T. Engelken, L.M. Gorrell, T.B. Carlson, J.J. Catana, D. Mikdadi, V.R. Bonazzi, J.A. Klenk. Accelerating genomic workflows using NVIDIA Parabricks, BMC Bioinformatics. 24(1) (2023) 221. https://doi.org/10.1186/s12859-023-05292-2.
E. Afgan, D. Baker, B. Batut, M. van den Beek, D. Bouvier, M. Cech, J. Chilton, D. Clements, N. Coraor, B.A. Grüning, A. Guerler, J. Hillman-Jackson, S. Hiltemann, V. Jalili, H. Rasche, N. Soranzo, J. Goecks, J. Taylor, A. Nekrutenko, D. Blankenberg. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update, Nucleic. Acids. Res. 46(W1) (2018) W537–W544. https://doi.org/10.1093/nar/gky379.
Brouard JS, Bissonnette N. Variant Calling from RNA-seq Data Using the GATK Joint Genotyping Workflow. Methods Mol Biol. 2022;2493:205-233. doi: 10.1007/978-1-0716-2293-3_13. PMID: 35751817.
Liu X, Han S, Wang Z, Gelernter J, Yang BZ. Variant callers for next-generation sequencing data: a comparison study. PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. PMID: 24086590; PMCID: PMC3785481.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup.. The Sequence Alignment/Map format and SAMtools, Bioinformatics. 25(16) (2019) 2078–2079. https://doi.org/10.1093/bioinformatics/btp352.
H. Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv (2013) arXiv:1303.3997, https://doi.org/10.48550/arXiv.1303.3997.
A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, M.A. DePristo. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data, Genome. Res. 20(9) (2010) 1297–1303. https://doi.org/10.1101/gr.107524.110.
G.A. Van der Auwera, M.O. Carneiro, C. Hartl, R. Poplin, G. Del Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K.V. Garimella, D. Altshuler, S. Gabriel, M.A. DePristo. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline, Curr. Protoc. Bioinformatics. 43(1110) (2013) 11.10.1-11.10.33. https://doi.org/10.1002/0471250953.bi1110s43.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
K. Wang, M. Li, H. Hakonarson, 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic. Acids. Res. 38, e164. https://doi.org/10.1093/nar/gkq603.
Guo, Y., Ding, X., Shen, Y. et al. SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Sci Rep 5, 14283 (2015). https://doi.org/10.1038/srep14283
S . Pei, T. Liu, X. Ren, W. Li, C. Chen, Z. Xie, 2021. Benchmarking variant callers in next-generation and third-generation sequencing analysis. Brief. Bioinform. 22, bbaa148. https://doi.org/10.1093/bib/bbaa148.
K.L Huang, R.J. Mashl, Y. Wu, D.I. Ritter, J. Wang, C. Oh, M. Paczkowska, S. Reynolds, M.A. Wyczalkowski, N. Oak, A.D. Scott, M. Krassowski, A.D. Cherniack, K.E. Houlahan, R. Jayasinghe, L.B. Wang, D.C. Zhou, D. Liu, S. Cao, Y.W. Kim, A. Koire, J.F. McMichael, V. Hucthagowder, T.B. Kim, A. Hahn, C. Wang, M.D. McLellan, F. Al-Mulla, K.J. Johnson; Cancer Genome Atlas Research Network; O. Lichtarge, P.C. Boutros, B. Raphael, A.J. Lazar, W. Zhang, M.C. Wendl, R. Govindan, S. Jain, D. Wheeler, S. Kulkarni, J.F. Dipersio, J. Reimand, F. Meric-Bernstam, K. Chen, I. Shmulevich, S.E. Plon, F. Chen, L. Ding. Pathogenic Germline Variants in 10,389 Adult Cancers, Cell. 173(2) (2018) 355-370.e14. https://doi.org/10.1016/j.cell.2018.03.039.
B.V. Halldorsson, H.P. Eggertsson, K.H.S. Moore, H. Hauswedell, O. Eiriksson, M.O. Ulfarsson, G. Palsson, M.T. Hardarson, A. Oddsson, B.O. Jensson, S. Kristmundsdottir, B.D. Sigurpalsdottir, O. A. Stefansson, D. Beyter, G. Holley, V. Tragante, A. Gylfason,P.I. Olason, F. Zink, M. Asgeirsdottir, S.T. Sverrisson, B. Sigurdsson, S.A. Gudjonsson, G.T. Sigurdsson, G.H. Halldorsson, G. Sveinbjornsson, K. Norland, U. Styrkarsdottir, D.N. Magnusdottir, S. Snorradottir, K. Kristinsson, E. Sobech, H. Jonsson, A.J. Geirsson, I. Olafsson, P. Jonsson, O.B. Pedersen, C. Erikstrup, S. Brunak, S.R. Ostrowski; DBDS Genetic Consortium; G. Thorleifsson, F. Jonsson, P. Melsted, I. Jonsdottir, T. Rafnar, H. Holm, H. Stefansson, J. Saemundsdottir, D.F. Gudbjartsson, O.T. Magnusson, G. Masson, U. Thorsteinsdottir, A. Helgason, H. Jonsson, P. Sulem, K. Stefansson. The sequences of 150,119 genomes in the UK Biobank, Nature. 607(7920) (2022) 732–740. https://doi.org/10.1038/s41586-022-04965-x.
M. Byrska-Bishop, U.S. Evani, X Zhao, A.O. Basile, H.J. Abel, A.A. Regier, A. Corvelo, W.E. Clarke, R. Musunuri, K. Nagulapalli, S. Fairley, A. Runnels, L. Winterkorn, E. Lowy; Human Genome Structural Variation Consortium; F. Paul, S. Germer, H. Brand, I.M. Hall, M.E. Talkowski, G. Narzisi, M.C. Zody. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios, Cell. 185(18) (2022) 3426-3440.e19. https://doi.org/10.1016/j.cell.2022.08.004.
Y. Dou, M. Kwon, R.E. Rodin, I. Cortés-Ciriano, R. Doan, L.J. Luquette, A. Galor, C. Bohrson, C.A. Walsh, P.J. Park. Accurate detection of mosaic variants in sequencing data without matched controls, Nat. Biotechnol. 38(3) (2020) 314-319. https://doi.org/10.1038/s41587-019-0368-8.
Q. Wang, V Kotoula, P.C. Hsu, K. Papadopoulou, J.W.K. Ho, G. Fountzilas, E. Giannoulatou. Comparison of somatic variant detection algorithms using Ion Torrent targeted deep sequencing data, BMC. Med. Genomics. 12 (2019) 181. https://doi.org/10.1186/s12920-019-0636-y.
McKenna A et al. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297-1303.
Cibulskis K et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213-219.
Benjamin D et al. (2019). Calling Somatic SNVs and Indels with Mutect2. bioRxiv, 861054.
Eren KK, Çınar E, Karakurt HU, Özgür A. Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics. Bioinformatics. 2023 Dec 1;39(12):btad694. doi: 10.1093/bioinformatics/btad694. PMID: 38019945; PMCID: PMC10692869.
Lin, YL., Chang, PC., Hsu, C. et al. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep 12, 1809 (2022). https://doi.org/10.1038/s41598-022-05833-4Barbitoff, Y.A., Abasov, R., Tvorogova, V.E. et al. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 23, 155 (2022). https://doi.org/10.1186/.
De Summa, S., Malerba, G., Pinto, R. et al. GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC Bioinformatics 18 (Suppl 5), 119 (2017). https://doi.org/10.1186/s12859-017-1537-8.

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

WEAP: An automatic and accelerated pipeline for analysing multi-sample whole exome sequencing data

Status:

Version 1

Abstract

Background

Method

Results

Conclusions

Figures

1. Introduction

2. Material and methods

3. Results

4. Discussion

4.1 Strengths and Limitations

5. Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1