1. A Nextflow Workflow for BRCA1/2 variant profiling.
An NGS workflow for paired-end Illumina libraries was developed to identify germline variants for targeted sequencing using the AmpliSeq for Illumina BRCA Panel [20]. This workflow, built on the Nextflow platform, enables an autonomous, reproducible, and scalable scientific pipeline [21]. The integrated tools are executed through containers, and the entire workflow is available on the GitHub repository ( https://github.com/digenoma-lab/BRCA).
The BRCA workflow (Fig. 1) starts by mapping FASTQ files to the human hg38 genome reference using the Burrows-Wheeler Alignment tool (BWA) [22] and performs read quality control with FastQC, which allows for the identification of potential problems in the generated reads. Metrics of aligned reads and exon coverage are computed with Qualimap [23]. After aligning reads with BWA during data preprocessing, the reads are processed to ensure consistency and correct mate information in BAM files using Samtools fixmate, and reads are sorted using the Samtools sort [24]. In parallel to variant calling, the outputs from FastQC, Qualimap, and Samtools are consolidated by MultiQC [25] to create an HTML report with the metrics of processed samples. The preprocessed BAM files are used as input for Variant calling that is performed using two callers: Strelka [26] and DeepVariant [27, 28]. Both have demonstrated high performance for detecting germline SNVs and Indels. These callers were executed in two modes: single-sample mode (S), which calls variants for each sample individually, and multisample mode (MS), which performs joint variant calling and genotyping, considering all samples simultaneously. Finally, the functional annotation of the identified variants is performed using ANNOVAR [29], integrating databases such as gnomAD v4, dbSNP v150, ICGC v28, ClinVar, InterVar, and REVEL (Fig. 1). REVEL is an ensemble method for predicting the pathogenicity of missense variants based on 18 individual tools, including conservation and functional scores [30] and InterVar is a tool designed for the clinical interpretation of genetic variants according to the ACMG/AMP 2015 guidelines [31]. We used REVEL and InterVar to understand the functional significance of the genetic variants and their potential effects on BRCA1/2 genes. This pipeline design ensures comprehensive variant detection and annotation, providing a robust and reliable tool for clinical and research applications.
2. The BRCA pipeline enables rapid and reproducible variant profiling for clinical practice.
Sixteen breast cancer patients were sequenced using the AmpliSeq for Illumina BRCA Panel, which includes all exonic regions of the BRCA1 and BRCA2 genes and flanking intronic sequences, covering a total of 22 Kb [20].
To ensure reproducibility, we executed the BRCA Nextflow pipeline 10 times using data from 16 breast cancer patients. The results showed high reproducibility, with 100% concordance of the variants reported by Strelka and DeepVariant in both single-sample and multi-sample modes. Additionally, we used the Fastq Shuffle tool [32] to create new FASTA files by randomly reordering the reads per sample for three independent runs, as well as altering the sample names in one instance. We observed a 100% concordance of the variants reported in Strelka and DeepVariant in both modes. Finally, all tasks showed a median execution time of less than 3 minutes (Supplementary Fig. 1) and with an average total execution of 13 minutes for all 16 samples. The most time-consuming tasks were annotation with ANNOVAR, which had a median of 2 minutes and 43 seconds, and BWA-MEM, with a median of 1 minute and 31 seconds. Among the variant callers, Strelka had a median execution time of 50 seconds in single-sample mode and 1 minute and 35 seconds in multi-sample mode. In comparison, DeepVariant had a median execution time of 51 seconds, and GLnexus took only 4.5 seconds per run (Supplementary Fig. 1). Overall, the BRCA Nextflow pipeline demonstrated exceptional performance, with rapid execution times and high reproducibility, making it a reliable and efficient tool for clinical practice.
3. High-quality sequencing and comprehensive variant report for BRCA1/2 genes
We obtained high-quality reads for all 16 samples, with an average of 478,730 raw reads per sample (Supplementary Fig. 2). All the reads passed quality control. The average GC content was 38%, with 99% of reads mapped to the target regions per sample, achieving a mean coverage of 1,905X (Supplementary Fig. 2). We identified a total of 70 unique mutations in the BRCA1/2 genes across all patients, of which 4 are UTR, 39 intronic, and 27 exonic. Among the exonic mutations, 10 are silent, 16 are missense, and one is a frameshift insertion (Fig. 2). The most frequent mutation changes are T > C (n = 193), followed by C > T (n = 81) and T > G (n = 26). Strelka and DeepVariant identified a median of 13 mutations in exonic regions per patient. Patients HR06 and HR12 had a higher number of mutations (n = 9), while patients HR14 had 3 mutations and HR02 had 2 mutations. (Fig. 2). We examined coverage variation across the target regions of the panel and observed a high degree of uniformity (Supplementary Fig. 3).
We detected 8 nonsynonymous mutations in BRCA1 and 9 in BRCA2. Among these, one is a frameshift insertion, and the remaining are missense mutations. Strelka reported one variant more than DeepVariant in 16 BC patients (Fig. 3). With a median of 6.5 nonsynonymous variants per sample for both varcallers. Non-synonymous mutations were detected in 81.25% (n = 13) and 100% (n = 16) in BRCA1 and BRCA2 with Strelka and DeepVariant, respectively (Fig. 3).
4. Comparison of Single and Multi-Sample Modes in Variant Detection for BRCA1 and BRCA2 Using Strelka and DeepVariant.
In Strelka, both single and multi-sample modes found 8 nonsynonymous variants in BRCA1. These were 1 frameshift insertion and 7 missense mutations. Moreover, 9 missense mutations in BRCA2 are also found by both strelka modes. The E1390G variant in BRCA1 was identified in one patient in single mode and in three patients in multisample mode. This suggests it could be a novel mutation or a sequencing artifact (Supplementary Fig. 4). Visual inspection of this variant in IGV supports the former rather than the latter possibility (variant phred quality 49, Supplementary Fig. 6).
On the other hand, DeepVariant, in both single and multi-sample modes, reported identical exonic nonsynonymous variants. We observed no differences in variant calling between the two modes of DeepVariant (Supplementary Fig. 5).
5. Frequency and Impact of BRCA1/2 Mutations in Chilean Patients Diagnosed with Breast Cancer
The 16 sequenced patients were Chilean women, up to 40 years old, diagnosed with breast cancer via histopathological confirmation between January 2015 and December 2021. None of the participants were selected based on a family history of cancer. Of the germline mutations detected on this cohort, 97% are benign, 1% are pathogenic, 1% are of uncertain significance, and 1% are unknown. We identified a pathogenic variant in one breast cancer patient (HR06). Both Strelka and DeepVariant reported the p.L655Ffs*10 frameshift insertion in exon 9 of BRCA1 in single and multisample modes (Figs. 2 and 4). This variant, with rs80357880, is classified as pathogenic in ClinVar and has been associated with hereditary breast and ovarian cancer. According to the OncoKB database [33], this variant is classified as likely oncogenic with a Level 1 evidence classification, indicating that it is an FDA-recognized biomarker predictive of response to several FDA-approved drugs, including PARP inhibitors.
In patient HR10, we identified a BRCA2 mutation of uncertain significance, p.K797Q, reported by both Strelka and DeepVariant in both modes. According to ClinVar, this mutation has been associated with hereditary breast and ovarian cancer syndrome. Predictors SIFT, PROVEAN, and M-CAP suggest the variant could be deleterious. However, FATHMM, REVEL (score of 0.281), and InterVar indicate that it could be benign. Additionally, this mutation is not present in gnomAD v4.0. We identified one variant that has not been detected in other repositories. The E1390G mutation in BRCA1 was reported by Strelka in single mode for one patient and in multisample mode for three patients. DeepVariant did not report this variant. This mutation does not have a REVEL score; however, predictors MutationTaster, MetaRNN, and FATHMM-MKL classify it as benign. The E1390G mutation in BRCA1 was visually inspected using IGV for the three patients reporting this variant. The mutation is located in reads flanking one of the target regions, showing low read coverage (greater than 8). However, it has a Strelka Phred quality score of 47, suggesting a low probability of being an artifact (Supplementary Fig. 6). The mutations with the highest REVEL scores are A2951T in BRCA2 and S993N in BRCA1, with scores of 0.39 and 0.37, respectively. We did not identify any variants with a REVEL score over 0.5 (Fig. 4), which is the minimum required to classify variants as pathogenic.
We identified 14 mutations in BRCA1/2 that are present in gnomAD. The most frequent mutations in our cohort for BRCA1 are K1136R, E991G, P824L, and S104G, each reported in 13 patients, with an allele frequency of 0.5 in our cohort (Fig. 4). For BRCA2, the most common mutations are I2490T, N372H, and V2466A, found in 5, 9, and 16 patients, respectively, with allele frequencies of 0.15, 0.34, and 1.0 in our cohort (Fig. 4). In the gnomAD Latino/Admixed American (AMR) subpopulation, the frequencies for these mutations are 0.07, 0.30, and 0.99, respectively.
To evaluate the frequency of these mutations, we compared the allele frequencies (AF) in our Chilean cohort (CHI) with those of the subpopulations reported in gnomAD. We subsequently performed a PCA analysis based on the allele frequencies of the mutations in our cohort and gnomAD, followed by hierarchical clustering. We observed that the first principal component accounted for 44.9% of the variance in allele frequencies, while the second explained 18%. Using the first two components (62.9% variance), we identified three distinct clusters. This analysis indicates that the allele frequencies identified in our study are more closely aligned with those reported for South Asians and Admixed Americans (clustered in blue in Fig. 5). The mutations contributing most to the variance of dimension 1 were E991G, K1136R, and S104G with contributions of 12.8%, 12,6%, and 12,6%, respectively (Fig. 5).