An integrated approach for short variant discovery
The integrated approach described here provides a method for the discovery of short germline variants from the integration of WES and RNA-seq data. It is mainly based on the individual calling of the WES and RNA-seq data in GVCF mode and their subsequent joint genotyping. The comparison of WES calls and RNA-seq calls allows the validation of variants, enables the identification of RNA-editing and ASE events, and allows the classification of variants into six groups based on their quality, genotype or expression in the different types of source data: strong-evidence, DNA-only, RNA-only, ASE, RNA-editing and RNA-rescue variants.
A total of 718,239 variants were identified using the integrated approach from four samples, of which 622,229 were SNPs and 96,010 were indels (Table 1). These variants were classified into six groups consisting of 44,264 strong-evidence, 578,359 DNA-only, 68,193 RNA-only, 589 ASE, 24,082 RNA-editing and 2,752 RNA-rescue variants (Table 1). To evaluate the identified variants, we considered two types of matches, genotype match (GT) and allele match (AL), for each type of source data. Therefore, we evaluated the allele and genotype information obtained from WES data, the allele and genotype information from the RNA-seq data, and the allele information obtained by jointly considering both WES and RNA-seq alleles.
Table 1
Overview of total variants identified in all samples by variant type using the integrated approach.
Variant type
|
Total
|
SNPs
|
Indels
|
ALL
|
718,239
|
622,229
|
96,010
|
Strong-evidence
|
44,264
|
41,363
|
2,901
|
DNA-only
|
578,359
|
501,496
|
76,863
|
RNA-only
|
68,193
|
57,666
|
10,527
|
RNA-rescue
|
589
|
318
|
271
|
RNA-editing
|
24,082
|
19,041
|
5,041
|
ASE
|
2,752
|
2,345
|
407
|
For the evaluation of genotyping from WES data, RNA-only and RNA-rescue variants were excluded from evaluation because they had a low DNA genotype quality, as well as RNA-editing and ASE variants because they had a different allele expression. For the assessment of genotype from RNA-seq data, DNA-only variants were excluded as they had a low RNA genotype quality, as well as RNA-editing and ASE variants because they had a different allele expression. For the evaluation of the allele information, RNA-editing variants were excluded from the analysis in all cases (DNA, RNA and both).
In general, genotyping from the integrated approach achieved similar recall, precision and F-score regardless of source data (DNA or RNA) (Fig. 1). Regarding allele identification metrics, recall, precision and F-score were also similar when WES data, RNA-seq data, or both were used (Fig. 2). Furthermore, we observed a higher performance of the integrated approach in the identification of SNPs than of indels (Figs. 1 and 2).
Regarding the variants obtained from WES data, the identification of alleles and the genotyping in SNPs achieved a recall greater than 97%, a precision greater than 87% and an F-score greater than 92% in all samples (Figs. 1 and 2). Concerning the indels, their genotypes were identified with a recall greater than 88%, a precision greater than 39% and an F-score greater than 56% (Fig. 1), while the allele identification achieved a recall greater than 72%, a precision greater than 40% and an F-score greater than 57% in all samples (Fig. 2).
With respect to the short variants identified from the RNA-seq data, the genotyping of SNPs reached a recall greater than 99%, a precision greater than 93% and an F-score greater than 96% in the four samples, while indel genotyping achieved recall, precision, and F-score greater than 90%, 40% and 56%, respectively, in all samples (Fig. 1). The identification of the alleles of SNPs achieved a recall greater than 97%, a precision greater than 90% and an F-score greater than 93% (Fig. 2). The identification of the alleles of indels achieved a recall greater than 78%, a precision greater than 37% and an F-score greater than 54% in all samples (Fig. 2).
The alleles of the SNPs identified jointly considering both the WES alleles and the RNA-seq alleles achieved a recall greater than 97%, a precision greater than 88%, and an F-score greater than 92% in all samples (Fig. 2). For indels, there was a recall greater than 73%, a precision greater than 40%, and an F-score greater than 57% in all samples (Fig. 2).
Strong-evidence variants
Strong-evidence variants are variants with the same genotype in the DNA and RNA-seq data, as well as a reliable quality in both. Thus, the same alleles and genotypes are obtained regardless of the type of data and therefore the same metrics.
There was a high recall, precision and F-score in genotyping and identification of alleles for strong-evidence SNPs and indels (Supplementary Fig. 1 online). Strong-evidence SNPs were identified with a recall, precision and F-score greater than 98% in all samples (Supplementary Fig. 1 online). Strong-evidence indels did not achieve as high metrics as SNPs but they did achieve the highest metrics of all types of variants (Supplementary File S1).
DNA-only variants
DNA-only variants have an RNA coverage lower than or equal to 6 and/or a strand bias in RNA greater than or equal to 2 and/or an RNA GQ lower than or equal to 20. Because of this, the alleles and genotype from the RNA data cannot be trusted, and therefore RNA-editing and ASE cannot be detected.
For these variants we observed low recall, precision and F-score in the genotyping from RNA data (Supplementary Fig. 2 online). From the DNA data, there was a reliable but lower recall, precision, and F-score than observed in the strong-evidence variants (Supplementary File S1). DNA-only SNPs were genotyped with a F-score greater than 92% in all samples (Supplementary Fig. 2 online). Furthermore, there was a decrease in recall, precision and F-score in genotyping of DNA-only variants compared to the final GT metrics from DNA data of the integrated approach, which includes strong-evidence and DNA-only variants (Supplementary File S1).
Regarding the allele identification of SNPs and indels from DNA data, RNA data and jointly considering both (DNA + RNA), a reliable recall, precision and F-score was observed (Supplementary Fig. 2 online). Alleles of DNA-only SNPs were identified with a F-score greater than 91% in all the cases (Supplementary Fig. 2 online). There was a decrease in recall, precision and F-score in identifying DNA-only SNPs alleles relative to total SNPs alleles, including all variant types except RNA-editing variants, when using DNA data, RNA data and both (Supplementary File S1). However, in general, there was a higher precision and F-score in the allele identification of DNA-only indels than of total indels in DNA, RNA and with the joint use of both (Supplementary File S1).
RNA-only variants
RNA-only variants are variants with a DNA coverage lower than or equal to 6, therefore the alleles and genotype from DNA data cannot be trusted and ASE and RNA-editing cannot be identified. The presence of ASE variants or RNA editing variants can negatively affect the metrics when evaluating genotypes in RNA data since the truth dataset was derived from DNA data.
We observed an expected low recall, precision and F-score in genotyping from DNA data (Supplementary Fig. 3 online) for RNA-only variants. Concerning SNPs and indels genotyping from RNA data, we observed a good recall, precision and F-score, but lower than those observed in strong-evidence variants (Supplementary File S1). Genotyping of RNA-only SNPs from RNA data had a F-score greater than 94% in all samples (Supplementary Fig. 3 online). Moreover, we observed lower recall, precision and F-score in genotyping of RNA-only variants relative to total variants in RNA data, including Strong-evidence, RNA-only and RNA-rescue variants (Supplementary File S1).
The recall, precision and F-score of the allele identification of RNA-only SNPs were greater than 96.8%, 90.6%, and 93.7%, respectively, in all samples when considering the alleles from RNA data and considering both DNA and RNA alleles (Supplementary Fig. 3 online). The alleles of RNA-only indels were identified with a recall greater than 79.5%, a precision greater than 39.8% and an F-score greater than 56.8% in all samples when considering the alleles from RNA data and considering both DNA and RNA alleles (Supplementary Fig. 3 online). Regarding the identification of alleles from DNA data, metrics were lower with an F-score greater than 89.3% for RNA-only SNPs and greater than 49.8% for RNA-only indels in all samples (Supplementary Fig. 3 online). There was an increase in precision and F-score in detecting alleles for RNA-only SNPs than for total SNPs, which includes all variant types except RNA-editing variants, when using RNA data and RNA with DNA data, but a decrease when using DNA data (Supplementary File S1). However, there was a decrease in precision and F-score for allele identification in RNA-only in relation to the total indels in all cases (DNA, RNA or DNA + RNA) (Supplementary File S1).
ASE variants
ASE variants are heterozygous for DNA but homozygous for RNA because only one of the two alleles has been expressed. The recall, precision and F-score in the AL match were the same for the DNA and RNA data (Supplementary Fig. 4 online). In the sample NA12878, allele identification reached an F-score of 87.4% for ASE SNPs and 21.4% for ASE indels when considering DNA alleles, RNA alleles, or jointly considering both (Supplementary Fig. 4 online). For sample HG00171, we observed an F-score of 44.9% for the allele identification of ASE SNPs and of 52.5% for the allele identification of ASE indels in all cases (DNA, RNA and DNA + RNA) (Supplementary Fig. 4 online). In HG00378, we observed an F-score of 25.7% for the alleles of the ASE SNPs and an F-score of 33.8% for the alleles of the ASE indels, while in NA20509, the F-scores were 57% and 49.4% for the alleles of ASE SNPs and ASE indels, respectively, when considering DNA alleles, RNA alleles, or jointly considering both (Supplementary Fig. 4 online).
Concerning the genotyping of ASE variants, there was an improvement in the metrics when using DNA data instead of RNA data (Supplementary Fig. 4 online), this is because the truth dataset was derived from DNA data so the expression of the variants was not considered. In NA12878, we observed an F-score of 87.4% in the genotyping of ASE SNPs from DNA data and an F-score of 0% in the genotyping of ASE SNPs from RNA data, while the genotyping of ASE indels only achieved an F-score of 16.6% when using DNA data and an F-score of 13.3% when using RNA data (Supplementary Fig. 4 online). In the sample HG00171, genotyping from DNA data reached an F-score of 40.2% for ASE SNPs and an F-score of 46.5 for ASE indels, whereas genotyping from RNA data achieved an F-score of 1.8% and 11.7% for ASE SNPs and ASE indels, respectively (Supplementary Fig. 4 online). For samples HG00378 and NA20509, we observed an F-score of 21.6% and 48.3%, respectively, in the genotyping of ASE SNPs from DNA and an F-score of 26.9% and 32.2%, respectively, in the genotyping of ASE indels from DNA. In the genotyping of samples HG00378 and NA20509 from RNA data, we observed an F-score of 2.4% and 7.3%, respectively, for the ASE SNPs and an F-score of 13.3% and 17.3%, respectively, for the ASE indels (Supplementary Fig. 4 online).
RNA-editing variants
RNA-editing variants are the variants present in RNA but absent in DNA. They cannot be evaluated as our truth dataset was derived from DNA data where RNA-editing variants were not present. For this reason, we observed very low recall, precision and F-score in genotype and allele identification from RNA data (Supplementary Fig. 5 online).
RNA-rescue variants
RNA-rescue variants meet the minimum thresholds of coverage and strand bias in DNA and RNA, have an RNA GQ greater than 50 but a DNA GQ lower than or equal to 20. Therefore, the genotype information from RNA data can be trusted but not the genotype information from DNA data. In addition, RNA-editing and ASE variants cannot be recognized and will be present within the RNA-rescue variants, which may affect their evaluation.
We observed an F-score greater than 70% in all samples in the genotype and allele identification of RNA-rescue SNPs in RNA data (Supplementary Fig. 6 online). However, the identification of indels in RNA data had very low metrics, we observed an F-score in genotyping lower than 22% and an F-score in allele identification lower than 40% in all the samples (Supplementary Fig. 6 online).
Comparison of variant discovery with the integrated approach versus WES calling
The total number of identified variants increased when WES and RNA-seq data were integrated (WES + RNA calling) instead of using only WES data (WES calling) (Table 2). Specifically, a total of 718,239 variants were identified in all samples with the integrated approach (Table 1), while 620,959 variants were identified in all samples with only WES calling (Table 2). Two types of variant matches were considered for evaluation (GT match and AL match) and each type of match includes different variants. Because of this, the number of variants obtained from WES + RNA calling was evaluated depending on the type of match and compared with the total variants obtained from WES calling.
Table 2
Overview of the total variants identified by sample using the integrated approach and using only WES data.
Sample
|
Analysis
|
Variants
|
Total
|
SNPs
|
Indels
|
NA12878
|
WES + RNA calling
|
300,597
|
259,503
|
41,094
|
WES calling
|
259,692
|
224,840
|
34,852
|
HG00171
|
WES + RNA calling
|
142,760
|
122,761
|
19,999
|
WES calling
|
130,068
|
113,298
|
16,770
|
HG00378
|
WES + RNA calling
|
150,370
|
129,610
|
20,760
|
WES calling
|
133,523
|
116,342
|
17,181
|
NA20509
|
WES + RNA calling
|
124,512
|
110,355
|
14,157
|
WES calling
|
97676
|
89030
|
8646
|
The evaluation of the genotype information extracted from DNA data included two types of variants: strong-evidence and DNA-only variants. These two types of variants were enough to exceed the total number of variants obtained with WES calling in all samples except NA20509 (Table 3). The evaluation of genotype information from RNA data included strong-evidence, RNA-only and RNA-rescue variants. The number of variants detected for genotype matching from RNA data was much lower than those obtained from DNA data when using both WES and RNA-seq data (Table 3).
Table 3
Overview of the number of variants identified by sample using the integrated approach and using only WES data.
Sample
|
Analysis
|
Data type
|
Match
|
Variants
|
Total
|
SNPs
|
Indels
|
NA12878
|
WES + RNA calling
|
DNA
|
GT
|
260535
|
224987
|
35548
|
AL
|
266317
|
229854
|
36463
|
RNA
|
GT
|
47288
|
42359
|
4929
|
AL
|
68881
|
60859
|
8022
|
DNA + RNA
|
AL
|
287333
|
248646
|
38687
|
WES calling
|
DNA
|
-
|
259692
|
224840
|
34852
|
HG00171
|
WES + RNA calling
|
DNA
|
GT
|
131453
|
113844
|
17609
|
AL
|
134158
|
116030
|
18128
|
RNA
|
GT
|
13295
|
11425
|
1870
|
AL
|
19922
|
17403
|
2519
|
DNA + RNA
|
AL
|
140257
|
120943
|
19314
|
WES calling
|
DNA
|
-
|
130068
|
113298
|
16770
|
HG00378
|
WES + RNA calling
|
DNA
|
GT
|
134219
|
116216
|
18003
|
AL
|
138657
|
120023
|
18634
|
RNA
|
GT
|
19761
|
17454
|
2307
|
AL
|
27187
|
24154
|
3033
|
DNA + RNA
|
AL
|
147339
|
127262
|
20077
|
WES calling
|
DNA
|
-
|
133523
|
116342
|
17181
|
NA20509
|
WES + RNA calling
|
DNA
|
GT
|
96406
|
87806
|
8600
|
AL
|
101784
|
92342
|
9442
|
RNA
|
GT
|
32702
|
28109
|
4593
|
AL
|
42550
|
37076
|
5474
|
DNA + RNA
|
AL
|
119228
|
106337
|
12891
|
WES calling
|
DNA
|
-
|
97676
|
89030
|
8646
|
Regarding the assessment of allele matches, all variant types except RNA-editing variants were considered. Here, the number of variants detected with our integrated strategy (WES + RNA calling) from DNA data was higher than with WES calling and even higher when the allelic information of both DNA and RNA was considered (Table 3). Specifically, a total of 640,916 variants were identified in all samples with the integrated strategy for the allele match from DNA data (266,317 in NA12878, 134,158 in HG00171, 138,657 in HG00378 and 101,784 in NA20509) and a total of 694,157 variants when considering both DNA and RNA data (287,333 in NA12878, 140,257 in HG00171, 147,339 in HG00378 and 119,228 in NA20509), while a total of 259,692 variants were identified in the allele match with only WES calling (259,692 in NA12878, 130,068 in HG00171, 133,523 in HG00378 and 97,676 in NA20509) (Table 3).
Evaluation of recall, precision and F-score in the identification of variants was performed with respect to the genotype information (GT match) and the allele information (AL match). There was an improvement in SNP genotyping from DNA and RNA data in most samples (HG00171, HG00378 and NA20509), but a decrease in the precision and F-score when indels genotyping from DNA (NA12878, HG00171 and HG00378) and RNA (NA12878, HG00171, HG00378 and NA20509) (Fig. 1). Concerning the SNPs genotyping from DNA data, the recall was the same with the integrated approach or with WES calling for all samples: NA12878 (99.6%), HG00171 (99.3%), HG00378 (99.3% and) and NA20509 (99.1%) (Fig. 1). Precision was the same in NA12878 (96.5%) (Fig. 1 – A) but higher in WES + RNA calling than in WES calling for HG00171 (PrecisionWES+RNA−calling = 88.5% and PrecisionWES−calling = 88.4%), HG00378 (PrecisionWES+RNA−calling = 87.9% and PrecisionWES−calling = 87.7%) and NA20509 (PrecisionWES+RNA−calling = 87.6% and PrecisionWES−calling = 87.2%) (Fig. 1 – B, C, D). F-score was higher in WES + RNA calling than in WES calling for HG00171 (F-scoreWES+RNA−calling = 93.6% and F-scoreWES−calling = 93.5%), HG00378 (F-scoreWES+RNA−calling = 93.3% and F-scoreWES−calling = 93.1%) and HG00378 (F-scoreWES+RNA−calling = 93% and F-scoreWES−calling = 92.7%) (Fig. 1 – B, C, D), but lower for NA12878 (F-scoreWES+RNA−calling = 98.0% and F-scoreWES−calling = 98.1%) (Fig. 1 – A). Regarding the genotyping of indels from DNA data, recall, precision and F-score were higher with WES + RNA calling than with WES calling for NA20509 (RecallWES+RNA−calling = 92%, RecallWES−calling = 91.9%, PrecisionWES+RNA−calling = 66.4%, PrecisionWES−calling = 66.1%, F-scoreWES+RNA−calling = 77.1% and F-scoreWES−calling = 76.9%) (Fig. 1 – D), while for the rest of samples, they were lower (with the exception of the recall of NA12878 which is the same) (Fig. 1). Concerning the SNPs genotyping from RNA data, there was an improvement in recall, precision and F-score for HG00171 (F-scoreWES+RNA−calling = 96.3% and F-scoreWES−calling = 93.5%), HG00378 (F-scoreWES+RNA−calling = 96.8% and F-scoreWES−calling = 93.1%) and NA20509 (F-scoreWES+RNA−calling = 97% and F-scoreWES−calling = 92.7%) with the exception of the recall for NA20509 which was the same (99.1%) (Fig. 1 – B, C, D). However, all metrics for NA12878 were lower with WES + RNA calling than with WES calling (RecallWES+RNA−calling = 99.1%, RecallWES−calling = 99.6%, PrecisionWES+RNA−calling = 96.1%, PrecisionWES−calling = 96.5%, F-scoreWES+RNA−calling = 97.1% and F-scoreWES−calling = 98.1%) (Fig. 1 – A). Regarding the genotype of indels from RNA data, there was a decrease in precision and F-score for all samples (Fig. 1) and an increase in recall for NA12878 (RecallWES+RNA−calling = 97.4% and RecallWES−calling = 95.6%), HG00171 (RecallWES+RNA−calling = 92.8% and RecallWES−calling = 89.6%) and HG00378 (RecallWES+RNA−calling = 92.7% and RecallWES−calling = 88.9%) (Fig. 1 – A, B, C), but there was also a decrease in recall for NA20509 (RecallWES+RNA−calling = 90.9% and RecallWES−calling = 91.9%)(Fig. 1 – D).
Overall, the precision and F-score in the allele identification of SNPs and indels decreased when the integrated approach was applied instead of performing WES calling (Fig. 2). There was a decrease in the precision and F-score of allele identification of SNPs and indels from DNA data in all samples: NA12878 (F-scoreWES+RNA−calling = 97.6% and F-scoreWES−calling = 98.4%), HG00171 (F-scoreWES+RNA−calling = 92.6% and F-scoreWES−calling = 93.2%), HG00378 (F-scoreWES+RNA−calling = 92.3% and F-scoreWES−calling = 92.8%) and NA20509 (F-scoreWES+RNA−calling = 92.3% and F-scoreWES−calling = 92.8%) (Fig. 2). Concerning the alleles obtained from RNA data, there was a decrease in the precision and F-score to identify SNPs and indels in all samples, but they increased for the identification of SNPs in HG00378 (F-scoreWES+RNA−calling = 93.8% and F-scoreWES−calling = 92.8%) and NA20509 (F-scoreWES+RNA−calling = 95.8% and F-scoreWES−calling = 92.8%) (Fig. 2). Regarding the alleles obtained jointly considering DNA and RNA data, the recall, precision and F-score in the identification of SNPs in NA20509 increased (F-scoreWES+RNA−calling = 92.9% and F-scoreWES−calling = 92.8%) (Fig. 2 – D), however, the precision and F-score decreased in the detection of SNPs in the rest of samples: NA12878 (F-scoreWES+RNA−calling = 97.6% and F-scoreWES−calling = 98.4%), HG00171 (F-scoreWES+RNA−calling = 92.8% and F-scoreWES−calling = 93.2%) and HG00378 (F-scoreWES+RNA−calling = 92.5% and F-scoreWES−calling = 92.8%) (Fig. 2 – A, B, C). In the identification of alleles in indels there was a decrease in the precision and F-score in all samples when considering jointly DNA and RNA data (Fig. 2).