Effect of trimming procedure on mapped mtDNA sequencing data
In NGS data analysis, trimming procedure is commonly used to remove the adaptor sequences and bases with low quality from sequencing reads. Therefore, we first evaluated the effect of trimming procedure on mtDNA mapping of sequencing reads in three types of samples, including FFPE tissues, fresh tissues and plasma. As shown in Fig.1A and B, the trimming procedure had no significant influence on both the mtDNA mapping rate of sequencing reads and average sequencing depth in FFPE and fresh tissue samples. In contrast, the mtDNA mapping rate and average sequencing depth were significantly increased in plasma samples after trimming (Fig. 1A and B). Furthermore, our data indicated the very consistent size distribution of mtDNA insert fragments in FFPE and fresh tissue samples (Fig. 1C). However, the mapped mtDNA sequencing reads were significantly increased in plasma samples after trimming (Fig. 1C). These results indicate that trimming procedure is essential for improving the mtDNA mappability in plasma but not tissue samples.
Evaluation of mtDNA mapping strategy in mtDNA mutation calling
Next, we evaluated the application of two commonly used mapping strategies in three different sample types. As shown in Fig. 2A, when compared with the first mapping strategy (rCRS-alone), the second mapping strategy (combined rCRS-hg19) clearly removed the sequencing reads mapped to both hg19 and rCRS, leading to 11 more remarkable peaks. Among them, the most significant peak occurred around mtDNA site 8500, with 45% - 80% removed reads in three different sample types. Compared to tissue samples, plasma DNA exhibited a clearly different distribution pattern of removed reads. Then, the mtDNA mutations detected by the two mapping strategies were depicted in Fig. 2B, with mutation site and heteroplasmy level (MAF≥1%). The venn diagram showed that all mtDNA mutations detected by the second mapping strategy were included in those detected by the first one, while 1, 5 and 1256 of mtDNA mutations only detected by mapping to rCRS were observed in FFPE tissue, fresh tissue and plasma samples, respectively (Fig. 2C). The site and heteroplasmy level of mtDNA mutations only detected by mapping to rCRS were shown in Fig. 2D. To explore the potential reason explaining extremely high number of mtDNA mutations only detected by mapping to rCRS in plasma samples, we further analyzed the mtDNA copy number and percentage of mtDNA sequencing reads mapped to both hg19 and rCRS in three sample types. Our results showed that the mtDNA copy number in plasma was significantly lower than tissues (Fig. S1), and the percentage of mtDNA sequencing reads mapped to both hg19 and rCRS was significantly higher in plasma (17.18%) compared to FFPE and fresh tissues (3.46% and 4.03%, respectively) (Fig. S2). To investigate the NUMTs-derived possibility of those mutations, we determined the percentage of mutant reads mapped to rCRS only or both hg19 and rCRS. As shown in Fig. 2E, 93.33%, 92.86% and 91.91% of the mutant reads were mapped to both hg19 and rCRS in three sample types, respectively, while 6.67%, 7.14% and 9.09% were only mapped to rCRS. Further analysis showed that 82.57%, 91.50% and 92.72% of the mutant reads had higher alignment scores when mapped to hg19 than rCRS in three sample types, respectively (Fig. 2F). These data suggest that these mutations may be introduced by NUMTs. Collectively, when mtDNA mutations are detected in tissue samples, both the two mapping strategies can be used, although rare mutations may be introduced by NUMTs. In comparison, the second mapping strategy is strongly suggested for mtDNA mutation detection in plasma samples due to the extreme abundance of NUMTs.
Effect of mismatch number selection on mtDNA mutation calling
Then, we investigated the effect of mismatch number selection in the second mapping strategy on mtDNA mutation calling by using the repeated sequencing data in three sample types. As shown in Fig. 3A, the total repeated mtDNA mutation numbers in two repeat experiments of 10 samples gradually increased when the mismatch number changed from 1 to 4 in three sample types. We then found that the average number of repeatable mutations in the group with 3 or 4 mismatches was significantly higher than that in the group with 1 mismatch in all three sample types. The average number of repeatable mutations varied from 35.5-42.6, 32.3-36.6, and 37.7-49.2 under different mismatches in three sample types, respectively. Furthermore, the assumed false positive (AFP) mutations were defined as those only detected in one mismatch group. The assumed false negative (AFN) mutations were defined as those not detected in this group but detected in at least two other mismatch groups. We found no significant difference of AFP mutation number among the groups with 1, 2, or 3 mismatches in three sample types, while it was significantly increased in the group with 4 mismatches when compared to the groups with 1, 2, or 3 mismatches (Fig. 3B). In contrast, the AFN mutation number in groups with 1 or 2 mismatches were significantly higher than that in the groups with 3 or 4 mismatches among three sample types (Fig. 3C). Collectively, these results demonstrate the impacts to some extent of mismatch number selection on mtDNA mutation detection, and strongly suggests the setting of 3 mismatches for mtDNA mutation calling.
Relationship between the site sequencing depth and consistency of mtDNA mutations at different heteroplasmy levels.
Next, we comprehensively investigated the effect of mtDNA site sequencing depth on the detection accuracy of mutations at low heteroplasmy levels by using the repeated sequencing data in three sample types. As shown in Fig. 4A, a very consistent trend was observed in all three sample types, indicating a faster rise of the consistency when the site sequencing depth was relatively low. To achieve the consistency of 90%, the site sequencing depth of greater than 2700X, 2300X, and 3200X was needed in FFPE tissues, fresh tissues and plasma, respectively, when the heteroplasmy level was set above 0.5%. In comparison, the consistency of mtDNA mutations was notably decreased in all three sample types when the heteroplasmy level was set at 0.1%, suggesting that the system used in this study may not be suitable for mtDNA mutation detection at the heteroplasmy level of 0.1%. Furthermore, the negative logarithmic relationship was established between mtDNA site sequencing depth and minimum detectable mutation frequency when the consistency was set at 90% (Fig. 4B). When the minimum detectable mutation frequency was 1% and 0.5%, the site sequencing depth was required to be higher than 1000X and 4000X, 1200X and 3600X, 1700X and 4700X in FFPE tissue, fresh tissue and plasma, respectively.
Commonly, the average sequencing depth was calculated and provided based on mtDNA sequencing data. Considering the accuracy of mtDNA mutation calling was actually affected by site sequencing depth, we thus determined the relationship between them. As shown in Fig. 4C, with the increasing average sequencing depth, the percentage of mtDNA sites with site depth greater than average sequencing depth were gradually increased. When the average sequencing depth was greater than 8000X, there were 80% of sites with site depth higher than the average.
Assessment of different mtDNA mutation filtering strategies
To reduce false positive mutations, several filtering strategies has commonly been applied (18). First, we assessed the effect of two filtering strategies previously reported on mtDNA mutation calling. One filtering strategy (filter 1) is to remove C > A/G > T mutations with low MAF (1%) and strong sequence context bias (at CpCpN>CpApN; most frequently at CpCpG>CpApG), which is known to arise from artificial guanine oxidation during sequencing library preparation steps (18, 19). As shown in Fig. 5A, no difference of mtDNA mutation numbers was observed between two groups filtered with and without filter 1 in FFPE tissue, fresh tissue and plasma. Furthermore, second filtering strategy (filter 2) is to remove mtDNA mutations where the quality of mutant bases does not fit well with binominal distribution (P > 0.0001) (20, 21). Those may be false positive mutations introduced by sequencing errors. Very similarly with filter 1, filter 2 exhibited no significant effect on mtDNA mutation numbers in all three sample types (Fig. 5B). Considering the great influence of site sequencing depth on mtDNA mutation detection, we built up a novel filtering strategy (filter 3) based on negative logarithmic functions presented in Fig. 4B to remove mtDNA mutations with heteroplasmy level lower than minimum detectable mutation frequency. As shown in Fig 5C, the number of mtDNA mutations was significantly decreased after filtering in all three sample types. Moreover, the repeated mtDNA sequencing data were used to analyze whether those filtered mutations were repeatable or not. We found that 91.84%, 89.4%, and 93.3% of these filtered mutations were unrepeatable in FFPE tissue, fresh tissue and plasma, respectively. In comparison, only 9.4%, 5.7% and 15% of the unaffected mtDNA mutations was unrepeatable (Fig. S3). These findings indicate that both filter 1 and filter 2 are not necessary in our analysis pipeline, whereas filter 3 greatly contribute to improve the detection accuracy.
Comparison between PCR-based and capture-based enrichment strategies or two sequencing platforms
To reduce sequencing cost, both capture-based and PCR-based strategies are commonly used for mtDNA enrichment from total genomic DNA. Therefore, we systematically compared the performance of two enrichment strategies in NGS-based mtDNA mutation detection of plasma samples. We found that capture-based mtDNA sequencing exhibited a more uniform distribution of the coverage across the whole mitochondrial genome and the coefficient of variation (CV) significantly decreased when compared to PCR-based approach (Fig. 6A). Then, six plasma DNA samples were sequenced three times, twice by capture-based approach and one time by PCR-based approach. The number of mtDNA mutation sites with more than 1% heteroplasmy level was depicted in Fig. 6B. Moreover, we found that the consistency of the mtDNA mutations between two capture-based sequencing data was significantly higher than those between each capture-based sequencing data and PCR-based sequencing data (Fig. 6C). With the increasing of mtDNA site sequencing depth, the mtDNA mutation consistency between capture-based and PCR-based sequencing data was gradually elevated (Fig. 6D). To get more accurate mtDNA mutations, our results suggest higher sequencing depth for PCR-based sequencing approach when compared with capture-based sequencing.
Additionally, to test the robust application of our innovative bioinformatics pipeline, we evaluated the consistence of mtDNA mutation profiling between Illumina and BGI sequencing platforms, which were most widely used for next-generation sequencing. The mtDNA sequencing data of 10 fresh tissue samples from two platforms were analyzed. As shown in Fig. 6E, all samples exhibited the good consistency of mtDNA mutation numbers. We further compared the heteroplasmy level of mtDNA mutations. Our results revealed a remarkable correlation between two sequencing platforms (r = 0.9974, P < 0.01), indicating the widespread applicability of our innovative data analysis pipeline.