QoALa: a comprehensive workflow for viral quasispecies diversity comparison using long-read sequencing data

doi:10.21203/rs.3.rs-4637890/v1

Download PDF

Method Article

QoALa: a comprehensive workflow for viral quasispecies diversity comparison using long-read sequencing data

https://doi.org/10.21203/rs.3.rs-4637890/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The concept of viral quasispecies refers to a constantly mutating viral population occurring within hosts, which is essential for grasping the micro-evolutionary patterns of viruses. Despite its high error rate, long-read sequencing holds potential for advancing viral quasispecies research by resolving coverage limitations in next-generation sequencing. We introduce a refined workflow, QoALa, implemented in the longreadvqs R package. This workflow begins with nucleotide position-wise noise minimization of read alignments and sample size standardization, and extends to viral quasispecies comparison across related samples. Raw read samples from five studies of different viruses (HCV, HBV, HIV, SARS-CoV-2, and IAV), sequenced by major long-read platforms, were used to evaluate these approaches. The comparative results provide novel insights into intra- and inter-host diversity dynamics in various scenarios and unveil rare haplotypes not reported in the original study, underscoring the versatility and practicality of our methodology.

The emergence of SARS-CoV-2 in 2019, alongside historical pandemics, emphasizes RNA viruses, with their characteristically high evolutionary rates, as the most concerning group of pathogens. Their remarkable adaptability stands as a primary factor behind their widespread success¹. Their rapid mutation stemming from a lack of proofreading during genome replication², which serves as a mechanism fostering genetic variability. This variability manifests as a cloud of closely related genetic variants of the virus within the viral population infecting a single individual, which is known as viral quasispecies³. Studies elucidating quasispecies dynamics in relation to virus adaptation have significantly enhanced our understanding of RNA virus microevolution, virus-host interactions, and the ability of viruses to adapt to evade host immunity^4–9.

The viral quasispecies concept has been referred to in nearly one thousand virology studies, notably for hepatitis C and B viruses, and HIV¹⁰, where some viral variants correlate with antiviral drug resistance^11–14. The advent of deep sequencing technology, alongside improvements in de novo viral genome assembly techniques, has enhanced our ability to detect minority variants within the viral cloud that may hold clinical significance^15–18. While next-generation sequencing (NGS) has been widely employed for viral quasispecies analysis^19,20, one of the persisting challenges is the reconstruction of continuous viral quasispecies haplotypes—sets of identical genomic sequences. This challenge arises due to the short length of NGS reads (typically 100–400 bases), which often do not cover the entire targeted gene or genome, necessitating short-read assembly techniques that are often unable to discern between haplotypes that are closely related^17,21–24.

Long-read sequencing technology provides a solution to the coverage limitations of NGS technology in viral quasispecies research²¹, yielding read lengths typically exceeding 1,000 bp. Few analytical workflows or software exist that are tailored for viral quasispecies exploration using long-read data, and these primarily focus on single nucleotide variant (SNV) calling, haplotype reconstruction, or quasispecies profiling on an individual sample basis^25–33. However, these packages do not fully address the unique challenges or harness the full potential of long-read sequencing for viral quasispecies analysis. One challenge is the higher error rates (10–30%) compared to NGS (~ 1%)³⁴, affecting haplotype reconstruction accuracy. Efforts in post-sequencing error correction have spurred the development of numerous tools for improving sequencing reads generated by Oxford Nanopore Technologies (ONT) or Pacific Biosciences (PacBio)^34–38. However, even with lower error rates, the rapid mutation rate of RNA viruses means that longer reads often differ by only one or two bases, and thus are called a unique haplotype. This ultimately results in a large number of haplotypes that each occur only once within the data, and obscures our ability to discern structure the viral quasispecies.

Sequencing read depth and length also significantly influence viral quasispecies diversity measures^21,39,40. Certain diversity metrics, such as Shannon entropy and mutation frequency, are dependent on sample size or read depth^40,41. As yet, it is uncertain how such diversity metrics are influenced by sampling depth, particularly when comparing diversity between samples with different depths or down-sampled to the same depth, or even what minimum depth is needed to attain a robust measure of diversity. In addition, the longer the gene or genome length used for quantifying quasispecies, the greater the number of unique haplotypes or singletons due to the increased detection of mutations along its length. This variability may introduce bias, particularly when comparing viral quasispecies profiles across longitudinal or related samples.

In response to these challenges, we developed longreadvqs, an R package designed for viral quasispecies comparison using long-read data that can be applied to any RNA virus. This package creates a customizable workflow for long-read noise-minimization and down-sampling as well as for grouping related haplotypes into operational taxonomic units (OTUs). In addition, this package can analyze multiple samples together, identifying common haplotypes or OTUs that recur in different samples. This feature is particularly useful when analyzing longitudinal samples collected from a single individual or multiple individuals that may be epidemiologically linked. Our suggested analytical steps — QoALa: Quasi-species Optimized & Adaptive Long-read Analysis — performed by the key functions of the package, were tested with publicly available long-read data sequenced by both ONT and PacBio technologies from the most studied viruses in the field of viral quasispecies.

QoALa workflow and example datasets

The goals of the QoALa workflow are to standardize and comprehensively compare viral quasispecies and OTU profiles across multiple samples based on user customization. For each comparison, equal length read alignments from the same gene or genomic region of interest (Fig. 1a) are separately imported from any genome assembly pipeline’s output (see examples in Methods) and noise-minimized using position-wise nucleotide base replacement in the “vqsassess” function (Fig. 1b). The noise-minimization step replaces potential erroneous nucleotides, defined as SNVs with a frequency lower than n% of the total read depth (cut-off percentage), with either the majority base or the dominant haplotype’s base at that position. This cut-off percentage can be determined using prior information such as the sequencing error rate, or by observing the change in the percentage of singleton haplotypes at different percentage cut-offs (see Methods and Fig. 1a). Additionally, when comparing multiple samples, all alignments must be randomly down-sampled by the “vqsassess” function to achieve equivalent read depth (the shallowest depth among all samples is recommended) to reduce bias in diversity measurement caused by sample size disparity. SNV profiles of the samples visualized by the “snvcompare” function should be inspected to evaluate the result of noise-minimization (Fig. 1c). Finally, prepared samples are pooled by the “vqscompare” function to (1) identify common haplotypes, (2) reclassify haplotypes into OTUs, (3) visualize diversity profiles and genetic relationships among samples (Fig. 1d), and (4) summarize quantitative diversity metrics.

We evaluated this workflow using read alignments assembled from publicly available raw reads of the five most studied viruses in the viral quasispecies field, namely HCV, HBV, HIV, SARS-CoV-2, and IAV. These datasets varied not only by sequencing technologies (PacBio or ONT) and sampling scenarios (short- or long-term longitudinal or cross-sectional), but also by the depth of coverage per gene of interest, ranging from 130 reads in the IAV's M gene segment sample (Supplementary Table 6) to 854,147 reads in the SARS-CoV-2's ORF3a gene sample (Supplementary Table 5). Besides, the level of sequencing error varied, estimated from the percentage of singleton haplotypes before noise-minimization. This value ranged from approximately 25% being singleton haplotypes in the SARS-CoV-2's ORF3a gene sample to 100% in the HCV and HIV samples (Supplementary Fig. 1). This considerable variation highlights the necessity for customized long-read data preparation on a case-by-case basis using our package, particularly focusing on two key parameters: the cut-off percentage and sample size (Methods).

Effect of down-sampling on diversity metrics

While read down-sampling is necessary for standardizing sample size before comparison, it may alter viral quasispecies diversity measures, especially at very low sample sizes. To assess this impact, we sub-sampled reads with replacement and computed nine diversity metrics, ranging from 10,000 down to 25 reads, depending on the original depth of each read alignment sample (Methods). This process was repeated 100 times per sub-sample size. This assessment was conducted twice: once for the alignments that were sampled after noise-minimization, and another for the alignments that were sampled before noise-minimization.

As anticipated, diversity metrics directly reliant on read depth [number of haplotypes (H) and mutation frequency at the molecular level (Mfm)], indirectly reliant on read depth [Shannon entropy (H_S), mutation frequency at the molecular level (Mfe), functional attribute diversity (FAD), and nucleotide diversity at the entity level (π_e)], or normalized by read depth [normalized Shannon entropy (H_SN)] showed varying degrees of fluctuation corresponding to sample size (Supplementary Figs. 2–11). Down-sampling post noise-minimization from high-depth alignments (> 10,000 reads) minimally affected H_S and Mfm at sample sizes over 1,000 reads (Supplementary Figs. 2, 4, and 8), while sampling from low-depth alignments (< 1,000 reads) resulted in abrupt changes in these metrics at any sample size (Supplementary Figs. 6 and 10). The Gini-Simpson index (H_GS) and nucleotide diversity (π_m) remained relatively stable across all sample sizes, irrespective of the original alignment depth (Fig. 2). Computing metrics from alignments sampled after noise-minimization generally yielded less varied values compared to those sampled before noise-minimization in most instances (Supplementary Figs. 3, 5, 7, 9, and 11).

Regarding inter-sample comparisons, variations in metrics due to sample size and sampling strategy marginally affected comparisons in scenarios where the actual diversity measures of samples were either highly similar (Fig. 2 and Supplementary Figs. 6–11) or markedly different (Fig. 2 and Supplementary Figs. 2–5). However, down-sampling to very low sample sizes (< 100 reads) or down-sampling before noise-minimization likely resulted in false equivalences or differences because of the high variation in computed diversity metrics (Fig. 2 and Supplementary Figs. 2–11). Thus, our analysis suggests that diversity metrics from samples with read depths of < 100 should be interpreted with caution.

Each dot indicates the mean of one target, boxes the 25th to 75th percentile, lines medians and whiskers extend from minimum to maximum values.

Example scenarios of diversity comparison

Since a single mutation can differentiate one haplotype from another, and such mutations may introduce noise and obscure the true frequency of different major genetic variants within the quasispecies, solely using haplotype-based diversity metrics to summarize the viral quasispecies profile might hinder our ability to observe the continuous dynamics of the virus across samples. To illustrate this, we compared related viral samples in various scenarios using both traditional haplotyping based on a strict mutational profile and a novel OTU assignment based on genetic distance (Methods). In the latter, haplotypes are clustered by relatedness into larger operational taxonomic units (OTUs). This allows the user to better comprehend the structure of the quasispecies, i.e., the evolving genetic diversity surrounding several main variants within the viral cloud.

The studies on HIV and SARS-CoV-2 served as examples of short-term longitudinal sampling scenarios. Within a few hours of each other, env genes of four HIV samples infecting T cells in vitro exhibited numerous haplotypes, among which a common dominant haplotype was not found (Fig. 3a). When OTU clustering was applied to the same samples, the members and proportions of all four samples were relatively similar (Fig. 3b), with the 186 haplotypes being optimally clustered into 10 OTUs (Fig. 3c/d). Examining five SARS-CoV-2’s S gene samples taken within a nine-day period from a single patient, the dominant haplotype and OTU occupied over half of the virus population in each sample (Fig. 3a/b), with minor haplotypes appearing as outliers (Fig. 3c/d). It is also worth noting that the sporadic appearance of many minority variants in the day 13 and 17 SARS-CoV-2 samples (Fig. 3a/b) was likely due to noise rising from incomparable read depths in the original alignments, which were over ten times lower than the other three samples (Supplementary Table 5).

As a model for long-term longitudinal sample comparison, studies on HCV and HBV were utilized. Analysis of the NS4 coding region of HCV revealed that viral haplotypes and their proportions underwent sudden changes after treatment. Interestingly, certain major haplotypes detected pre-treatment reappeared 18 months post-treatment (Fig. 3a). Concurrently, OTU clustering provided better insights into the dynamics of HCV across three sampling points by grouping genetically closely related minor haplotypes into common clusters observed at varying proportions across all time points (Fig. 3b/c/d). The within-host virus population dynamics were also clearly illustrated by HBV's S gene. Both haplotype and OTU classifications underscored the gradual turnover of predominant haplotypes from timepoint one to four over a 30-month period, and a completely different population emerged at timepoint five, occurring over 80 months after the initial sampling (Fig. 3a/b).

Lastly, in the scenario of the outbreak represented by IAV's M gene segment samples, viruses were collected cross-sectionally among patients in two wards (B and C) within the same hospital. At the haplotype level, only two mostly identical sets of virus populations were found in more than one patient (B1, B2 and C1, C3), without other prominent links between cases or wards (Fig. 3a). However, upon regrouping them into new OTUs, we discovered that all patients in ward B were partly infected with the same genetically similar sub-population, while all patients in ward C were infected with a similar set of OTUs. Furthermore, the predominant OTU found in two ward B patients (B1, B2) was also present as a minority group in all ward C patients, suggesting potential inter-ward transmission of IAV (Fig. 3b/c/d).

The comparative profiles of viral quasispecies, whether analyzed at the haplotype or OTU levels, based on other genes from the same viral samples (including HBV’s P gene, HIV’s gag gene, and SARS-CoV-2’s ORF3a gene), exhibited visual similarities to the chosen genes depicted in Fig. 3 (Supplementary Fig. 12 to 14). This suggests the absence of linkage disequilibrium between the observed genes of these three viruses, indicating that they may adequately represent the population dynamics at the whole-genome level. Interestingly, one comparable result between the original and our studies is the number and frequency of haplotypes found in the day 11 and day 15 SARS-CoV-2 samples. We captured a similar quasispecies diversity in the S gene samples and discovered two SNV positions in the ORF3a gene that created two rare haplotypes not reported in the original study (Supplementary Table. 7 and Ko et al., 2021⁴⁶ ).

The longreadvqs package was developed as a tool for both quantitative and qualitative analysis of viral quasispecies diversity, addressing the challenges posed by varied error rates and read depths inherent in long-read sequencing technologies as well as noise introduced by rare SNVs in the alignment. During the testing phase, we carefully considered these variations across diverse sampling scenarios and multiple viral species, ensuring the tool’s general applicability. The strengths of this tool lie in its ability to customize parameter settings based on prior information and in providing comprehensive visualizations that elucidate the dynamics of virus microevolution across multiple samples.

The analytical findings derived from the QoALa workflow presented in this study provide unique perspectives and insights not previously explored in the original research. For instance, we utilized raw read data to reconstruct within-host virus haplotypes and OTUs, emphasizing an overview of population structure changes over time or among outbreak cases. In contrast, the original studies focused on specific aspects, such as mutation spectra⁴², deletions⁴³, gene splicing⁴⁴, or phylogenetic analysis of consensus sequences⁴⁵. To build read alignments, we employed a uniform genome assembly workflow for all raw read samples, which was not tailored or reproduced from the original protocols created for each dataset. Furthermore, in the comparison step, we fixed the number of sample sizes and k-means clusters for all scenarios. Hence, discrepancies in details between the original findings and ours may have arisen, which we acknowledge as a limitation on our part. Nevertheless, it's essential to note that the objective of our study does not revolve around delivering new findings from each dataset but rather demonstrating the usage and versatility of this package.

Previous studies have established a robust foundation in measuring viral quasispecies diversity and developing analytical workflows for NGS data over the past decade^20,40,41,47. However, much of this research cannot be directly applied to long-read sequencing data. Long-read quasispecies analyses are hindered by the significantly higher level of noise present in long-read data driven by the higher likelihood that longer reads will have at least one SNV across their length. For NGS short-read data, it is recommended to employ techniques such as rarefaction, resampling, and fringe trimming based on haplotype frequency to estimate appropriate sample sizes and minimize bias from unbalanced samples before comparing diversity metrics⁴¹. The presence of a misrepresented large proportion of singleton haplotypes (Supplementary Fig. 1), compounded by the high sequencing error rate of current long-read technologies, hinders us from following such approaches.

Instead of excluding low-frequency haplotypes as fringe trimming does, we opt to retain all reads and replace low-frequency SNVs with the majority base at their respective positions. This approach offers two primary benefits. Firstly, low-frequency haplotypes, comprising both true mutations and errors, are not eliminated from the alignment; rather, potential erroneous SNVs within the haplotype are smoothed to the mode. This advantage is particularly crucial when comparing shallow depth read alignments, where removing reads belonging to low-frequency haplotypes would severely reduce sample size. Secondly, the false haplotype diversity resulting from long-read errors are mitigated by consolidating previously low-frequency haplotypes into larger haplotype groups (OTUs) that share common SNVs, which are more likely to represent true mutations and provide more clarity on the structured diversity existing within quasispecies.

Despite these advantages, many factors need to be considered in selecting the cut-off percentage for position-wise noise-minimization. Factors such as gene mutation rate, sequencing error rate, and the trade-off between retaining versus unifying rare haplotypes are critical. Especially when dealing with a lengthy alignment where singleton haplotypes are more common, one may have to decide whether to prioritize retaining most mutation details, thus making it difficult to quantify overall population complexity, or to focus on the broader picture of viral cloud diversity while potentially discarding some details that may or may not be significant.

Standardizing unbalanced sample sizes by down-sampling can introduce bias in diversity comparisons, especially when integrating noise-minimization. Since sampling techniques used for NGS data⁴¹ are not appropriate for long-read data, we redesigned sampling strategies by integrating the concept of rarefaction and repeat sampling, taking into account the order of sampling before or after noise-minimization to simulate possible options for analysis. Here, we demonstrate the sensitivity of diversity metrics across our example scenarios. Based on this analysis, down-sampling after noise-minimization is our recommended technique. However, repeated down-sampling to observe the distribution of metrics at different sample sizes, at least between the largest and smallest samples in the dataset, is encouraged to better understand the sensitivities of diversity metrics to sampling depth in a particular project. Ultimately, we suggest interpreting quantitative metrics comparisons alongside qualitative profiles that uncover shared viral variants between related samples, either at the haplotype or OTU levels, to better illuminate the dynamics of viral quasispecies, as made possible by the QoALa workflow implemented in the longreadvqs package.

Sequencing reads data

To evaluate our analytic approaches, we gathered raw sequencing reads from five projects focusing on whole genome sequence (WGS) of five different viruses: hepatitis C virus (HCV), hepatitis B virus (HBV), human immunodeficiency virus (HIV), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and influenza A virus (IAV), which are among the most studied species in the field of viral quasispecies. These reads were generated using either PacBio Single Molecule, Real-Time (SMRT) sequencing or Oxford Nanopore Technologies (ONT) and were sourced from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA). Specifically, we selected projects that obtained either longitudinal viral samples or cross-sectional samples from a single outbreak, where the genetic relationships were suitable for comparison at the quasispecies level.

PacBio read samples were obtained from three projects. First, three HCV samples from a chronic hepatitis patient, were collected before and after treatment over a span of 18 months⁴². Second, five samples were collected at various time points within a period of 100 months from an untreated HBV-infected patient⁴³. Third, five longitudinal SARS-CoV-2 samples were collected between 8 to 17 days after clinical onset from a single patient⁴⁶. Data from the remaining two projects were generated using ONT, which included four samples from HIV-infected T cells collected up to 24 hours post-infection⁴⁴, and seven samples from different patients infected by IAV during a nosocomial outbreak in the same hospital⁴⁵. The NCBI accession numbers of a total of 24 raw read samples and five projects were listed in Supplementary Table 1.

Genome assembly and read alignment generation

We conducted comparable genome assembly steps for both PacBio and ONT raw reads. Initially, we employed the default settings of HiFiAdapterFilt v3.0.1⁴⁸ and Porechop v0.2.4⁴⁹ to trim PacBio and ONT adapter sequences, respectively. Subsequently, poor-quality PacBio reads—those shorter than 1,000 bp or constituting the worst 10% of reads based on the final score—were filtered out using Filtlong v0.2.1⁵⁰. Similarly, ONT reads with an average quality score < 7 were removed using NanoFilt v2.8.0⁵¹.

We then aligned the trimmed and filtered reads against the NCBI reference genomic sequence of each viral species (Supplementary Table 1) using minimap2 v2.26’s presets⁵² tailored for each sequencing technology ("map-hifi" for PacBio and "map-ont" for ONT). Subsequently, to prepare the resulting read alignments for downstream analyses, SAM files from read mapping were converted to FASTA format and fragmented into genes or coding regions based on the genomic annotation of each reference sequence, utilizing SeqKit v2.7.0⁵³. It was ensured that the reads retained in the alignment covered the full length of the particular gene or region (Fig. 1a). The depth and length of the final alignments were documented in Supplementary Tables 2 to 6.

Sequencing read noise-minimization

A main function of the longreadvqs package, “vqsassess,” aims to minimize noise from sequencing errors, artifacts, or rare mutations by performing position-wise nucleotide base replacement. We hypothesize that any SNV in a read alignment with a base frequency at each position less than the specified cut-off percentage is likely an artifact that should be replaced with the majority base of that position (Fig. 1b). The cut-off percentage can be determined based on prior knowledge, such as the estimated sequencing error rate of the technology used or documented mutation or evolutionary rates of the studied virus. However, the selection of the cut-off for noise-minimization can be further guided by the “pctopt” function, which demonstrates the percentage of singleton haplotypes (haplotypes with only a single read member) in the alignment. The percentage of singleton haplotypes decreases as the cut-off percentage increases, as low-frequency SNVs are replaced, creating more groups of identical reads (Fig. 1a). The normal range for the percentage of singleton haplotypes found in previous studies using more accurate technologies like Sanger or NGS sequencing can serve as a reference for this step. Alternatively, the plot showing changes in the percentage of singleton haplotypes can be used to identify the cut-off value where the percentage of singleton haplotypes ceases decreasing, indicating a plateau (Fig. 1a).

An alternative option for noise-minimization involves replacing potential erroneous low-frequency SNV bases with the base of the dominant haplotype. However, this method is not recommended, since either the base used for replacement could also be an error, or the “dominant” haplotype before noise-minimization may be relatively rare or non-existent (100% singleton haplotypes).

In some cases, a single cut-off percentage cannot be generalized for the entire gene or region of interest because the amount of noise may vary throughout the sequence length. For example, low ONT base calling accuracy is often reported in homopolymer regions (continuous identical bases)^54,55 or soft clipping (unaligned regions) may persist at the ends of reads after mapping in both technologies^56,57. The “snvcompare” function, which visualizes SNV distribution across sequence length between different samples (Fig. 1c), can help identify noise-rich regions indicated by the overaccumulation of SNV positions. Subsequently, the “vqscustompct” function can be applied to readjust the cutoff percentage for noise-minimization at specific region(s) (Fig. 1b).

To validate the noise-minimization workflow, we extracted read alignments from specific genes or regions of five exemplary viral species. These alignments were chosen based on their sequence length, with each exceeding or closely approximating 1,000 bases, and their depth of coverage surpassing 100 reads, while ensuring that the soft-clipped region, if present, comprised less than 25% of the total length. Utilizing the "pctopt" function, we determined the optimal cut-off percentage for minimizing errors, aiming to achieve nearly 0% singleton haplotypes, or identifying the point where the percentage of singleton haplotypes in the median sample ceased decreasing, indicating a plateau phase (Fig. 1a). The selected alignments (with their respective cut-off percentages) comprised the NS4 coding region (15%) of HCV, the P (5%) and S (5%) genes of HBV, the env (15%) and gag (22%) genes of HIV, the S (1%) and ORF3a (1%) genes of SARS-CoV-2, and the M gene segment (10%) of IAV (Supplementary Fig. 1).

Down-sampling and diversity metric sensitivity

For the demonstration of our package usage, we specified basic settings for the "vqsassess" function. After the noise-minimization step described above, we then down-sampled to either the size of the shallowest depth sample or to 1,000 reads (for samples with a depth over 10,000 reads), for every sample. These steps were taken to prepare the samples for the final quasispecies comparison within the same viral species or project.

However, it's important to note that read depth or sample size significantly influences some quantitative metrics used for quasispecies diversity measures^40,41. Aggressive down-sampling may result in information loss and misinterpretation. To evaluate the impact of down-sampling on between-sample diversity comparison, we conducted a sensitivity analysis of nine diversity metrics as fully described by Gregori et al.⁴⁰, including the number of haplotypes (H), the Shannon entropy (H_S), the normalized Shannon entropy (H_SN), the Gini-Simpson index (H_GS), the functional attribute diversity (FAD), the mutation frequency at the entity level (Mfe), the nucleotide diversity at the entity level (π_e), the mutation frequency at the molecular level (Mfm), and the nucleotide diversity (π_m).

We utilized two to four samples of one gene of interest per species for the analysis. Using the QSutils v1.18.0⁵⁸ embedded in our package, all nine metrics were computed from both unsampled and down-sampled read alignments. For samples with a depth over 10,000 reads (HCV, HBV, and SARS-CoV-2), down-sample sizes started from 10,000 reads and were gradually halved until the final size reached 78 reads. For samples with low depth (HIV and IAV), sample sizes were set at 300, 150, 100, 50, and 25 reads. Random down-sampling with replacement was repeated 100 times for each sample size after noise-minimization. The same repeated sampling strategy was also implemented before noise-minimization. Distributions of each metric were visually compared between different sample sizes and approaches (down-sampling after versus before noise-minimization).

Viral quasispecies comparison

Once the within-species samples were noise-minimized and standardized to equal depths, they were ready for comparison using the "vqscompare" function (Fig. 1d). This function aggregates the prepared read alignments of the listed samples, initially identifying shared haplotypes between them, and then visualizes the proportion of unique haplotypes for each sample as a color-coded bar plot. In addition, it reclassifies haplotypes into new operational taxonomic units (OTUs) based on the genetic distance matrix, utilizing the "dist.dna" function in ape v5.7.1⁵⁹, from the SNV alignment extracted from the pooled read alignment. In detail, the distance matrix is transformed into dissimilarity coordinates, which are subsequently clustered into new OTUs using classical multidimensional scaling and k-means clustering, respectively, through the "cmdscale" and "kmeans" functions in stats v4.3.1⁶⁰. The number of OTUs or clusters of genetically closely related haplotypes must be customized by specifying the number of clusters (k). The proportions of OTUs are also depicted in a color-coded bar plot. The clustering scheme of OTUs, along with major haplotypes within them, is illustrated with corresponding colors in multidimensional scale (MDS) plots (Fig. 1d). All resulting plots are generated using ggplot2 v3.4.4⁶¹.

Apart from the comparative plots, the "vqscompare" function offers several other valuable outputs for in-depth investigation. These include noise-minimized down-sampled read and SNV alignments for all samples, along with classified haplotype and OTU identifications. Additionally, the nine quasispecies diversity metrics for each sample, computed using QSutils v1.18.0⁵⁸, are tabulated. Another set of metrics are calculated based on consensus reads and read frequencies of each OTU, rather than haplotypes, to simplify the diversity quantification in a larger scale.

To exemplify such comparative features for our examples, noise-minimized down-sampled alignments of each selected gene and viral species, prepared by the "vqsassess" function, are listed as input for individual "vqscompare" analyses. The number of clusters (k) for OTU classification via k-means clustering was set to 10 for every run. The resulting summary plot was used to illustrate the dynamics of viral quasispecies diversity across different example scenarios, ranging from short-term (hours to days in HIV and SARS-CoV-2 datasets) to long-term (months to years in HCV and HBV datasets) longitudinal samples, as well as samples from the same outbreak cohort (IAV dataset).

Contributions

N.P. and K.V. conceived, initiated, and designed the study. N.P. developed the longreadvqs package, collected raw read data, implemented the QoALa workflow, conducted all analyses, and drafted the manuscript. K.V. and D.C.S. supervised the development of the package and workflow. All authors contributed to the final version of the paper.

Author Contribution

N.P. and K.V. conceived, initiated, and designed the study. N.P. developed the longreadvqs package, collected raw read data, implemented the QoALa workflow, conducted all analyses, and drafted the manuscript. K.V. and D.C.S. supervised the development of the package and workflow. All authors contributed to the final version of the paper.

Acknowledgement

We thank J. Gregori and M. Guerrero-Murillo from VHIR Vall d’Hebron Research Institute for their support in troubleshooting the dependency of our package. We thank TGS. Williams from Guy's and St Thomas' NHS Foundation Trust for providing additional information of the IAV study. Individually, we thank D. Makau and J. Baker for testing the package.

Data Availability

All raw read sequencing data used as example datasets for this study are publicly available at the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), as well as being part of previously published publications. Accession numbers and corresponding publications were tabulated in Supplementary Table 1.

Code availability

All codes used in this study are available as an R package longreadvqs and accessible via the CRAN repository at https://cran.r-project.org/web/packages/longreadvqs/index.html

Carrasco-Hernandez R, Jácome R, López Vidal Y. Ponce De León, S. Are RNA Viruses Candidate Agents for the Next Global Pandemic? A Review. ILAR J. 2017;58:343–58.
Steinhauer DA, Domingo E, Holland JJ. Lack of evidence for proofreading mechanisms associated with an RNA virus polymerase. Gene. 1992;122:281–8.
Domingo E, Sheldon J, Perales C. Viral Quasispecies Evolution. Microbiol Mol Biol Rev. 2012;76:159–216.
Domingo E. Quasispecies Structure and Persistence of RNA Viruses. Emerg Infect Dis. 1998;4:521–7.
Mandary M, Poh. Impact of RNA Virus Evolution on Quasispecies Formation and Virulence. IJMS. 2019;20:4657.
Vignuzzi M, Stone JK, Arnold JJ, Cameron CE, Andino R. Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature. 2006;439:344–8.
Woo H-J, Reifman J. A quantitative quasispecies theory-based model of virus escape mutation under immune selection. Proc. Natl. Acad. Sci. U.S.A. 109, 12980–12985 (2012).
Domingo E, García-Crespo C, Perales C. Historical Perspective on the Discovery of the Quasispecies Concept. Annu Rev Virol. 2021;8:51–72.
Lauring AS. Within-Host Viral Diversity: A Window into Viral Evolution. Annu Rev Virol. 2020;7:63–81.
PubMed. PubMed https://pubmed.ncbi.nlm.nih.gov/.
Metzner K. The significance of minority drug-resistant quasispecies. In: Geretti AM, editor. Antiretroviral Resistance in Clinical Practice. London: Mediscript; 2006.
Monaco DC, Zapata L, Hunter E, Salomon H, Dilernia DA. Resistance profile of HIV-1 quasispecies in patients under treatment failure using single molecule, real-time sequencing. AIDS. 2020;34:2201.
Perales C. Quasispecies dynamics and clinical significance of hepatitis C virus (HCV) antiviral resistance. Int J Antimicrob Agents. 2020;56:105562.
Kai Y, et al. Baseline quasispecies selection and novel mutations contribute to emerging resistance-associated substitutions in hepatitis C virus after direct-acting antiviral treatment. Sci Rep. 2017;7:41660.
Margeridon-Thermet S, et al. Ultra‐Deep Pyrosequencing of Hepatitis B Virus Quasispecies from Nucleoside and Nucleotide Reverse‐Transcriptase Inhibitor (NRTI)–Treated Patients and NRTI‐Naive Patients. J INFECT DIS. 2009;199:1275–85.
Rozera G, et al. Massively parallel pyrosequencing highlights minority variants in the HIV-1 env quasispecies deriving from lymphomonocyte sub-populations. Retrovirology. 2009;6:15.
Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017;27:835–48.
Fritz A, et al. Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol. 2021;22:212.
Houldcroft CJ, Beale MA, Breuer J. Clinical and biological insights from viral genome sequencing. Nat Rev Microbiol. 2017;15:183–92.
Lu I-N, Muller CP, He FQ. Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies. Virus Res. 2020;283:197963.
Posada-Cespedes S, Seifert D, Beerenwinkel N. Recent advances in inferring viral diversity from high-throughput sequencing data. Virus Res. 2017;239:17–32.
Huang A, Kantor R, DeLong A, Schreier L, Istrail S, QColors. An algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads. In Silico Biol. 2012;11:193–201.
Hong LZ, et al. BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads. Genome Biol. 2014;15:517.
Mardis E. R. DNA sequencing technologies: 2006–2016. Nat Protoc. 2017;12:213–8.
Dilernia DA, et al. Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing. Nucleic Acids Res. 2015;43:e129–129.
Huang DW. Towards Better Precision Medicine: PacBio Single-Molecule Long Reads Resolve the Interpretation of HIV Drug Resistant Mutation Profiles at Explicit Quasispecies (Haplotype) Level. J Data Min Genomics Proteom 07, (2016).
Dudouet P, et al. SARS-CoV-2 quasi-species analysis from patients with persistent nasopharyngeal shedding. Sci Rep. 2022;12:18721.
Artyomenko A, et al. Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants. J Comput Biol. 2017;24:558–70.
Jiao X, et al. QuasiSeq: profiling viral quasispecies via self-tuning spectral clustering with PacBio long sequencing reads. Bioinformatics. 2022;38:3192–9.
Link RW, et al. HIV-Quasipore: A Suite of HIV-1-Specific Nanopore Basecallers Designed to Enhance Viral Quasispecies Detection. Front Virol. 2022;2:858375.
Luo X, Kang X, Schönhuth A. Strainline: full-length de novo viral haplotype reconstruction from noisy long reads. Genome Biol. 2022;23:29.
Ng TT-L, et al. Long-Read Sequencing with Hierarchical Clustering for Antiretroviral Resistance Profiling of Mixed Human Immunodeficiency Virus Quasispecies. Clin Chem. 2023;69:1174–85.
Su J, Li S, Zheng Z, Lam T-W, Luo R. ClusterV-Web: a user-friendly tool for profiling HIV quasispecies and generating drug resistance reports from nanopore long-read data. Bioinf Adv. 2024;4:vbae006.
Morisse P, Marchet C, Limasset A, Lecroq T, Lefebvre A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep. 2021;11:761.
Sahlin K, Medvedev P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun. 2021;12:2.
Wang L, Qu L, Yang L, Wang Y, Zhu H. NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm. Front Genet. 2020. 10.3389/fgene.2020.00900.
Xiao C-L, et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods. 2017;14:1072–4.
Salmela L, Walve R, Rivals E, Ukkonen E. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics. 2017;33:799–806.
Zagordi O, Däumer M, Beisel C, Beerenwinkel N. Read length versus Depth of Coverage for Viral Quasispecies Reconstruction. PLoS ONE. 2012;7:e47046.
Gregori J, et al. Viral quasispecies complexity measures. Virology. 2016;493:227–37.
Gregori J, et al. Inference with viral quasispecies diversity indices: clonal and NGS approaches. Bioinformatics. 2014;30:1104–11.
Nakamura F, et al. Mutational spectrum of hepatitis C virus in patients with chronic hepatitis C determined by single molecule real-time sequencing. Sci Rep. 2022;12:7083.
Arasawa S, et al. Evolutional transition of HBV genome during the persistent infection determined by single-molecule real-time sequencing. Hepatol Commun. 2023;7:e0047–0047.
Nguyen Quang N, et al. Dynamic nanopore long-read sequencing analysis of HIV-1 splicing events during the early steps of infection. Retrovirology. 2020;17:25.
Williams TGS et al. Feasibility and clinical utility of local rapid Nanopore influenza A virus whole genome sequencing for integrated outbreak management, genotypic resistance detection and timely surveillance. Microb Genomics 9, (2023).
Ko SH, et al. High-throughput, single-copy sequencing reveals SARS-CoV-2 spike variants coincident with mounting humoral immunity during acute COVID-19. PLoS Pathog. 2021;17:e1009431.
Knyazev S, Hughes L, Skums P, Zelikovsky A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief Bioinform. 2021;22:96–108.
Sim SB, Corpuz RL, Simmonds TJ, Geib SM. HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly. BMC Genomics. 2022;23:157.
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genomics 3, (2017).
Wick R. rrwick/Filtlong. (2024).
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018;34:2666–9.
Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021;37:4572–4.
Shen W, Le S, Li Y, Hu F, SeqKit. A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE. 2016;11:e0163962.
Sarkozy P, Jobbágy Á, Antal P. Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times. in EMBEC & NBC 2017 (eds. Eskola, H., Väisänen, O., Viik, J. & Hyttinen, J.) vol. 65 241–244Springer Singapore, Singapore, (2018).
Huang Y-T, Liu P-Y, Shih P-W. Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biol. 2021;22:95.
Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS ONE. 2021;16:e0257521.
Zhang S-J, et al. Isoform Evolution in Primates through Independent Combination of Alternative RNA Processing Events. Mol Biol Evol. 2017;34:2453–68.
Guerrero-Murillo M. QSutils. [object Object] https://doi.org/10.18129/B9.BIOC.QSUTILS (2018).
Paradis E, Schliep K. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–8.
R Core Team. R: A language and environment for statistical computing. R Foundation Stat Comput (2019).
Ginestet C. ggplot2: Elegant Graphics for Data Analysis. J Royal Stat Society: Ser (Statistics Society) 174, (2011).

No competing interests reported.

Download PDF

Editor assigned by journal
28 Jun, 2024
Submission checks completed at journal
26 Jun, 2024
First submitted to journal
25 Jun, 2024

You are reading this latest preprint version

QoALa: a comprehensive workflow for viral quasispecies diversity comparison using long-read sequencing data

Status:

Version 1

Abstract

Figures

Main

Results

QoALa workflow and example datasets

Effect of down-sampling on diversity metrics

Example scenarios of diversity comparison

Discussion

Methods

Sequencing reads data

Genome assembly and read alignment generation

Sequencing read noise-minimization

Down-sampling and diversity metric sensitivity

Viral quasispecies comparison

Declarations

Contributions

Author Contribution

Acknowledgement

Data Availability

Code availability

References

Additional Declarations

Supplementary Files

Status:

Version 1