Sequencing reads data
To evaluate our analytic approaches, we gathered raw sequencing reads from five projects focusing on whole genome sequence (WGS) of five different viruses: hepatitis C virus (HCV), hepatitis B virus (HBV), human immunodeficiency virus (HIV), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and influenza A virus (IAV), which are among the most studied species in the field of viral quasispecies. These reads were generated using either PacBio Single Molecule, Real-Time (SMRT) sequencing or Oxford Nanopore Technologies (ONT) and were sourced from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA). Specifically, we selected projects that obtained either longitudinal viral samples or cross-sectional samples from a single outbreak, where the genetic relationships were suitable for comparison at the quasispecies level.
PacBio read samples were obtained from three projects. First, three HCV samples from a chronic hepatitis patient, were collected before and after treatment over a span of 18 months42. Second, five samples were collected at various time points within a period of 100 months from an untreated HBV-infected patient43. Third, five longitudinal SARS-CoV-2 samples were collected between 8 to 17 days after clinical onset from a single patient46. Data from the remaining two projects were generated using ONT, which included four samples from HIV-infected T cells collected up to 24 hours post-infection44, and seven samples from different patients infected by IAV during a nosocomial outbreak in the same hospital45. The NCBI accession numbers of a total of 24 raw read samples and five projects were listed in Supplementary Table 1.
Genome assembly and read alignment generation
We conducted comparable genome assembly steps for both PacBio and ONT raw reads. Initially, we employed the default settings of HiFiAdapterFilt v3.0.148 and Porechop v0.2.449 to trim PacBio and ONT adapter sequences, respectively. Subsequently, poor-quality PacBio reads—those shorter than 1,000 bp or constituting the worst 10% of reads based on the final score—were filtered out using Filtlong v0.2.150. Similarly, ONT reads with an average quality score < 7 were removed using NanoFilt v2.8.051.
We then aligned the trimmed and filtered reads against the NCBI reference genomic sequence of each viral species (Supplementary Table 1) using minimap2 v2.26’s presets52 tailored for each sequencing technology ("map-hifi" for PacBio and "map-ont" for ONT). Subsequently, to prepare the resulting read alignments for downstream analyses, SAM files from read mapping were converted to FASTA format and fragmented into genes or coding regions based on the genomic annotation of each reference sequence, utilizing SeqKit v2.7.053. It was ensured that the reads retained in the alignment covered the full length of the particular gene or region (Fig. 1a). The depth and length of the final alignments were documented in Supplementary Tables 2 to 6.
Sequencing read noise-minimization
A main function of the longreadvqs package, “vqsassess,” aims to minimize noise from sequencing errors, artifacts, or rare mutations by performing position-wise nucleotide base replacement. We hypothesize that any SNV in a read alignment with a base frequency at each position less than the specified cut-off percentage is likely an artifact that should be replaced with the majority base of that position (Fig. 1b). The cut-off percentage can be determined based on prior knowledge, such as the estimated sequencing error rate of the technology used or documented mutation or evolutionary rates of the studied virus. However, the selection of the cut-off for noise-minimization can be further guided by the “pctopt” function, which demonstrates the percentage of singleton haplotypes (haplotypes with only a single read member) in the alignment. The percentage of singleton haplotypes decreases as the cut-off percentage increases, as low-frequency SNVs are replaced, creating more groups of identical reads (Fig. 1a). The normal range for the percentage of singleton haplotypes found in previous studies using more accurate technologies like Sanger or NGS sequencing can serve as a reference for this step. Alternatively, the plot showing changes in the percentage of singleton haplotypes can be used to identify the cut-off value where the percentage of singleton haplotypes ceases decreasing, indicating a plateau (Fig. 1a).
An alternative option for noise-minimization involves replacing potential erroneous low-frequency SNV bases with the base of the dominant haplotype. However, this method is not recommended, since either the base used for replacement could also be an error, or the “dominant” haplotype before noise-minimization may be relatively rare or non-existent (100% singleton haplotypes).
In some cases, a single cut-off percentage cannot be generalized for the entire gene or region of interest because the amount of noise may vary throughout the sequence length. For example, low ONT base calling accuracy is often reported in homopolymer regions (continuous identical bases)54,55 or soft clipping (unaligned regions) may persist at the ends of reads after mapping in both technologies56,57. The “snvcompare” function, which visualizes SNV distribution across sequence length between different samples (Fig. 1c), can help identify noise-rich regions indicated by the overaccumulation of SNV positions. Subsequently, the “vqscustompct” function can be applied to readjust the cutoff percentage for noise-minimization at specific region(s) (Fig. 1b).
To validate the noise-minimization workflow, we extracted read alignments from specific genes or regions of five exemplary viral species. These alignments were chosen based on their sequence length, with each exceeding or closely approximating 1,000 bases, and their depth of coverage surpassing 100 reads, while ensuring that the soft-clipped region, if present, comprised less than 25% of the total length. Utilizing the "pctopt" function, we determined the optimal cut-off percentage for minimizing errors, aiming to achieve nearly 0% singleton haplotypes, or identifying the point where the percentage of singleton haplotypes in the median sample ceased decreasing, indicating a plateau phase (Fig. 1a). The selected alignments (with their respective cut-off percentages) comprised the NS4 coding region (15%) of HCV, the P (5%) and S (5%) genes of HBV, the env (15%) and gag (22%) genes of HIV, the S (1%) and ORF3a (1%) genes of SARS-CoV-2, and the M gene segment (10%) of IAV (Supplementary Fig. 1).
Down-sampling and diversity metric sensitivity
For the demonstration of our package usage, we specified basic settings for the "vqsassess" function. After the noise-minimization step described above, we then down-sampled to either the size of the shallowest depth sample or to 1,000 reads (for samples with a depth over 10,000 reads), for every sample. These steps were taken to prepare the samples for the final quasispecies comparison within the same viral species or project.
However, it's important to note that read depth or sample size significantly influences some quantitative metrics used for quasispecies diversity measures40,41. Aggressive down-sampling may result in information loss and misinterpretation. To evaluate the impact of down-sampling on between-sample diversity comparison, we conducted a sensitivity analysis of nine diversity metrics as fully described by Gregori et al.40, including the number of haplotypes (H), the Shannon entropy (HS), the normalized Shannon entropy (HSN), the Gini-Simpson index (HGS), the functional attribute diversity (FAD), the mutation frequency at the entity level (Mfe), the nucleotide diversity at the entity level (πe), the mutation frequency at the molecular level (Mfm), and the nucleotide diversity (πm).
We utilized two to four samples of one gene of interest per species for the analysis. Using the QSutils v1.18.058 embedded in our package, all nine metrics were computed from both unsampled and down-sampled read alignments. For samples with a depth over 10,000 reads (HCV, HBV, and SARS-CoV-2), down-sample sizes started from 10,000 reads and were gradually halved until the final size reached 78 reads. For samples with low depth (HIV and IAV), sample sizes were set at 300, 150, 100, 50, and 25 reads. Random down-sampling with replacement was repeated 100 times for each sample size after noise-minimization. The same repeated sampling strategy was also implemented before noise-minimization. Distributions of each metric were visually compared between different sample sizes and approaches (down-sampling after versus before noise-minimization).
Viral quasispecies comparison
Once the within-species samples were noise-minimized and standardized to equal depths, they were ready for comparison using the "vqscompare" function (Fig. 1d). This function aggregates the prepared read alignments of the listed samples, initially identifying shared haplotypes between them, and then visualizes the proportion of unique haplotypes for each sample as a color-coded bar plot. In addition, it reclassifies haplotypes into new operational taxonomic units (OTUs) based on the genetic distance matrix, utilizing the "dist.dna" function in ape v5.7.159, from the SNV alignment extracted from the pooled read alignment. In detail, the distance matrix is transformed into dissimilarity coordinates, which are subsequently clustered into new OTUs using classical multidimensional scaling and k-means clustering, respectively, through the "cmdscale" and "kmeans" functions in stats v4.3.160. The number of OTUs or clusters of genetically closely related haplotypes must be customized by specifying the number of clusters (k). The proportions of OTUs are also depicted in a color-coded bar plot. The clustering scheme of OTUs, along with major haplotypes within them, is illustrated with corresponding colors in multidimensional scale (MDS) plots (Fig. 1d). All resulting plots are generated using ggplot2 v3.4.461.
Apart from the comparative plots, the "vqscompare" function offers several other valuable outputs for in-depth investigation. These include noise-minimized down-sampled read and SNV alignments for all samples, along with classified haplotype and OTU identifications. Additionally, the nine quasispecies diversity metrics for each sample, computed using QSutils v1.18.058, are tabulated. Another set of metrics are calculated based on consensus reads and read frequencies of each OTU, rather than haplotypes, to simplify the diversity quantification in a larger scale.
To exemplify such comparative features for our examples, noise-minimized down-sampled alignments of each selected gene and viral species, prepared by the "vqsassess" function, are listed as input for individual "vqscompare" analyses. The number of clusters (k) for OTU classification via k-means clustering was set to 10 for every run. The resulting summary plot was used to illustrate the dynamics of viral quasispecies diversity across different example scenarios, ranging from short-term (hours to days in HIV and SARS-CoV-2 datasets) to long-term (months to years in HCV and HBV datasets) longitudinal samples, as well as samples from the same outbreak cohort (IAV dataset).