Though hypermutated proviruses persist in all people living with HIV (PLWH) (Bruner et al. 2016; Ho et al. 2013; Kinloch et al. 2023), we know relatively little about their within-host origins because they cannot be readily incorporated into phylogenies. We explored three simple approaches to remove hypermutation from nucleotide alignments, with the dual goals of 1) reconstructing phylogenies that accurately reconstruct the within-host evolutionary histories of hypermutated sequences and 2) applying molecular dating approaches to these trees to gain insights into the within-host origins and longevity of hypermutated proviruses.
Of the approaches we evaluated, stripping nucleotide positions containing putative APOBEC3 mutations from the alignment, or replacing individual APOBEC3 mutations in hypermutated sequences with R, consistently normalized tree topologies and metrics. By contrast, replacing APOBEC3 mutations in hypermutated sequences with G failed to consistently resolve their erroneous clustering in the tree. We speculate that this is because G replacement is an overcorrection, as not all A bases at target sites are necessarily the result of APOBEC3 activities (the HIV genome is naturally high in A bases (Kypr and Mrazek 1987; Kypr et al. 1989)). Across-the-board G replacement therefore likely obscures some legitimate ancestral information (i.e., inherited A bases), leaving these sequences at continued risk of long-branch attraction. By contrast, replacing putative APOBEC3 mutations with R mitigates this risk by acknowledging this ambiguity. We therefore advise against replacement of APOBEC3 mutations in hypermutated sequences with G.
We further showed that the integration dates of env-intact proviruses inferred from the HM-Stripped and HM-Replacedw/R approaches were highly concordant with those inferred from benchmark trees that excluded hypermutated sequences entirely, as is the current practice. The demonstration that these corrected trees provide valid molecular dating results is important because it provides, for the first time, an approach to study the within-host evolutionary origins and longevity of the large and genetically diverse population of hypermutated proviruses that persist in all PLWH during ART.
Proviral integration date estimates produced by the two approaches were highly concordant, and there was no clear difference in their performance. While the p-values derived from comparing the tree-based metrics of env-intact and hypermutated sequences, shown in Fig. 4, are overall slightly higher for the HM-Replacedw/R compared to the HM-Stripped approach, we caution against interpreting this to mean that the former is superior. Though we applied statistical tests to guide interpretation, the main goal was to produce tree metric values for hypermutated and env-intact sequences that were in the same range as one another. Both HM-Stripped and HM-Replacedw/R approaches achieved this. We did not necessarily expect that env-intact and hypermutated sequence metrics would all normalize completely (i.e., produce non-significant p-values) because some evolutionary attributes of env-intact and hypermutated sequences might plausibly differ. As hypermutated sequences don't normally yield descendants for example, their closest neighbors in the tree might be more distant than those for env-intact proviruses, simply because of the lower likelihood of sampling a close relative (which, for a hypermutated sequence, could only be an ancestor). Differential evolutionary dynamics between hypermutated and env-intact proviruses could also produce differential root-to-tip measurements (and by extension integration date estimates) between groups, a phenomenon that was indeed observed in WIHS-P2 and WIHS-P4.
We therefore offer the following considerations when choosing an approach. Since the HM-Replacedw/R approach retains the full alignment, it should also preserve more phylogenetic signal than the HM-Stripped approach, where an average of 9% of each env-gp120 alignment was removed. This could be advantageous for HIV regions that are relatively conserved, yet hotspots for APOBEC3 mutation, for example parts of pol (Kieffer et al. 2005; Kijak et al. 2008). But, before implementing the Replacedw/R approach, it is essential to verify that the chosen phylogenetic inference package supports ambiguous characters. IQ-TREE 2, used in the present study, assigns equal likelihood to each component character (Minh et al. 2020), but other packages, such as the approximate maximum likelihood algorithm FastTree, treat all non-ACTG characters as missing data (Price et al. 2010).
It is also important to recognize when sequence alignment modifications are warranted. For routine phylogenetic visualization of HIV datasets, hypermutated sequences can be incorporated directly. Such trees might even be adequate for some limited tree-based inferences, as suggested by our finding that uncorrected trees produced reasonable root dates and evolutionary rates, likely because these calculations only use information from pre-ART plasma HIV RNA sequences. Nevertheless, our demonstration that uncorrected trees erroneously reconstructed the ancestry of hypermutated proviruses, and produced inaccurate (and often nonsensical) integration dates for them underscores why they can't be used to answer questions about the evolutionary history of hypermutated proviruses. For such questions, the above alignment modification approaches should be used.
Our results also reveal insights into hypermutated provirus evolutionary dynamics. Like env-intact ones, hypermutated proviruses spanned a broad age range. From WIHS-P2 for example, we isolated hypermutated proviruses that had integrated as early as a year following seroconversion. This indicates that hypermutated proviruses, like other provirus types, begin to be seeded into the proviral pool essentially immediately following transmission, and can persist for decades thereafter. Our results also revealed evidence of differential evolutionary dynamics of hypermutated and env-intact proviruses in two of the six participants studied, namely WIHS-P2, whose hypermutated proviruses were on average older than env-intact ones, and WIHS-P4, in whom the opposite was observed. This suggests that the decay rates of different types of proviruses can be heterogeneous within a given host, as well as heterogeneous between hosts.
Our study has some limitations. We analyzed the present dataset (Shahid et al. 2024) because it is among the most comprehensive of its type (in terms of sequence N, follow-up time and sampling near seroconversion) and because env-gp120 is commonly used for within-host HIV evolutionary studies (Brooks et al. 2020; Dapp et al. 2017). That said, participants WIHS-P3 and WIHS-P6 had only modest numbers of hypermutated proviruses, which limited our power to detect differences between these and env-intact proviruses in their data. Furthermore, while our proposed method should be applicable to any HIV gene region, we did not explicitly investigate this. The identification of hypermutated sequences, on which our method depends, is by definition imperfect, as it relies on a statistical cut-off and can be subtly influenced by the choice of reference sequence, particularly if a heterologous sequence (e.g. HXB2 HIV reference strain) is used for this purpose (Rose and Korber 2000). As recommended, we used the most frequent sequence observed post-seroconversion as the reference (Rose and Korber 2000), though we verified that use of a different sequence impacted the identification of hypermutated sequences minimally or not at all (e.g., using an arbitrarily-chosen reference sequence from WIHS-P2's earliest sampling time point yielded 137 (out of 1515) nucleotide positions with putative APOBEC3 mutations, versus the original 140). Finally, we cannot assume that intact env-gp120 sequences come from fully intact HIV genomes. As such, the comparison group for hypermutated sequences in the present study is not the replication competent HIV reservoir, but rather the pool of proviruses with intact env-gp120 sequences, many of which will have defects elsewhere.
In summary, the current practice of excluding hypermutated proviruses from phylogenies used for hypothesis testing has been a major barrier to understanding the in vivo evolutionary origins and longevity of these sequences. Here, we validated two simple nucleotide alignment modification approaches that, for the first time, allow hypermutated sequences to be correctly incorporated into phylogenies that can be used for molecular dating. Overall, our observations reveal that hypermutated proviruses, like other provirus types, are archived throughout untreated infection and can persist for years on ART. Our observations further suggest that the evolutionary dynamics of hypermutated proviruses may differ from those of other proviral types in some individuals. In addition to enriching our understanding of HIV persistence towards the ultimate goal of HIV cure, the approaches developed here could be extended to between-host phylogenies, and testing of other hypotheses related to within-host evolutionary origins of hypermutated sequences.