Models.
For this project, we sought to substantially improve upon previous studies aimed at resolving the highly complex EBV transcriptome and to add an extra layer to the transcriptome information through the classification of each transcript isoform within the early, leaky late and late categories. To improve the depth of EBV long read data and thereby enhance discernment of the EBV transcriptome, we utilized a selection model that we previously used to study EBV’s remodeling of cell transcription initiation 39. Specifically, we used the EBV positive Burkitt’s Akata cell model in which we introduced an episomally replicating lytic cycle reporter harboring the early gene BMRF1 promoter driving the expression of a GFP marker gene (Supplemental Fig. 1). These Akata lytic reporter cells were treated with anti-IgG to activate the B-cell receptor (BCR) and fluorescence activated cell sorting (FACS) was used to isolate the subpopulation of cells that successfully entered the lytic cycle. RNA was isolated from the GFP positive cells and subjected to high depth Oxford Nanopore Technology (ONT) long read sequencing of the polyadenylated (polyA) fraction (Supplemental Fig. 1). In a separate model, EBV positive Mutu cells were co-transfected with a GFP expression vector and an expression cassette for the EBV encoded immediate early transcription factor Zta to induce reactivation, GFP positive cells were isolated and likewise subjected to high depth ONT polyA sequencing (Supplemental Fig. 1).
Identification of EBV lytic transcripts.
Because long read sequencing data suffers from 5’ truncations, 3’ internal priming, and false splice junction calls 35 40, we developed a new pipeline to quickly validate each 5’ and 3’ end as well as splice junctions for each long read using CAGE-seq data (from purified reactivating cells) 39, 3’ ONT clusters, and splice junctions identified through short read sequencing 39, respectively (Fig. 1A). Because of unique aspects of viral transcription initiation, including our observed high density of start sites 35, we developed our own peak calling software to better assess viral transcription start sites. Using merged CAGE-seq coverage data from both Mutu-Zta and Akata-BCR models 39, we identified peaks with a minimum of 1000 reads for at least one genomic position and a maximal peak width of 8bp in which the most distal base of the peak is defined by having a depth value of at least 0.2X the highest position count. Using these criteria, we identified 573 unique start sites (vs 240 start sites identified in our previous study 35) across the EBV genome during reactivation (Fig. 1A, upper left inset). This compares to 1697 start sites identified across the cell genome using the same criteria (Fig. 1A, upper left inset). Considering that the human genome is approximately 3.1 billion base pairs long whereas the EBV genome is about 171,000 base pairs long, this indicates that EBV initiates transcription through a vastly higher density of start sites (> 6,000X) than what is observed across cell chromatin during reactivation.
Combining all replicates for purified BCR activated Akata cells and combining all replicates for Zta transfected Mutu cells, we generated 23.4 million and 25.4 million ONT reads, prior to validation. Of these, 2.6 million and 4.2 million mapped to EBV, respectively (Supplemental Fig. 2). Our validation CAGE-seq data 39 which were also generated from purified lytic cells totaled 13.1 and 13.3 million reads for Akata-BCR and Mutu-Zta models, of which approximately 22% and 36% mapped to the EBV genome (Supplemental Fig. 2). Lastly, our previously published short read data (generated from purified lytic cells) 39 consisted of approximately 79 million and 58 million mapped read pairs per replicate with approximately 50% and 36% of these reads mapping the EBV genome for the Akata-BCR and Mutu-Zta models (Supplemental Fig. 2).
Using this data and our new long-read validation pipeline, we identified a total of 792 unique viral transcripts in the Akata-BCR model and 1256 unique viral transcripts in the Mutu-Zta model, for a total of 1453 unique transcripts and 595 viral transcripts that were found in both models (Fig. 1A). The finding of 1453 unique transcripts in this current study compares to 355 that we identified previously 35 and 351 that Fulop et al 40 identified. A unique additional component of our pipeline is that we are able to classify each identified viral transcript as early, leaky late, or late (i.e. dependency on the EBV BALF2 gene which is essential for viral DNA replication), as well as whether each transcript is independent, partially dependent or fully dependent on the EBV lytic origin of replication (OriLyt) or the viral preinitiation complex (vPIC) (dependency on the vPIC component, BDLF4) (Fig. 1A) using CAGE-seq data from EBV recombinants published recently by the Johannsen group 41. Altogether, we identified and classified 1453 unique viral transcripts expressed during EBV reactivation. It is noteworthy, however, that although long read sequencing has the capacity to sequence long transcripts, there is still a substantial bias against recovery of longer reads. We therefore believe that despite identifying 1453 transcripts in our study, this is likely an under-estimation of the substantial complexity of the EBV transcriptome.
Generation of a “core” EBV transcriptome annotation.
Based on our findings, there is an extraordinary number of polyadenylated transcript isoforms that are produced during EBV during reactivation. We therefore attempted to generate a more manageable annotation consisting of the most robustly expressed key viral transcripts (Supplemental table 1 and Supplemental file 1 (annotations for standard Akata EBV genome available at flemingtonlab.tulane.edu)). Specifically, we include the major isoform for each reading frame, we include additional alternative start site isoforms with CAGE peak read depth comparable to that of the major isoform, we include alternatively spliced isoforms with more than 1000 splice junction reads in at least one of the Mutu-Zta short read RNA-seq samples, we include polyA read-through isoforms when the identified transcript is not spliced, and we include new transcripts with exceptionally high ONT read depth (mostly lncRNAs).
Within this core set of transcripts, we 1) refined the start site (within a few bases of our previous annotation 35) of 28 primary isoforms, 2) we newly identified the start site of the major isoform of 22 ORF transcripts, 3) we discovered two new major spliced isoforms, one of which forms a chimeric protein, 4) we identified secondary start or termination sites for 59 ORF transcripts, and 5) we identified 6 highly detected and/or notable lncRNAs (Fig. 1B and Supplemental table I). All combined, we have resolved the structure of the major isoform for all but one (BILF1) EBV lytic reading frame and we identified major additional isoforms of known ORFs and major novel transcripts (Supplemental table I).
Classification of lytic promoters and transcripts.
To classify promoters and transcripts according to their dependency on either viral DNA replication (BALF2 dependency), the EBV lytic origin of replication (OriLyt), or the virus encoded preinitiation complex (vPIC) (BDLF4 dependency), we used published CAGE-seq data from the Johannsen lab 41 generated from cells infected with either wild type or an OriLyt deletion mutant (defective for viral DNA replication and OriLyt enhancer function 42), a BALF2 deletion mutant (defective for viral DNA replication) or a BALF2 deletion mutant reconstituted with BALF2, and a BDLF4 mutant (defective for vPIC function) or a BDLF4 mutant reconstituted with BDLF4. For classification, we determined the fold change in all CAGE-seq reads spanning each start site in mutant conditions versus wild type or reconstituted conditions. Fold change cutoffs were used to classify each promoter as “independent”, “partially dependent”, or “fully dependent” and were chosen based on thresholds that best recapitulated “early”, “leaky late”, or “late” gene classifications made previously by the Johannsen lab 41. Transcripts were then classified based on the classification of their respective start site (Fig. 1A and Supplemental files 2–7.
Contribution of OriLyt to early promoter activity.
Early, leaky late, and late genes are defined based on their reliance on viral DNA replication for their expression. Classification of early, leaky late, and late promoters can therefore be defined by their dependence on the key replication associated protein, BALF2. Viral DNA replication also requires OriLyt and accordingly, we observe that nearly all promoters with partial or full dependency on BALF2 are also dependent on OriLyt (Fig. 1C). The two promoters that show unique dependency on BALF2 are partially dependent and are just below the threshold cutoff. These are therefore likely due to experimental noise in the data. On the other hand, excluding the adjacent promoters in the BHLF1 and BHRF1 regions for which the OriLyt region plays a direct role in activity, 55 out of 131 TSSs with partial or full dependency on OriLyt do not require viral DNA replication (based on BALF2 dependency) (Fig. 1C). This indicates that OriLyt has an impact on the activity of a substantial number of promoters outside of its role in facilitating viral DNA replication. Overall, we found that OriLyt plays a role in 42% of viral early TSSs (Fig. 1C), contributing to expression of the major isoform of 14 viral early genes (Table I). These OriLyt responsive TSSs are distributed throughout the viral genome (Fig. 2, red only lines) suggesting that OriLyt provides enhancer function to multiple genomic regions. Overall, these results indicate a key role for OriLyt outside of its requirement for replication dependent late gene expression, namely, a widespread role in enhancing early viral gene expression. While our analyses are technically limited to early genes because for late genes, we cannot discern between a role in replication dependent expression and potential enhancer function, it is likely that OriLyt also contributes to leaky late and late gene expression through a potential enhancer mechanism.
vPIC contributes to early promoter activity.
In addition to classic viral DNA replication mechanisms driving late gene expression in herpesviruses, late gene expression in gammaherpesviruses is also regulated in part through encoding their own preinitiation complex that further contributes to their expression through the TATA-like motif, “TATT” 41, 43, 44, 45. Accordingly, we found that the majority of leaky late and late TSSs (BALF2 dependent) displayed a dependency on vPIC (Fig. 1C and 2). Surprisingly, however, we also found that 31% of early TSSs displayed a vPIC dependency (Fig. 1C), contributing to expression of major isoforms of 4 early genes, 2 of which have the classic TATT motif, one having a TATA motif and the other having a variant motif (Table I). These findings show that vPIC not only regulates leaky late and late gene expression, but also contributes to the expression of early EBV gene expression.
Unique biphasic initiation characteristics of leaky late promoters.
Inspection of start sites at leaky late promoters revealed an unusual characteristic in response to viral DNA replication (BALF2 dependency). Specifically, we noted that in several cases, there were multiple start site positions within each cluster whose activity differed with respect to their dependency on BALF2, with BALF2 dependency being biased specifically towards the downstream positions (Fig. 3). vPIC dependency was similarly biased towards the downstream positions (Supplemental Fig. 3) and it is notable that each of these promoters contain either a TATA or a TATT motif. Based on these findings, we hypothesize that there are distinct differences in the composition of pre-initiation complexes that drive early and late expression at these promoters and that this gives rise to unique downstream positioning of initiation. One scenario could involve a predisposition for non-vPIC pre-initiation factors (potentially a TBP-based preinitiation complex) to drive proximal initiation during the early stage of the lytic cascade while the vPIC pre-initiation complex drives distal initiation during late transcription. Other potential determinants could involve differences in the chromatinized nature of early genomes vs the naked nature of newly replicated genomes during late transcription and/or the impact of additional unknown factors. Overall, the finding of the biphasic nature of these leaky late promoters provides insights into the underlying mechanisms driving the leaky late characteristic of these promoters. It should also be mentioned that while other leaky late initiation sites that lack multiple start site positions may similarly be initiated by distinct pre-initiation complexes that do not cause a shift in positioning of the TSS.
We also noticed a similar shifted initiation site bias for three early gene promoters (Fig. 3A and Supplemental Fig. 3). For BNLF2a, we separate these into two different isoforms in our core annotation: an early and a leaky late isoform. The early isoform is likely expressed to protect lytic cells from adaptive immune responses against viral lytic antigens immediately as they are being produced while the late isoform may ensure high amounts of BNLF2a to maintain this protective impact and/or for potentially loading into viral particles to protect de novo infected cells from adaptive immune responses. Interestingly, for BDLF4, this shift results in initiation downstream from the BDLF4 ATG initiation codon, thereby encoding a transcript lacking the ability to express BDLF4 protein. Instead, this small shift in transcription initiation gives rise to a variant isoform of the overlapping leaky late BGLF1 gene (BGLF1-v1) (Fig. 3). The biphasic nature of this promoter provides an unusual example of a single promoter that has the potential to give rise to the production of two entirely different viral proteins with differing dependencies on viral replication.
Regulatory diversity of lytic ORFs through multiple promoter usage.
For our core annotation, we included lytic gene isoforms that are driven by alternative promoters with comparable activity. In total, 27 lytic ORFs have two or more major promoters (Supplemental table I). While in some cases, the additional promoters displayed the same dependence on BALF2, BGLF4, and OriLyt, in most cases, the additional promoter(s) added unique dependencies on viral DNA replication (BALF2), vPIC, and/or OriLyt. For example, while the BFLF2 ORF is driven by three different promoters with no dependence on viral DNA replication, they all displayed a dependence on OriLyt and the activity of the proximal most promoter was unique in its partial dependence on vPIC (Fig. 3B). Djavadian et al. have proposed that viral early promoters are shut down once viral DNA replication initiates 41. The use of biphasic initiation sites as mentioned above and/or the use of different alternative promoters with differing dependence on BALF2 may have evolved as a way to adapt to requirements for select sets of viral proteins to overcome these expression restrictions to support functions in both early and late phases of the lytic replication cascade.
Viral polyA read-through - Downstream of Gene (DoG) transcription.
Rutkowski et al. discovered that alphaherpesvirus replication causes transcription termination defects in the cell transcriptome that leads to long transcripts generated from transcription termination well beyond the canonical polyadenylation site (referred to as “Downstream of Gene” – DoG transcripts) 46, 47. Recently, this finding has recently been extended to EBV reactivation 48. While the functional impact of defective cell transcription termination in supporting viral replication is unclear, it is speculated that this may influence cell chromatin structure, potentially contributing to the substantial structural changes in cell chromatin that occur during reactivation 46, 49. Among the validated ONT reads in our work here, we find extensive evidence for DoG transcription scattered throughout the EBV viral genome (Fig. 4). These include DoG transcripts that arise from transcription proceeding through multiple previously annotated polyA sites including a DoG transcript initiating from the BFRF3 promoter that passes through 5 annotated polyA sites (Fig. 4).
There is substantial evidence from the literature demonstrating that upstream transcription proceeding through a downstream promoter causes “transcriptional interference” through displacement of transcription factors and other proteins 46, 50, 51, 52, 53, 54, 55, 56, 57, 58. In the context of the highly compacted viral genome, it is likely that DoG transcription plays a role in fine tuning transcriptional regulation at least in part through this mechanism. For example, copies of the genome in which transcription of the late BDLF1 or BDLF2 genes extend transcription beyond its polyA site, expression of the early genes, BGRF1/BDRF1, BDLF3.5, BGLF3, BGLF3.5, BGLF4 and BGLF5 as well as the leaky late BBLF1 and late BGLF2 and BGLF1 genes would be disrupted. DoG transcription may therefore play a role in the temporal regulation of viral lytic transcription.
High abundance viral lncRNAs.
Among the most highly detected transcripts detected in our ONT data was the small lncRNA (265bp), lncBGLF5, which is the 9th most highly detected ONT read (66166 reads) in the Mutu-Zta dataset (lower but still significant detection in the Akata-BCR model may result from lncBGLF5 being near the size selection cutoff) (Table II). This lncRNA is generated by the overlapping context of the BGLF5 and the BALF4/BGLF3.5/BGLF3 gene structures which end 265bp downstream from the BGLF5 start site (Fig. 5A). Because of this configuration, the generation of full length BGLF5 requires transcriptional read-through of the BALF4/BGLF3.5/BGLF3 polyA site. lncBGLF5, then, is generated from failure of BGLF5 promoter-initiated transcripts to read through the BALF4/BGLF3.5/BGLF3 polyA signal, resulting in a high abundance lncRNA. In addition, the abundant lncBALF3 and lncBALF5 lncRNAs are similarly generated from BALF3 and BALF5 initiated transcripts that fail to readthrough overlapping gene polyA sites (Fig. 5B-C and Table II). It will be interesting to assess whether these transcripts are exported to the cytoplasm to serve some unknown function or whether they remain localized at their site of transcription where they may play roles as stable forms of enhancer RNA-like RNAs that help establish phase separation at these promoters to promote transcription.
Another abundant lncRNAs worth noting is the 1.5kb lncBALF4 RNA. Despite being longer than the above mentioned small lncRNAs, it is still highly represented in our ONT libraries (Table II). While it may have some unknown function, it is also noteworthy that this lncRNA is driven from a novel late promoter with transcription proceeding through the early BALF5 promoter (Fig. 5C). In this context, it is likely that lncBALF4 transcription causes transcriptional interference and may be one mechanism to shut down early BALF5 transcription during the late phase of the reactivation cascade.
Also noteworthy are a series of lncRNAs generated from the BZLF1 locus (Fig. 5D and Table II). The BZLF1 reading frame contains only two methionine residues that are present in tandem at the very beginning of the reading frame and there are two promoters of approximately equal strength that drive transcription upstream from the reading frame (Fig. 5D). In addition, there are two downstream start sites of similar strength that drive transcription downstream from the initiation codon, giving rise to lncRNAs (Fig. 5D). Lastly, there is an upstream transcription initiation site of similar strength that causes transcription in the opposite orientation (not shown). Together, this cluster of start sites, some of which make lncRNAs is somewhat reminiscent of transcription initiation at the OriLyt enhancer, raising the possibility that the BZLF1 locus and/or associated lncRNAs may similarly provide enhancer activity.
Novel highly abundant EBV lytic gene splice variants.
Forty seven and 64% of all validated ONT reads in the Akata-BCR and Mutu-Zta models were found to encode at least one splicing event. Despite these high numbers, while 50% and 36% of short reads mapped to the EBV genome in the Akata-BCR and Mutu-Zta models, we only observed that 2% and 1% of splice junction reads that mapped to the EBV genome. While the substantial isoform diversity that we observe is of likely importance to reactivation, many of these may represent rare splicing events of highly expressed long transcripts that are too long to be easily detected by ONT sequencing. To ensure that new spliced isoforms are of likely high significance, warranting inclusion in our “core” annotation, we only include spliced isoforms that are either previously identified and/or have more than 1000 splice junction reads per short read sequencing replicate of the Mutu-Zta model. Using this criterion, we identified two new splice variants for inclusion in our “core” annotation (Supplemental Fig. 4 and Fig. 6). One is a second splice variant of BLLF1 (gp350) which encodes the first 121 amino acids of BLLF1 and a carboxy-terminal 5 new amino acids arising from a frame shift at the splice junction (Fig. 6A). This results in a BLLF1 protein isoform encoding the first of three anti-parallel beta-barrels that in the full-length protein, forms an anchor bridging the other two (Fig. 6) 59. Although this splice variant encodes this well-defined structural motif, full length BLLF1 binding to the CR2 receptor on B-cells occurs through the third beta-barrel. What possible function the protein product of this splice variant has is therefore unclear.
The other high abundance splicing isoform results in the generation of an early fusion protein containing the first 489 amino acids of BXLF1 (viral thymidine kinase gene) and the last 459 amino acids of the leaky late BXLF2 gH glycoprotein (Fig. 6B). Because translation initiation occurs from the natural BXLF1 start codon, it is almost certain that this fusion protein is produced. While the function of BXLF1 is unclear, expression of BXLF1 in EBV negative cells results in localization to a few distinctive large perinuclear dots 60, 61. Although BXLF2 encodes gp85, which is a key virion factor for EBV infection, expression of BXLF2 in EBV negative cells results in a cytoplasmic localization as well as “discrete patches” at the nuclear rim 62. Through AlphaFold3 63 structure predictions, the structure of BXLF2 appears largely intact in the fusion context, as is a bundle of central BXLF1 alpha-helices (Fig. 6B). It is possible that the covalent linkage of these regions of BXLF1 and BXLF2 play some role in linking the functional impacts of both proteins at the perinuclear and nuclear rim to support some as yet to be determined viral function.