Most TE expression candidates are fragmented and polymorphic TE loci
Following the system established by Lizamore [23] to activate TE transcription, V. vinifera (Pinot noir UCD5 clone) embryogenic callus cultures were subjected to a mock treatment (hereinafter Vv_Mock; which included vigorous shaking that was expected to mount a wounding response in this tissue; see Methods) or mock treatment plus H. uvarum live cultures (hereinafter Vv_Yeast), and samples were collected at four time points after exposure to the yeast culture (Fig. 1a). Untreated embryogenic callus culture was included as a common time zero (Vv_T = 0) for both Vv_Mock and Vv_Yeast treatments. To comprehensively survey the transcriptional activity of TEs in this system, we generated RNA-seq data for each sample. The short sequencing reads of stranded polyadenylated transcriptome data were analysed with a computational workflow that includes multiple existing software to incorporate unique- and multi-mapping reads for identification of potentially expressed TE loci, which were denoted as 'expression candidates' (Fig. 1a; Additional file 1: Table S1; Additional file 2: Figure S1). In addition, publicly accessible A. thaliana RNAseq data of wild-type (Col) [36], ibm2 [36], and ddm1 [19] were also analysed using the same computational workflow and denoted as At_WT, At_ibm2, and At_ddm1, respectively (Fig. 1a; Additional file 1: Table S1; Additional file 2: Figure S1). This analysis workflow was consisting of three parts that differed in the software (including HTSeq [42], Bedtools [43] and TEFingerprint [44]) used to quantify the sequencing reads mapping to individual TE loci, as well as in the characteristics of the quantified reads (Fig. 1a; Additional file 2: Figure S1; see Methods for details). The three sets of TE loci passing corresponding expression threshold (see Methods) were united as a pool of expression candidates (Fig. 1a; Additional file 2: Figure S1A, B), in which those collected by htseq-count or TEFingerprint were denoted as 'trackable' expression candidates as their expression pattern was able to be tracked by virtue of unique-mapping reads, whereas those exclusively collected by Bedtools were denoted as 'un-trackable' expression candidates as the multi-mapping reads did not have suitable sequence polymorphisms to allow unambiguous mapping to individual TE-loci. Thus those expression candidates that are deemed to be trackable represent those transcribed TE loci with accumulated unique polymorphisms, such as single nucleotide variants (SNVs) or small insertions or deletions (INDELs). By contrast, un-trackable expression candidates represent a group of highly conserved TE loci, which are likely to be younger than trackable expression candidates and might include recently mobilised TE loci.
In the V. vinifera system, our analysis workflow identified 3,698 (1.6%), 5,524 (2.5%), and 5,531 (2.5%) of a total of 223,411 annotated TE loci as expression candidates in Vv_T = 0, Vv_Mock, and Vv_Yeast, respectively (Fig. 1b). Of the total number of loci containing TE sequence, 75%-87% were effectively excluded as having no evidence of expression and a further 11%-22% of the total annotated TE loci were excluded due to insufficient support in terms of low numbers of mapped reads and thus fell under threshold (Additional file 2: Figure S2). In each of the tested conditions (Vv_T = 0, Vv_Mock and Vv_Yeast), different sets of expression candidates were identified (Fig. 1c). The increase in the number of expression candidates uniquely found in Vv_Mock (1,139 TE loci; Fig. 1c) and Vv_Yeast (1,913 TE loci; Fig. 1c) comparing to Vv_T = 0 (329 TE loci; Fig. 1c) indicates the transcriptional activation of TE activity due to the wound and biotic stress treatments, respectively.
Similarly, in At_WT and At_ibm2, over 95% of the total 31,189 annotated Arabidopsis TE loci were excluded from the pool of expression candidates, leaving 1,410 and 1,342 TE loci identified as expression candidates in these two genotypes, respectively (Fig. 1d). However, in At_ddm1, the number of expression candidates was increased to 4,156 TE loci (Fig. 1d), in which 3,091 TE loci were uniquely present in the pool of expression candidates of At_ddm1 (Fig. 1e), most likely reflecting the deficiency of ddm1 mutant in TE silencing [20].
Depending on the experimental condition in the V. vinifera system and genotypes in the A. thaliana system, 69.2%-77.4% and 87.1%-94.9% of the expression candidates, respectively, showed evidence of transcription that unique-mapping reads can represent, thus were identified as trackable expression candidates in these two plant systems (Fig. 1b, d). Only 9%-10% of the expression candidates in V. vinifera system and 11%-17% of the expression candidates in the A. thaliana system retained > 90% length integrity (hereinafter 'full-length' loci) as compared with the length of canonical elements (Fig. 1f, g). The high proportion of trackable and fragmented TE loci in the pools of expression candidates in both V. vinifera and A. thaliana systems indicate that most of the expressed TE loci are both divergent sequences and degenerated in length.
The characteristics of TE expression candidates varies by TE family
We were interested in determining whether the degree of mutation accumulation within expressed TE families varied depending on each family's mobilisation history. TE loci of more recently mobile families (i.e. younger TE families) are expected to possess a high degree of sequence conservation compared to those TE loci belonging to families that have more ancient mobilisation events (i.e. older TE families) [45, 46]. These observations suggest that younger TE families may not necessarily be transcribed to a higher level than older TE families, but younger TE families may possess a higher proportion of full-length un-trackable expression candidate loci than older TE families.
To test this assumption, we grouped grapevine expression candidates hierarchically by families, degree of polymorphism (i.e. trackable or un-trackable), and length integrity (i.e. fragmented or full-length). For Vv_T = 0, 2,565 (69%) of the total 3,698 expression candidates were trackable, whereas the remaining 1,280 expression candidates (un-trackable) remained indistinguishable (Fig. 2a). Among the total 232 TE families (corresponding to nine superfamilies) presented on the y-axis of Fig. 2a, 174 families contained expression candidates in Vv_T = 0. Among these expressed families, 102 contained fewer than ten expression candidate loci, and a further 32 families contained 10 to 23 expression candidate loci (the median abundances of expression candidates among expressed TE families; Additional file 3: Table S3), indicating that the bulk of the expressed TE families have few transcriptionally active TE loci. It was noticeable that 6 TE families had more than 100 expression candidate loci; these families were Copia-23 (211 expression candidates), Gypsy-12 (175 expression candidates), VLINE1 (245 expression candidates), VLINE4 (211 expression candidates), VLINE5 (117 expression candidates) and VLINE6 (139 expression candidates). However, of these 6 TE families, the expression candidate loci in 5 of the families were mostly fragmented and rackable TE sequences (Additional file 3: Table S3). In fact, half of the TE families containing transcribed loci (87 of the 174 expressed families) had zero un-trackable expression candidates, and a further 74 TE families had only 1 to 20 un-trackable expression candidates, most of which are fragmented. This observation demonstrates that the vast majority of the expressed TE families comprised transcriptionally active loci that are mostly fragmented and could be identified by polymorphisms (e.g. SNVs and INDELs). In contrast, Copia-23 and Copia-3 families were over-represented with expression candidates that were both un-trackable and potentially autonomous (full length). These findings were concordant with the observations in Vv_mock (Fig. 2b) and Vv_Yeast (Fig. 2c). Although more TE families were obtaining low numbers of full-length candidates in each of these two treatments than that in Vv_T = 0, most of the expressed families are still in short of un-trackable expression candidates, and Copia-3 and Copia-23 are still the two families comprising most full-length un-trackable expression candidates (Additional file 3: Table S3).
A closer look at the sequences of the canonical and 90 annotated full-length Copia-3 elements shows a condensed phylogenetic cluster mostly comprised with intact Copia-3 indicated by the presence of LTRs flanking INT domain (i.e. structurally autonomous; Additional file 2: Figure S3; also see Methods). This cluster included 26 sequences, 19 of which were structurally autonomous un-trackable candidates with over 90% read coverage of the annotated INT domain, and 4 of which were intact trackable Copia-3 with a nearly complete transcription of INT (see Methods). The majority of the remaining un-trackable expression candidates formed three groups, the largest two close to the cluster formed by intact Copia-3 loci. The opposite distal end of the tree was occupied mainly by un-expressed full-length Copia-3 loci.
The neighbour-joining tree built from the canonical and 220 full-length Copia-23 sequences revealed that the 11 un-trackable and four trackable structurally autonomous candidates with nearly complete transcription across INT were scattered in 5 broom-like clusters. These clusters were densely packed with other un-trackable candidates that had either lost intact LTRs or lacked full INT coverage (Additional file 2: Figure S4). These compact clades with short branches were distinguished from the sequences of unexpressed full-length Copia-23.
To test the contribution of the full-length Copia-3 loci to the pool of Copia-3-related transcripts, reads mapping to all Copia-3 candidates were categorised into four groups by whether they mapped to full-length/fragmented and trackable/un-trackable candidates. This analysis revealed that each category contained reads shared with one or more categories, irrespective of treatments and treatment time-point (Additional file 2: Figure S5). Nonetheless, each category obtained a unique subset of reads that only mapped to one of the four groups (pairwise combination of full-length/fragmented and trackable/un-trackable) of expression candidates, meaning none of the group was able to represent the whole collection of Copia-3 transcripts. The same analysis was applied to reads mapping to all Copia-23 expression candidates. This analysis also demonstrated reads shared across different categories and those reads unique to a single category (Additional file 2: Figure S6).
Copia-3 and Copia-23 belong to LTR-retrotransposon (LTR-TEs). This type of TEs is known for the identical long terminal repeat (LTR) at both ends upon insertion. The pair of LTRs gradually accumulates independent mutations across time; therefore, it has been believed that the more diverse the two LTR sequences, the more time that has passed since insertion [46, 47]. To test whether Copia-3 and Copia-23 were active more recently than other LTR-TEs, we analysed the divergence of each pair of LTRs, following with the estimation of the insertion time of each structurally autonomous TE loci. The insertion dates of the 87 and 177 structurally autonomous loci of Copia-3 and Copia-23, respectively, were calculated based on the divergence of individual pairs of LTRs for each element. The peak of Copia-3 and Copia-23 mobilisation was then estimated from the distribution of insertion times and found to occur approximately 0.02 and 0.017 million years ago (MYA), respectively (Additional file 2: Figure S7A, B). Peak insertion times of the other 39 LTR-TE families with at least ten intact copies were analysed in the same way (Additional file 2: Figure S7C; Additional file 4: Table S3). Most LTR-TE families experienced bursts no longer than 4.5 million years ago (MYA). Note that Copia-3 and Copia-23 were the most recently active LTR-TE families (Additional file 2: Figure S7C). Comparison of the peak insertion time of Copia-3, Copia-23 and the other 5 Copia families, which obtained trackable full-length candidates across all treatments but lacked un-trackable full-length candidates, showed that Copia-3 and Copia-23 experienced significantly more recent bursts than these five other families (Additional file 2: Figure S7D).
Together, these results suggest that LTR-TE families that experienced the most recent mobilisation burst in evolutionary time may not necessarily the TE families having the most number of expression candidate loci. Instead, they may tend to have structurally autonomous expression candidates involving in ambiguous alignment and lacking polymorphisms to facilitate the identification of the exact origin of transcription among these highly conserved loci.
TE expression candidates are not randomly distributed in the genome
A number of recent reports highlight that TE mobilisation and reintegration into the host genome often show distinct insertion bias [14, 28, 29, 30–32]. However, little is known regarding the genome-wide distribution of transcriptionally active TE loci. In order to investigate whether there is location bias between all annotated TEs and expression candidates, our analysis compartmented the annotated reference genome into genic and intergenic regions. The genic region comprised gene units, which were made of exons and introns included from the transcription start sites to the transcription stop sites of genes, and flanking regions of genes which encompassed 2kb upstream (N-flanks) and 2kb downstream (C-flanks) of corresponding translation start and stop sites (Fig. 3a). All annotated TEs intersected with specific genome compartments were categorised accordingly and hierarchically in the order of genic/intergenic regions, location within the genic region (e.g. exon, intron, flanks), and integrity (full-length or fragmented). In the grapevine system, over half of all annotated TE loci fell into intergenic regions (126,976 TEs, 56.83%), while 96,435 (43.16%) TEs co-localised with genes (Fig. 3b). About half of the annotated genic TEs of V. vinifera were in flanking regions, without particular preference for either side (N-flank or C-flank). As expected, intronic TEs comprised most TEs in gene units (Fig. 3b; Additional file 5: Table S4).
Expression candidates of our V. vinifera system were classified in the same way, with additional categories added, including the transcriptional activity of co-localised genes (i.e. TEs associated with expressed or non-expressed genes) and the presence or absence of unique-mapping reads (trackable or un-trackable). Untreated embryogenic callus (Vv_T = 0), Vv_mock, and Vv_Yeast respectively showed 71.47%, 75.69%, and 74.62% of the expression candidates were in the genic regions (Fig. 3c-e, Additional file 5: Table S4). The goodness of fit X-square test shows that the proportion of genic TEs was significantly elevated from 43% of the 'default' distribution of all annotated TEs to 71% of Vv_T = 0 expression candidates (see the comparison of 'All annotated' versus Vv_T = 0 in Fig. 3f). This distribution bias is further enhanced in Vv_mock and Vv_Yeast (Fig. 3f). Delving deeper into the insertion context, about two-thirds of these genic TE expression candidates overlapped with introns; in particular, there was a significant bias toward introns of expressed genes (Fig. 3c-e, g, h; Additional file 5: Table S4).
To examine whether the location bias of expression candidates of our V. vinifera system is also conserved in other plant species, the A. thaliana dataset was also analysed in the same way. There are 58.37% of A. thaliana annotated TE loci located in the genic region, while TEs in gene unit and flanking regions, respectively, comprised 16.00% and 42.37% of the total pool (Fig. 3i; Additional file 5: Table S4). Remarkably, the majority of the TEs annotated within the 'gene unit' overlapped with exons (Fig. 3i), contributing to 12.42% of all annotated loci. This high proportion of exonic TE loci is because, based on the TE annotation established by Jin et al. [48], considerable numbers of long TEs overlap with multiple exons and introns, which were annotated based on the gene annotation file deposited in Ensembl Plants (see Methods), and therefore were preferentially categorised as exonic TEs.
In wild-type Arabidopsis (At_WT), TE loci located in the genic region comprised 91.28% of the 1,410 expression candidates, of which 1,093 loci co-localised with expressed genes (Fig. 3j; Additional file 5: Table S4). In addition, the proportion of gene-unit loci had skewed from 16% of total annotated TE loci (Fig. 3i) to 74.61% of the expression candidates in At_WT (Fig. 3j; Additional file 5: Table S4). The majority of these expression candidates in the gene unit were associated with expressed genes (65.67% to the total expression candidates). The expression candidates of At_ibm2 showed very similar location distribution to At_WT (Fig. 3k). However, while intergenic expression candidates comprised 8.72% of the expression candidate pool of At_WT, the intergenic proportion in At_ddm1 was significantly higher at 54.79% of expressed TE loci (Fig. 3l, m, Additional file 5: Table S4). As opposed to the high proportion of 'gene-unit' expression candidates in At_WT (74.61%), only 24.95% of the expression candidates in At_ddm1 were located in the gene unit. These results were supported by statistical tests (Fig. 3m-o) and show that the distribution of expression candidates in Arabidopsis ibm2 mutant is highly similar to that in wild-type, whereas a striking difference was observed between Arabidopsis wild-type and ddm1.
The proposed preference for expressed TE loci to be located in genic regions may be explained by either a general increase in the proportion of genic expression candidates from most TE families or simply a reflection of the genic-enriched annotation of a few TE families that largely contribute to the pool of expression candidates. To test these two assumptions, we further delved into the location distribution of all annotated TE loci and expression candidates of individual TE families. The genic and intergenic proportions of annotated TE loci and expression candidates were plotted for families belonging to Copia, Gypsy, LINE, hAT and MULE, the five superfamilies that contributed to the majority of expression candidates (Additional file 2: Figure S8-S12). This analysis first looked at the genic and intergenic proportion of all annotated TE loci grouped by families and then categorised expression candidates, in the same way, to examine that whether the genic proportion of expression candidates is higher than that of annotated TE loci in most of the investigated TE families. For the genic and intergenic distribution of annotated TE loci belonging to Copia families, about two-thirds of the families show underrepresentation (< 50%) of genic TE loci (Additional file 2: Figure S6A). When it comes to expression candidates in Vv_T = 0, 58 of the 71 expressed families demonstrated higher genic proportions of expression candidates (Additional file 2: Figure S6B) than that of the annotated TE loci (Additional file 2: Figure S6A), meaning that the tendency of TE expression candidates to be located in the genic region is broadly presented in V. vinifera Copia families in Vv_T = 0. This trend in Copia families is also observed in Vv_Mock and Vv_Yeast (Additional file 2: Figure S6C-D). The analysis for V. vinifera TE families of Gypsy, LINE, hAT and MULE (Additional file 2: Figure S7-9) is concordant with the aforementioned findings in Copia: despite the different degrees of elevation in genic proportion, the distribution bias of expression candidates towards genic region seems to occur in most of the families broadly. The same analysis for the location preference within genic regions (exon, intron, and flanking regions) also reveals that the elevation of the intronic fraction of expression candidates is widely presented in most TE families of V. vinifera (Additional file 2: Figure S13-S17).
The distribution bias of TE expression candidates suggests the tolerance of transcription of TE loci within or proximal to expressed genes. We then asked whether this distribution bias only restricted to non-autonomous TE loci that are assumed to pose less stress of mutational load to host cells than the structurally autonomous TE loci. To interrogate this issue, we investigated the location distribution of structurally autonomous TE loci of Copia-3 and Copia-23, the two families estimated to mobilise most recently among all LTR-TE families in V. vinifera. Among the total annotated structurally autonomous TE loci of Copia-3 and Copia-23, 55 (63%) of 87 and 117 (66%) of 177 loci are within introns (Additional file 2: Figure S18A), respectively. Seventy-four and 138 of these structurally autonomous TE loci of Copia-3 and Copia-23 were identified as TE expression candidates in at least one experimental condition, respectively. Astonishingly, 59 (80%) of the 74 structurally autonomous Copia-3 expression candidate loci and 104 (75%) of the 138 structurally autonomous Copia-23 expression candidates co-localised with expressed genes, primarily within introns of expressed genes (Additional file 2: Figure S18B), indicating the distribution bias of structurally autonomous TE expression candidates toward expressed genes in the LTR-TE families that are most recently active.
The transcriptional dynamics of TEs is closely related to that of co-localised genes
The over-representation of TE expression candidates in proximal of expressed genes in the V. vinifera and A. thaliana systems (particularly At_WT and At_ibm2) indicates the tolerance of the transcriptional activity of intragenic TEs within expressed genes and suggests a' hitchhiker-like' manner of intragenic TEs in that they take advantage of the genic transcriptional permissive status for their own expression [18, 49]. Therefore these TEs might display expression dynamics that resemble a host gene's activity.
To address this possibility, differential transcriptional changes of TE expression candidates and genes of the grapevine system were detected by the computational tool DESeq2 [50]. Due to the repetitive characteristics of TEs, only a subset of expression candidates (i.e. trackable expression candidates) that were mapped by unique-mapping reads were suitable for this analysis. The differential analysis was performed on 5,869 trackable expression candidates found in at least one of the three experimental conditions of the V. vinifera system (Vv_T = 0, Vv_mock, and Vv_Yeast). Hierarchical clustering of differentially expressed TEs (DETEs) demonstrated various predominant expression patterns in response to different treatments (Fig. 4a, b). The mock treatment (Vv_Mock) showed that over 50% of the DETEs were transcriptionally activated in the first 3 hours (3h) of post-treatment and then returned to an expression level similar to that observed in Vv_T = 0 (Fig. 4a, c), illustrating an 'up-back' expression pattern. Interestingly, 206 of the 291 DETEs (70.79%) responded to H. uvarum incubation (Vv_Yeast) in an up-regulated manner (Fig. 4b, e).
In order to define the expression pattern of genes co-localised with DETEs, differential expression analysis was also applied on all expressed genes (FPKM > 1; Additional file 2: Figure S18), following by collection of differentially expressed genes (DEGs) that co-localised with DETEs (see Methods; Fig. 4d, f). Among the DEGs of Vv_Mock samples, 40 DEGs were co-localised with 45 of the 78 DETEs (Fig. 4d). In Vv_Yeast samples, 106 DEGs were co-localised with 124 of the 291 DETEs in this treatment (Fig. 4f). In an attempt to investigate the relationship of expression pattern between DEGs and the co-localised DETEs, these corresponding DETEs and DEGs were used for hierarchical clustering, in which DETEs and DEGs of similar expression pattern were grouped into the same clusters (Fig. 4g, h). Note that a small number of DETEs, especially DETEs within 2kb flanking regions of genes, might co-localise with multiple DEGs and vice versa. Instead of arbitrarily excluding DETEs or DEGs that fell into this scenario, the comparison of the expression pattern of co-localised DETEs and DEGs was conducted on each DETE-DEG pair. The expression pattern of the 45 DETEs co-localized with 40 DEGs in Vv_Mock was then compared with that of paired DEGs, resulting in 45 pairs of DETE-DEG comparisons summarised in Fig. 4i, where 42 (93.33%) pairs of co-localised DETE-DEG showed concordant clustering between DETEs and corresponding DEGs. The same approach was applied on co-localised DETEs and DEGs of Vv_Yeast, in which 113 (89.68%) of the total 126 co-localised DETE-DEG pairs showed the same expression pattern between paired DETEs and DEGs (Fig. 4j). These findings indicate that the dynamic expression pattern of DETEs co-localized with DEGs tended to resemble that of the paired DEGs.
Transcriptionally activated intronic TE loci are associated with intron retention and exposure of premature termination codons
Aberrant alternative splicing, such as exon skipping and intron retention, has been observed at gene loci containing epigenetically unmasked intronic TEs [51, 52]. While TE expression candidates in the V. vinifera and A. thaliana systems show a strong location bias towards expressed genes, how broad the transcriptionally active intronic TE loci associated with aberrant alternative splicing remains unanswered. To interrogate this issue on a genome-wide scale and retain intact sequence information of TE-related transcripts, Oxford Nanopore Technology (ONT) cDNA sequencing was utilised for V. vinifera (P. noir UCD5 clone) embryogenic callus subjected to 12 hours of continuous incubation with live H. uvarum culture (hereinafter Vv_Yeast12h) and the corresponding mock treatment (hereinafter Vv_Mock12h). The alignment depth of the ONT cDNA sequencing reads to the V. vinifera reference genome was comparable to that of the Illumina libraries (Additional file 6: Table S5). The quantification of gene and TE family expression level in the ONT dataset shows a medium to high level of correlation with that of the Illumina dataset (Spearman's correlation coefficient ρ > 0.80 for genes; ρ > 0.58 for TE families; Additional file 2: Figure S20). The FLAIR pipeline [53] was used to categorise alternative splicing events into four categories: alternative 3' splicing (Alt3), alternative 5' splicing (Alt5), intron retention (IR) and exon skipping (ES). Gene-related alternative splicing features overlapping with TEs were further collected, following by estimating the productivity (as per the definition in FLAIR pipeline, this denotes the ability of a transcript to produce protein), of gene transcripts having these alternative splicing features. First of all, among the total 21,081 alternative splicing features identified by FLAIR across the ONT libraries of Vv_Mock12h and Vv_Yeast12h, 19,526 (92.6 %) of these are related to annotated genes. Over 90% of these gene-related alternative splicing features were IR (8,806 alternative splicing features) and ES (9,378 alternative splicing features). Note that an isoform may contain multiple numbers and various types of alternative splicing features. Nonetheless, an alternative splicing feature could appear in multiple isoforms, as indicated in Fig. 5a. Notably, there are more genes than the number of associated ES features, suggesting that, for some ES events, each may involve more than one gene. Of the 19,526 gene-related alternative splicing features, only 524 (2.7%) of these overlapped with TEs (Fig. 5a). As expected, almost all TEs overlapping with Alt3, Alt5 and IR features are located within introns, while 22 of 40 ES-associated TEs overlapped with annotated exons.
To understand whether the presence of these TEs associated with the productivity of the gene transcripts, the productivity of isoforms containing these gene-and-TE-related alternative splicing was estimated using FLAIR and grouped into four types: productive (PRO), having premature termination codon (PTC, i.e. unproductive), no start codon (NGO), and having start codon but no stop codon (NST). This analysis shows that 50–68% of the isoforms having Alt3, Alt5, or ES remained productive, no matter whether the alternative splicing features overlapped with TEs (Fig. 5b). However, 80.6% of isoforms having TE-related IR were PTC, while the PTC proportion in isoforms having IR events non-overlapping with TEs was less than 45%. Looking into the estimated translation stop site of these isoforms containing TE-related IR feature, 196 of the 261 PTC isoforms exhibit premature stop codon exactly within the TE-overlapping IR feature (Additional file 7: Table S6). From the perspective of the isoform orientation, nine of the translational premature termination sites appear within TEs, two are after TEs, and the rest 186 isoforms show premature termination sites before the presence of TEs. The distance between TEs and the premature termination sites presented prior to TEs ranged from 2 bp to over 4 kb, with the first quartile, median and third quartile at 147 bp, 311 bp, and 693 bp, respectively (Additional file 7: Table S6).
Interestingly, different TE superfamilies were preferentially observed among the four types of alternative splicing features. Retrotransposon VLINE was over-represented in Alt3 and Alt5 alternative splicing events (Additional file 2: Figure S21A, B), and Harbinger, a DNA transposon, was predominantly seen in IR features (Additional file 2: Figure S21C). For ES features, MULE DNA transposon was the most predominant superfamily among all TE superfamily (Additional file 2: Figure S21D). In addition, most of these TEs were fragmented. There're only 10, 7, 34 and 1 full-length TE loci associated with Alt3, Alt5, IR, and ES features, respectively (Additional file 7: Table S6).
Identification of the potential origin of full-length transcription for autonomous mobilisation
Transcriptional activation of TEs under stress condition has been widely reported in plants and primarily investigated at the TE family level [17, 54]. Nonetheless, what proportion of these TE transcripts is derived from structurally autonomous TE loci and where are these autonomous TE loci seeding TE mobilisation in the genome remained unclear. Since our V. vinifera system has shown transcriptional activation of TEs under stressed conditions and demonstrated the characteristics of individual TE loci in the pool of expression candidates, we then aimed at identifying TE loci that potentially contribute full-length TE transcripts necessary for autonomous mobilisation. Due to the ability of ONT sequencing technology to sequence full-length transcripts, ONT cDNA libraries have the potential to reveal, if any, competent transcription of autonomous TE and decipher the origins of these transcripts [55]. The sequence integrity and structure of TE loci were firstly screened for structurally autonomous TE loci annotated in the genome. The structurally autonomous TE loci identified as expression candidates in the stress treatments (Vv_Mock12h and Vv_Yeast12h) were then examined for the breadth of read coverage against the TE sequence compartments whose transcription is required for autonomous mobilisation (see Methods). Intact LTR-TEs with > 90% INT coverage (Additional file 2: Figure S22 A, B), autonomous LINEs with > 0.9 breadth of coverage across whole elements (Additional file 2: Figure S23 A, B), as well as intact TIR-transposon (TIR-TE) with > 90% ORF covered by ONT reads (Additional file 2: Figure S24 A, B) were collected. This process captured 20 and 19 LTR-TE loci in Vv_Mock12h and Vv_Yeast12h libraries, respectively (Additional file 2: Figure S22 C). These include Copia-3, Copia-23, and Gypsy-V1. For LINE retrotransposon, only a single VLINE7 locus and two VLINE 8 loci were selected from Vv_Mock12h library (Additional file 2: Figure S23 C). For TIR-TE, three hAT-7 loci in Vv_Mock12h library revealed > 0.9 breadth of coverage across ORF (Additional file 2: Figure S24 C).
To further check whether the nearly full breadth of coverage resulted from contiguous full-length ONT reads across TEs, rather than a co-contribution of multiple reads, the read length of each read and the bases mapped to the potentially autonomous TEs were investigated. The ONT read should meet two criteria to prove that a full-length autonomous TE transcript was present. Firstly, depending on the type of mapped TE, it should be at least as long as the INT domain, the ORF, or the full feature of the TE locus. Secondly, this read should have its TE-mapped bases almost as much as its read length. Take Copia-23 as an example, the ONT reads mapping to the 19 structurally autonomous, seemly fully expressed, Copia-23 loci in Vv_Mock12h mainly were shorter than 3,000 bp (x-axis, Additional file 2: Figure S25 A), whereas the size of the canonical Copia-23 INT domain is 4,084 bp. The only read longer than 3 kb identified exhibited a very poor mapping to the element (y-axis, Additional file 2: Figure S25 A). In addition, the majority of these ONT reads were multi-mapping (red dots, Additional file 2: Figure S25 A). This analysis showed that most of these reads were skewed from the diagonal line, indicating the inconsistency between lengths of ONT reads and the mapped bases of these reads to TEs. To figure out the factors underlying this inconsistency, the alignment start and end sites of these ONT reads in relation to the mapped Copia-23 loci were surveyed. As illustrated in the cartoons in Additional file 2: Figure S25 B and C, the head and tail of the alignment were grouped into three categories, internal, external, and clipped. The investigation reveals that most of these ONT reads represented transcription started within the Copia-23 loci (Additional file 2: Figure S25 B). However, the tail of the reads, especially those that deviated from the diagonal line, were mostly clipped due to the sequence discrepancy between ONT reads and TEs (Additional file 2: Figure S25 C). Only a few of them extended through the annotated boundary. Overall, there is no evidence of autonomous transcription from annotated Copia-23 loci.
The situation described for the 19 autonomous Copia-23 loci was also observed in captured TE loci of Copia-3, Gypsy-V1, VLINE7, VLINE8, and hAT-7 (Fig. 6; Additional file 2: Figure S25). Only a single ONT read mapping to an autonomous Gypsy-V1 locus and eight reads of hAT-7 appeared to adequately cover the bases of the INT (Fig. 6a-c) or ORF (Fig. 6d-f) of the associated TE loci. The genome browser image of the only Gypsy-V1 locus demonstrates the full coverage of this locus by a single ONT read (Fig. 6g). The genome browser image for hAT-7 shows ONT reads covering the ORF of the hAT-7 in chromosome 14 (Fig. 6h). These suggest potential transcription of Gypsy-V1 and hAT-7 may allow limited mobilisation of these elements.