Effect of TE expansion on gene evolution in cotton
Previous studies have indicated that there is a more than threefold difference in the size of cotton genomes (Table S1) [41]. Notably, TEs accounted for a significant proportion of these genomes, ranging from 57% (D5) to 81% (K2), with the majority being class I LTR retrotransposons, primarily Gypsy elements (Fig. 1a, Table S2). Based on previous studies, each genome underwent TE annotation [41]. However, we found that the complexity of transposon activity in cotton species made it difficult to assess the relationship of transposon activity among cotton species. Therefore, we combined TEs from all species to generate a library containing 20,868 entries, and then grouped the entries into clusters based on sequence similarity using CD-HIT2. By default, CD-HIT2 outputs the longest sequence in each cluster. This means that full-length entries are preferred over fragmented entries, and entries with nested structures or chimeric TEs will prioritize full-length but shorter elements. Therefore, in addition to sequence length, we also evaluated each TE in each cluster according to the methods from previous studies to selectively choose representative and full-length TE consensus sequences. After two rounds of CD-HIT2 clustering, we further removed redundant sequences using the "cleanup_nested.pl" script provided by EDTA, reducing the TE library size to 6,180 consensus sequences (Material). Finally, based on the final TE library, we re-annotated the TEs of all cotton genomes using EDTA to document the differential expansion of TEs across cotton species.
Among the 6,180 TE entries, 698 entries expanded in all cotton lineages, 582 entries specifically expanded in D5 (Cluster 6), 1,148 entries specifically expanded in E1 (Cluster 1), 1,401 entries specifically expanded in the African-Asian cottons (A2, B1, F1) (Cluster 2), and 2,351 entries specifically expanded in the Australian cottons (C1, G1, K2) (Cluster 4). There are substantial numbers of lineage-specific TEs within the African-Asian and Australian clades, indicating a potential association between cotton lineage divergence and TEs differential expansion (Fig. 1b, Fig. S1).
To explore the impact of TEs on gene evolution after speciation, we conducted gene family analysis on all the genomes. Conserved gene families across cotton species were identified based on sequence conservation and genome alignment. The genes from eight species were categorized into 18,676 single-copy orthologous genes, 12,501 (D5) – 13,768 (C1) multi-copy gene families, 6,035 (K2) – 8,938 (C1) variable conserved genes, and 471 (C1) – 1,990 (D5) specie-specific genes present in only one genome (Fig. 1c, Fig. S2, Table S3). By comparing the TE coverage of different gene types, we found a significant enrichment of TEs on species-specific genes (P < 0.0001), suggesting that TE activity may contribute to the formation of species-specific genes (Fig. S3). We also found that the number of conserved TE insertions on single-copy orthologous genes was highly similar among closely related species, such as the Australian and African-Asian cotton clades (Fig. S4). This aligns with the general understanding of evolutionary biology, where closely related species share more genetic similarities due to their recent common ancestry. Conversely, in relatively distant species, there was significant divergence, consistent with the accumulation of genetic differences over time. These results, along with previous studies indicating that TEs provide a phylogenetic signal [42], further support the role of TEs in the genetic divergence of cotton species.
We focused on the impact of differential TE expansion on 18,676 single-copy orthologous genes (the following as orthologous genes). Based on the presence of conserved TE insertion events in genes, the orthologous gene families were classified into various evolutionary time nodes and a TE insertion map was constructed (Fig. 1d). This map provides a more intuitive display of differential TE expansion among orthologous genes in cotton. The orthologous genes on the N0 branch exhibited conserved TE insertions in all eight species, indicating that TE insertions occurred before species divergence of the Gossypium genus. Within this branch, we identified 4,103 (F1) to 6,557 (K2) genes that have undergone TE contraction, which also serve as potential factors leading to structural and functional differences in orthologous genes. There were 1,183 gene families in the N3 branch (Clade 1) that exhibited conserved TE insertions only in the African-Asian cotton, and 1,384 gene families in the N5 branch (Clade 2) that exhibited conserved TE insertions only in the Australian cotton (C1, G1, K2) (Fig. 1d). These results further indicate that TEs potentially contribute to the genetic divergence of cotton species.
Differential TE expansion regulates orthologous expression divergence
To explore the impact of TE on posttranscriptional regulation during the divergence of cotton species, we first analyzed gene transcription in eight cotton species to determine their transcription status and any differences in transcription levels. We constructed RNA-seq libraries and quantified gene expression in each species (Table S4, S5). Out of 18,676 single-copy orthologous genes (the following as orthologous genes), we categorized 1,636 genes as not expressed in any species because their FPKM values were ≤ 0.05. Additionally, we identified 8,420 genes that were expressed in only some cotton species and 8,620 genes that were expressed in all eight species (Fig. 2a middle). Among the genes expressed in only some cotton species, 32.30% of orthologous genes were not expressed in a single cotton species, and 7.69% of orthologous genes were detected to be expressed in only one cotton species (Fig. 2a right). To clarify our classification, genes with FPKM ≤ 0.05 are considered not expressed as their expression levels fall below the detection threshold, indicating they do not contribute meaningfully to the transcriptome. Conversely, genes expressed in some species (FPKM > 0.05) but not in others suggest a regulatory mechanism controlling their transcription presence or absence in certain species. This differs from differentially expressed genes, which are expressed across all species but show varying expression levels, implying differential regulation or adaptation. For example, among the genes expressed in all species, significant differences in expression levels were observed. In a comparison between K2 and A2, we found 1,076 genes up-regulated and 1,037 genes down-regulated in A2 relative to K2 (Fig. 2a left). This distinction helps us better understand the regulatory dynamics and potential functional significance of gene expression across different cotton species.
To investigate the impact of TEs on the transcriptional regulation of orthologous genes, the differential TE insertions within 2 kb upstream of orthologous genes across different cotton species were examined (Table S6). Subsequently, we analyzed the orthologous genes with these differential TE insertions and the differences in their transcription. The results revealed that among genes expressed only in some species, the predominant TE type with differential insertions was DNA transposons, accounting for 45.6% (Fig. 2b left). However, the total length of DNA transposons only accounts for 6% of all TEs. A chi-square test confirmed that the proportion of DNA transposons in these genes is significantly higher than expected based on their overall genome proportion (P < 0.01). This significant enrichment of DNA transposons in genes expressed only in some cotton species implies that they play a crucial role in regulating whether genes are transcribed. For example, in the gene family G0020242 encoding a peroxidase-like protein that influences plant development and stress resistance, no expression signal was detected in Grai_10G013610 (D5) with DNA transposon insertion within 2 kb upstream. In contrast, the orthologous gene Garb_10G013290 in A2 exhibited a high expression level (Fig. 2c). We speculate that the DNA transposon insertion in the upstream 2 kb region disrupted the promoter of Grai_10G013610 (D5) or replaced its transcription start site (TSS), thereby impeding proper recognition and binding by RNA polymerase. This observation suggests a potential impact of the DNA transposon insertion on the expression of orthologous genes.
Moreover, the main type of TE with differential insertions in the differentially expressed orthologous genes is LTR retrotransposons, constituting 69.5% (Fig. 2b right). This indicates that LTR retrotransposons may play a role in regulating differences in gene transcription levels. It appears that different types of TE insertions have varying impacts on the expression of orthologous genes. Promoter regions are known to contain numerous regulatory elements, and TE insertions into these regions can influence gene expression. To further explore the impact of TE insertions on promoters, we identified TE-associated transcription factor-binding sites (TFBSs) in cotton. The heatmap illustrates the Z-scores of TFBS enrichments across various TE families, with LTR retrotransposons exhibiting significantly higher enrichment scores compared to other TE clusters (Fig. 2d). The clustering analysis clearly differentiates between LTR retrotransposons and other TEs based on their TFBS enrichment profiles, highlighting the greater impact of LTR retrotransposons on gene regulation. The results indicate that LTR retrotransposons exhibit a higher enrichment of TFBSs compared to other types of TEs. Specifically, well-known regulators of gene expression such as MYB and WRKY are associated with LTR retrotransposons (Fig. 2d). Within the gene family OG0000794, which is involved in plant cell wall growth, the LTR retrotransposon TE1502079 inserted upstream of the Garb_09G019690 (A2) contains a MYB TFBS. The expression of Garb_09G019690 was significantly upregulated compared to that of the orthologous gene in D5 (Fig. 2e). Therefore, LTR retrotransposons may regulate gene expression by influencing the binding of transcription factors.
Subsequently, we investigated the specific LTR retrotransposon insertions of orthologous genes on four major cotton branches (D5, E1, Clade 1, and Clade 2). When there were no specific LTR insertions on genes in each branch, their expression did not significantly differ; however, branches with specific LTR insertions exhibited significantly greater expression levels than did the other branches (Fig. 2f). Interestingly, in branches with specific LTR insertions, their genes contain more TEs that have undergone specific expansion in that branch. For instance, in Clade 1 with specific LTR insertions, the most abundant TE type in their genes is cluster 2 TE, which was also the TE family that has undergone specific expansion in Clade 1 (Fig. 2f, Fig. 1b). This result suggests that specifically expanded TEs can regulate lineage divergence at the transcriptional level of orthologous genes by introducing specific LTR insertions, thereby affecting the evolution of cotton species. Similarly, comparing the expression levels of orthologous genes with and without specific DNA TE insertions on the four branches revealed that branches with specific DNA TE insertions had significantly lower expression levels (Fig. S5). In conclusion, these results reveal the regulatory role of differential TE expansion at the transcriptional level in gene function and the divergence of cotton species, providing new evidence for our understanding of species evolution.
TEs regulate transcript isoforms of orthologous genes by affecting AS
Following the analysis of transcriptional level differences among genes in various cotton species, Nanopore Direct RNA Sequencing (DRS) was utilized to investigate the variations in the transcript structures of the aforementioned transcribed orthologous genes. DRS does not require reverse transcription and PCR amplification, and directly sequencing RNA strands, enabling the acquisition of more comprehensive and precise transcriptomic information, thereby unveiling the intricate structures and splicing patterns within transcripts (Table S7). Thus we generated full-length DRS data and identified transcript isoforms in eight species. A total of 45,075 (F1) – 57,274 (D5) isoforms were transcribed from 22,538 (D5) – 23,950 (F1) genes across the species analyzed (Fig. S6a). Among these genes, 11,640 (K2) – 12,574 (G1) orthologous genes were transcribed, with almost half of them transcribing multiple (≥ 2) isoforms in each species (Fig. 3a, Table S8). Furthermore, there is a significant disparity in the number of isoforms among orthologous genes. Approximately 76% of these genes exhibit a coefficient of variation (CV) in isoform number exceeding 0.3 (Fig. S6b). This indicates that there are differences in splicing patterns among orthologous genes, leading to the generation of diverse mRNA isoforms.
We next explored the putative role of TEs in regulating gene splicing. We found that cotton species with a higher number of isoforms exhibit a greater enrichment of TEs on their genes. Specifically, there is a significant positive correlation (P < 0.05) between the number of isoforms and the TE enrichment score (Fig. 3b). For instance, the gene family OG0015491, which regulates rRNA generation, transcribed 5 isoforms in Garb_07G013780 (A2) with a TE enrichment score of 0.93, and 3 isoforms in Glon_07G013840 (F1) with a score of 0.70 (Fig. 3d). Compared with F1, the TE1548733 introduced a splicing site C in A2, resulting in the splicing of a new isoform T64325, while the introduction of splicing sites C and G generated a new isoform T64323 (Fig. 3d). This demonstrates that certain TEs can introduce new splicing sites, leading to the production of a greater number of transcripts through AS. However, some orthologous genes exhibit a significant negative correlation (P < 0.05) between the number of isoforms and the TEs enrichment score (Fig. 3c). For example, the gene family OG0014419, which regulates fatty acid synthesis, transcribed 4 isoforms in Gsto_06G009650 (E1) without TE insertions, while its orthologous gene Garb_06G018610 in A2 has a TE enrichment score of 0.52 but only transcribed a single short isoform (Fig. 3e). This is probably due to the TE insertion introducing a premature transcription termination site, resulting in the premature termination of transcription and the production of only one isoform. These results collectively illustrate the regulatory role of TEs on the number of transcripts isoforms.
Apart from affecting the number of transcripts, how do TEs influence the structure of transcripts? The AS events within transcripts are directly linked to the diversity of transcript structure. We proceeded to identify AS events across 8 species, totaling 22,568 (F1) – 36,716 (A2) AS events (Table S9). Among these events, intron retention (IR) is the most abundant, accounting for 63.1% (A2) to 71.5% (G1) of all events (Fig. 3f, Table S9), consistent with other species such as maize [43]. A comparative analysis of genes with and without TE insertions in each species revealed that genes with TE insertions had significantly more AS events (P < 0.0016) (Fig. 3g). Furthermore, among orthologous genes across cotton species, genes with TE insertions had significantly more AS events (P = 6.9 × 10− 41) compared to those without TE insertions (Fig. 3h). Interestingly, by subdividing genes with TE insertions based on insertion positions, we found that genes with TE insertions in introns contained more AS events than those with insertions in upstream/downstream regions and exons (Fig. 3h).
We analyzed the level of transcript conservation among cotton species. Based on isoform sequence similarity, a total of 32,752 sets of conserved isoforms were identified across all genes. Among these, 4,323 sets of isoforms were conserved in all eight species, accounting for only 13.2% (Fig. 3i). This indicates significant differences in the splicing patterns of genes. The conservation level of major isoforms among orthologous genes was slightly greater. Among the 8,620 orthologous gene families transcribed in all cotton species, 3,881 (45%) had major isoforms conserved in all eight species, and these major isoforms mostly did not involve splicing events. However, the major isoforms with AS are conserved only in some (2–7) cotton species (Fig. 3j). This indicates that splicing events contribute to structural differences in major isoforms. Importantly, isoforms containing splicing events exhibited significantly greater TE enrichment scores (P = 2.1 × 10− 6) compared to the major isoforms without splicing events (Fig. 3j). This further demonstrates that TE can mediate change of alternative splicing event, leading to structural differences in major isoforms of orthologous genes. Additionally, we identified the potential non-conserved major isoforms caused by TE difference expansion on four branches to investigate the impact of TE-mediated alternative splicing on the lineage differentiation of genes. In branches with specific TE insertions (D5, E1, Clade 1, and Clade 2), there were 278 (F1) – 322 (K2), 116 (F1) – 199 (D5), 147 (C1) – 311 (D5), and 229 (F1) – 448 (D5) genes with different major isoform structures (Fig. 3k). In summary, these results suggest that TEs can alter splice sites or splicing regulatory sequences, leading to changes in gene splicing patterns and subsequently influencing the transcript structural differentiation of orthologous genes. These observations contribute to further elucidating the posttranscriptional regulatory mechanisms underlying cotton species divergence.
TE-mediated uORF regulates translational differences of orthologous genes
Given that transcriptional differences do not fully elucidate the diversity and differentiation mechanisms among cotton species, the gene translation levels are more indicative of the actual protein levels. Therefore, we focused on differences at the translation level among orthologous genes. We constructed polysome profiling libraries for eight cotton species and quantified their translation levels (Table S10). The correlation between the transcriptome and translatome data in each cotton species ranged from 0.73 to 0.90 (Fig. 4a, Fig. S7), indicating a certain discrepancy between the transcriptional and translational levels. The coefficient of variation (CV) for transcriptional and translational levels within each cotton species showed that the CV for translational levels was greater than that for transcriptional levels (Fig. 4b). This indicates that while the transcriptome and translatome show a global good correlation, there are some discrepancies between the transcriptional and translational levels within cotton species. We compared the differences (Δ) in transcriptional and translational level changes among orthologous genes across cotton species and found that the translational level changes were greater (Δ > 0) (Fig. 4c). This finding also indicates that greater differences exist at the translational level between orthologous genes.
To investigate the post-transcriptional changes of orthologous gene regulation, we simultaneously compared the transcriptional and translational levels among orthologous genes. For example, compared to E1, there were 1,804 genes in D5 with no difference in transcription but up-regulated translation (class 2) and 1,930 genes with no difference in transcription but down-regulated translation (class 8), which were the two most abundant types of differentially expressed genes (DEGs) (Fig. 4d). Similarly, in comparison with those in other species, class 2 and class 8 genes were also the most abundant (Fig. S8, Table S11). The presence of large numbers of class 2 and class 8 genes suggests significant variability in gene expression levels after transcription, with some post-transcriptional regulatory factors playing crucial roles. Further analysis revealed that genes with differential translation but no difference in transcription levels contained more specific uORFs at the 5’ end (Fig. 4e, Table S12). During translation, uORFs may compete with mORFs for ribosome binding, thereby reducing the translation efficiency of mORFs. Therefore, we speculate that these specific uORFs regulate the differences in translation levels of orthologous genes after transcription. Investigations into four genes neighboring the orthologous gene family OG0010564, which encodes a WRKY transcription factor, demonstrated that genes containing specific uORFs exhibited lower gene translation efficiencies (Fig. 4f). For instance, OG0010573 in G1, OG0010570 in F1, OG0010564 in G1, and OG0010562 in A2, B1, G1 and K2 contained specific uORFs, and their translation efficiencies were significantly lower than those without uORFs (Fig. 4f). Additionally, the translation efficiencies of orthologous genes with and without uORFs at different evolutionary nodes (N1 – N6) were compared and found that genes with uORFs had significantly lower translation efficiencies (Fig. 4g). These results demonstrate that uORFs may regulate the translation efficiency differences of orthologous genes, thereby impacting cotton species evolution.
To explore whether differential TE expansion leads to uORF variations, the TE coverage in the 5' regions of transcripts containing uORFs was calculated. The results revealed that genes with specific uORFs have greater TE coverage in their 5' regions than conserved uORFs present in all cotton species (Fig. 4h). We hypothesized that differential TE expansion may regulate the translation efficiency of orthologous genes by introducing specific uORFs, thereby mediating species divergence. For example, the gene Garb_03G025400 (A2) that responds to light stimulation contains a specific uORF introduced by a TE at the 5' end, leading to significantly reduced translation levels compared those of to the orthologous gene Grai_03G004290 in D5 (Fig. S9). Subsequently, we compared the translational efficiencies of genes with and without specific uORFs introduced by differential TE expansion across the four branches (D5, E1, Clade 1, Clade 2). The results showed that genes in branches with specific uORFs had significantly lower translation efficiencies than those in other branches (Fig. 4i). This result indicates that differential TE expansion introduced specific uORFs and contributed to differences in gene translation levels and lineage divergence in cotton. GO enrichment analysis of genes with differential translation in the four branches showed that the genes in D5 were mainly enriched in pathways such as cellular response to stimulus and cellular nitrogen compound metabolic processes. The genes in Clade 1 were enriched in macromolecule catabolic processes and protein transport pathways, and genes in Clade 2 were enriched in intracellular signal transduction and cytoskeletal protein binding (Fig. S10). These findings provide insights into understanding the posttranscriptional regulation evolution of gene expression among cotton species.
TE-derived miRNA regulates translation differences in transcripts
Structural variations in transcript isoforms may affect the effective binding of regulatory factors, thereby influencing translation efficiency. Analysis of translational efficiency differences at the transcript isoforms level can provide deeper insights into the detailed mechanisms of posttranscriptional regulation. The actions of small RNAs on mRNA degradation and translation inhibition are well-known posttranscriptional regulatory mechanisms [44, 45]. To investigate the regulatory roles of small RNAs in transcript translation, we performed small RNA sequencing for each cotton species (Table S13). A total of 315,936–947,549 small RNA loci were identified (Fig. S11). It is found that some miRNAs and siRNAs are shared among different cotton species (Fig. S12). Given the large number of siRNAs making analysis cumbersome, we focused on miRNAs, which have 193 (K2) – 249 (G1) loci (Table S14). We predicted their targets and constructed a network of miRNA-regulated target genes (Fig. S13a). We observed that miRNA targets were not conserved in some orthologous genes, with some cotton species showing an absence of miRNA target sites. For example, the target of MIR477 is present only in the Australian cotton but is absent in other cotton species (Fig. S13b). This implies that miRNAs may contribute to the expression differences among orthologous genes in cotton species. Subsequently, we counted the numbers of conserved and nonconserved targets at each evolutionary node (Fig. 5a). The translation efficiencies of miRNA target and nontarget genes were compared, revealing significantly lower translation efficiency of miRNA target genes within each species and between orthologous genes of different species (Fig. 5b). These results indicate that miRNAs indeed exert translational inhibition on cotton target isoforms. As expected, the conservation level of these miRNA target sites among orthologous genes was relatively low. For example, the ancient superfamily miR482 and miR2118 exhibit target site specificity of 63.8% and 53.4% in the orthologous genes, respectively (Fig. 5c, Table S15, S16). This finding suggests that miRNAs can mediate the expression differences by regulating the translation efficiency of genes.
To investigate the role of TEs in this context, we quantified the contributions of different TE families to the evolution of miRNA target sites. We found a significant enrichment of miRNA target sites within LTR retrotransposons, particularly at the insertion sites of Gypsy transposons (Fig. 5d, Table S17). This observation indicates that TE expansion may lead to an increase in miRNA target sites, thereby influencing the regulation of gene translation. Interestingly, the translation efficiencies of genes with miRNA target sites decreased, especially as TE enrichment increased, in comparison to those of nontarget genes (Fig. 5e). This further demonstrates that TEs can regulate gene translation through miRNA-mediated mechanisms.
Upon closer observation, it was found that TEs primarily affect the regulation of miRNA-mediated isoform translation differences through two mechanisms. First, TE-derived miRNAs can reduce the translation efficiencies of isoforms containing target sites, while nontarget isoforms undergo normal translation. For example, in Garb_08G006760 (A2), three nontarget isoforms exhibited greater translation efficiencies, whereas the isoform containing the TE-derived miR171 target site had a translation efficiency of only 0.265, resulting in a lower translation efficiency of the entire Garb_08G006760 gene compared to its orthologous gene in D5 (Fig. 5f). The second mechanism involves TEs introducing new splicing sites in miRNA target genes, which leads to the emergence of isoforms that evade miRNA regulation due to the absence of miRNA target sites. This leads to an increase in the translation efficiency of that isoform. For example, TE1229 introduces a new splicing site in Garb_01G001730 in A2, resulting in the transcription of an isoform without the miR166 target site. This isoform exhibited significantly greater translation efficiency than other target isoforms (Fig. 5g). Notably, both of these mechanisms result in nontarget transcripts becoming new major transcripts, leading to possible structural and functional alterations in the proteins encoded by orthologous genes. We identified the number of genes regulated by these two mechanisms in each cotton species, ranging from 314 (D5) – 541 (F1) and 326 (F1) – 400 (C1), respectively (Fig. 5f, 5g). We further investigated the impact of TE-derived miRNAs on cotton lineage divergence. By comparing the translation efficiencies of genes with and without TE-derived miRNA targets across four branches, we found that branches with miRNA targets exhibited significantly lower translation efficiencies (Fig. S14). These results indicate that TE-derived miRNAs mediate the lineage divergence of gene translation in cotton, and provide clues for exploring the role of TEs in driving the evolution.
Phylogenetic insight into posttranscriptional evolution in the Gossypium genus
To delve deeper into gene expression regulation during cotton species divergence, we generated expression profiles of orthologous genes from four lineages (D5, E1, Clade 1, Clade 2) at three levels (transcription, splicing, translation) and found numerous lineage-specifically expressed genes through K-means clustering. It is noteworthy that there were variations in lineage-specific expression of orthologous genes across the levels of transcription, splicing and translation (Fig. 6a). Specifically, there were distinct regulatory divergences of orthologous genes among different lineages. Specifically, there were distinct regulatory divergences of orthologous genes among different lineages. For instance, the orthologous gene OG0012439 exhibited Clade 1 lineage specificity at the transcriptional level, while showing Clade 2 lineage specificity at the translational level (Fig. 6b). A total of 68 orthologous genes display this pattern. These findings indicate a dynamic trend in gene expression regulation during cotton divergence, involving a substantial number of genes. Among the genes exhibiting Clade 1 lineage-specific expression at the transcriptional level, 72.1% showed changes in lineage-specific expression at both the splicing and translation levels, while only 2.3% remained unchanged (Fig. S15). These results reveal the high flexibility and plasticity of gene expression regulation during evolution, providing a fundamental guarantee for organisms to adapt to diverse environments and growth requirements.
To comprehensively explore the expression divergence of orthologous genes during the evolution of cotton species, we focused on key evolutionary time points in the cotton evolutionary process. Based on phylogenetic analysis, the evolution of cotton species was divided into four stages: an undifferentiated ancestral stage of eight cotton species (Stage 1), the D5 differentiation stage (Stage 2), the E1 differentiation stage (Stage 3), and the Clade1 and Clade2 differentiation stage (Stage 4) (Fig. 6c left). We meticulously identified DEGs at the three levels present in these four evolutionary stages. Specifically, there were 877 conserved gene families across all eight cotton species (Stage 1), 1,799 gene families with lineage-specific expression in the D5 (Stage 2), 3,617 gene families with lineage-specific expression in the E1 (Stage 3), and 4,455 gene families exhibiting significant expression differences between Clade1 and Clade2 (Stage 4) (Fig. 6c middle). These DEGs are key factors driving the divergence of cotton lineages. We further quantified the regulatory role of TEs on the DEGs, and documented the potential number of genes with TE-mediated differential expression in the four evolutionary stages. The expression divergences of 1,035 genes (Stage 2), 2,288 genes (Stage 3), and 3,931 genes (Stage 4) were found to be caused by differential TE insertions, accounting for 57.5%, 63.3%, and 76.1% of the DEGs, respectively (Fig. 6c middle).
Finally, we conducted GO enrichment analysis on these divergent genes and found that the functional enrichment patterns of DEGs affected by TE varied significantly across different evolutionary stages. In Stage2, genes with D5-specific divergence were predominantly enriched in processes related to pigment metabolic process and sterol biosynthetic process, reflecting the differentiation of secondary metabolic pathways. In Stage 3, the enrichment of E1 lineage-specific divergent genes in cellular response to abiotic stimulus and protein methyltransferase activity implies a differentiation towards stress resistance. Genes with divergence between Clade 1 and Clade 2 at Stage 4 were enriched in seed trichome initiation, cell wall, cell morphogenesis (Fig. 6c right). Importantly, A2 (G. arboreum) in Clade 1 can produce spinnable cotton fibers. In contrast, K2 (G. rotundifolium) and C1 (G. sturtianum) in Clade 2 lack fibers or have very short fibers [46], possibly implying differentiation in cotton fiber evolution. In conclusion, the results of this study provide valuable insights into deciphering posttranscriptional regulatory mechanisms in species evolution.