Phylogenetic relationships in Crocus
Until now, the phylogenetic backbone of Crocus was not yet resolved. Therefore, we first established a molecular classification of Crocus to have a robust foundation for interpreting the relationships between WGD, repeat dynamics, and dysploidy in a phylogenomic context. We generated phylogenetic trees based on chloroplast genomes and 35S rDNA sequences of 175–178 taxa (Supplementary Table 1). Four Romulea species, as well as Syringodea bifurcata and Afrocrocus unifolius, served as outgroups. We included only those with complete 35S rDNA and plastid genome assemblies for Crocus samples.
In the 35S rDNA phylogeny (Supplementary Fig. S2), S. bifurcata and A. unifolius group, with 93% bootstrap support, as sister to the maximally supported genus Crocus. Both species were lacking in the plastid dataset. Therefore, we refer for divergence time estimations to the 35S rDNA dataset.
The genus Crocus originated about 17.5 Mya ago (16.59–19.74 Mya) and is divided into four subgenera (Orientalis, Carpetanus, Crocus, and Nudiscapus) in the chloroplast as well as in the 35S rDNA dataset (Supplementary Fig S2). In the 35S rDNA phylogeny, subg. Orientalis split first (15.56 Mya [14.9–16.7 Mya]), followed by Carpetanus (15.2 Mya [14.2–16.7 Mya]) and Crocus (13.8 Mya [13.2–15.2 Mya]), with Nudiscapus as the sister clade (bootstrap supports of 97–100%) (Supplementary Fig. S2). The cp phylogeny differs with Carpetanus splitting first, followed by Orientalis (Supplementary Figs. S3, S4). However, while the support of the backbone of the plastid phylogeny is very high (Supplementary Fig. S3), this split of Carpetanus and Orientalis is only weakly supported (51%) in the 35S rDNA phylogeny. Still, support for the four main clades and most of the subclades is strong in both datasets. The only exception with more incongruences and lower support values concerns the subclades of Nudiscapus (Supplementary Fig. S4). In this subgenus, we could detect hybridization between different series (C. sieheanus 2n = 16, series Abantensis or ser. Reticulati x ser. Punctati, C. mazziaricus 2n = 16, 2C = 11.77 pg originated by a cross of species from ser. Punctati species and ser. Cancellati, Supplementary Fig. S4).
In summary, although we could resolve the phylogeny, we found several incongruences that are, at least in some cases, likely caused by hybridization. We used the 35S rDNA tree to anchor our WGD, repeat dynamics, and dysploidy data, considering the biparental evolution of the nuclear genome-derived 35S rDNA, which is more appropriate for analyzing these nuclear genomic events.
Two periods of WGD events in Crocus
We identified WGD events using paralogous genes from genomic and transcriptomic data of five species (subgenus Crocus: C. hadriaticus and C. sativus from ser. Crocus, C. longiflorus and C. vernus from ser. Verni, and subgenus Nudiscapus: C. weldenii from ser. Punctati) to understand whether WGD events occurred before the diversification of Crocus and of its intrageneric lineages.
With the duplicated BUSCO gene sets from the five species, we could detect a WGD event between 14.89–19.74 Mya (Ks = 0.118–0.168) before Crocus diversification, henceforth referred to as Cr-β WGD event (Fig. 1A, B). Part of the vacuolar protein sorting 11 gene (vps11, Ks = 0.122 in C. longiflorus (evolutionary rate [ER; [46]]: 0.84, Ks normalized by the ER: 0.145), used as a phylogenetic marker for samples without omics dataset, also showed duplication in species representing all four Crocus subgenera, supporting the shared Cr-β WGD event by all crocuses (Fig. 1C). Similar results were obtained using other Angiosperm 353 marker genes (Ks ≈ 0.145), corroborating the shared Cr-β WGD event among crocuses (Supplementary Fig. S5). The gene trees using these marker genes resulted in grouping one Crocus paralog with one or both outgroups S. bifucata and A. unifolius and the other Crocus paralog as a sister (Supplementary Fig. S5). This observation indicates that the divergence of Crocus paralogs is older than the actual Cr-β WGD and that the Cr-β WGD event was allopolyploidization involving an ancestor of S. bifucata and A. unifolius. Because allopolyploidization suggests that the paralogous genes have already accumulated some divergence before the merger of the two parental genomes, the actual Cr-β WGD event should be younger than the dated stem age of 16.59–19.74 Mya but before crocuses started to diversify around 14.89–16.74 Mya.
In addition to the Crocus-specific Cr-β WGD, we also detected more recent independent and intrageneric WGD duplication events. We collectively refer to this period of recent intra-Crocus WGD events as the Cr-α WGDs (Fig. 1A, B). Using the Ks values of duplicated BUSCO genes from C. sativus (density peak at Ks = 0.03) and C. hadriaticus (density peak at Ks = 0.02), we inferred the ser. Crocus Cr-α WGD event at 2.0–3.9 Mya at the root of ser. Crocus. Similarly, for ser. Verni, we estimated the Ks values from C. vernus (density peak Ks = 0.02) and inferred the ser. Verni Cr-α WGD event at 2.1–2.7 Mya after divergence from C. longiflorus (Fig. 1A, B).
With the C. weldenii dataset, we could clearly detect the Cr-β WGD, but the Cr- α WGD-related Ks peak is not distinctly visible, just showing a “bump.” The number of duplicated BUSCO genes with Ks values of 0.015–0.035 around this “bump” is much lesser in C. weldenii (110 out of 3,055 duplicated genes) than in other species that experienced the Cr-α WGD (981 out of 3020 in C. vernus, 772 out of 2,995 in C. sativus, and 163 out of 1,314 in C. hadriaticus transcriptome, Supplementary Table 2). These 110 duplicated genes in C. weldenii are more than those in C. longiflorus (16 out of 2,969), which did not experience the Cr-α WGD. This observation suggests that C. weldenii may have experienced introgressed genomic segments that experienced the Cr-α WGD.
In summary, we have shown that at least two periods of WGD events have predated the initial (Cr-β) and several recent intrageneric (Cr-α) diversifications in Crocus.
Repeat bursts after WGD events
To understand whether WGDs spurred repeat bursts in Crocus, we analyzed the Crocus panrepeatome, which includes TEs, satDNAs, rDNA, and microsatellites. First, we quantified the different repeat classes by subgenus, sub-lineages, and samples (Fig. 2A–C). Then, we calculated statistical support for repeat bursts within each sub-lineage (Fig. 2D–F) and anchored the evolutionary interpretation on the 35S rDNA tree.
We analyzed the repeatomes of 219 samples, including 215 Crocus and four Romulea outgroups (Fig. 2A, B; Supplementary Table 3). RepeatExplorer2 and TAREAN generated 142,149 repeat clusters represented by contigs with the highest genome representations (Supplementary Table 4), which grouped into 47 repeat categories based on RepeatExplorer2 annotation and into six major panrepeatome categories: Class I TE, Class II TE, satDNA, rDNA, Unknown, and Organelle (Supplementary Table 5). The sequences clustered into 40,599 uclusters, excluding organellar repeats (see Methods for the difference between “cluster” and “ucluster”). Of these uclusters, 37,878 are specific to Crocus, 2,552 to Romulea, and 169 are shared (Supplementary Fig. S6 and Supplementary Table 6). For comparative purposes, we included Romulea in the panrepeatome dataset, which comprised 7,354 and 33,245 core and accessory uclusters, respectively (Supplementary Table 6).
Sequences within all uclusters ranged from 13 to 20,697 bp (Fig. 2C, Supplementary Fig. S7, Supplementary Table 4). Transposable elements, rDNA, and Unknown repeats are in the longer range, and satDNAs are in the shorter range. Most Unknown repeats fall into the 100–600 bp range, with 126-bp the predominant sequence representation, and satDNAs were dominated by 178-bp repeats.
The genome proportions (GP) of total repeats are significantly higher in Crocus (34.9–73.2%) than in Romulea (18.5–34.1%) (Fig. 2A), and the repeat content significantly contributed to the genus-level genome size (GS) increase (p < 0.001) (Supplementary Fig. S8). However, GS positively correlated with the repeat content only in subg. Crocus and not in other subgenera (Supplementary Fig. S9). In the cytogenetic group, ser. Verni, total repeat content did not affect GS when samples are strictly limited to ser. Verni. However, this relationship is significantly positively correlated when the immediate sister group, ser. Malyi, is included (Supplementary Fig. S10). However, this relationship between genome size and total repeats should be interpreted cautiously, as many samples do not have genome size data (Supplementary Table 1). Chromosome numbers generally did not correlate with genome size at the genus level but showed a strong negative correlation with GS in combined ser. Verni and ser. Malyi (Supplementary Fig. S10).
Regarding overall repeat content, the subgenera Crocus and Nudiscapus have significantly more repeats than Carpetanus and Orientalis combined, and Nudiscapus has significantly more repeats than Crocus. Moreover, repeat content varied between different sub-lineages (Supplementary Fig. S11). Class I TEs (9.6–57.7% GP) significantly contributed to genome size more than Class II TEs (0.3– 3.4% GP) (Fig. 2A, B, Supplementary Figs. S12, 13). Among Class I TE, LTR retrotransposons are the predominant repeats. Ty1/Copia is significantly higher than the Ty3/Gypsy superfamily (Fig. 2B). Angela (Ty1/Copia), Athila, and Retand (Ty3/Gypsy) LTR families have significant contribution to genome size (Supplementary Figs S12). In contrast, Class II TEs, with EnSpm/CACTA having the highest GP (Fig. 2B), have negative correlations with genome size, although not as substantial as those of Class I TEs (Supplementary Fig. S13). In ser. Verni, Angela, Bianca, Retand, SIRE, Tekay, EnSpm/CACTA, and uncharacterized LTR elements significantly increased relative to genome size (Supplementary Fig. S14).
Both satDNAs (0.2–9.7% GP) and rDNAs (0.4–11.9% GP) also contributed considerably to the genome sizes of some samples. Series Adanensis, ser. Crocus, and ser. Scardici have the most abundant satDNA, 35S rDNA, and 5S rDNA content, respectively (Supplementary Figs. S11 and Fig. S15), although C. wattiorum (currently unplaced in sect. Heterogenei, provisional named ser. Wattiorum) has the highest 5S rDNA content (Fig. 2A). Nevertheless, a considerable proportion of repeats (3.6–46.8 GP) remain uncharacterized (Fig. 2A, C). These unknown repeats may include short, non-autonomous mobile elements like miniature inverted-repeat transposable elements (MITE) [47], terminal-repeat retrotransposons in miniature (TRIM) [48, 49], or short interspersed nuclear elements (SINES) [50].
Of the 40,599 uclusters, 8,918 were significantly associated with each group (p < 0.05), and 2,308 were strictly specific (p ≥ 0.05 in other groups) to each group (Fig. 2D and Supplementary Table 6, 7). These group-specific repeats could resolve series within the subg. Crocus, but not within subg. Nudiscapus, except for section Adamii (Fig. 2E).
The 2,308 group-specific repeats represent 2–70% of the total repeats in each sample and a mean of 5–60% per series (Supplementary Table 8), indicating that the group-specific repeats varied in the magnitude of amplification in each sample and sub-lineage (Fig. 2F). These repeats represent 18–70% (mean 60%) of the total repeats in ser. Crocus, 38–70% (mean 60%) in sect. Adamii, and 16–38% (mean 31%) in ser. Verni, indicating that bursts of specific repeats contribute differently to the overall repeat content in different clades. Moreover, both ser. Crocus and ser. Verni, where WGD events were involved in cladogenesis, showed bursts of these repeats, indicating a link between WGD and repeat bursts.
In summary, we have shown that different types of repeats contributed differently to each sample, series or section, and subgenus. We have also demonstrated a direct correlation between WGD and repeat bursts in ser. Crocus and ser. Verni where we have observed the Cr-α WGD. The panrepeatome profile (heatmap) can also enable one to extrapolate potential WGD events in clades where no genomic data is available but shows a similar profile and amount of repeat burst with those in the two series (e.g., ser. Sieberi, sect. Adamii, and ser. Cancellati). Next, we wondered if TEs co-evolved with some tandem repeats like microsatellites.
Microsatellite repeats co-amplified with transposable elements
To comprehensively survey the panrepeatome, we included microsatellites, but we extended the length range to cover 2–30 bp repeats because genomic repeats are not limited to only the longer satDNAs and TEs. We extracted 357,682 microsatellite sequences Supplementary Table 9) and saw a higher representation of hexanucleotides (105 Mb cumulative length), which is second only to dinucleotide microsatellites (189 Mb) (Supplementary Fig. S16). Moreover, microsatellites that are 6-bp, 12-bp, and 18-bp long, all multiples of six, are the most represented sequences in the dataset, indicating a biological function of hexanucleotide microsatellites in Crocus.
All microsatellites clustered into 35,985 uclusters with the 6-bp human-type telomeric repeat (TTAGGG, micro10581, htel) representing the fourth longest cumulative length, only next to AG, AT, and AAC microsatellites as the three highest, respectively (Fig. 3A and Supplementary Tables 9 and 10). Whereas the Arabidopsis-type telomere (TTTAGGG, micro03525, aratel) ranked only 44th. Intriguingly, both telomeric repeat variants only had 53-bp and 50-bp mean array sizes, respectively, which did not cover the input read length (103 bp) (Fig. 3A and Supplementary Table 10), indicating an abundance of short-arrayed loci over the long arrays found at chromosome termini.
FISH analysis of the htel and aratel microsatellites confirmed the distribution of htel at chromosome termini but not of aratel (Fig. 3B). In addition to the chromosome termini, a chromosome painting-like pattern was also observed for htel in all species we analyzed (Fig. 3B and Supplementary Figs. S17-S37). We suspected this chromosome painting-like pattern of htel is caused by the insertion of short htel arrays in dispersed repeats like retrotransposons. So, we analyzed the co-amplification of microsatellites and TEs between Romulea and the Crocus subgenera and found that several microsatellites were significantly co-amplified with TEs (p < 0.05, Fig. 3C and Supplementary Table 11). For example, micro03527 and micro18034 co-amplified with Ucluster00803, which is a TAR element (Ty1/Copia), and micro02204 co-amplified with Ucluster00104 (SIRE) and Ucluster00323 (Bianca) (Fig. 3D). Moreover, micro10581 (htel) and micro03525 (aratel) co-amplified with Athila (Ucluster00339), Bianca (Ucluster00323, Ucluster00395, and Ucluste00398), and Angela elements (Ucluster00303).
This co-amplification is explained by the insertion of short co-evolved microsatellite arrays into their corresponding co-amplified TEs (Fig. 4). For example, micro10581 forms short arrays between the polyprotein domain and the 3’ LTR of the Athila Ucluster00339. Conversely, the insertion of micro10581 in Bianca and Angela elements is between the 5’ LTR and the gag gene. These short arrays within these TEs explain the overall short mean array size of htel microsatellites (Fig. 3A), contrary to what we expected if htel was exclusively found at chromosome ends. In such a case, we expected htel to cover all 103-bp reads, not just the mean 53 bp.
Micro18034 and micro03527 are mainly localized in short arrays at the 3’ LTR of TAR elements (Ucluster00803). Microsat02204 is localized as part of a longer repeat monomer in short arrays towards the 3’ LTR in SIRE elements (Ucluster00104) and after the 5’ LTR in Bianca elements (Ucluster00323). We did not observe a high amplification of micro02204 in our C. vernus LTR dataset because this microsat is amplified more in subg. Nudiscapus than in subg. Crocus (Fig. 3D).
In summary, we have shown that human-type telomeric repeat makes up the chromosome ends in Crocus, and that this repeat, along with the much fewer aratel, are inserted in TEs. Some other microsatellites, like micro02204, also show co-evolution with TEs and lineage-driven amplification. Next, we wondered if the pattern of repeat amplification and distribution in the genome could tell us about their potential involvement in descending dysploidy.
Amplification of CroSat042 suggests a role in descending dysploidy
To explore whether the patterns of satDNA abundance and chromosomal distribution among related taxa in a clade can tell us about the potential role of repeats in descending dysploidy, we analyzed the karyotypes of species in ser. Verni. We focused our analysis on ser. Verni because of this group's high rate of dysploidy (Supplementary Fig. S10) [45] and the readily available materials for chromosome analysis. To do this, we focused on the satDNA fraction of the panrepeatome (pansatellitome) (Fig. 5A, and Supplementary Table 6), as satDNAs are the repeat types found mainly at chromosome fusion points [22].
The 2,146 satDNA sequences ranging from 13–7948 bp were clustered into 564 CroSats with 178 bp as the most abundant monomer length in terms of sequence count and cumulative length (Fig. 5B). Of the 564 CroSats, some show sub-lineage enrichment, while others are shared by several groups (Fig. 5A). Three CroSats (CroSat042, CroSat144, and CroSat137) are relatively abundant in ser. Verni but are also shared by other groups. Most groups in subg. Crocus share the 178-bp CroSat042, but it can also be detected in very high copies in ser. Adanensis within subg. Nudiscapus. The 186-bp CroSat144 is more abundant in ser. Sieberi than in ser. Verni. Finally, the 159-bp CroSat137 is shared in ser. Verni and sister outgroup (ser. Malyi) and in ser. Versicolores.
These three satDNA markers and the 5S and 35S rDNA markers enabled karyotyping of eight C. vernus populations; three C. heuffelianus populations; one population each for C. bertiscensis, C. kosaninii, C. tommasinianus, and C. longiflorus; two populations of a tetraploid C. cf. heuffelianus; and the cultivar C. etruscus ‘Zwanenburg’ (which is a hybrid of C. tommasinianus and C. etruscus) (Supplementary Fig. S17). However, the homologous pairing was not always straightforward for some species as these cytogenetic markers also revealed the highly heteromorphic karyotypes of many species from different populations, like those of C. heuffelianus and C. vernus. No two populations showed the same karyotype (Supplementary Figs S17-S37).
CroSat042, CroSat144, and CroSat137 varied in the number of chromosome loci, distribution, and cumulative lengths within ser. Verni. Species with more chromosomes have fewer CroSat042 loci and, thus, shorter cumulative lengths, such that the number of loci and the cumulative lengths are strongly negatively correlated with chromosome number (Fig. 5C). Whereas CroSat144 has no significant correlation, and CroSat137 is strongly positively correlated.
Regarding chromosomal distribution, CroSat144 was detected on only one chromosome in C. longiflorus, at the interstitial region of the long arm of chromosome 8 but varied in presence or absence and distribution in its dysploid relatives (Supplementary Figs. S17-S37, Supplementary Table 12). CroSat137 was mainly distributed at the primary constriction, indicating centromeric distribution, but also at intercalary loci in some dysploid species. CroSat042 was localized on only one chromosome in C. longiflorus, a small locus in the subtelomeric regions of both chromosome arms (Fig. 5D, Supplementary Fig. S19, Supplementary Table 12). However, dysploids, particularly C. vernus, which has the lowest chromosome count, have more loci, and FISH signals are more intense at subtelomeric, pericentromeric, and intercalary regions (Supplementary Figs. S20-S37).
The increase in loci number and cumulative length of CroSat137 is reasonable because this satDNA is localized in the centromere; hence, the more chromosomes, the more loci and the longer the cumulative length. On the other hand, the increase in loci number and cumulative lengths of CroSat042 as chromosome numbers decrease suggests a role in chromosome fusion, especially since many loci are present at intercalary regions, potentially mediating chromosomal rearrangements at these sites.
In summary, we have shown that comparative pansatellitome and cytogenetics analyses revealed a putative involvement of CroSat042 in satDNA-linked chromosome fusion and descending dysploidy. Nevertheless, relying solely on this correlation does not unambiguously demonstrate the fusion of chromosomes in dysploids.
Crocus longiflorus chromosomes formed fused chromosome blocks in C. vernus
To demonstrate the fusion of C. longiflorus chromosomes in a dysploid relative, we performed chromosome painting using the genome assembly of C. longiflorus to design the paint probes. Then, we compared the FISH painting signals between C. longiflorus and C. vernus. We developed probes for six chromosome arms: CL_2S for the short arm of chromosome 2, CL_2L for the long arm of chromosome 2, CL_3L for the long arm of chromosome 3, CL_11S for the short arm of chromosome 11, CL_13L for the long arm of chromosome 13, and CL_14L for the long arm of chromosome 14 (Fig. 6A).
FISH signals of all six C. longiflorus chromosome arms showed patterns precisely as predicted (Fig. 6B). Each pool showed two signals, indicating one homologous pair and confirming the diploid genome of C. longiflorus. FISH analysis of the same six targets confirmed the fusion of C. longiflorus chromosome arms in C. vernus. Moreover, the six targets showed a more complex distribution in C. vernus (Fig. 6C, D). Four targets (CL_2L, CL_11S, CL_13L, and CL_14L) showed two signals indicating conserved synteny with that in C. longiflorus; one target (CL_3L) showed four signals but on the same chromosome arm, which could indicate chromosome inversion; and finally, one target (CL_2S) showed six signals, indicating either fragmentation of the short arm of C. longiflorus chromosome 2 before fusion in C. vernus or duplicated segments after WGD that rearranged to different chromosomal regions.
Because five of our probes targeted only one of the two chromosome arms for each chromosome, we could not tell whether the entire chromosomes were fused as a unit or whether chromosome arms fragmented before rearrangements. Nevertheless, we could tell from the C. longiflorus chromosome 2 (CL_2S and CL_2L) that one chromosome was fused as a unit as indicated by adjacent CL_2S and CL_2L in C. vernus chromosome 3a. Whereas in C. vernus chromosome 3b, CL_2S and CL_2L are separated into the different chromosome arms.
This comparative chromosome painting approach confirmed three C. vernus karyotype features: 1) The fusions of C. longiflorus chromosomes forming chromosome blocks in C. vernus. 2) These chromosome blocks are often bordered by CroSat042, reiterating the putative role of CroSat042 in a satDNA-linked chromosome fusion. 3) The absence of exact homologous chromosome pair but rather complex heterologous chromosomes, reiterating the highly heteromorphic C. vernus karyotypes. While some chromosome arms show similar painting patterns indicating pairing in meiosis, most do not (Fig. 6C, D).
In summary, we have shown through chromosome painting that the dysploid C. vernus genome is composed of C. longiflorus chromosomes that “fused” to form syntenic blocks, and that cytogenetically visible loci of CroSat042 are in between these blocks.