Sequencing and genome annotation. We obtained high quality-standard genome sequences of five representative perennial diploid species across the Glycine genus, Glycine falcata (FF), Glycine stenophita (BB), Glycine cyrtoloba (CC), Glycine syndetika (AA), and Glycine tomentella D3 (DD), and a perennial allopolyploid, Glycine dolichocarpa (AADD, dubbed AtAtDtDt to distinguish it from the ancestral AA and DD genomes). These genomes were de novo assembled through a combination of PacBio single molecule real-time (SMRT) sequencing, Illumina sequencing, and chromatin conformation capture Hi-C technologies (Methods), and further corrected/improved through integrating our previously generated paired bacterial artificial chromosome (BAC) end sequences (BESs) from the same set of accessions (soybase.org) (Supplementary Table 1). Of the BESs uniquely mapped to their respective genomes, 99.0~99.7% were anchored in pairs in expected orientations and ranges of physical distances. Approximately 98.6%-99.6% of the genomes were assembled into 20 (diploids) or 40 (allopolyploid) chromosomes, with average contigs N50 ranging from 2.2 to 6.8 megabase pairs (Mbs) and average scaffold N50 ranging from 49 to 71 Mb in size (Supplementary Table 2), comparing favorably with those of the reference genomes of the annual soybeans3, 9. Completeness of the genome assemblies was assessed by BUSCO10 and CEGMA11 (Supplementary Tables 3, 4). The assembled diploid genomes range in size from 941 to 1,374 megabase pairs (Mb), with 55,376-58,312 annotated protein-coding genes (Supplementary Table 5). The assembled allopolyploid genome is 1,948 Mb, harboring 107,346 annotated protein-coding genes (Supplementary Table 5). Approximately 40%-62% of these genomes are composed of transposable elements (TEs) (Fig. 2a and Supplementary Table 5).
Phylogeny and karyotype stability. A phylogeny of these perennials and the annual species was constructed with 281 randomly selected single-copy orthologous genes using common bean (Phaseolus vulgaris), which diverged from the Glycine lineage ~19 MYA12, as an outgroup. The evolutionary relationships, as reflected by the phylogenetic tree (Fig. 1a), are consistent with previous reports based on a limited number of genes, although the dates for the split of the perennial lineage from the annual species and the speciation of individual perennials were estimated to be ~1-2 MY earlier than the previous estimates with a different method13 (Fig. 1a).
The perennial diploids exhibited a high level of chromosomal conservation (Fig. 1b and Supplementary Fig. 1a). Only 183 non-redundant genomic rearrangements including inversions, translocations, and insertions/deletions (InDels) of >50 kb in size were identified by pairwise comparisons with the annual soybeans using common bean as an outgroup (Supplementary Table 6). Perennial-specific and species-specific rearrangements were also identified (Fig. 1c and Supplementary Fig. 1a). Comparison among multiple species/genomes enabled validation of many genomic rearrangements and defining the relative timing and nature of those events, such as the occurrence of the inversion of a ~3.4-Mb fragment involving 398 genes in the annual species (Supplementary Fig. 1b) and the reorganization of chromosome 8 of the D/Dt genome by a combination of inversion, deletion, and translocation events, which resulted in a reduction of >20Mb of genomic sequence as compared with chromosome 8 of the A/At genome (Fig. 1c). Intriguingly, the ancestral telomeric region adjacent to the site where these rearrangements occurred has been retained as one of the telomeric regions of the re-arranged chromosome in the D/Dt genome (Fig. 1c and Supplementary Fig. 1c). The gene density along this chromosome has been reshaped primarily by accumulation of TEs in the re-structured pericentromeric regions and reduction of TE sequences in the re-structured chromosomal arm (Fig. 1c). Overall, fewer genomic rearrangements have occurred in the perennials than in the annual species as revealed by comparison with common bean (Fig. 1b, d).
TEs, centromeric repeats, and rapid intergenic differentiation. Among all categories of TEs, long terminal repeat-retrotransposons (LTR-RTs) are most abundant, accounting for 79.4-90.6% of annotated TEs, or 31.6-49.9% of the sequenced perennial genomes (Extended Data 1; Supplementary Table 5). Individual LTR-RT families exhibited distinct spectra of amplification among the species as reflected by their relative abundance and estimated insertion times (Fig. 2a, b, Supplementary Fig. 2b-e and Supplementary Tables 7, 8). For example, the largest gypsy- and copia-LTR-RT families in the 1,374-Mb F genome, comprise 317.4 Mb (23.1%) of the genome. By contrast, few LTR-RTs belonging to these two families were seen in the annual genomes. Given that 98.2% of intact LTR-RTs were estimated to be amplified in the last 5 MYA (Supplementary Tables 7, 8), it is apparent that LTR-RT amplification was largely responsible for genome variation. On the other hand, the pace and degree of LTR-RT DNA loss are also striking. Of the 905 intact LTR-RTs in the perennials estimated to be amplified ~7 MYA, and thus considered to be inserted prior to the split of the perennial and annual lineages, none were found to have intact orthologs in the annuals, and fewer than a quarter of these elements had detectable remnants at putatively orthologous sites in the annuals, suggesting a loss of at least 94.3% DNA from the “original” copies. In addition, a large amount of LTR-RT DNA has been removed by formation of solo-LTRs through unequal recombination (Supplementary Fig. 2a). Besides LTR-RTs, Mutator and Helitrons are the two major TE superfamilies contributing to the perennial genome size variation and differentiation of intergenic regions (Fig. 2a).
Centromeres in most plant species are composed of long arrays of centromeric satellite repeats (CSRs), which are often interrupted by centromere-enriched retrotransposons (CRs). In the annual soybean, two subfamilies of CSRs, Gm-Cent1 and Gm-Cent2, were previously identified and found to mark two distinct subsets of the 20 chromosomes by fluorescence in-situ hybridization14 and interpreted as evidence for the paleo-allopolyploid origin of soybean. Intriguingly, few copies of these CSRs were detected in the perennial genomes. Compared with the annual G. max (Gm) genome, which harbors ~38 Mb of CSRs, the F genome contains ~7 Mb, of CSRs, while the B, C, A, and D genomes only contain 75, 5, 0.4, and 12 kb of the CRS-homologous sequences, respectively (Fig. 2c). The Gm-Cent1 and Gm-Cent2 were estimated to have diverged ~8 MYA, and CRS homologs in the F genome (dubbed Gf-Cent) are slightly more similar to Gm-Cent2 than Gm-Cent1 (Fig. 2d and Supplementary Fig, 3, 4a). Based on the relative frequencies of physical adjacency of Gm-Cent1/Gm-Cent2 and specific retrotransposon sequences detected with the Illumina sequences, we identified Gmr17 (enriched in the Gm-Cent1 sequences) and Gmr01 (enriched in the Gm-Cent2 sequences) (Fig. 2e-g), as putative CR families in the annual genomes, whereas no similar retrotransposon families were found in the perennial genomes (Fig. 2e, f, h and Supplementary Fig. 4b-d).
Super-pangenome framework and evolutionary architecture. The representative perennial Glycine genomes provide a foundation for constructing a super-pangenome15 framework of this genus. Rather than defining presence/absence of individual gene families that would not reflect the gain/loss of individual genes, we identified and compared orthologous genes among the sequenced perennial species. A total of 109,827 non-redundant genes were annotated in the five perennial diploids, of which, 31,936 (29%) are shared by all the five perennials as orthologs and referred to as perennial core genes (Fig. 3a and Supplementary Table 9). By contrast, a total of 129,006 non-redundant genes were annotated in the 26 G. soja/G. max accessions5, 31,564 (24.5%) are shared by all these annual accessions and referred to as annual core genes (Supplementary Table 10 and Supplementary Fig. 5a). Of the 31,936 perennial core genes, 17,922 (56.2%) overlap with the annual core genes, 8,704 (27.2%) overlap with the annual non-core genes, and 5,310 (16.6%) were perennial specific core genes (Fig. 3b). Of the 77,891 perennial non-core genes, 7,022 (9.0%) overlap with the annual core gene set, 6,745 (8.7%) overlap with the annual non-core genes, and 64,124 (82.3%) were identified as perennial specific non-core genes (Fig. 3b). The shared orthologs between any two of the five perennial species after 5.7-3.8 MY independent evolution account for 77.4-83.5% of all non-redundant genes annotated in the compared species (Fig. 3a and Supplementary Table 9), whereas the shared orthologs between a G. max and a G. soja accession that diverged ~0.25 MYA make up 84.5% of all non-redundant genes in the compared two accessions (Supplementary Table 10). These observations suggest a much higher rate of non-core gene formation in the annuals than found in the perennials.
Overall, the 17,922 core genes shared between the perennials and annuals exhibited lower rates of synonymous substitution (Ks) and non-synonymous substitution (Ka), and stronger intensities of purifying selection (w) than the 6,745 non-core genes shared between the perennials and annuals in both lineages, but neither the core genes nor the non-core genes showed differences in Ks, Ka and w between the two lineages (Fig. 3c-e). Among these shared genes, the duplicates showed lower rates of Ka and stronger intensities of purifying selection than the singletons in both lineages, but no differences in Ks, Ka and w were detected between the two lineages (Fig. 3f-h).
The duplicates were more conservative between the annuals and the perennials than the singletons, as 83.74% (23,671 in the D genome), 86.27% (21,537 in the F genome) of the duplicates were shared with the G. max genome compared to 46.80% (6,386 in D) and 50.85% (6,036 in F) for the singletons (Supplementary Fig. 5b, c). In addition, a higher ratio of duplicates to singletons was observed in the core gene sets (4.7:1) than in the non-core gene set in the perennial diploids (1:1.6) (Supplementary Fig. 6 and Supplementary Tables 11, 12). Interestingly, a higher ratio of duplicates to singletons was observed in the annual genome than in the perennial genomes, and the rapid emergence of non-core genes in the annuals appear to be partly caused by the rapid turnover of singletons (Supplementary Fig. 5d). These observations suggest the two life strategies have had distinguishable effects on gene evolution during the course of the diploidization process after the split of the annual and perennial lineages.
Adaptive evolution of flowering networks underlying perenniality. Little is known about the genetic basis of the life strategy transitions between perenniality and annuality in flowering plants. Nevertheless, it has been suggested that regulatory networks controlling flowering are involved, as the outcomes of adaptive response to particular environments. Mutagenesis analysis revealed that SQUAMOSA PROMOTER BINDING PROTEIN-LIKE 15 (SPL15), the ortholog of Arabidopsis thaliana FLOWERING LOCUS C (FLC), in the perennial crucifer Arabisalpina contributes to perenniality through reducing flowering duration to facilitate a return to vegetative development, preventing floral transition of some branches for polycarpic growth, and conferring flowering response to winter temperatures that restrict flowering to spring16, 17. In an attempt to pinpoint candidate genes underlying the perenniality-annuality transition in Glycine, we selected 174 genes shared by these perennials with seven wild annual soybean accessions that are orthologous/homologous to the A. thaliana genes controlling flowering18 (Supplementary Table 13), and determined whether they have undergone adaptive evolution during the divergence of the two lineages (see Methods). Of the 174 genes examined, two genes, which are orthologous to the Arabidopsis PLANT HOMOLOGOUS TO PARAFIBROMIN (PHP) and APETALA2 (AP2), were found to have experienced adaptive evolution (Fig. 4a and Supplementary Table 14). However, only a PHP ortholog in Glycine showed purifying selections within the annual and perennial groups (Supplementary Fig. 7). In Arabidopsis, the loss-of-function mutant php resulted in mis-regulation of FLC and activate SPL15 and FT expression, conditioned accelerated phase transition from vegetative growth to flowering (Fig. 4d and Supplementary Table 14). Structural predictions from the consensus sequences of the PHP protein in the annuals and perennials showed noticeable differences (Fig. 4b). The comparion of Ka/Ks also indicates subgenera specific adaptation (Fig. 4c). Together, these observations suggest that the adaptive evolution of the PHP ortholog in Glycine may be partly responsible for the perennial-annual life strategy transition, although additional genes may also be involved.
Biased subgenome fractionation in the recent allopolyploid. We first evaluated the collinearity of genes among the A, D, At and Dt genomes/subgenomes, focusing on rearrangements each involving DNA larger than 50 kb. Compared with the more diverse perennial genomes such as F and B, which produced 7.5 rearrangements per MY, and B and C, which produced 6.8 rearrangements per MY, the A and D genomes exhibited a higher rate of rearrangements (13.5 per MY), with more rearrangements in D than A (Supplementary Table 6). A and At, and D and Dt, are highly conserved, with only 10 transpositions between A and At and six transpositions between D and Dt identified (Fig. 5a). Of the 10 transpositions between A and At, three occurred in A involving 113 genes, and seven occurred in At involving 314 genes, Of the six transpositions between D and Dt, three occurred in D involving 123 genes, and three occurred in Dt involving 146 genes. It remains unclear if the transpositions detected in the At and Dt subgenomes occurred before or after the allopolyploidization event. Nevertheless, 23 small inter-subgenomic transpositions involving a total of 45 genes were identified as post-allopolyploidization events (Fig. 5b).
Comparison among the A, D, At and Dt genomes revealed loss of 7,351 genes from the At and Dt subgenomes of the allopolyploid, including 3,242 from At and 4,109 from Dt (Fig. 5c, Supplementary Fig. 8a and Supplementary table 15). Based on their homoeologs present in the other subgenome, 60.4% of the lost genes were deduced to be singletons in their respective subgenomes prior to their losses, 39.6% were deduced to be members of duplicated gene pairs generated by the ~13-MYA WGD event prior to their losses (Fig. 5d). More singletons were lost in Dt than in At, whereas no difference in the number of losses of duplicates was observed between the two subgenomes (Fig. 5e). Only 0.14% of the ~13-MYAWGD-derived homoeologs lost both copies from either the At or Dt subgenomes (Fig. 5e). Based on the orthologs of the 7,351 genes lost in At and Dt that are present in the A and D genomes, we found that 55.4% and 76.2% of the non-core genes present in both A and D were lost from the At and Dt subgenomes respectively, whereas 39.8% and 49.6% of the core genes present in both A and D were lost from the At and Dt subgenomes. It is apparent that non-core genes had a higher tendency of loss than the core genes in both subgenomes (Supplementary Table 16). In addition, the A-orthologs of the genes lost from At have experienced an overall lower stringency of purifying selection and exhibited an overall lower level of expression than their orthologs in the D genome, while the D-genome orthologs of the genes lost from Dt have undergone an overall lower stringency of purifying selection and exhibited an overall lower level of expression than their orthologs in the A genome (Fig. 5f, g). Thus, the At subgenome was likely dominant over the Dt subgenome upon the formation of the allopolyploid. The biased fractionation between the At and Dt subgenomes by gene losses were likely pre-destined by diverged genomic features of their diploid progenitors such as the levels of expression (Supplementary Fig. 8b), the stringencies of purifying selection (Supplementary Fig. 8c), the distribution of TE distal to the promofter regions of adjacent genes that is associated with levels of gene expression (Supplementary Fig. 8d). We found that gene losses in At and Dt tended to occur in clusters Fig. 5h, as exemplified (Supplementary Fig. 8e), whose orthologs in A and D generally showed a pattern of co-expression (Fig. 5i; r2=0.155, P=0.012). By contrast, such a pattern of co-expression was not seen between the A- or D-orthologs of a single lost gene in At or Dt and its adjacent gene (Fig. 5j; r2=0.014, P=0.508).
Illegitimate recombination as a key mechanism for genomic fractionation. The loss of a gene, as described above, was defined when it was annotated in A, D, and only one of the two subgenomes (At and Dt). After careful manual inspection of individual genes, we found that only 13.1% of the 7,351 gene losses are complete absence of the gene sequences, whereas 45.3% and 6.3% of the losses were pseudogenized genes caused by small InDels and point mutations, respectively, and 4.4% were genes interrupted by TE insertions (Fig. 5k and Supplementary Fig. 9). The remaining 30.9% of gene losses were defined as “losses” as they were not predicted as genes Nevertheless, this category of “losses” in the At and Dt subgenomes were proportional to the ratio of total lost genes in respective subgenomes and were not excluded from our analysis.
In an attempt to shed light on the mechanism(s) that gave rise to the small deletions resulting in pseudogenization that was largely responsible for subgenome fractionation, we examined 2,850 small deletions with clearly defined boundaries in 2,315 genes in the allopolyploid. We found that 31.2% of these deletions were flanked by short repeats of 2-18bp, as exemplified in Fig. 5l, a hallmark of illegitimate recombination19, suggesting that illegitimate recombination that gradually deleted genic sequences was a key mechanism for subgenome fractionation.
It has been predicted that allopolyploidization by interspecific hybridization triggers ‘genomic shock’ that would lead to widespread activation of TEs20. If this happened in the recently synthesized Glycine allopolyploid, as a consequence, a large number of identical or nearly identical TEs, particularly LTR-RTs between the At and Dt subgenomes would be observed. We therefore extracted the reverse transcriptase (RT) sequences from 1,202 copia type and 3,070 gypsy type intact LTR-RTs which inserted into the genome within last 350,000 years to construct their phylogenetic relationships, respectively. Out of the 4,272 LTR-RTs examined, only 38 of LTR-RTs showed a high level of sequence identity (e.g., ³99%) between the At and Dt subgenomes (Supplementary Fig. 10, 11). By contrast, none of the LTR-RTs from A and D genomes showed such a level of sequence identity (Supplementary Fig. 12, 13). These observations suggest that the allopolyploidization event did not lead to massive amplification of LTR-RTs.