We selected a landrace line JMZ1 for genome size estimate. The highest peak provided a peak depth of 74 for genome size estimation. As the total number of k-mers was 45,035,768,647, the genome size was calculated to be ~ 601.44 Mb. Therefore, the sequenced Illumina reads (56.18 Gb) provided ~ 93 × coverage. The estimated repetitive sequence content was ~ 58.61% based on a peak depth of 142. As no heterozygosity peak was found, the estimated heterozygosity was ~ 0.02% based on a half-depth of 74. The GC content of the genome of this species was ~ 37.30% (Supplementary Fig. S1; Supplementary Table S1;).
De novo DSS3 genome assembly was performed using a combination of PacBio single-molecule, real-time (SMRT) sequencing and chromatin interaction mapping (Hi-C). We obtained 56.40 Gb raw data by PacBio SMRT sequencing (Supplementary Table S2). After filtering out low-quality data, we generated 55.23 Gb of high-quality subreads with an average contig length of 8.85 kb and N50 = 13.42 kb. This large dataset covered ~ 93-fold of the predicted genome size. We optimized the genome with Quickmerge and generated a preliminary genome assembly capturing 621.06 Mb with contig N50 = 4.16 Mb and 4,320 contigs (Supplementary Table S3).
A total of 119 Mb of read pairs was collected via Illumina sequencing and 35.94 Gb of clean Hi-C reads were generated with 57-fold coverage. The scaffolds were broken into 50-kb fragments of equal length and reassembled by Hi-C data analysis. The low- and middle Hi-C coverage depths in this region were localized and identified as the error points. The aforementioned genome assembly was corrected and a final genome was generated that captured 621.06 Mb with contig N50 = 2.5 Mb, 4,473 contigs, and scaffold N50 = 29.35 Mb (Table 1).
Table 1
C. tora genome assembly information.
Chromosome number (n)
|
13
|
Genome size (Mb)
|
568.59
|
Contig number
|
4473
|
Contig N50 (kb)
|
2,500
|
Contig N90 (kb)
|
43
|
Contig max (kb)
|
17,300
|
Scaffold number
|
4,229
|
Scaffold N50 (kb)
|
29,349
|
Scaffold N90 (bp)
|
43,629
|
Scaffold max (bp)
|
44,760,356
|
GC content (%)
|
35.96
|
Note: Scaffolds, including unplaced contigs, are defined as input contigs not placed by the optical map. |
The Hi-C data reads mapping to the assembly genome consisted of 212 Mb pairs corresponding to 88.38% of the genome pairs. There were 63 Mb of unique mapped read pairs accounting for 52.90% and meeting the requirements of the subsequent analysis (Supplementary Table S4). For the unique mapped read pairs, we evaluated the library data, deleted invalid dangling end-, religation-, self-cycle-, and dumped pairs, and identified 52.85 Mb of valid interaction pairs to anchor the aforementioned genome sequences to pseudochromosomes. A total of 568.59 Mb of sequences were mapped to 13 pseudochromosomes accounting for 91.55% of the whole-genome sequence. Of these, 440.34 Mb was ordered and oriented with pseudochromosomes in the size range of 26.33–55.29 Mb (Table 2). The Hi-C assembly heatmap of the chromosomes displayed 13 distinct groups which denoted high-quality assembly (Fig. 1).
Table 2
Clustered sequence data of each pseudochromosome.
Group
|
Sequence Number
|
Sequence Length (bp)
|
Lachesis Group 0
|
281
|
57,814,068
|
Lachesis Group 1
|
138
|
53,811,674
|
Lachesis Group 2
|
231
|
55,297,032
|
Lachesis Group 3
|
234
|
51,878,843
|
Lachesis Group 4
|
161
|
45,427,400
|
Lachesis Group 5
|
294
|
48,951,038
|
Lachesis Group 6
|
144
|
39,736,525
|
Lachesis Group 7
|
191
|
41,298,315
|
Lachesis Group 8
|
137
|
35,748,794
|
Lachesis Group 9
|
135
|
35,033,854
|
Lachesis Group 10
|
227
|
38,596,663
|
Lachesis Group 11
|
61
|
26,337,058
|
Lachesis Group 12
|
71
|
38,666,604
|
Total Sequences Clustered
|
2,305
|
568,597,868
|
Total Sequences Ordered and Oriented
|
257
|
440,341,984
|
Note: Ratio% of first two items shows % of clustered sequence number or length compared to contig number or genome size. Ratio% of next two items shows % of ordered and oriented sequence number or length compared to cluster sequence number or sequence. |
We used the raw Illumina HiSeq paired-end reads to validate the genome assembly. Of these, 85% could be properly mapped to the assembly (Supplementary Table S5). Thus, the genome assembly contained comprehensive genome information. The Core Eukaryotic Genes Mapping Approach (CEGMA) and Benchmarking Universal Single-Copy Orthologs (BUSCO) methods were used to assess genome assembly accuracy and completeness. The CEGMA v. 2.5 alignment showed that 448 CEG proteins (97.82%) with high-confidence hits and 229 (92.34%) highly conserved full-length sequences were contained in the genome assembly. A BUSCO analysis revealed that we obtained > 93.89% (1,352/1,440 BUSCOs) genome coverage.
Annotation of the C. tora genome sequence
Using homology-based and de novo approaches, we identified 361.17 Mb of repetitive elements accounting for 58.15% of the genome. The main repetitive sequences were transposable elements (TEs) of which the most abundant were long terminal repeats (LTRs) (Supplementary Table S6). Among the Class I elements (retrotransposons), the Gypsy-, LARD, and Copia-type LTRs represented the three largest subfamilies (19.88%, 7.12%, and 6.81%, respectively). Among the Class II elements (DNA transposons), the CACTA repeats and Helitron-type elements occupied 1.68% and 1.39% of the genome, respectively. The remaining genome consisted of putative host genes (1.74%) and simple sequence repeats (SSRs; 6.82%). Most of these TEs were distributed in the central parts of the chromosomes. Most of the repeat sequence types were unequally distributed along the chromosomes. Their densities were higher in the centromeric than the telomeric regions on all pseudomolecules with similar patterns (Fig. 2). Overall, the centromeric regions were enriched in various repeat elements.
We identified 11.64% unclassified repeats in the genome. Thus, many new LTR types could be discovered in plants. We hypothesize that these TEs are different from those of the ancestral legume species and may have contributed to diversification and speciation in these plants.
A combination of homology-based, de novo, and RNA-seq approaches predicted 32,361 protein-coding genes that covered 20.37% of the genomic sequence length (126.52 Mb) (Supplementary Table S7). In total, 87.76% of the predicted genes were supported by homology-based- and RNA-seq data derived from a mixture of leaf, stem, flower, and pod tissues (Supplementary Fig. S2). The average gene size and exon number per gene were 3,875 bp and 5.09, respectively (Supplementary Table S8). The average exon and intron lengths were 239.09 bp and 528.87 bp, respectively. Of the predicted genes, 29,163 (90.12%) were functionally annotated in the NR, TrEMBL, GO, KOG, and KEGG databases (Supplementary Table S9). Most of the protein-coding genes were concentrated at the distal chromosome regions and were relatively sparse in the proximal regions on pseudochromosomes with similar distribution patterns (Fig. 2). Thus, the gene distribution was non-random and resembled that for other plant species reported [23, 24].
We predicted 3,948 non-coding genes including 632 transfer RNAs, 446 ribosomal RNAs, 88 microRNAs, and 2,782 pseudogenes with evidence of transcription but no consistent coding sequence. We also detected 2,980 motifs and 34,761 domains (Supplementary Table S10).
Comparative analysis of the genomes of C. tora and other plant species
A gene family cluster analysis of the complete gene sets was performed on C. tora and certain sequenced legume species such as Arachis duranensis, Arachis ipaensis, Cajanus cajan, Cicer arietinum, Glycine max, Glycyrrhiza uralensis, Lotus japonicus, Medicago truncatula, Phaseolus vulgaris, Vigna angularis, Vigna radiata, and the non-legume Vitis vinifera.
The 26,757 predicted genes were classified into 15,619 gene families with unique gene family number of 780 to C. tora (Supplementary Table S11). Four hundred and six gene families expanded and 5,363 gene families contracted in C. tora. The expanded gene families were enriched mainly in the GO (gene ontology) categories ‘metabolic process’, ‘cellular process in biological process’, ‘membrane in cellular component’, and ‘catalytic activity and binding in molecular function’ (Supplementary Fig. S3). The contracted gene families in C. tora were enriched in ‘biological death process’, ‘antioxidant activity’, and ‘nutrient reservoir activity’.
A phylogenetic tree was constructed based on the single-copy orthologous genes shared by C. tora and 11 other legumes (Fig. 3). The legumes differentiated from grapes over 116 MYA. Cassia tora differentiated from 11 other legumes ~ 95 MYA. Therefore, C. tora is only distantly related to other legumes. After ~ 20 MYA, peanut differentiated from the other nine legumes. The latter then evolved into two clusters. Five vegetable legume species grouped into one cluster and differentiated only ~ 33–4.7 MYA. The other cluster included four legumes (alfalfa for leaf use, licorice for root use, lobelia for leaf use, and chickpea for bean use) that differentiated as early as 56–34 MYA.
LTR family expansion number and time were analyzed for the genomes of C. tora and 11 other plant species (Fig. 4). Using C. cajan as a reference, no distinct peak LTR amplification activity was observed in C. tora over 10 MYA. Thus, the C. tora genome was stable and inactive compared to the other 11 species.
Collinearity analysis between C. tora and A. ipaensis
Cassia tora and A. ipaensis (subfamily Faboideae) diverged ~ 95 MYA. A synteny analysis between C. tora and A. ipaensis revealed that 33.79% of their genes were collinear in 490 blocks (Fig. 5). The genes on one peanut chromosome were dispersed over four C. tora chromosomes. This finding reflected poor collinearity and demonstrated substantial differences between genomes.
The 4DTv (fourfold degenerate transversion) distribution of the homologous gene pairs within these syntenic blocks suggested that C. tora has not undergone any recent lineage-specific whole-genome duplication (WGD) event. Rather, it experienced the paleohexaploidy event (γ) common to all eudicots (Fig. 6).