Genome assembly and annotation
The genomes of all the taxa were assembled using 55-75.9 Gb of PacBio HiFi reads with median read quality ranging from Q31 - Q36 and with an estimated genome coverage depth of 162X to 223X from two PacBio SMRT cells. Scaffolding was performed using 85.1–116 Gb of Hi-C data (250X to 341X coverage of the genome) (Table S1). The previously assembled C. australis genome was used as the reference to assign the contigs/scaffolds into pseudochromosomes in two occasions [Chr2 of C. inodora collapsed assembly (Table S2) and Chr4 of C. australasica (Nakandala et al. 2023a)]. The numbering of pseudochromosomes and their orientations were based on the C. australis genome (Nakandala et al. 2023c) (Table S2). The sizes of the collapsed nuclear genomes ranged between 315 Mb (C. sp) and 391 Mb (C. inodora). The size difference between any two haplotypes ranged between 2.1 Mb (C. australis) and 34.5 Mb (C. inodora). The N50 of the collapsed nuclear genomes ranged from 29.5 to 35 Mb. The gene completeness (BUSCO) of the assembled genomes was high for all the taxa (Table 01). Information about assembly and annotated gene models of nine pseudochromosomes is given in Table S3.
Table 01
The size, contiguity (N50), and gene completeness (BUSCO) of the collapsed and haplotype genome assemblies, including the nine pseudochromosomes and unplaced scaffolds
Species | Assembly | Assembly size (Mb) | Assembly N50 (Mb) | Assembly BUSCO (%) |
C. australasica | Collapsed | 344.2 | 35.0 | 99.1 |
| hap1 | 321.1 | 32.3 | 98.9 |
| hap2 | 323.2 | 34.4 | 99.0 |
C. australis | Collapsed | 328.5 | 35.1 | 98.8 |
| hap1 | 325.8 | 31.4 | 98.8 |
| hap2 | 300.0 | 30.7 | 97.4 |
C. garrawayi | Collapsed | 316.8 | 29.5 | 98.4 |
| hap1 | 286.8 | 30.0 | 98.3 |
| hap2 | 312.7 | 28.9 | 98.6 |
C. glauca | Collapsed | 340.2 | 33.5 | 99.3 |
| hap1 | 318.0 | 33.3 | 99.5 |
| hap2 | 311.6 | 32.4 | 98.4 |
C. inodora | Collapsed | 391.1 | 31.5 | 99.0 |
| hap1 | 290.8 | 29.3 | 98.8 |
| hap2 | 325.3 | 30.3 | 98.8 |
C. sp* | Collapsed | 315.0 | 31.9 | 98.4 |
| hap1 | 298.3 | 31.5 | 98.6 |
| hap2 | 305.0 | 29.7 | 98.8 |
* A distinct form of C. garrawayi |
The genomes were structurally annotated to predict the repeat elements and protein coding genes, and their locations in the genomes. All the genomes contained a high number of repeat elements (53–60%), where many of them were interspersed repeats, predominantly containing unclassified repeats and LTR elements (Fig. 2a, Table S4). Among the Australian Citrus, the C. inodora genome had the highest repeat content (60.8%), while C. sp. had the lowest repeat content (52.9%). The high repeat content of C. inodora was due to the large content of unclassified repeats compared to all other species. We compared the repeat composition of these six Australian Citrus taxa with those of four diverse non-Australian species [C. sinensis (L.) Osbeck, C. maxima (Burm.) Merr, Fortunella hindsii (C. hindsii)] and Poncirus trifoliata a (L.) Raf (C. trifoliata), for which high-quality genomes were available. The total repeat content of the Australian species was similar to C. sinensis, C. maxima and F. hindsii. However, the total repeat content of the P. trifoliata genome was lower than the other Citrus species (44.5%) and that was mainly due to the low percentage of unclassified repeats in the genome (21.9%). The unclassified repeat content of the F. hindsii genome was also relatively low (24.5%) similar to that of P. trifoliata, however, the LTR elements of F. hindsii was the highest among all (Fig. 2a).
Heterozygosity based on a 21 k-mer varied among Australian species (Fig. 2b) and was compared to available data for three other non-Australian wild species including Atalantia buxifolia, C. ichangensis and P. trifoliata. Five C. australasica cultivars showed a great variation ranging from 0.94% (C. australasica cv 1) to 1.52% (C. australasica cv 3 - Red Champagne and C. australasica cv 5 - Ricks Red). These results are consistent with our previous study on C. australasica, which revealed a high variation in heterozygosity among the five cultivars based on genome wide heterozygous single nucleotide variants (SNVs) (Nakandala et al. 2023a). C. australis had the lowest heterozygosity (0.50%) among all the wild limes. The k-mer analysis of three other non-Australian wild species revealed a high heterozygosity for Atalantia buxifolia (1.65%), which was higher than that of the other wild limes. P. trifoliata had a low heterozygosity which was similar to C. australis, while C. ichangensis had a relatively high heterozygosity (1.41%) (Fig. 2b). The predicted gene models of all the Australian collapsed nuclear genomes were from 28,946 (with 34,141 transcripts in C. garrawayi) to 41,304 (with 45,935 transcripts in C. australasica) (Fig. 2c, Table S6). The completeness of the annotated gene sets (BUSCO) was high for all the assemblies (97.4 to 99.2%) (Table S6). The difference in number of annotated genes in pseudochromosomes between haplotypes was ranged from a minimum of 70 in C. australis to a maximum of 4,089 in C. australasica. The CDS sequences of the collapsed and haplotype assemblies of all the species were subjected to functional annotation. Many of the sequences had BLAST hits explaining their protein descriptions. The sequences with no BLAST hits had high coding potentials based on both Citrus and Arabidopsis coding models (Table S6).
The inference of gene duplications in genomes and gene collinearity between and within species
The genes in the genomes were classified into five categories, based on their copy number, and distributions in the genome by MCScanX algorithm (Fig. 3a). The duplicated genes were predominantly dispersed, and were not adjacent on chromosomes, and did not show conserved synteny. All the genomes had more than 10,000 dispersed duplicates (> 34%), which might have been translocated by transposons in the genomes. The next dominant type of duplicates was segmental / WGD duplicates. The collinear genes derived from WGD events were anchored into collinear blocks by collinearity analysis (Wang et al. 2012). The largest number of segmental / WGD duplicates were identified in the C. australasica genome (8,920–24%), followed by C. australis (6,257–19.7%), whilst the lowest number of WGD duplicates were found in C. garrawayi (4,120–15%). The number of WGD duplicates were comparable, and lower in the other genomes. The other two types of duplicates (proximal and tandem) were also defined as paralogs and were nearly similar in all the genomes. The tandem duplicates were adjacent to each other on chromosomes and the proximal duplicates were near to each other, however separated by a few other genes. All the other genes which were present as one copy (5,548 in C. inodora to 6,248 in C. australasica) were defined as singletons (Fig. 3a).
The circos plots indicated high collinearity between homologous chromosomes in all the species (Fig. 3b). More than 70% of genes were found to be collinear between the two haplotypes in all the species (Table S7). The two haplotypes also had structural rearrangements, which were mainly translocations between them as depicted in Fig. 3b. In C. australasica, some collinear blocks were found between Chr7 of hap1 and Chr8 of hap2. Collinear blocks were also found between Chr1 of hap1 and Chr5, Chr8 of hap2. Some collinear blocks were present between Chr7 of hap1 and Chr8 of hap2. Similarly, in C. australis, a major translocation was found between Chr4 of hap1 and Chr8 of hap2. In C. garrawayi, blocks of genes and their orders in Chr1 were found to be preserved in Chr3, Chr4 and Chr8 of hap2. In C. glauca, noticeable translocations were found between Chr4 of hap1 and Chr2, 7, 8 of hap2. In addition, blocks of genes in Chr7, 8, and 9 of hap1, were found to be collinear with Chr2 of hap2. In C. inodora, some collinear blocks between Chr2 and 7 of hap1 and Chr4 of hap2 were dominant. In C. sp, blocks of genes in Chr1 of hap1 were found to be in a collinear arrangement with Chr3, 4 and 8 of hap2. Additionally, some blocks were also found between Chr2 of hap1 and Chr7 of hap2, and Chr9 of hap1 and Chr5 of hap2, indicating structural rearrangements between the two haplotypes.
Collinear genes are important in deciphering the evolutionary relationships of genomes. The collinear genes between any of the two genomes revealed the genes that were conserved in the same order as their ancestral genomes. The results revealed a high homology of genes among our six Australian Citrus genomes (Fig. 3c). Many genes and their orders were conserved between the same corresponding chromosomes of any two species (Fig. 3c, Figure S1). Interestingly, some collinear genes were also found between different chromosomes revealing the structural rearrangements. A conserved pattern of chromosomal locations of the collinear genes could be observed for all the species (Figure S1). The highest number of collinear genes were shared between C. australasica and C. glauca (40,320). A high number of collinear genes were also shared by C. australis and C. glauca (39,787), C. australis and C. australasica (39,243) and C. garrawayi and C. sp (39,301) (Table S8). Chr4 showed collinearity with Chr1, 2 between any two species. Apart from the same chromosome, a noticeable number of collinear blocks were found between Chr8 of C. australis and Chr2, 3, 4, 6, 7, 9 of C. glauca (Figure S1). Collinear genes were also found between Chr4 and Chr1, 2, 8, 9 in C. australis and C. australasica, C. australis and C. glauca, C. australasica and C. glauca, which were not found between any other two species (Figure S1). The lowest number of collinear genes were found between C. inodora and C. glauca (35,755). (Table S8).
The undetermined C. sp was previously identified as a closely related, but distinct form of C. garrawayi, using 86 single copy nuclear genes (Nakandala et al. 2023b). Comparison of this distinct accession with the conventional form, using complete whole genomes now revealed large structural variations between the two forms of this species (Fig. 4a). A large-scale inversion was identified in Chr5, and small-scale inversions were found in all chromosomes. Chr3 had the largest number of structural variations represented by translocations and duplications, while Chr6 had more syntenic regions and less structural variations. In total, 10,619 syntenic regions, 123 inversions and 3,980 translocations were found between the two genomes. Local sequence variations such as insertions (973), deletions (995), 898 copy gains, 820 copy losses and 3,023 highly diverged regions were found across the two genomes.
The haplotype comparison of C. garrawayi (Fig. 4b) and C. sp (Fig. 4c) revealed many structural variations within C. garrawayi compared to C. sp. A total of 81 inversions, 635 insertions, 650 deletions, 589 copy gains and 624 copy losses were found between C. garrawayi haplotypes, whereas it was 70 (inversions), 589 (insertions), 578 (deletions), 522 (copy gains), and 479 (copy losses) between C. sp haplotypes. The number of translocations were high between C. sp (2,403), compared to C. garrawayi (2,164). Moreover, the number of duplications were also higher in C. sp than in C. garrawayi. Similar to the collapsed genomes, the Chr3 of both C. garrawayi and C. sp haplotype assemblies had the largest number of total structural variations, while Chr6 had the lowest number of variations.
Evolutionary events and comparative analysis of gene families
To investigate the polyploidization events in the evolution of Australian Citrus species, the synonymous substitutions (Ks) were estimated for each species. C. sinensis and V. vinifera were used as references. The peaks for the distributions of Ks for paralogous gene pairs of each Citrus species were clearly seen at Ks = ~ 1.5 (C. australasica = 1.55, C. australis = 1.56, C. garrawayi = 1.59, C. glauca = 1.55, C. inodora = 1.58, C. sinensis = 1.56) (Fig. 5a, Figure S2). The results are consistent with previous studies for Citrus [C. clementina Ks peak = 1.5; (Wu et al. 2014)] and Citrus related genomes [P. trifoliata Ks peak = 1.5; (Peng et al. 2020)]. The reference Vitis vinifera had the Ks peak at 1.18. Thus, this is a clear indication that the Australian Citrus species have undergone only one ancient whole-genome triplication (γ-WGT event), which was shared by all eudicots in their evolutionary history similar to other Citrus species, and there were no recent WGD events after the divergence from Asian Citrus. The reference V. vinifera has also undergone only one polyploidization event in its history (the γ-WGT event) (Almeida-Silva and Van de Peer 2023). The different Ks peaks for the same WGD event in different species are due to the diverged evolutionary rates of each species (Sun et al. 2022). To determine the divergence between Australian Citrus and Asian Citrus, the Ks were estimated between the orthologs of C. australis and C. sinensis and a peak at Ks = 0.032 was observed for this divergence event. The time that it happened was estimated to be 8 ~ 16 MYA with the assumption of at neutral substitution rate r to be 1 ~ 2 × 10− 9 in citrus, which is similar to that in Poplar (Wu et al. 2014) (Figure S3).
Conserved and unique gene families
To explore the unique gene families, the data was collected from the five main Australian native citrus taxa from this study, another representative citrus species (C. sinensis), and the related genus P. trifoliata, for which high quality genomes and chromosome-level annotations were available. 92.9% of the total genes were assigned to 27,262 orthogroups (Fig. 5b). In total, 15,536 gene families, containing 127,511 genes were shared among the seven species, which represents their core proteome. Of these species, C. glauca had the highest number of species-specific gene families (orthogroups) (1,099), encompassing 5,147 genes, and 4,579 singletons which were not assigned to orthogroups (Fig. 5b, Fig. 5c). The second highset number of unique gene families was identified in C. australasica with 401 gene families including 8,233 genes and 2,184 singletons. C. garrawayi, P. trifoliata and C. inodora had the lowest number of species-specific gene families in them (C. garrawayi – 55, P. trifoliata – 93, C. inodora – 94) (Figure S4).
All the species had unique genes enriched in two main pathways related to purine and thiamine metabolism (Table S9). Each species had less than 32 unique genes associated with these two pathways except for C. glauca which had 117 unique genes for purine metabolism and 96 unique genes for thiamine metabolism which is exceptionally high compared to the other species (Table S9). In the purine metabolism pathway, allantoinase (ALN) is a key enzyme which converts allantoin to allantoate (Kaur et al. 2023) (Figure S5). ALN is one of the unique genes identified in C. glauca, which is structurally different from the corresponding genes in the other native species. In C. glauca, the ALN CDS contains only four exons, whereas all the other species had 15 exons. The corresponding protein of C. glauca had 1148 amino acids (aa), whereas other species had 495 to 507 aa (C. inodora – 507 aa, C. garrawayi – 506 aa, C. australis – 495 aa, C. australasica – 507 aa). Other than purine and thiamine metabolism, C. glauca specific genes were enriched in biosynthesis of cofactors, pyruvate metabolism, glycine, serine and threonine metabolism, butanoate metabolism, glyoxylate and dicarboxylate metabolism and cysteine and methionine metabolism (Table S9). For C. glauca specific genes, the enriched GO terms included the molecular functions corresponding to transferase activity, hydrolase activity and oxidoreductase activity. The proteins encoded by unique genes in C. glauca were mostly present in the plasma membrane (Table S10).
In addition to purine and thiamine metabolism pathways, gene families unique to C. australasica were predominantly enriched in terpenoid backbone biosynthesis, monoterpenoid biosynthesis, starch and sucrose metabolism, glutathione metabolism, and the Toll-like receptor (TLR) signalling pathway (Table S9). The top enriched molecular functions included protein binding, and hydrolase activity, and many gene products were identified to be present in cell nucleus (Table S10). The top enriched pathways of unique genes in C. australis were associated with biosynthesis of cofactors, glycerolipid metabolism, tryptophan metabolism, phenylpropanoid biosynthesis, plant-pathogen interaction, starch, and sucrose metabolism (Table S9). The molecular functions were mainly related to hydrolase activity, protein binding and nucleic acid binding (Table S10). C. garrawayi had unique genes enriched in plant-pathogen interactions, phenylpropanoid biosynthesis, starch and sucrose metabolism, terpenoid backbone biosynthesis and monoterpenoid biosynthesis, while those of C. inodora corresponded to tryptophan metabolism and metabolism of xenobiotics by cytochromosome P450 in KEGG pathways (Table S9).
Gene family expansion and contraction
The likelihood approach for gene family evolution rates revealed the gene families which underwent significant expansion or contraction in a specific lineage when compared to the most recent common ancestor. The number of expanded gene families in C. australasica (405) and C. glauca (474) were higher than their contracted gene families, however, the reverse was true for C. inodora, C. australis and C. garrawayi (Fig. 5d). In total, 405 expanded gene families were identified in C. australasica with notable enrichment in pathways of plant-pathogen interaction, thiamine metabolism and purine metabolism (Table S11) and in the GO terms of transferase activity and hydrolase activity (Table S12). For C. australis, the expanded gene families were mostly enriched in plant-pathogen interaction, steroid hormone biosynthesis and tryptophan metabolism in KEGG pathways (Table S11) and protein binding, hydrolase activity in the GO terms (Table S12). In C. garrawayi, the top three significant pathways associated with expanded genes included phenylpropanoid biosynthesis, purine metabolism and thiamine metabolism (Table S11) and were enriched in protein binding and hydrolase activity in GO terms (Table S12). In C. glauca, the expanded gene families were mostly enriched in tryptophan, purine, and thiamine metabolism in KEGG pathways (Table S11) and nucleic acid binding and protein binding in GO terms (Table S12). For C. inodora, the topmost pathways associated with expanded gene families included steroid hormone biosynthesis, purine, tryptophan metabolism (Table S11) and GO terms included nucleic acid binding and transferase activity (Table S12).
The above results indicated that both C. australasica and C. australis had gene families with a relatively high number of gene copies for plant-pathogen interactions. Among the expanded gene families related to plant-pathogen interactions, 37 genes in C. australasica and 51 genes in C. australis were related to Disease Resistance Protein RPS2 (RPS2). C. garrawayi, C. glauca and C. inodora had 25 genes, 15 genes and 15 genes, respectively, related to RPS2 protein. In all the species, most of the RPS2 proteins contained nucleotide-binding site (NBS), and leucine-rich repeats (LRRs) or ARC domain. In addition to RPS2 protein, the other expanded genes in C. australasica were related to mitogen-activated protein kinase 1, 3-Ketoacyl-CoA Synthase (KCS) and Glycerol Kinase (GK). In C. australis, the other expanded genes were associated with mitogen-activated protein kinase 1, GK, LRR Receptor-Like Serine/Threonine-Protein Kinase (FLS2), and Senescence-Induced Receptor-Like Serine/Threonine-Protein Kinase (FRK1). In C. garrawayi, the other expanded genes were associated with KCS, FRK1 and Cyclic Nucleotide Gated Channel, Plant (CNGC). The expanded genes in C. glauca were associated with KCS, GK, FRK1, CNGC, Pathogenesis-Related Protein 1 (PR1) and Elongation Factor Tu (Tuf, TUFM).
The contracted gene families of C. australasica were mainly enriched in biological processes related to GO terms of signal transduction, protein phosphorylation and response to stress, and molecular functions related GO terms of ATP binding, and protein binding (Table S13). For C. australis, the GO terms included regulation of cellular process, cellular nitrogen compound biosynthetic process, protein modification process (biological processes) and transferase activity, protein binding (molecular functions). In C. garrawayi, most of the contracted genes were involved in biological processes including RNA metabolic process, regulation of cellular process, DNA integration and molecular functions such as protein binding and oxidoreductase activity. Those of C. glauca were mainly involved in protein phosphorylation and RNA, DNA metabolic processes and molecular functions such as nucleic acid binding and hydrolase activity. C. inodora had contracted gene families enriched in protein phosphorylation, multicellular organismal process and anatomical structural development as biological processes, and protein binding, hydrolase activity as molecular functions (Table S13).
Gene families related to immunity and abiotic stress tolerance
The total genes related to some important immunity related gene families / pathways and abiotic stress related genes / pathways were compared among the five main native taxa (Fig. 6). C. glauca had a high number of genes related to the abiotic stress related KEGG pathways of proline and arginine metabolism, tryptophan metabolism, phenylalanine, tyrosine and tryptophan biosynthesis, and glycine, serine, threonine metabolism. The number of genes for HSP proteins, calcium-dependent protein kinases and WRKY TFs were comparable among the species. Previous studies have reported that a cold-regulated gene LOW-TEMPERATURE-INDUCED 65 (LTI65), which is induced by C-repeat/DREB binding factor (CBF) genes, are positively selected in two mostly cold-tolerant species P. trifoliata and C. ichangensis (Peng et al. 2020). We identified the homologs of P. trifoliata LTI65, in Australian citrus and C. sinensis using Orthofinder. The gene tree of orthogroup OG0009805 revealed two gene duplication events, one shared by C. glauca and C. sinensis, and a second shared by C. australis, C. australasica, P. trifoliata, C. inodora, and C. garrawayi (Figure S6). C. glauca, which can survive under extreme cold temperatures was identified with a homolog of LTI65, which is g28904.t1, and this gene would be a good candidate for future research on its expression and evolutionary selection in response to cold tolerance. When considering immunity related genes, C. glauca had the highest number of genes for thiamine metabolism and glutathione metabolism pathways (Fig. 6). C. australasica had a relatively large number of genes for terpenoid backbone biosynthesis, monoterpenoids and diterpenoid biosynthesis pathways and disease resistant protein (RPS2). C. australis had a large number of genes encoding receptor like proteins (RLP) and, and C. garrawayi had a high number of genes related to diterpenoids, sesquiterpenoids and triterpenoid biosynthesis pathways and receptor like protein kinases (RLPK).
WRKY TF family
WRKY TFs are known to have pivotal roles, modulating abiotic and biotic stress tolerance in plants. The number of genes for WRKY TFs were comparable among the five main Australian native citrus species included in our study (Figure S7). The WRKY family was represented by 42 genes in C australasica, 48 genes in C. glauca, 48 genes in C. australis, 50 genes in C. inodora, and 42 genes in C. garrawayi. The WRKY genes were categorized into three main groups (G1, GII and GIII) based on the number of WRKY domains and the type of zinc-finger domains. Group I had two WRKY domains and C2H2-type zinc-finger domain. Group II had one WRKY domain and C2H2-type zinc-finger domain. Group III had one WRKY domain and C2HC-type zinc-finger domain. Group II was further categorized into five sub-groups, based on the type of zinc-finger structure and their phylogenetic position. Group IIa consisted of C-X6-C-X23-H-X-H zinc-finger structure, with X being an arbitrary amino acid). Group IIb consisted of C-X5-C-X23-H-X-H. Group IIc contained C-X4-C-X23-H-X-H. Group IId contained C-X5-C-X23-H-X-H and Group IIe consisted of C-X4-C-X23-H-X-H (Figure S7). All the genes had the conserved WRKYGQK domain except for ten genes within Group IIe, which had WRKYGKK domain.
The duplication events of WRKY genes were analysed to reveal the expansions of WRKY TF family in each species. Among the WRKY TF genes, 14 gene pairs in C. glauca (58.3%), 11 gene pairs in C. australis (45%), C. inodora (44%) and C. garrawayi (52.3%) and 10 gene pairs in C. australasica (47.6%) were involved in duplication events (Table S14). All the duplicated genes were derived from WGD / segmental duplication events. To further understand the adaptive evolution of WRKY genes, the Ka/Ks (non-synonymous/synonymous) ratios of all gene pairs were calculated. All of them had Ka/Ks < 1, indicating that the purifying selection had acted on the evolution of WRKY genes and some duplicated gene pairs had no Ka/Ks values, which might be due to the low sequence divergence (Table S14).