Characteristics of Engelhardia plastomes
The lengths of the complete plastomes of Engelhardia were slightly different, ranging from 161,069 bp to 162,336 bp, exhibiting a quadripartite structure with a large singlecopy (LSC) region (89,927–91,637 bp), dual inverted repeat (IR) regions (25,813–26,016 bp), and a small single-copy (SSC) region (18,790–19,203 bp) (Fig. 1; Table 2). There were a total of 134 genes were identified in the newly sequenced plastomes, including 88 protein-coding genes (CDS), 2 pseudogenes (Ψycf1、Ψrps19), 37 transfer RNA (tRNA) genes and 8 ribosomal (rRNA) genes (Table 2). The ycf1 in the IRb region (Ψycf1) and the rps19 in IRa region (Ψrps19) of all Engelhardia species were identified as pseudogenes (Table S2). Among these genes, there were 18 intron-containing genes, of which three genes rps12, clpP and ycf3, had two introns, and the rest contained a single intron (trnA-UGC, trnG-UCC, trnI-GAU, trnK-UUU, trnL-UAA, trnV-UAC, rpl2, rpl16, rps16, rpoC1, atpF, ndhA, ndhB, petB, and petD) (Table S2). These newly-generated Engelhardia plastomes were deposited in GenBank (Assession Number showed in Table 1).
Table 2
Complete features of 13 newly assembled Engelhardia plastomes and one Rhoiptelea chiliantha plastome.
Species | Total (bp) | LSC (bp) | SSC (bp) | IR (bp) | CDS (bp) | Total GC content (%) | Total genes | CDS | Pseudo | tRNA genes | rRNA genes |
E. Arminian | 162,088 | 91,302 | 19,054 | 25,866 | 78,969 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. fenzelii_JNSX01 | 161,069 | 89,927 | 19,112 | 26,015 | 78,867 | 36.0 | 134 | 88 | 2 | 37 | 8 |
E. fenzelii_TTD01 | 161,105 | 89,964 | 19,111 | 26,015 | 78,867 | 36.0 | 134 | 88 | 2 | 37 | 8 |
E. hainanensis_02 | 161,574 | 91,157 | 18,791 | 25,813 | 79,137 | 35.8 | 134 | 88 | 2 | 37 | 8 |
E. hainanensis_HN01 | 161,572 | 91,156 | 18,790 | 25,813 | 79,137 | 35.8 | 134 | 88 | 2 | 37 | 8 |
E. roxburghiana_BPZ11 | 161,713 | 90,478 | 19,203 | 26,016 | 78,906 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. roxburghiana_JFL02 | 161,511 | 90,373 | 19,106 | 26,016 | 78,915 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. roxburghiana_TPS06 | 161,713 | 90,478 | 19,203 | 26,016 | 78,906 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. roxburghiana_XSBN01 | 161,667 | 90,448 | 19,187 | 26,016 | 78,906 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. serrata | 162,161 | 91,313 | 19,092 | 25,878 | 78,852 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. spicata | 161,551 | 90,937 | 18,890 | 25,862 | 79,113 | 35.8 | 134 | 88 | 2 | 37 | 8 |
E. spicata var. rigida | 161,520 | 90,951 | 18,871 | 25,849 | 79,122 | 35.9 | 134 | 88 | 2 | 37 | 8 |
E. villosa | 162,336 | 91,637 | 19,043 | 25,828 | 79,143 | 35.8 | 134 | 88 | 2 | 37 | 8 |
R. chiliantha_MWS2 | 161,702 | 90,447 | 19,081 | 26,087 | 79,182 | 36.1 | 133 | 88 | 1 | 37 | 8 |
The overall GC content of Engelhardia plastomes was 35.8%-36.0% (Table S3), and the GC contents of coding sequence (CDS) regions was 37.2%-37.3%. We found that the GC content of LSC (33.2–33.6%) and SSC (29.3–29.6%) region were lower than those of IR regions (42.6–42.7%) (Table S3).
Comparative analysis of Engelhardia plastomes
Multiple plastomes comparison among all the Engelhardia species using mVISTA and Mauve alignment showed high degree of collinearity. It was found that the composition and sequence of genes in Engelhardia were highly consistent, and no inversion or translocation of DNA fragment rearrangement was detected in the sequences, which is consistent with the rarity of recombination in plant plastomes [25] (Fig. S2). The regions with relatively low identity were rps16_trnQ-UUG, trnS-GCU_trnG-UCC, trnT-GGU_psbD, trnF-GAA_ndhJ, ndhK_ndhC, accD_psaI, petA_psbJ, and ndhF_trnL-UAG (Fig. S1). Most of the DNA sequence variation in Engelhardia species occurred in non-coding regions such as gene spacer region and gene intron region, and the sequence differentiation between LSC and SSC region was significantly higher than that in IR region (Figs. S1-2).
By analyzing the boundary differences of LSC, SSC, IRa and IRb sequences in plastomes of Engelhardia, it was found the inner boundary differences were small. There was no large regional expansion and shortening of spacer regions occurred, which was consistent with the conserved character of platomes within this genus (Fig. 2). The ycf1 gene in all species spanned the SSC/IRa region, with the length of ycf1 in SSC is 4623 bp-4729 bp, and that in IRa is 1004 bp-1104 bp. The pseudogene (Ψycf1) was formed at the corresponding position near the IRb/SSC boundary, and the extension of the short Ψycf1 fragment into the SSC region was observed in all Engelhardia species. Similar to most plants, the ndhF gene involved in photosynthesis was located in the SSC region. The overlapping of Ψycf1 and ndhF has only been detected in E. anminiana, E. spicata and E. villosa. The rps19 gene spanned the LSC/IRb regions in all the Engelhardia species, and it formed a pseudogene (Ψrps19) at the IRa/LSC boundary (Fig. 2).
Repetitive sequences in Engelhardia plastomes
Plastid genome repeats include dispersed repeats and tandem repeats. Dispersed repeats are further divided into four types: forward, reverse, complement and palindromic repeats. REPuter software identified 2,368 repeated sequences, including 24–47 forward repeats, 7–16 reverse repeats, 21–31 palindromic repeats, 1–4 complement repeats, and 89–163 tandem repeats, in the 13 Engelhardia plastomes (Table S4-5, Fig. 3). Most of the repetitive sequences existed in non-coding regions such as IGS and introns (Tables S4–5). Overall, tandem repeats were more prevalent in Engelhardia, accounting for about 60.52% of all repeat types. On the contrary, the complement repeats were relatively small, accounting for 1.01% (Table S5).
Simple sequence repeats (SSR) have been selected as ideal genetic markers in plant molecular studies due to their high variability [26]. In this study, statistical analysis of SSR was performed by MISA online software, and a total of 1530 SSR loci were detected in the 13 Engelhardia plastomes. The total number of SSRs varied little among individuals, ranging from 111 (E. roxburghiana_JFL02) to 127 (E. villosa). The majority of these plastid SSR (ptSSR) were mono-nucleotide repeats, accounting for 71.24% of all SSR, followed by di- (13.07%), tri- (5.69%), tetra- (4.97%) and trinucleotide repeats (4.64%), while hexanucleotide repeats was the least, accounting for only 0.39% (Table S5, Fig. 3). The A/T type mononucleotide was the most abundant SSR, accounting for 98.44%, and only 17 G/C single nucleotide repeats were detected, which also resulted in the enrichment of A and T in the plastomes. Most SSRs were located in the LSC region (72.88%), and a smaller percentage of SSRs were distributed in the SSC (19.67%) and IR (7.45%) regions, respectively. Furthermore, most of the SSRs (87.84%) were distributed in the IGS and introns, while only 12.16% in the coding sequences (Table S4-5, Fig. 4).
Comparative analysis of genomic variation in the Carya, Engelhardia, and Juglans plastomes
Comparative analysis of nucleotide polymorphisms of Carya, Engelhardia, and Juglans plastomes, the variability in Engelhardia is higher than that of Carya and Juglans (Fig. 5). There were eighteen hypervariable regions in Engelhardia with Pi > 0.010 were trnH-trnK, trnK-rps16, rps16-psbK, trnG-atpI, rpoB-trnT, trnT-psbD, psbC-trnM, rps4-trnT, trnL-ndhJ, ndhC-trnV, petA-psbJ, psbE-rpl33, rps11-rps8, rps3-rpl2, trnN-ndhF, ndhF-ccsA, ndhA and ndhH-ycf1, while there were only seven and eight hypervariable regions in Carya and Juglans, respectively. Among them, trnK-rps16, ndhF-rpl32, and ycf1 were common high variation hotspots in these three genera (Fig. 5).
Using R. chiliantha as a reference, we characterized genomic variations including single nucleotide variants (SNVs), insertions and deletions (InDels) in the plasomes of Juglandoideae and Engelhardioideae and found that they were very different among different species (Table S6a). A total of 115213 SNVs, 9502 insertions (1–274 bp) and 10428 deletions (1–2,468 bp) were identified in all the collected species (Table S6d). The number of SNVs, deletions and insertions per kb varied at the plastid genome level, with the average values of 15.03, 1.36 and 1.24 in Juglandaceae, 11.84, 1.04 and 0.93 in Carya, 17.48, 1.48 and 1.64 in Engelhardia, and 17.20, 1.70 and 1.28 in Juglans, respectively. In terms of these three types of genomic variation, the IR regions presented the fewest numbers per kb, with the average values of 1.71, 0.15 and 0.09 in Juglandaceae, 1.81, 0.13 and 0.11 in Carya, 1.66, 0.20 and 0.10 in Engelhardia, and 1.70, 0.14 and 0.05 in Juglans, respectively. The LSC regions exhibited the maximum numbers of SNVs, deletions and insertions per kb with the average values of 9.04, 0.97 and 0.93 in Juglandaceae, 6.17, 0.70 and 0.63 in Carya, 10.62, 1.01 and 1.23 in Engelhardia being, and 11.30, 1.30 and 1.04 in Juglans (Table S6b). These results collectively indicate that the IR regions were more conserved than the single-copy regions.
All genomic structural variations were mapped onto a phylogenetic tree constructed based on the plastid genome, with very different times of insertion events and deletion events occurring in Carya (insertion events: 132–199 times; deletion events: 135–236 times), Engelhardia (186–364; 155–311), and Juglans (149–230; 192–331). It can be concluded that the structural variation of Carya was less than that of Engelhardia and Juglans. The range of structural variation among Engelhardia species was relatively large, especially in E. serrata and E. villosa, with 329 insertions and 306 deletions in E. serrata, and 364 insertions and 311 deletions in E. villosa (Fig. S3).
The corresponding genomic positions of these identified InDels were mapped and located into these plastomes of Juglandoideae and Engelhardioideae. It can be found that 90% of InDels were found in intronic (35%) and intergenic regions (55%) for Juglandaceae, 92% of InDels were found in intronic (43%) and intergenic regions (49%) for Carya, 88% of InDels were found in intronic (33%) and intergenic regions (55%) for Engelhardia, and 91% of InDels were found in intronic (31%) and intergenic regions 60%) for Juglans (Table S6c; Fig. 6).
Codon usage analysis of Juglandaceae plastomes
Comparing the coding sequences of 50 genes with a length over 300 bp of Juglandaceae plastomes, it was found that there were two codons with RSCU value of 1, which are AUG and UGG encoding methionine (Met) and tryptophan (Trp), indicating that the codon usage of two amino acids doesn’t show bias. There were 29 codons with RSCU > 1, of which 16 end with U and 12 end with A, which was the same in Engelhardioideae, Juglandoideae and Rhoipteleoideae. It indicates that the codons ending with U or A are the preferred codons in the plastomes of these three subfamilies. The use frequency and RSCU value of each amino acid in the collected accessions were shown in the Table S7a, Figure S4. There was no significant difference in the codon bias of the genes of most Juglandaceae plastomes. However, the A/T content of the third base in the coding sequence is significantly higher than the G/C content, T3s (0.4748–0.4782) > A3s (0.4399–0.4438) > G3s (0.1695–0.1722) > C3s (0.1613–0.1649). We found that Carya ovata and Carya palmeri had the highest values of ENC, GC3s and GC, while Platycarya strobilacea had the lowest values. No significant differences in intra-genus codon preferences were detected among in these three subfamilies (Table S7b).
The codon usage pattern parameters, ENC, Fop, CBI and CAI of the coding genes of three subfamilies were further calculated and plotted. The CAI values were between 0.09 and 0.31, psbA, rbcL and psbD with the highest CAI value, and rpl20, rpl18 and rps8 with the lowest one. Most of the CBI values ranged from − 0.23 to 0.23, the highest were psbA, psbD and rbcL, and the lowest were ndhF, ndhG and rps14. Most of the Fop values were between 0.26–0.55, the highest were psbA, psbD and rbcL, and the lowest were ndhG, ndhF and petD. Most of the ENC value were concentrated between 35.71and 60.6, the highest were ycf3, ycf2 and rpl2, and the lowest were rps18, petD and rps14 (Table S7c).The highly expressed genes in the plastid genome of three subfamilies were ycf2, rpoC1 and rpoC2, and the low expressed genes are rps18, petD and rps14 (Table S7d). Combined with the 29 high-frequency codons with RSCU value > 1 in Table S4a, 10 common optimal codons were finally determined, which were CUU, GUU, UCU, UCA, CCU, CCA, GCU, AAU, CGA, GGA, and all end with A or U (Table S7d).
There was a positive correlation between the codon preference index (CBI) and the optimal codon usage frequency (Fop), and the highest correlation coefficient of 0.97. The correlation coefficients between CAI and CBI and between CAI and Fop were also higher, which were 0.72 and 0.76, respectively, showing a positive correlation. In addition, there were negative correlation between T3s/C3s, T3s/A3s, T3s/G3s, T3s/GC3s, T3s/GC, C3s/A3s, C3s/G3s, A3s/G3s, A3s/CAI, A3s/CBI, A3s/Fop, A3s/ENC, A3s/GC3s, A3s/GC, G3s/CAI, G3s/CBI, G3s/Fop, CAI/GC, etc. Among them, A3s/CAI has the highest degree of negative correlation, with a correlation coefficient of -0.57 (Fig. S5). The results of three subfamilies were similar to the whole Juglandaceae family, with the highest correlation coefficients being CBI and Fop, followed by the correlation coefficients between CAI and CBI, as well as between CAI and Fop (Fig. S5). The ENC values were positively correlated with T3s, C3s, G3s and GC3s. However, ENC values were negatively correlated with A3s. ENC values can be used to determine the relative expression levels of genes [27], and our results showed that the content of the third base of synonymous codons is closely related to gene expression levels, and T3s, C3s and G3s are positively correlated with gene expression, while A3s was negatively correlated with gene expression (Table S7e, Fig. S5).
ENC values for all screened gene coding sequences ranged from 35.71 to 60.6. The ENC frequencies were calculated using the formula (ENCexp-ENCobs)/ENCexp, ranging from − 0.25 to 0.28. There were 2051 ENC frequencies in the range of -0.1 to 0.1, which were close to the expected ENC values (Table S7f). Based on the standard curve formula ENC = 2 + GC3 + 29/[GC32+(1 − GC3)2], we took ENC as the ordinate and GC3s as the abscissa to draw a scatter plot (Fig. 7). It can be seen that most of the genes are located on or near the standard curve (Fig. 7A), indicating that the expected ENC value and the actual value of most genes is nearly uniform, and the mutation has a greater impact on the codons bias of coding gene. However, we also found the observed ENC values of six genes (rpl16, rps18, cemA, psbA, rps14, and ycf3) in all species deviated significantly from the standard curve (Fig. 7A, B). Among all genes, ycf3 showed the highest ENC value, while rps18 and rpl16 showed the lowest ENC value (Fig. 7B; Table S7f).
PR2-plot is used to analyze the composition of the four bases at the third position of the codon encoding the amino acid, plotting with G3/(G3 + C3) and A3/(A3 + T3) as the horizontal and vertical coordinates. When the four bases are evenly distributed, codon bias is not affected by mutation and natural selection [28]. The results showed that A/T and G/C (pyrimidine versus purine) were used slightly differently at the third codon position in the protein-coding sequence of Juglandaceae (Fig. 7C). The PR2-plot shows that there is a slight imbalance in the use of A/T and G/C at the third codon of the CDSs of 36 Juglandaceae plasomes, especially the four CDSs (psbA, rpl20, rpl16 and rps8) (Fig. 7C).The number of genes in the third and fourth quadrants is more than that in the first and second quadrants, and the number of genes distributed in the fourth quadrant is greater than the number of genes distributed in the other three quadrants, so G and T are used most frequently (Fig. 7C).
Selective pressure analysis of CDS in Juglandaceae
To analyze the evolutionary pressure among the protein coding sequences of 8 Juglandaceae species, the Ka/Ks value of 80 protein coding sequences (CDS) were calculated. The results showed that the Ka/Ks values of 78 genes was almost all less than 1, indicating that most genes were subjected to purifying selection (Fig. 8C, Table S8a). Only rpl22 and psaI have Ka/Ks > 1, indicating that positive selection happened during plastome evolution in Juglandaceae and three subfamilies. We also found that rps16 was only subjected to positive selection in Juglandaceae and Engelhardioideae. For all Juglandaceae samples, the Ka/Ks values of photosynthesis-related genes were significantly lower than those of self-replication-related and other genes, indicating stronger purification selection (Fig. 8A, Table S8b). For functional classification genes, except for differences in photosynthesis related genes between Engelhardioideae and Juglandoideae, other Ka/Ks values showed no significant differences (Fig. 8C, Table S8c).
Phylogenetic analysis of Juglandaceae
In this study, Quercus rubra (Fagaceae) was used as the outgroup, ML and BI trees of Juglandaceae species based on the complete plastome (excluding one copy of the inverted repeat) showed nearly identical topologies (Fig. S6). The phylogenomic results indicated that Juglandaceae family was mainly divided into three groups, including Juglandoideae, Engelhardioideae and Rhoipteleoideae subfamilies, with a very high support rate (BS = 100%, PP = 1). The phylogenomic tree further supported 7 major branches, corresponding to 7 genera, namely the monophyletic Carya, Juglans, Pterocarya, Cyclocarya, Platycarya, Engelhardia and Rhoiptelea.
There were two main clades in Juglandoideae, clade I was Carya, clade II were Juglans, Pterocarya, Cyclocarya and Platycarya. In the ML tree, the support rate inside clade I (BS = 63–100%) was lower than that of clade II (BS = 66–100%). The species of Carya were divided into two sects, C. hunanensis, C. kweichowensis, C. sinensis, C. polianei, C. tonkinensis, and C. cathayensis were grouped together, while the remaining 12 species were grouped together. The Juglans were divided into three sects, namely the Sect. Juglans or Dioscaryon, Sect. Cardiocaryon, and the Sect. Rhysocaryon. The Sect. Juglans/Dioscaryon included J. regia and J. sigillata, while the Sect. Cardiocaryon included J. mandshurica, J. ailanthifolia and J. hopeiensis, the Sect. Rhysocaryon included J. cinerea, J. nigra, J. hindsii, J. major and J. microcarpa. The Pterocarya were divided into two sects, one included P. fraxinifolia, P. stenoptera and P. hupehensis, the other included P. macroptera var. insignis and P. tonkinensis. Cyclocarya paliurus was a single species of Cyclocarya, which was closely related to Pterocarya according to the phylogenetic relationships.
The species of the Engelhardioideae were closely related and were further divided into two main clades, which is consistent with the Sect. Engelhardia (Clade I) and Sect. Psilocarpeae (Clade II), with a very high support rate (BS = 100%, PP = 1). The Clade I included E. spicata, E. spicata var. rigida, E. hainanensis, E. serrata, E. anminiana, and E. villosa. The clade II included E. roxburghiana and E. fenzelii, which were sister species. The Rhoipteleoideae subfamily only included R. chiliantha, a single genus and single species.
The divergence time and historical diversification of Juglandaceae
By using multiple fossil correction points to estimate the differentiation time of the Juglandaceae family, the results showed that the crown nodes of Juglandaceae were approximately 97.69 Mya (95% highest posterior density (HPD): 95.49 Mya–100.58 Mya), which differentiated from Myricaceae in the Early Cretaceous period (Fig. 9). The three subfamilies, namely Rhoiptelioideae, Engelhardioideae and Juglandoideae, successively diverged at 89.28 Mya (95% HPD: 85.6–92.96 Mya; Middle Cretaceous) and 73.59 Mya (95% HPD: 69.01–78.13 Mya; Late Cretaceous) (Fig. 9).
Most genera of Juglandaceae were diverged between 46.20 and 73.59 Mya. The differentiation time of the two clades in the Engelhardoideae subfamily was approximately between 27.64 and 46.11 Mya, mainly occurring from the Early Eocene to the Middle Oligocene. The phylogeny and divergence times didn’t support rapid radiation occurred in the evolution history of Engelhardia. In the Juglandoideae subfamily, the crown age of Carya was estimated at 64.98 Mya (95% HPD: 60.49–69.70 Mya), Platycarya at 60.51 Mya (95% HPD: 56.32–64.91 Mya) during the Late Paleocene, and Cyclocarya paliurus at 54.10 Mya (95% HPD: 50.84–57.41 Mya). The divergence of Pterocarya and Juglans was estimated at 46.29 Mya (95% HPD: 43.43–49.63 Mya) during the Middle Eocene. Most genera of Juglandoideae subfamily were diverged between 46.29 and 64.98 Mya in the relatively warm and dry climate of the Middle Paleocene to the Early Eocene (Fig. 9).