Chloroplast genome characterization of M. baccata var. gracilis
The chloroplast entire genome of M. baccata var. gracilis was successfully acquired by assembly and splicing techniques, resulting in a total sequence length of 159,992 bp (Fig. 1). The genome sequence has a total GC content of 36.56%. The genome is comprised of three distinct regions: a large single-copy region (LSC), a short single-copy region (SSC), and a pair of inverted repeats (IRs, namely IRa and IRb) (Table 1). The lengths of the tetrad sequences were as follows: LSC, 88100 base pairs, spanning from position 1 to 88100; IRb, 26353 bp, spanning from position 88101 to 114453; SSC, 19186 bp, spanning from position 114454 to 133639; and IRA, 26353 bp, spanning from position 133640 to 159992. The GC content of the IRs in M. baccata var. gracilis is 42.70%, higher than the overall GC content of the whole chloroplast genome (36.56%).
Table 1
Regions of the M. baccata var. gracilis chloroplast genome
Region name
|
Start
|
End
|
Length(bp)
|
GC(%)
|
LSC
|
1
|
88,100
|
88,100
|
34.23
|
IRa
|
88,101
|
114,453
|
26,353
|
42.70
|
SSC
|
114,454
|
133,639
|
19,186
|
30.40
|
IRb
|
133,640
|
159,992
|
26,353
|
42.70
|
Whole-genome sequence annotation of the chloroplast of M. baccata var. gracilis yielded 112 genes (Table 2), including 79 unique protein-coding genes (nine genes were multiple copies), 29 tRNA genes (eight were multiple copies), and four rRNA genes (four were multiple copies). The set of protein-encoding genes are from 15 distinct gene families. The genome contains 11 genes encoding NADH dehydrogenase subunits, five genes encoding photosystem I subunits, 15 genes encoding photosystem II subunits, six genes encoding cytochrome b/f complex subunits, six genes encoding ATP synthase subunits, one gene encoding the large subunit of ribulose 1,5-diphosphate carboxylase/oxygenase (rbcL), four genes encoding DNA-dependent RNA polymerase subunits, nine genes encoding ribosomal large subunit proteins, 12 genes encoding ribosomal small subunit proteins, one gene encoding a maturation enzyme (matK), one gene encoding a c-type cytochrome synthase (ccsA), one gene encoding a membrane protein (cemA), one gene encoding a protease (clpP), one gene encoding a subunit of acetyl-CoA-carboxylase (accD), and five genes encoding conserved open reading frames (ycf1, ycf2, ycf3, ycf4, ycf15).
Table 2
Coding genes of chloroplast genome in M. baccata var. gracilis
Groups of genes
|
Names of genes
|
Subunits of NADH-dehydrogenase
|
ndhA, ndhB(×2), ndhC, ndhD, ndhE,
ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK
|
Subunits of photosystemⅠ
|
psaA, psaB, psaC, psaI, psaJ,
psbA, psbB, psbC, psbD, psbE, psbF,
|
Subunits of photosystemⅡ
|
psbH, psbI, psbJ, psbK, psbL, psbM
psbN, psbT, psbZ
|
Subunits of cytochrome b/f complex
|
petA, petB, petD, petG, petL, petN
|
Subunits of ATP synthase
|
atpA, atpB, atpE, atpF, atpH, atpI
|
Large subunit of rubisco
|
rbcL
|
Small subunit of ribosome
|
rps2, rps3, rps4, rps7(×2), rps8, rps11
rps12(×2), rps14, rps15, rps16, rps18,
rps19(×2)
|
Large subunit of ribosome
|
rpl2(×2), rpl14, rpl16, rpl20, rpl22, rpl23(×2), rpl32, rpl33, rpl36
|
DNA dependent RNA polymerase
|
rpoA, rpoB, rpoC1, rpoC2
|
rRNA genes
|
rrn4.5S(×2), rrn5S(×2), rrn16S(×2) ,
rrn23S(×2)
|
tRNA genes
|
rnA-UGC(×2), trnC-GCA, trnD-GUC,
trnE-UUC, trnF-GAA, trnfM-CAU, trnG-
GCC, trnG-UCC, trnH-GUG, trnI-
CAU(×2), trnI-GAU(×2), trnK-UUU, trnL-
CAA(×2), trnL-UAA, trnL-UAG, trnM-
CAU, trnN-GUU(×2), trnP-UGG, trnQ-
UUG, trnR-ACG(×2), trnR-UCU, trnS-
GCU(×2), trnS-UGA, trnT-GGU, trnT-
UGU, trnV-GAC(×2), trnV-UAC, trnW-
CCA, trnY-GUA
|
Maturase
|
matK
|
c-type cytochrome synthesis gene
|
ccsA
|
Envelope membrane protein
|
cemA
|
Protease
|
clpP
|
Subunit of Acetyl-CoA-carboxylase
|
accD
|
Genes of unknown functions Open Reading
|
ycf1(×2), ycf2(×2), ycf3, ycf4, ycf15(×2)
|
Simple repeat sequences
Simple sequence repeats (SSRs) or microsatellites are DNA sequences consisting of short, tandemly repeated di-, tri-, tetra-, penta-, or hexa-nucleotide motifs. As shown in Fig. 2, a total of 93 SSR loci were detected in the chloroplast genome of M. baccata var. gracilis, of which 68 were single nucleotide repeats, 18 were dinucleotide repeats, and seven were polynucleotide repeats. The monomeric and dimeric forms of SSRs constituted 92.47% of the overall SSRs. Thymidine (T) monomeric repeats constituted 57.35% of the total 68 monomeric SSRs and AT repeats were the predominant kind of dimeric SSRs, accounting for 94.44% of the total. The dispersed repetitive regions within the chloroplast genome were also analyzed. A total of 55 pairs of repeating sequences, each at least 30 in length, were identified. Among these pairings, 22 were found to be palindromic repeats, 27 were identified as forward repeats, one pair exhibited reverse repetitions, and none were complementary repeats. The largest observed palindromic repeat sequence spanned a length of 63 base pairs (bp), the biggest forward repeat sequence was found to be 40 bp in length, and the longest reverse repeat sequence was determined to be 45 bp long.
Codon preference analysis of M. baccata var. gracilis
Amino acids preferentially employ codons with a relative synonymous codon usage (RSCU) value greater than 1. According to Fig. 3, the chloroplast coding sequence of M. baccata var. gracilis contained 32 codons with RSCU values greater than 1, including UUA (Leu), GCU (Ala), AGA (Arg), ACU (Thr), and UAU (Tyr).
Leucine (Leu) appeared most frequently, followed by arginine (Arg) and serine (Ser). In leucine, the UUA codon was used preferentially, with an RSCU value of 2.04. The preferred codons for arginine and serine were AGA and UCU, and these exhibited a bias towards A/U in the third position. There was a preference for UAA for the termination codon.
Chloroplast genome alignment of M. baccata var. gracilis and other Malus species
To assess the similarities and differences of the chloroplast genome sequences for different Malus species, we performed a global comparison of the newly assembled sequence to the sequences of chloroplast genomes from five Malus species (Fig. 4). The chloroplast sequence of M. baccata var. gracilis showed good co-linearity with the chloroplast genome sequences of the five Malus species. The chloroplast sequence of M. baccata var. gracilis was more similar to those of M. toringoides, M. hupehensis, and M. floribunda, and less similar to those of M. ioensis and M. yunnanensis.
Analysis of variation hotspots in the cpDNA of M. baccata var. gracilis and other Malus species
To find the variation hotspot regions of the M. baccata var.gracilis chloroplast genome, the cpDNA genome sequences of ten Malus species (including M. baccata var.gracilis) were compared (Fig. 5). The results showed that the sequences of these species were relatively consistent overall, but there were some regions with large differences. Most regions with large variation were located in the LSC region, including psbA-trnH(GUG), rps16-trnk(UUU), trnR(UCU)-atpA, trnT(GGU)-psbD, psbZ-trnG(GCC), trnT(UGU)-rps4, rps14-trnM(CAU), trnV(UAC)-ndhC, accD-psaI, psaJ-rpl33, rpl14-rps8, and rps3-rpl16. The identified hotspots are predominantly located in non-coding areas, whereas the coding regions exhibit a higher degree of stability and conservation. Therefore, it is possible that these zones could serve as the foundation for variations among species.
To further assess DNA polymorphism and identify sequence similarities and differences, the chloroplast genome of M. baccata var. gracilis were analyzed using DnaSPv6.0 (Fig. 6). Four regions with substantial variability (Pi ≥ 0.008) were identified: trnR(UCU)-atpA (Pi = 0.02017), trnT(GGU)-psbD (Pi = 0.0105), trnT(UGU)-trnL(UAA) (Pi = 0.008), and ndhF-rpl32 (Pi = 0.00817). The Pi values observed in the LSC and SSC regions were significantly higher than those observed in the IRs. The comparison of these regions, which exhibit substantial variability, can facilitate species identification, molecular marker studies, and investigations into species evolution.
Boundaries of the IR/SC region of the cpDNA of M. baccata var. gracilis and other Malus species
The structure of the chloroplast genome is characterized by its circular shape, with four regions: LSC-IRb, IRb-SSC, SSC-IRa, and IRa-LSC, also known as JLB, JSB, JSA, and JLA, respectively (Fig. 7). The variability in the length of chloroplast genomes is facilitated by the dynamic processes of expansion and contraction inside the IR regions.
As the IR regions expand and contract, genes close to the boundary are likely to enter the LSC, SSC, or IR regions. The cpDNA boundaries across ten species of the Malus genus were analyzed using IRscope. We successfully identified multiple genes that either traversed or were in close proximity to the boundaries separating the IR and SC portions of the cpDNA, including rpl22, rps19, rpl2, ycf1, ndhF, and trnH.
As presented in Fig. 8, Malus coronaria exhibited the largest cpDNA genome, 160,295 base pairs (bp), whereas M. baccata var. gracilis had a slightly smaller cpDNA size of 159,992 bp. Among the ten cpDNA samples analyzed, the lengths of the LSC region were in the range of 88,100 to 248,274 bp. However, no variation was observed in the lengths of the SSC, IRa, and IRb sections. The distribution pattern of the rps19 gene at the JLB boundary exhibited consistency across the eight Malus species. The length of the rps19 gene was found to be 159 bp in the LSC region and 120 bp in the IRb region. However, in the M. coronaria sequence, the rps19 gene is longer on the LSC side (171 bp) and shorter on the IRb side (108 bp). Conversely, in M. yunnanensis, the rps19 gene is longer on the LSC side (210 bp) and shortest on the IRb side (69 bp). The ycf1 gene in the JSB locus of the M. ioensis cpDNA experienced expansion towards the SSC region, leading to an elongation of the ycf1 gene. The ycf1 gene located at the IRb/SSC locus did not undergo crossing over at the JSB border in either M. toringoides or M. hupehensis. All ten selected Malus species exhibited displacement bias in the trnH gene at the JLA locus. The aforementioned findings collectively indicate that the expansion and contraction of the border region have a significant impact on the dimensions and structure of cpDNA.
Construction of a phylogenetic tree of the chloroplast genome of M. baccata var. gracilis
A phylogenetic tree was created using the chloroplast genomes of 24 plants from the Malus genus and two exo-taxa plants to explore the evolution of M. baccata var. gracilis (Fig. 8). As shown in Figs. <link rid="fig8">8</link>-A and 8-B, the 24 Malus species can be well classified in the NJ and ML trees, with the outgroup divided into a large branch. M. baccata var. gracilis is clustered in a large group with its closest relatives, M. hupehensis and M. sikkimensis.
The gene matk, found in the chloroplast genome, is a solitary gene that undergoes evolution at a modest pace. This gene has been identified as a potential DNA barcode for taxonomic applications (Uchoi et al., 2016). Using Vitis as an outgroup, we next constructed a phylogenetic tree based on this gene in Vitis vinifera and nine apple species. Phylogenetic trees (ML and NJ trees) were constructed based on the chloroplast matk gene for M. baccata var. gracilis and nine other apple species, as shown in Figs. <link rid="fig8">8</link>-C and 8-D. The results showed that M. baccata var. gracilis clustered with other apple plants on one large branch, and V. vinifera occupied a separate branch. M. baccata var. gracilis and M. hupehensis are closer in evolution, and M. ioensis is less similar to other Malus plants in the genus.