Chloroplast genome assembly and features
C. arborescens and C. opulens chloroplast genomes were sequenced using the Illumina Novaseq platform. According to the sequencing results, the chloroplast whole genome sequence was assembled at 129,473 (Fig. 1A)and 132,815 base pairs (Fig. 1B). Due to the loss of the IR region, neither of their chloroplast genomes have the typical tetrad structure of most angiosperm chloroplast genomes, and their length has been shortened accordingly. Nonetheless, their genetic structures are extremely comparable.
In the chloroplast genomes of C. arborescens and C. opulens, there were 111 unique genes, including 76 protein-coding genes, 31 tRNA genes, and 4 rRNA genes, and their respective GC contents were 34.30% and 34.71% (Table 1), indicating that the GC content between the two species was extremely similar. This paper compares and analyzes the chloroplast genome sequences of six species of Caragana plants lacking the IR region. According to the results, the total length of their sequences varied between 129,331 and 133,122 base pairs. Due to the absence of the IR region, the chloroplast genome length of C. korshinskii was the shortest, at only 129,311 bp, and that of C. rosea was the longest, at a total length of 133,122 bp. In addition, the number of genes in C. arborescens and C. opulens was one gene greater than that of other species (tRNA encoded by the trnN-GUU gene), whereas the number of protein-coding genes and rRNA genes was consistent among the six plants. In terms of gene content, the number of protein-coding genes was the highest among the six species, comprising approximately half of the full-length genome, followed by the number of tRNA genes, whose length was shorter than that of other genes. C. rosea has the highest GC content in its chloroplast genome, at 34.84 percent, followed by C. kozlowii (34.5 percent), and C. microphylla, which has the lowest GC content, at 34.2 percent. We also examined the variations in GC concentration between the three gene types. The GC concentration of rRNA was over 50%, which was high and stable, followed by tRNA, and the GC content of protein-coding genes was approximately 37%. In conclusion, the sequence length and gene number of the chloroplast genomes of the six Caragana species were generally consistent, and the average GC content of the species was approximately 34%, which suggests that the evolution of the Caragana genus was relatively conservative.
Table 1
Summary of complete chloroplast genomes for six Caragana species.
Plastome Characteristics | Caragana arborescens | Caragana opulens | Caragana kozlowii | Caragana rosea | Caragana microphylla | Caragana korshinskii |
Protein Coding gennes | Length(bp) | 66,222 | 66,333 | 66,234 | 66,243 | 66,231 | 66,231 |
GC(%) | 36.89 | 37.01 | 37.03 | 37.13 | 36.88 | 36.88 |
Length(%) | 51.15 | 50.0 | 50.45 | 49.76 | 50.94 | 51.21 |
Number | 76 | 76 | 76 | 76 | 76 | 76 |
tRNA | Length(bp) | 2,379 | 2,370 | 2,285 | 2,359 | 2,370 | 2,379 |
GC(%) | 52.74 | 52.83 | 53.15 | 52.73 | 53.14 | 53.05 |
Length(%) | 1.83 | 1.80 | 1.74 | 1.77 | 1.82 | 1.83 |
Number | 31 | 31 | 30 | 30 | 30 | 30 |
rRNA | Length(bp) | 4,522 | 4,520 | 4,521 | 4,537 | 4,520 | 4,520 |
GC(%) | 54.8 | 54.56 | 54.75 | 54.77 | 54.82 | 54.82 |
Length(%) | 3.49 | 3.40 | 3.44 | 3.4 | 3.48 | 3.49 |
Number | 4 | 4 | 4 | 4 | 4 | 4 |
Total | Length(bp) | 129473 | 132815 | 131274 | 133122 | 130029 | 129331 |
Number Of genes | 111 | 111 | 110 | 110 | 110 | 110 |
GC(%) | 34.3 | 34.71 | 34.5 | 34.84 | 34.26 | 34.36 |
Comparable to other species, the chloroplast genomes of C. arborescens and C. opulens encode three categories of genes (Table 2). Self-replication was associated with 57 genes. 3 subunits (large, small, and DNA-dependent RNA polymerase), including ribosomal RNA genes, transporting RNA genes, and encoding chloroplast RNA polymerase; 44 photosynthesis-related genes; other genes and unknown genes. In the chloroplast genomes of C. arborescens and C. opulens, 16 genes with introns were detected, of which one gene, ycf3, had two introns, and the remaining 15 genes (trnK-UUU, trnV-UAC, trnL-CAA, rpoC1, atpF, trnG-UCC, clpP, petB, petD, rpl16, rpl2, ndhB, trnI-GAU, trnA-UGC, ndhA) had only one intron (Table 3). Among these 16 intron-containing genes, the intron lengths of the two genes were remarkably similar.
Table 2
Genes in the chloroplast genome of Caragana species.
Category | Group of genes | Name of genes |
Self-replication | Proteins of large ribosomal subunit | rpl14, rpl16*, rpl2*, rpl20, rpl23, rpl32, rpl33, rpl36 |
Proteins of small ribosomal subunit | rps11, rps12*, rps14, rps15, rps18, rps19, rps2, rps3, rps4, rps7, rps8 |
Subunits of RNA polymerase | rpoA, rpoB, rpoC1*, rpoC2 |
Ribosomal RNAs | rrn16, rrn23, rrn4.5, rrn5 |
| Transfer RNAs | trnA-UGC*, trnC-GCA, trnD-GUC, trnE-UUC, trnF-GAA, trnG-GCC, trnG-UCC*, trnH-GUG, trnI-CAU, trnI-GAU*, trnK-UUU*, trnL-CAA*, trnL-UAA, trnL-UAG, trnM-CAU, trnN-GUU(2), trnP-UGG, trnQ-UUG, trnR-ACG, trnR-UCU, trnS-GCU, trnS-GGA, trnS-UGA, trnT-GGU, trnT-UGU, trnV-GAC, trnV-UAC*, trnW-CCA, trnY-GUA, trnfM-CAU |
Photosynthesis | Subunits of photosystem I | psaA, psaB, psaC, psaI, psaJ |
Subunits of photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ |
Subunits of NADH dehydrogenase | ndhA*, ndhB*, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK |
Subunits of cytochrome b/f complex | petA, petB*, petD*, petG, petL, petN |
Subunits of ATP synthase | atpA, atpB, atpE, atpF*, atpH, atpI |
Large subunit of rubisco | rbcL |
Other genes | Maturase | matK |
Protease | clpP |
Envelope membrane protein | cemA |
Acetyl-CoA carboxylase | accD |
c-type cytochrome synthesis gene | ccsA |
Unknown | Conserved hypothetical chloroplast ORF | ycf1, ycf2, ycf3**, ycf4 |
Notes: Gene*:Gene with one introns; Gene**:Gene with two introns; Gene(2):Number of copies of multi-copy genes. |
Table 3
The intron-containing genes and the length of exons and introns in the chloroplast genomes of two Caragana species
Species | Gene | Exon I(bp) | Intron I(bp) | Exon II(bp) | Intron II(bp) | Exon III(bp) |
C.arborescens | trnK-UUU | 37 | 2488 | 29 | | |
trnV-UAC | 39 | 574 | 37 | | |
trnL-CAA | 37 | 550 | 50 | | |
ycf3 | 126 | 713 | 228 | 877 | 153 |
rpoC1 | 432 | 789 | 1623 | | |
atpF | 168 | 660 | 411 | | |
trnG-UCC | 23 | 682 | 49 | | |
clpP | 368 | 701 | 229 | | |
petB | 6 | 818 | 642 | | |
petD | 8 | 717 | 475 | | |
rpl16 | 9 | 1054 | 399 | | |
rpl2 | 396 | 685 | 435 | | |
ndhB | 723 | 685 | 762 | | |
trnI-GAU | 38 | 951 | 35 | | |
trnA-UGC | 38 | 812 | 35 | | |
ndhA | 552 | 1170 | 540 | | |
C.opulens | trnK-UUU | 37 | 2485 | 29 | | |
trnV-UAC | 39 | 574 | 37 | | |
trnL-CAA | 37 | 534 | 50 | | |
ycf3 | 126 | 702 | 228 | 875 | 153 |
rpoC1 | 432 | 790 | 1623 | | |
atpF | 168 | 679 | 411 | | |
trnG-UCC | 23 | 682 | 49 | | |
clpP | 368 | 1159 | 223 | | |
petB | 6 | 826 | 642 | | |
petD | 8 | 720 | 475 | | |
rpl16 | 9 | 1108 | 399 | | |
rpl2 | 396 | 692 | 435 | | |
ndhB | 723 | 685 | 762 | | |
trnI-GAU | 38 | 953 | 35 | | |
trnA-UGC | 38 | 810 | 35 | | |
ndhA | 552 | 1169 | 540 | | |
Analyses of repetitive sequences and SSRs
Repeat sites are important in genomic evolution, such as in structural rearrangement and size-based evolution [32, 33]. In this study, we identified the repetitive sequences in the chloroplast genomes of C. arborescens and C. opulens and analyzed their content. The results indicated that the chloroplast genome with a repeat length greater than or equal to 30 bp contained four categories of repeats: forward (F), palindromic (P), reverse (R), and complementary (C) repeats. In the two plants, 129 (length range: 30–249 bp) and 229 (length range: 30–472 bp) repeats, respectively, were identified (Table S1). The length range of 30–49 bp sequences had the highest frequency among all classes of repetitive sequences (former: 68.22%, latter: 52.40%).
Structural analysis of the repetitive sequences showed that C. arborescens was composed of 85 forward repeats (65.89%), 36 palindromic repeats (27.91%), 7 reverse repeats (5.43%), and 1 complementary repeat (0.78%) (Fig. 2A, Fig. 2C), while there are no complementary repeats in the repeat sequence of C. opulens, which consists of three repeat types, including 165 forward repeats (72.05%), 62 palindromic repeats (27.07%), and 2 reverse repeats (0.87%) (Fig. 2B, Fig. 2C). The majority of repeat sequences exist in the IGS region, and the majority of them are forward repeats.
Numerous simple sequence repeats (SSRs) are present in the chloroplast genome of plants. This form of sequence is transmitted from parents to offspring. It has a relatively basic structure and low variability. SSRs are therefore more efficient molecular markers [34]. Using the software MISA v1.0, we identified a total of 18 varieties in the two Caragana plants. The chloroplast genomes of C. arborescens and C. opulens contain 277 and 265 SSR loci, respectively (Table S2). The proportion of mononucleotide in the two Caragana plants with the highest concentration were 57.04 and 63.40 percent, respectively. While dinucleotide and trinucleotide repeat sequences comprised 7.58 and 29.24 percent of the former, tetranucleotide repeat sequences comprised the smallest proportion (6.14 percent). In the latter, the proportions of dinucleotide, trinucleotide, and tetranucleotide repeat sequences were 4.91 percent, 28.68 percent, and 2.64 percent, respectively, while pentanucleotide represented the smallest proportion, 0.38 percent.
C. arborescens has the longest SSR on the ycf1 gene of the chloroplast genome, which is a single nucleotide repeat sequence (A) with a length of 46 bp, whereas C. opulens has the longest SSR, which is a mononucleotide (T) with a length of 26 bp (Table S3). In addition, the distribution of SSRs in coding and noncoding regions was analyzed. Figure 3A displays that the number of SSRs in the protein-coding region was significantly lower than in the non-coding region. The majority of these SSRs were A/T single nucleotide repeats; 158 and 167 of the two Caragana species contained A/T, while only one contained C/G(Fig. 3B, Fig. 3C). Similarly, the majority of dinucleotide repeats consist of AT/AT, resulting in a deviation in base composition, which is consistent with the finding that the overall AT content of plastids is greater than the GC content[35].
Codon usage bias analysis
In the evolution of biology, plastids exhibit a prevalent codon usage bias. By analyzing codon usage bias, which may penetrate the phylogenetic relationship between bionts and the molecular phylogeny of genes[36], it is possible to study the origin, mutation model, and evolution of species. We have analyzed the codon distribution conditions in all protein-coding genes in these two plants. The 76 protein-coding gene sequences of the two Caragana species were used to generate 12,812 codons in total. Leucine (Leu) was the amino acid with the highest content, accounting for 10.58% and 10.65%, respectively, followed by codons encoding isoleucine (Ile) (9% and 8.89%), while cysteine (Cys) had the lowest abundance among the two plants (Table S4).
In the meantime, we also independently calculated the relative synonymous codon usage (RSCU) values, using which we determined the codon usage bias of the two plants' chloroplast genomes (Fig. 4). When the RSCU value is greater than one, the codon is considered optimal. Among the 31 codons with RSCU values greater than 1, the AUG codon encoding methionine had the highest utilization bias (C. arborescens RSCU: 2.99 (Fig. 4A), C. opulens RSCU: 2.98 (Fig. 4B)). Tryptophane had no codon usage bias among these 31 codons (only one codon). Except for UUG, which encodes leucine, and AUG, which encodes methionine, the remaining codons terminated in A (12) or U (16) (Table S4).
Sequence divergence analysis
Previous research has demonstrated that highly variable loci in the plastid genome can be used to investigate molecular markers [13]. Therefore, the software DNAsp6 [37] was used to calculate the nucleotide diversi (Pi) in order to identify highly variable regions in the chloroplast genomes of C. arborescens and C. opulens. According to the results of sliding window analysis, the Pi values of the two plants ranged from 0 to 0.05516, with an average value of approximately 0.0067655 (Fig. 5), indicating that the chloroplast genome sequences of the same genus have few distinctions and a high degree of similarity. rpoC2-rps2, accD-cemA, rps18-clpP, rpoA-rpl36, and rpl2-rpl23 were determined to be the most probable highly variable regions based on the pi values of 111 different genes. Furthermore, the rpoA-rpl36 region has the highest pi value, followed by the rps18-clpP region.
To demonstrate the distinct chloroplast genome sequence levels in C. arborescens and C. opulens. Caragana, including C. arborescens, C. opulens, C. kozlowii, C. rosea, C. microphylla, and C. korshinskii, had their whole plastid genome sequences compared to that of C. jubata (Fig. 6). Extremely low sequence divergence among species suggests that the CP genome was more conservative. IGS (matK-rbcL), IGS (psbM-petN), IGS (atpA-psbI), IGS (petA-psbL), IGS (psbE-petL), and IGS (rps7-rps1 2) contain significant differences among Caragana plants. Additionally, the majority of protein-coding regions were highly conserved, with the exception of a few (accD, ycf2, and rps7). This indicates that IGS is responsible for the accelerated evolution of Caragana species.