Cp genome structure and size of C. nannophylla
In plants, chloroplasts are important organelles for photosynthesis and energy production and are essential for plant growth and development [10]. Chloroplasts have a unique genome and gene expression system that plays a crucial role in metabolism as a source of energy that supports plant life [24]. The complete C. nannophylla cp genome showed great similarities to the majority of angiosperms in terms of GC content and quadripartite architecture, including two inverted repeats (IRs), a large single-copy region (LSC), and a small single-copy region (SSC), which is common in plants [24–26].
Furthermore, the cp genome of C. nannophylla contains 133 genes (including 89 protein-coding genes, 36 tRNAs, and eight rRNAs), and the GC% content of the genome is 38%. High GC content often correlates with earlier phylogenetic location differentiation (such as Nymphaeaceae and Magnoliaceae) [27]. Generally, the complete cp genome of C. nannophylla demonstrates great similarity to other reported cp genomes of Clematis plants in terms of length, structure, and gene composition [24,26,28]. There was no evidence of rearrangement, and a good collinearity relationship was observed. Aligning entire cp genomes revealed that C. nannophylla cp genomes were relatively well conserved; therefore, we concluded that C. nannophylla differentiated earlier among Ranunculaceae.
Cp genome repeat sequence of C. nannophylla
Plants contain numerous replicates in their genomes. However, the number, size, type, and location of repeats between different plants [29] and repeats of the cp genome have been widely used to identify mutation hotspots and determine plant evolutionary relationships [30]. Fifty dispersed repeats were found in C. nannophylla, including 22 forward, seven reverse, and 21 palindromic repeats. The number of dispersed repeats was the same as that in other species of Clematis, and most of these dispersed repeats were located in the LSC region. Most dispersed repeats were 20–30 bp in length, indicating that short repeats occurred more frequently than long repeats in the dispersed repeats of C. nannophylla. Tandem repeats are generally considered the primary cause of genomic rearrangements and expansions [31]. Tandem repeats of C. nannophylla ranged from 10 bp to 20 bp, with most of the tandem repeats located in intergenic spaces or intron regions and a few in the same gene region, ycf2 [32].
Simple sequence repeats (SSR) usually consist of 1–6 nucleotide repeating units and have been recognised as important molecular markers in the study of population variation [33,34]. Since genetic information in the cp genome is inherited only from the maternal progenitor, SSR in the cp genome are sensitive to population genetic effects [35] and have been widely used in the study of population evolution and polymorphism [36]. SSR varied in number and type according to species; 66 SSR repeats were screened in C. nannophylla, and their distribution was mainly found in the LSC and SSC regions. The number of variation sites in the IR region was reduced, mainly in the single-copy region [37]. Among the mononucleotide SSR repeats, A/T mononucleotide repeats were significantly higher than G/C mononucleotide repeats; this pattern also exists in other angiosperms [32,38]. The dispersed, tandem, and SSR repeats identified above are responsible for cp genome rearrangement, gene replication, and gene expression; play a vital role in genomic rearrangement and sequence variation in cp genomes; and are helpful in phylogenetic studies. Rearrangement or sequence variation in these repeat units may also lead to substitutions, insertions, and deletions in the cp genome [17,39,40]. Therefore, these repeat sequences have also been shown to be a source of information for the development of markers that play an important role in population and phylogenetic studies [32] and can be used for future genetic structure, differentiation, and species identification of C. nannophylla. Therefore, they are a source of information for the development of markers and thus play an important role in population and phylogenetic studies [32] for future genetic structure, differentiation, and species identification of C. nannophylla.
Codon usage bias in the cp genome of C. nannophylla
Codon usage bias is an important feature of genome evolution and is of great significance in the study of molecular gene evolution and exogenous expression [41]. PR2 further confirmed that most genes in C. nannophylla favour T and G in the coding chain rather than A and C and that the direct cause of this base asymmetry is the replication mechanism. However, the asymmetry between coding and non-coding strands is an important cause of nucleotide skew [42]. However, the influence of replication mechanisms on base bias differs in the AT and CG asymmetries. Replication is generally strong for GC skew, whereas AT skew is caused by coding sequence-related mechanisms [42,43].
Codon usage patterns are the evolutionary features of the genome. In plants, codon usage bias is related to gene expression and is mainly affected by natural selection and mutation pressure, with differences among species [44]. In the cp genome of C. nannophylla, there are 30 high-frequency codons (RSCU > 1); leucine is the most important amino acid, and cysteine contains the least, which is consistent with the codons observed in other higher plants [41,45,46]. The use of synonymous codons is not random, and analysis of codon preferences can provide valuable information for understanding species adaptation and molecular evolution.
Comparative genomic analysis of the cp genome of C. nannophylla
The IR regions of the cp genomes of angiosperms are highly conserved. The expansion and contraction of the IR region boundaries are common evolutionary events in most angiosperms, which may lead to variations in cp genome length, gene replication or reduction, and the origin of pseudogenes [47,48]. This study found that IR expansion and contraction of C. nannophylla showed great similarity with other plants of Clematis, and these regional genotypes and distribution locations are similar [25]. However, only minor differences were observed near the IRb/SSC boundaries. trnN was not ycf1 at the IRb/SSC boundary of C. nannophylla and C. florida, and infA was not observed near the IRa/LSC boundary, which may be the result of contraction and expansion of the IR region; this is also an important reason for the differences in cp genome length [49]. The infA gene is transcribed as polycistronic mRNA, a component of the ribosome protein (rpl23) operon, while the ycf1 gene is a functional gene and encodes essential products for cell survival [50]. Therefore, the loss (or pseudogenisation) of infA and ycf1 may result from gene transfer to the nucleus. However, there is no evidence that infA and ycf1 are transferred from the cp genome to the nuclear genome in Clematis. Further studies on the transcriptomes of these two genes are required to elucidate the effect of length variation on Clematis.
Owing to the highly conserved structure and nucleotide content of cp genomes, mutation hotspots of cp genomes can be quickly and accurately identified by comparative analysis. Therefore, mutation hotspots are often used as a basis for highly variable markers (DNA barcodes) in population genetics and phylogenetic studies [51,52]. In this study, we compared the cp genome structure of five Clematis species using mVISTA (using Clematis fruticosa as a reference) and found that the non-coding region was more prone to mutations than the coding region. Furthermore, the variation in the SC region was higher than that in the IR region, which is similar to the results of previous plant studies [25,51,53]. psbA-atpA, atpI-rpoC2, rpoB-psbD, psbE-petG, clpP, and rpoC2 were the most highly variable regions detected in C. nannophylla. To determine the degree of variation in these highly variable regions in C. nannophylla, the nucleotide variability in DNASP v6 was used to identify differences among the cp genomes of Clematis and mutation hotspots. Nucleotide diversity (Pi) indicates the degree of variation in the nucleic acid sequences in each species, and sites with high variability can be selected as molecular markers for population genetics [49,54]. In the present study, the results of the nucleotide diversity analysis showed that the gene sequences in the LSC and SSC regions were more variable than those in the IR regions, which is consistent with the results found in Asteraceae and Fagaceae plants [49,59].
By analysing the cp genome sequence variation of five Clematis species, we identified 13 hypervariable regions (Pi > 0.006) in the LSC and SSC regions, which is of great significance for the study of molecular barcodes; highly variable regions, such as ndhF,ccsA, and ndhD, have also been found in two Korean endemic Clematis species [25]. Simultaneously, the same highly variable regions, ccsA and rpl32, were also found in Fagus longipetiolata of Fagaceae. The ccsA gene is also considered to be the locus for understanding cp genome evolution in Fagus longipetiolata of Fagaceae [49], Litsea [54], Pterocarpus [51], and Prosopis genera [55]. Furthermore, the Pi values of 13 height-variable regions in this study were all higher than 0.006, corresponding to the height-variable region. Overall, these highly diverse regions provide a wealth of information for the development of molecular markers for the identification of Clematis species, as well as for the analysis of the phylogenetic relationships and population genetics of C. nannophylla.
Adaptive Evolution Analysis of the cp genome of C. nannophylla
By comparing C. nannophylla with four other species of Clematis, we detected the protein-coding region genes in C. nannophylla under selection pressure. If a base change leads to an amino acid mutation, it is called a non-synonymous mutation (Ka); otherwise, it is called a synonymous mutation (Ks), and a non-synonymous mutation is usually affected by natural selection [56]. Ka/Ks is generally used to express the selection pressure of protein-coding genes. When Ka/Ks is greater than 1, it shows a positive selection effect; when Ka/Ks is less than 1, it shows a purification selection effect [57]. In this study, the Ka/Ks of most genes in C. nannophylla was less than 1 compared to that of the other four plants, indicating that purification selection played an important role in the cp genomes of the five Clematis species. However, only the Ka/Ks of the ycf1 (C. nannophylla and C. florida) genes was greater than 1, indicating that the ycf1 gene was selected to adapt to the living environment; ycf1 was also positively selected in previous studies [33]. The ycf1 gene, the largest gene in cp and the most potential cp DNA barcode encodes the ATP-binding (ABC) protein in cp. ycf1 is characterized by species-specificity [50,58], rapid mutation rate, and rapid evolution [57] and has been verified to have classification potential at the subgenus level. In C. nannophylla, regions with high purification selectivity were mainly distributed in self-replication (proteins of large ribosomal subunits and subunits of RNA polymerase), photosystem genes (subunits of photosystem and NADH dehydrogenase), other genes, and unknown genes (ycf), similar to the evolution of cp genes in Pterocarpus, Artemisia maritima, and Artemisia absinthium [51,59], suggesting that strong purification selection preserves specific gene residues and gene functions in these species.
Phylogenetic analysis of the cp genome of C. nannophylla
Cp genomes contain a large amount of genetic information that is a useful resource for inferring evolutionary and phylogenetic relationships [60]. Many researchers have used the complete cp genome sequence to resolve phylogenetic relationships at various taxonomic levels, and a strong phylogenetic tree can intuitively represent the relatedness of species and the evolutionary relationships at various scales. The present study reconstructed a phylogenetic tree with the complete cp genomes of 23 species using the ML method with four Aconitum and 1of magnolia as outgroups. The results showed that C. nannophylla was more closely related to C. fruticosa and C. songorica but less closely related to C. florida, which is consistent with the results of classification based on morphological characteristics. C. nannophylla, C. fruticosa, C. tomentella, and C. songarica belong to the sect. Fruticella, whereas C. florida belongs to the sect. Viticella belongs to the Clematis group [6]. The present study also showed that Clematis is monophyletic, divides into two large subclades, and Clematis forms sister relationships with Aconitum [28].