Features of C. mongolicum Cp Genome
The cp genome of C. mongolicum was 162,124 bp in length, was comprised by a pair of IR regions (IRa and IRb) (30,512 bp), a large single copy (LSC) region of 87,718 bp and a small single copy (SSC) region of 13,382 bp (Fig. 1). The nucleotide composition of C. mongolicum cp genome was enriched in A/T nucleotides. The A + T content of the cp genome were 62.5%, which was significantly higher than the overall G + C content. The A + T content of the IR regions were 58.66%, obviously lower than LSC and SSC regions (64.42% and 67.51%, respectively) (Table 1). Weak base composition asymmetry (A-T, C-G) was found in C. mongolicum cp genome.
The positions of the 131 functional genes annotated in C. mongolicum cp genome were shown in Fig. 1. 78 genes were protein-coding genes, accounting for the half portion (59.5%) of the total genes. The remaining genes included 45 tRNA genes and 8 rRNA genes. According to the different functions, all the annotated genes were classified into four classes, including photosynthesis, self-replication, biosynthesis, and unknown functions (Table S1). Seventeen genes were duplicated in the IR regions, harboring 6 protein coding genes, 4 rRNA genes, and 6 tRNA genes.
In C. mongolicum cp genome, 12 different genes possessed a single intron and two exons, containing 5 tRNA genes and 7 protein coding genes, whereas the protein coding gene of ycf3 and clpP had two introns and three exons (Table 2). Of the total intron-containing genes, the gene of trnK-UUU had the largest intron (2511 bp), and the trnL-UAA had the smallest intron (520 bp).
Synonymous Codon Usage Analysis
A total of 51 coding sequences (CDSs) with length longer than 300 bp were screened for synonymous codon usage (SCU) bias analysis. In general, the four nucleotides were unevenly represented in the 51 CDSs. Adenine (A) and thymine (T) were the most represented (43.3% and 46.4%, respectively), cytosine (C) and guanine (G) were the least represented (16.7% and 16.9%, respectively), The average GC content of the CDSs was 38.7%.We identified the total of 61 synonymous codons except for stop codons, among which, the total of 18 codons with RSCU value more than 1.3 was identified as high frequency synonymous codons, 29 codons with ΔRSCU value more than 0.08 were identified as the high expressed codons (Table 3 and Table S2). 7 codons with high frequency as well as high expression including TTT, GGA, CAT, AAA, TTA, AAT and CCT were identified as the optimal codons.
To further analyze the SCU pattern in C. mongolicum cp genome, multivariate statistical analysis of PR2, ENC-plot analysis, and neutrality plot were combined conducted. PR2 plot mapping showed that the genes distributed unevenly in the four quadrants centered on 0.5, most points located under the horizontal centered line of 0.5 (the ratio of A3/ (A3 + T3) < 0.5) (Fig. 2a). ENC plot was used to analyze the codon usage variation of the 51 CDSs (Fig. 2b). A majority of the points were lying away from the expected curve, accompanied with a relative concentrate distribution, and except for some points (rp116, ycf2, ycf3, and so on) located on the curve. Besides, we performed neutrality plot analysis to reveal the relationship of GC12 and GC3 (Fig. 2c). Only one gene of ycf2 located on the effected curve, the remaining genes were up the standard curve.
Long-Repeat Sequences and Simple Sequences Repeats (SSRs) Analysis
The long repeat sequences in C. mongolicum cp genome were searched by REPuter software. A total of 50 long repeats were detected, 44 were forward and 6 were reverse repeats (Table 4). A majority long repeat sequences were only located in intergenic spaces (IGSs) (47%), 39% long repeat sequences were distributed in different genes, and the remaining long repeats (14%) were detected both in IGSs and genes. It was worth noting that the six reverse repeats were all located in IGSs. Besides, a total of 17, 1, 12, and 10 repeats harbored only one region of LSC, SSC, IRa and IRb regions, respectively. Another 10 repeats were detected simultaneously in two regions. Ycf1 CDS possessed the highest number of long repeats (14) and the longest repeats at 45 bp.
A total of 244 SSRs were found in C. mongolicum cp genome using MISA perl script. Among the identified SSRs, 67.2% was located in the LSC regions, 23.0% and 9.8% were found in the IR and SSC regions, respectively (Fig. 3a). 158 SSRs were located in IGSs, 80 SSRs were found in the coding regions and only 6 were found in introns (Fig. 3b). The numbers of mono-, di-, tri-, and tetranucleotides were 147, 43, 4, and 7, respectively (Fig. 3c). Mononucleotide repeats were the most frequented, accounting for 60.2% of the total repeats, while dinucleotides repeats accounted for 17.6%, and other SSRs were less common. Among all the identified SSRs, 20 SSRs belonged to G/C types, and the remaining SSRs belonged to the A/T types.
Phylogenetic Analysis
The phylogenetic tree was constructed based on a multiple alignment of nucleotide sequences of complete cp genomes from 37 plant species (Fig. 4). Drosera rotundifolia was used as the outgroup. The results showed that the species in the Polygonaceae family formed a clade, C. mongolicum was clustered closely to R. acetosa, R. palmatum, Oxyria sinensis, and F. esculentum. Furthermore, R. acetosa was the most related plant to C. mongolicum.
Comparative Analysis of Genomic Structure
The cp genome of C. mongolicum was compared to its closely related species including R. acetosa, R. palmatum, and F. esculentum (Table 5). C. mongolicum had the largest cp genome size, the largest SSC region and the most tRNA genes. F. esculentum had the smallest cp genome size and the largest LSC region. To further verify the genome divergence among these four species, sequence identity was compared using mVISTA with C. mongolicum as a reference (Fig. 5). Generally, IR regions were relatively conserved, while LSC and SSC regions were more divergent. Higher divergence of conserved non-coding regions were found than coding regions, for example, the IGS regions of rps16 and tmQ-UUG, ycf3 and tmS-GGA. Besides, significant differences were found in the regions of coding genes (petD and ndhA) and non-coding RNAs (tml-GAU).
IR Contraction and Expansion
The LSC/IR and SSC/IR boundaries of the cp genomes of C. mongolicum and other three related plant species were compared (Fig. 6). Six different genes were located at the juncture of the LSC/IRb (rps19 and rp12), IRb/SSC (ndhF), SSC/IRa (rps15 and ycf1), and IRa/LSC borders (rp12 and trnH), respectively. The ndhF gene crossed the IRb/SSC border, with 62-95 bp lengths within IRb region. Compared to other species in the Polygonaceae, the borders of the IRb/SSC and SSC/Ira in C. mongolicum changed greatly. The LSC/IRb and IRa/LSC borders were relatively conserved in C. mongolicum, R. palmatum, and F. esculentum, however the rps19 gene at the LSC/IRb border and the trnH gene at the IRa/LSC border in R. acetosa varied from the other three species.
Selective pressure events
A total of 75 orthologous protein-coding genes were found in the family of Polygonaceae. The ω values of most genes were lower than 1, except for the psbK gene found in the LSC region, which had a ω value of 1.0556 (Figure 7). The ω values of some genes were 0, such as psbI, petN, ycf3, psbE, petG, rps12, and ndhE.