Cp genome organizations
The chloroplast genome information of the three Nekemias species compared in this section has been uploaded to the NCBI database, and the GenBank accession numbers are shown in Table 1. As in other flowering plants, the chloroplast genome of Nekemias is a typical quadripartite structure which contains two inverted repeats (IRs), a large copy region (LSC) and a small copy region (SSC). As shown in Fig. 1 and Table 1, the chloroplast genome length of the three species was between 161,981 bp and 162,500 bp, while the whole genome length of the chloroplast of the Nekemias grossedentata had an intermediate length 162,147 bp, and the maximum length difference among the three species was only 519 bp.
By comparing the GC content of the three species in Tables 1 and 2, we found that the GC ratio of the IR region is greater than that of the other two independent regions, which is the same as that of most land plants. This is because the eight ribosomal RNAs (rRNAs) of the chloroplast genome are located in the IR region of the two inverted repeats, and ribosomal RNA is rich in GC bases, which causes this result19. In addition, the AT content of the three targeted plant protein-coding regions (CDS regions) was greater than their GC content, which was also consistent with other land plants20,21. We mapped and annotated the chloroplast genomes of these three species. Combined with Table 2, we can find that although the chloroplast genome of N. grossedentata has an intermediate length, it has 134 whole genes and 89 protein coding genes, 3 more protein coding genes and a unique gene—ycf15, compared to N. megalophylla and N. cantoniensis. This gene belongs to a protein-coding gene with unknown function, and no specific functional classification has been found in similar studies. It may be used as a molecular marker to identify Nekemias species in the supply of vine tea products. In addition, the three Nekemias species showed a high degree of similarity in 37 tRNAs and 8 rRNAs, but the length of the CDS region was different. The CDS region of Nekemias grossedentata was more than 1500 bp longer than the other two species, but the the total GC content of the CDS region was not significantly different, both were about 37.36%. The difference of nucleotide composition at each site was also small. The GC content of the first, second, and third codon positions in N. grossedentata was 45.57%, 38.19% and 30.19%, respectively. The GC content of the first, second, and third codon positions in N. cantoniensis was 45.73%, 38.28% and 30.02%, respectively. The GC content of the first, second, and third codon positions in N. megalophylla was 45.71%, 38.28% and 30.04%, respectively. Although the difference between species is small, we can still find that the GC content of the third codon is low. According to the literatures22,23, it can be found that the rule of AT base preference of the third codon is also reflected in other angiosperms, which indicates that our data mining and analysis are reliable.
There are 89 protein-coding genes in the chloroplast genome of N. grossedentata. It had a total of 27,007 codons, while the 86 protein-coding genes of N. cantoniensis and N. megalophylla had a total of 26,487 and 26,482 codons, respectively. Among all the codons in the protein coding region, the codon responsible for isoleucine (Ile), AUU (1125/1103/1103), was the most frequently used in all three species, while the codon responsible for cysteine (Cys), UGC (85/80/81), was the least used in all three species. As shown in Fig. 2, the relative synonymous codon usage frequencies (RSCU) in the protein-coding regions of the chloroplast genomes of the three species are slightly different but generally similar. All amino acids except methionine (Met) and threonine (Thr) had multiple paired codons, up to six of them, including arginine, leucine and serine. Most amino acids have two paired codons, and the single preference is obvious, but in multi-codon amino acids, there are several preferential codons at the same time (for example, three codons in arginine have RSCU > 1). However, we can find that the higher rate of synonymous substitutions provides more informative characters for phylogenetic studies.
(Fig. 1. Gene map of the N. grossedentata cp genomes. The genes inside and outside the outer circle are transcribed in the direction from 0K to 15K. Genes are classified into 14 groups according to their biological function and are shown by different colored boxes. And gene maps of the other two species are submitted in the supplementary material.)
Table 1
Summary of the base composition of the cp genomes of three Nekemias species
Species | Accession number | Total length (bp) | LSC region(bp) | IR region(bp) | SSC region(bp) | GC cotent(%) | GC content in LSC region(%) | GC content in IR region(%) | GC content in SSC region(%) |
N. grossedentata | MT267294.1 | 162,147 | 89,244 | 27,232 | 18,439 | 37.36 | 35.35 | 42.52 | 31.82 |
N. cantoniensis | ON616406.1 | 162,500 | 89,333 | 27,042 | 19,083 | 37.36 | 35.36 | 42.66 | 31.70 |
N. megalophylla | NC_068499.1 | 161,981 | 89,236 | 27,153 | 18,439 | 37.37 | 35.36 | 42.58 | 31.81 |
Table 2
Gene number and CDS nucleotide composition of the cp genomes of the three Nekemias species
Species | Number of | | | | | CDS (bp) | ATCG contents in CDS (%) |
| Total gene | Unique gene | P-cgs | tRNAs | rRNAs | | T(U) | C | A | G |
N. grossedentata | 134 | 114 | 89 | 37 | 8 | 81,021 | 31.30 | 17.72 | 30.72 | 20.26 |
N. cantoniensis | 131 | 113 | 86 | 37 | 8 | 79,461 | 31.29 | 17.72 | 30.70 | 20.29 |
N. megalophylla | 131 | 113 | 86 | 37 | 8 | 79,446 | 31.28 | 17.72 | 30.71 | 20.29 |
Species | Base contents in 1st position (%) | Base contents in 2nd position (%) | Base contents in 3rd position (%) |
| T(U) | C | A | G | T | C | A | G | T | C | A | G |
N. grossedentata | 23.66 | 18.83 | 30.77 | 26.74 | 32.34 | 20.34 | 29.47 | 17.85 | 37.90 | 13.98 | 31.91 | 16.21 |
N. cantoniensis | 23.57 | 18.85 | 30.70 | 26.88 | 32.31 | 20.38 | 29.41 | 17.90 | 37.98 | 13.94 | 32.00 | 16.09 |
N. megalophylla | 23.58 | 18.83 | 30.71 | 26.87 | 32.30 | 20.38 | 29.42 | 17.90 | 37.95 | 13.93 | 32.00 | 16.11 |
Repeat and simple sequence repeat (SSR) analyses
In this paper, the analysis of the repeat sequences of three Nekemias species was based on the online software REPuter 1.0 and Tandem v4.09, and the detected repeat sequences included forward repeat sequences, palindromic repeat sequences and tandem repeat sequences (the rest of the repeat sequences were not included in the experiment due to too few or no repeat sequences). The results are shown in Fig. 3, a, b, c and d. The characteristics of simple sequences repeat (SSR) were compared and analyzed through the online website MISA, and the results are shown in
Figure 4.
((a) Repeat types in three cp genomes; (b) tandem repeats in three cp genomes; (c) forward repeats in three cp genomes; (d) palindromic repeats in three cp genomes. The ordinate indicates the number of repeats, while the abscissa represents the name of the species. In (a), different colors mean different repetition types, in (b–d), different colors show different lengths.As marked in the upper right corner.)
In the analysis of repeat sequences, it can be found from Figure.3a that the number of forward repeat and palindromic repeat sequences of the three Nekemias species is 49, but the number of tandem repeat sequences of Nekemias grossedentata (72) is more than that of the other two species (69/70). Many of forward and palindromic repeats were found in the IR region of the tetrad structure. There were 6 forward and palindromic repeats from the intronic region which starting from the IR region to the rpl23 spacer. However, most of the tandem repeats were located in the LSC region, and the longest segment of all the tandem repeats was 79 bp, which was located in the intronic region between rps19 and rpl2 genes in the IR region of all three species. The analysis of the chloroplast genomes of three Nekemias species showed that the types and numbers of repeat sequences of the three species were similar, and the length of each repeat sequence was slightly different. The length of tandem repeat sequences of each species was generally concentrated in the range of 10–30 bp, while the length of forward and palindromic sequences was not significantly different and relatively scattered. However, N. cantoniensis had no repeat sequences larger than 65 bp.
Simple sequence repeats (SSRs) have a high frequency of variation within the same species, so they are often used as a genetic map construction, correction and mapping study24. According to the parameter setting, 217 simple repeats were observed in the chloroplast genome of N. grossedentata, including 140 mono-nucleotide repeats, 58 di-nucleotide repeats, 6 tri-nucleotide repeats and 10 tetra-nucleotide repeats, 2 penta- and 1 hexa-, respectively. The number and type of simple repeat sequences of N. megalophylla were very similar to that of N. grossedentata, with only one additional mono-typic sequence,but N. cantoniensis showed a slight quantitative difference. SSR analysis of the chloroplast genomes of the three species showed that the genomes of the three species were very similar and did not reflect obvious variation level. Single nucleotide repeats (SNPS) were the main simple repeat type in all three species, and A/T type SNPS were the dominant type. The number of oligonucleotide repeat of three or more was small but the types were very rich.
IR expansion and contraction
Expansion and contraction of IR region boundaries is the main reason for the size changes of the cp genome and play an important role in species evolution25. We compare the IR/SSC and IR/LSC junction areas of three species to determine whether they have expansion or contraction. It can be seen from
Figure 5 that the rpl22 gene of the N. grossedentata is always located in the LSC region at the JLB boundary, while the rpl22 gene of the other two species are extended to the IRb region. The ycf1 gene was identical in all three species. This gene is divided into two parts and spans three regions (IRa-SSC-IRb). As the gene with the longest sequence composition, the function of ycf1 is still unknown26, but some studies have shown that this gene family is indispensable in plants. And we might be able to leave a signature on this gene for identification, for example, through molecular markers. Besides, there was no significant differences between the chloroplast genomes of N. grossedentata and N. megalophylla, while there was a certain shift in the gene loci of N. cantoniensis, indicating evolutionary expansion at their boundaries.
Sequence divergence analysis
In Fig. 6, the three complete cp genomes were compared with mVISTA using N. grossedentata as a reference. We can first find that the similarity in the non-coding sequence(CNS) is relatively low, such as the region between trnR and atpA, rps19 and rpl2, psbM and trnD, which indicates that the substitution rate of non-coding regions is faster than that of coding regions27. It also indicated that the coding region of closely related species is highly conserved and the rate of variation is extremely low. It is worth noting that the alignment results of the three sequences show that the similarity of them is generally greater than 90%, but the coding region of ycf1 gene has a slight degree of divergence. Although the function of this gene is unknown, gene knockout studies have proved that this gene is an essential gene for plant cell survival28. In order to clarify the variation in the higher regions, the nucleotide diversity values (PI) were calculated using the DnaSP v.6.10 software (Fig. 8.). The variation in intergenic regions ranged from 0 to 9.59%, with an average of 0.39%, which was sixfold higher than that in the CDS regions (0.06% on average). Three divergent loci in intergenic regions (rps19-rpl2、rpl32-trnL-UAG、ccsA-ndhD) had a Pi exceeding 1%, while no gene in the CDS region had a PI value greater than 1%. These three divergence hotspot regions should be applied to the development of molecular markers for phylogenetic and phylogeographic analyses, as well as identification of Nekemias species. Finally, according to relevant studies29, the phylogenetic relationship of closely related species is often judged by the molecular evolution rate of their non-coding regions, so the results can really provide a theoretical basis for phylogenetic analysis and species identification.
Phylogenetic analysis
A phylogenetic tree was built based on 29 species of Vitaceae cp genomes (18 Vitis species, 2 species from Tetrastigma, 7 species from Nekemias ), using the maximum likelihood (ML) method, with two outgroups. Most branches have a bootstrap value of 100%, and in the ML tree, all three mentioned species (marked in red) form a well-supported monophyletic group. The phylogenetic tree generated a total of 27 branches with branch support greater than 94%. Among all evolutionary nodes, a total of 24 branches have node values greater than 99%. Nekemias. grossedentata and N. megalophylla are sister groups are clustered together in one terminal branch, thereby representing a similar genetic relationship. These results indicated that the three target plants were not only closely related to each other, but also had a certain genetic distance from the other four species of the same genus. It is worth to know that another plant of the same genus, N. chaffanjonii, was also found in the experimental preparation of this study. Unfortunately, there is little genetic information and related research content of this species. This plant deserves more attention in future studies.
HPLC analysis of the biosynthesis related compounds
In terms of chemical constituents, HPLC fingerprints of the three species were obtained at 292nm wavelength by comparing and analyzing common functional constituents and superimposed on the control. Through the analysis and comparison of Fig. 9, It can be seen that the active components of the three species are almost the same, but the content is different. All the three species had obvious peaks at the baseline of dihydromyricetin standard, and the content of N. grossedentata and N. megalophylla was close to each other and much higher than another species. Both taxifolin and myricetin were present in the three species, but the content of taxifolin was generally low, while the content of myricetin was much higher in N. cantoniensis than the other two species. The overall results showed that the compound composition of N. grossedentata and N. megalophylla were very similar, and dihydromyricetin was their main functional compound. The compound composition of N. cantoniensis were not similar to others, myricitrin and myricetin were its main functional compound. This result was also consistent with the results of other contents in this experiment, and complementary to verify the close correlation between plant functional components and genetic evolution, which not only improved the information of biochemical components of species, but also further improved the identification content of Nekemias species.