The basic structural characteristics of the cp genome of seven species of Nymphaea
After high-throughput sequencing, raw data was obtained. Low quality reads were removed, leaving the remaining paired-end reads of seven Nymphaea species. The paired-end reads ranged from 19,642,319 reads for N. odorata to 26,112,273 reads for N. micrantha. After performing de novo genome sequencing and assembly, seven complete cp genome sequences for N. odorata, N. rubra, N. gigantea, N. potamophila, N. colorata, N. tetragona, and N. micrantha were obtained (Fig 1; Table 1) and submitted to GenBank under the following accession numbers: MT107636, MT107632, MT107637, MT107633, MT107631, MT107634, and MT107635, respectively.
These seven novel cp genome sequences had the classical quadripartite structure that contained one LSC, one SSC, and two IR (IRa and IRb) regions (Fig 1). The cp genomes of seven species of Nymphaea had different sizes, among which the largest was N. gigantea at 160,179 bp long and the smallest was N. potamophila at 159,232 bp (Table 1). The length of the LSC ranged from 89,450–90,266 bp, SSC ranged from 19,340–1,965 bp, and IR ranged from 25,163–25,232 bp (Table 1). Therefore, the variation of the length of the LSC is greater than that of SSC and IR, and the difference in genome length was mainly due to the varying lengths of the LSC and SSC. The GC content of the seven Nymphaea cp genomes was ~39% (Table 1); the GC contents in IR (~43%) of these seven Nymphaea species were higher than that of LSC and SSC regions (~37% and 34%).
A total of 126–129 genes were identified in the cp genome of seven Nymphaea species, with 126, 127, and 128 in N. odorata, N. rubra, and N. tetragona, respectively, and 129 in the other four species (Table 1). Most of the genes are single-copy genes, but the six protein coding genes (rps12, rpl23, rpl2, ycf2, rps7, and ndhB), seven tRNA genes (trnI-CAU, trnL-CAA, trnV-GAC, trnI-GAU, trnA-UGC, trnR-ACG, and trnN-GUU) and four rRNA genes (rrn16s, rrn23s, rrn4.5s, and rrn5s) in the IR region contain two copies (Table 2). In the cp genome of the seven Nymphaea species, the number of rRNAs and tRNAs was the most conservative, with eight and 37, respectively (Table 1). There were 81–84 protein coding genes (Table 1; Table 2), among which ycf1 was missing in N. tetragona, ycf4 and atpF were missing in N. rubra. ndhF, rpoc2, and ycf1 were missing in N. odorata.
The IR and SC boundaries of the cp genomes of the seven Nymphaea species were compared and analyzed. As shown in Figure 2, the LSC/IRb boundaries of the seven Nymphaea species were all within the rpl2 gene, and the length of the overlapping region between the boundary and the rpl2 gene was 15–39 bp. The IRb/SSC boundaries of the six species were between trnN and ndhF, and the length of the ndhF gene entering SSC was 49–80 bp. However, the SSC region of N. odorata lacked an ndhF gene and there was a 522 bp gap between the boundary and the trnN gene in the IRb region. The SSC/IRa boundaries of five species were all located in ycf1gene, the length of the gene in the SSC region was between 5,659–5,719 bp, and was between 155–167 bp in the IRa region. No ycf1 gene was found in the SSC regions of N. odorata or N. tetragona, and there were 521 bp and 545 bp gaps between the trnN gene in the border and IRa regions, respectively.
Codon preference
The codon preferences of the cp genome of seven Nymphaea species were analyzed, and there was no significant difference among the genomes. N. colorata was used as an example to determine the codon preference of the cp genome of Nymphaea. A total of 25,995 codons were detected in 79 protein coding genes of the N. colorata cp genome and their RSCU values were calculated (Additional file 1). Among these codons, 2,636 were used to encode leucine (Leu), accounting for 10.1% of all codons, with UUA being the most commonly used. The content of codons used to code Tryptophan (Trp) was the lowest, 447 in total, accounting for 1.7% of all codons. The RSCU values of 32 codons were greater than 1, of which 28 ended with A/U, indicating that the codon ending preference was A/U. The RSCU of tryptophan (Trp) was equal to 1, indicating that Trp has no codon preference.
Analysis of dispersed repeats and simple repeat sequences (SSRs)
The dispersed repeats of the cp genome include forward, reverse, palindrome, and complementary types. Using Vmatch v2.3.0 software, the dispersed repeat sequences of seven Nymphaea species were identified. For N. odorata, N. rubra, N. gigantea, N. potamophila, N. colorata, N. micrantha, and N. tetragona, 163, 147, 151, 158, 153, 155, and 168 repeat sequences, respectively, with a length of 15–19 bp were detected in the cp genome and were distributed in the coding region, the gene interval region, and the gene intron region (Table 3). There was little difference in the total number of dispersed repeats of seven Nymphaea species, and the length of the sequences was concentrated between 15–19 bp, accounting for 83.4–88.0% of the total dispersed repeats (Table 3).
A total of 979 SSRs had been identified in the cp genome of seven Nymphaea species (Table 3). Among these SSRs, the most abundant was trinucleotide SSR, accounting for 48.2% (472), followed by mononucleotide SSR (44.7%; 438), dinucleotide SSR (3.3%; 32), tetranucleotide SSR (2.9%; 28), pentanucleotide SSR (1.1%; 11), and hexanucleotide SSR (0.1%; 1) (Table 3). Pentanucleotide SSRs were distributed in N. gigantea (AAGAA, ATTTA, TATAT, and TTATA), N. micrantha (CATAA), N. odorata (ATACC), N. rubra (AAAGG and TTCCT), and N. tetragona (ATACC, TATTA, and TTAGC); a hexanucleotide SSR was only distributed in N. gigantea (ATTTAT) (Table 3). Most of the SSRs of the seven species were located in the LSC (56.1–59.7%), followed by the SSC (20.3–21.6%) and the IR (18.8–21.9%).
Collinearity, KaKs, Pi, SNP, and indel analysis
Although the cp genome of plants is conservative, there will be rearrangement over the course of evolution. In this study, we selected the cp genome of N. colorata as the reference sequence for comparison with the other six cp genomes of Nymphaea. The results showed that there was no reverse rearrangement in the cp genomes of the seven species of Nymphaea, indicating a collinear relationship (Fig 3). In addition, the sequence differences of 105–115 kb of the cp genome of the seven species of Nymphaea are large (Fig 3); therefore, it can be used to develop molecular markers for future genetic analyses of Nymphaea.
The synonymous substitution rate (Ka), missense substitution rate (Ks), and Ka/Ks of seven species of Nymphaea were calculated by using the cp genome of N. colorata as the reference sequence (Additional file 2). Most of the cp genes have low Ka and Ks, indicating the conservation of the evolutionary process of the cp genome. Ka/Ks < 1 of most genes indicate that they may undergoing purifying selection, i.e., the mutation is harmful and eliminated in the population. Among the genomes, the Ka/KS of ycf2 in the six comparison groups, N. colorata vs. N. gigantea, N. colorata vs. N. micrantha, N. colorata vs. N. odorata, N. colorata vs. N. potamophila, N. colorata vs. N. rubra, and N. colorata vs. N. tetragona were all greater than 1, indicating that ycf2 was subject to positive selection effect.
The Pi values of eight genes (petN, petG, psaI,psbF,psbL,psbM,psbN,and rps18) in the LSC region, all tRNAs, the psaC gene in SSc region, and five genes (rrn4.5s,rrn5s,rrn16s,rps7,and rpl23) in the IR region were 0 (Additional file 3), indicating that these genes were relatively conserved in the seven Nymphaea species. Some genes had relatively high Pi values, for example, rpl32, rps19, rpl20, and ycf1 had Pi values of 0.012579, 0.011947, 0.009195, and 0.006307, respectively, indicating that the greater the number of polymorphisms in these genes in Nymphaea, the more abundant their genetic diversity. These regions of rpl32–ndhF, rps19–psbN, and atpB–ndhJ in the cp genome of Nymphaea were highly variable, indicating that they may be potential molecular markers (Fig 4).
With N. colorata as the reference sequence, 8,328 SNPs and 1,579 insertions/deletions (indels) were obtained by global genome alignment using MAFFT software (Additional file 4; Additional file 5). SNPs were found in both coding and non-coding regions of cp genome. However, there was a relatively large number of SNPs in some highly variable regions located in rpoA–rpl20, rbcL–ndhC, ndhD–ndhF, and trnN-GUU–ndhA (Fig 4). These regions may represent differentiation hotspots of cp genome of Nymphaea. Moreover, 64.7% of SNPs (5,391) were located in the intergenic region; the remaining SNPs were located in 84 cp genes (Additional file 4). In addition, some genes, such as matK, ndhA, ndhF, rpoC1, rpoc2, rpoA, rpoB, accD, atpA, ycf1, and ycf2, all showed a high density of SNPs (Fig 4; Additional file 4). We obtained 1,579 indels mutations, 93.3% of which were located in the gene interval and only 6.7% (106) of them were located in the protein coding genes (Additional file 5). These 106 indels were distributed in matK, rps11, rpl2, rpl22, rpl32, rpoC2, petB, ndhF, ndhJ, accD, atpE, ycf1, ycf2, and ycf4 genes.
Phylogenetic analysis
Nymphaeaceae was divided into five main branches, Brasenia, Nelumbo, Nuphar, Nymphaea, and Cabomba, but Euryale and Victoria were closely related to Nymphaea (Fig 5). Nymphaea can be further divided into five subbranches, corresponding to the five Nymphaea subgenera (Fig 5). The first branch consists of N. gigantea from the subgenus Anecphya; the second subfamily consists of N. colorata, N. micrantha, N. capensis, and N. ampla which compose subgenus Brachyceras. The third branch is the subgenus Nymphaea, which is composed of N. mexicana, N. odorata, N. alba var. rubra, N. alba, and N. tetragona; N. potamophila and N. jamesoniana compose the fourth branch, subgenus Hydrocallis. The fifth branch is the subgenus Lotos, which contains of N. lotus and N. rubra.