General features of the plastid genome
Using Illumina HiSeq sequencing platforms, 5.38-5.89 Gb of clean data from each Pilea species were obtained, with the number of clean reads ranging from 17,935,118 to 19,627,967 (Additional File 1: Table S1). The plastome was assembled based on these data. The 4 plastomes of Pilea are characterized by a typical circular DNA molecule with a length ranging from 150,398 to 152,327 bp. They all have a conservative quartile structure composed of a large single copy (LSC) region (82,063 to 83,292 bp), a small single copy (SSC) region (17,487 to 18,363 bp) and a pair of inverted repeat (IR) regions (25,180 to 25,356 bp) (Table 1). The lengths of the plastomes are conserved in this genus. GC content analysis showed that the overall GC contents ranged from 36.35% to 36.69% in the 4 plastomes. Note that the GC contents in the IR regions (42.56%-42.73%) are significantly higher than those in the LSC (33.87%-34.36%) and SSC regions (30.01%-30.81%). The raw sequencing data and the four genome sequences have been deposited into the NCBI database (accession numbers: PRJNA675740 and MT726015 to MT726018).
Genome annotation
The plastid genomes of the four Pilea species all comprise 113 unique genes, including 79 protein-coding genes, 4 rRNA genes and 30 tRNA genes (Additional File 1: Table S2). The gene order and gene numbers of these four species are highly similar, showing conserved genomic structures. Figure 1 shows the schematic diagram of the plastome of Pilea. Introns play a significant role in selective gene splicing [23]. Among the 79 protein-coding genes annotated, nine unique genes (rps16, rpoC1, atpF, petB, petD, rpl16, rpl2, ndhB, ndhA) contained one intron, and two unique genes (ycf3, clpP) contained two introns. Moreover, six unique tRNA genes (trnK-UUU, trnG-UCC, trnL-UAA, trnV-UAC, trnI-GAU, trnA-UGC) contain one intron. There are seven protein-coding genes, four rRNA genes, and seven tRNA genes completely duplicated in the IR regions, so they exist as two copies. The rps12 gene is a trans-spliced gene, and the 5’ end is located in the LSC region. However, the 3’ ends are found in the IRa and IRb regions. These results are similar to those in other species in Urticaceae [22].
Repeat analysis
SSRs, also referred to as microsatellite sequences, provide a large amount of genetic information [24-26]. Because of its high genetic polymorphism, SSRs are often used for the development of molecular markers and play an important role in the identification of species [27, 28]. In this study, we detected 68, 75, 71, and 80 SSRs in the 4 analyzed species (Fig. 2A, Additional File 1: Table S3). Most SSRs are mononucleotide homopolymers, particularly A/T, which accounts for 70.75% of the total. Hexanucleotide repeats are detected only in P. mollis. These SSRs showed high polymorphism, suggesting great potential in the identification of Pilea species.
In the plastid genomes of Pilea species, we detected four types of interspersed repeats. Most of them are forward repeats and palindromic repeats (Fig. 2B). By contrast, there are only two reverse repeats and one complementary repeat. The only complementary repeats were found in P. peperomioides. The detailed sequences showed in Additional File 1: Table S4. Moreover, the length of these short interspersed repeats mainly ranged from 30 to 34 bp. We note one forward repeat with a length of 102 bp (detected in P. serpyllacea).
Contraction and expansion analysis of IR regions
The contraction and expansion of IR regions are considered to be an important reason for the length diversity in plastid genomes [29]. In addition, with the expansion/contraction of the IR regions, genes near the border have an opportunity to access IR or SC regions [30]. We retrieved the published plastomes of six species from Urticaceae and compared them with those of the four Pilea species. We found several genes spanning or near the boundary of the IR and SC regions. They include mainly rps19, rpl22, rpl2, ycf1, ndhF and trnH (Fig. 3). Notably, an abnormal expansion of IR regions was observed in Gonostegia hirta. The IR regions are more than 30,000 bp in G. hirta, and more genes can access the IR regions (e.g., rpl36 and rps19). However, the length of IR regions in the other nine species is approximately 25,000 bp, and the rps19 gene spans the LSC/IRb boundary, except in Droguetia iners and Hesperocnide tenella; the rps19 gene in the former is in the LSC region, while that in latter is completely in the IR region. In addition, the trnH gene completely accesses IR regions in H. tenella, obtaining two copies. It can be seen that the genomic structure, gene order and numbers of some species in Urticaceae have changed obviously.
Furthermore, the ycf1 gene crosses the SSC/IRa boundary, most of which is located in the SSC region. The length of the ycf1 gene in the four Pilea species varies widely, indicating the possibility of sequence differences. Surprisingly, we annotated two copies of ycf1 in the four Pilea plants; they cross the IRb/SSC boundary and are not annotated in other species. Sequence alignment found that the two copies of ycf1 exist in other taxa, indicating that the previous annotation is imperfect, although one of the two copies is a fragment of ycf1 and is generally considered to be a pseudogene. Interestingly, a small fraction of the ndhF gene (less than 100 bp) crosses the IRb/SSC regions, which means that the first copy of ycf1 has an overlap with ndhF in Pilea species. The overlapping areas are 108 bp in length.
Genomic divergence
To evaluate the genomic divergence, sequence identity analysis based on mVISTA [31] was performed among the 4 Pilea species, with the reference being the plastome of P. peperomioides. We observed varying degrees of sequence divergence, especially in the LSC and SSC regions. In contrast, the IR regions were more conserved. Most of these highly variable regions were observed in conserved noncoding sequences (CNS) (Fig. 4). However, the regions with the greatest sequence divergence were found in protein-coding regions, in which the gene ycf1 is present. The coding regions of ycf1 in the four Pilea species showed significant differences, and the similarity was even less than 50% for some fragments. Overall, the analyzed genomic sequences showed rather high levels of sequence divergence throughout the genus Pilea.
To quantify the levels of DNA polymorphism, the 4 genomes were aligned and analyzed by using DnaSP v6.0 [32]. We detected 8 hypervariable regions, with Pi values exceeding 0.06 (Fig. 5), petN-psbM (Pi = 0.06067); psbZ-trnG-GCC (Pi = 0.07067); trnT-UGU-trnL-UAA (Pi = 0.06433); accD-psbI (Pi = 0.06003); ndhF-rpl32 (Pi = 0.06100); rpl32-trnL-UAG (Pi = 0.06800); ndhA-intron (Pi = 0.06533), and most regions of the gene ycf1 (Pi values ranging from 0.07367 to 0.17067). The Pi values are listed in parentheses. Notably, most regions of the plastome sequences had Pi values greater than 0.02 (except for IR regions), exhibiting abundant polymorphism of the plastid genome in Pilea.
Nucleotide variations in protein-coding genes
The protein-coding regions are highly conserved in plastid genomes [33]. We analyzed the protein-coding sequences of 79 identified unique orthologous genes in 4 Pilea taxa. Surprisingly, these protein-coding genes also showed high levels of variation (Fig. 6A, Additional File 1: Table S5). Of the 79 shared genes, 63 had a mutation rate of more than 2%, and 30 had a mutation rate of more than 4%. The gene with the highest mutation rates was ycf1 (16.62%), followed by matK (10.54%), ccsA (8.74%) and rps15 (8.42%). Only two genes (psbJ and psbL) showed extreme conservation without any variable sites. Moreover, we observed a total of 11 genes (ycf1, ndhF, rps19, accD, rpoC2, rps16, rpoA, rpl20, ndhD, rpoC1 and ycf2) with InDels in nucleotide sequences by using DnaSP v6.0 [32]. Among these, ycf1 had 35 InDels, followed by ycf2 (9), accD (4) and rpoC2 (3). Considering that the protein-coding regions are highly conserved, protein-coding sequences with high nucleotide mutation rates are usually infrequent in the same genus, and these results showed interspecific diversity within the plastid genome of Pilea.
In this study, synonymous (dS) and nonsynonymous (dN) substitution rates, along with dN/dS, were estimated for the 79 shared genes in parallel by using PAML v4.9 [34]. Among the 79 genes, ycf1, matK, ccsA and rps15 had relatively high dN values, and rps16, rpl32, ndhF and psaJ had relatively high dS values (Fig. 6B, Additional file 1: Table S6). Most genes exhibited considerably low dN/dS values (less than 0.6), implying that most of the protein-coding genes were under purifying selection during evolution. However, the dN/dS ratio of three genes (rpl36, clpP and accD) was between 0.6 and 1.0. Moreover, the dN/dS ratio was greater than 1.0 for petL, rps12, ycf1 and ycf2, indicating that they were under positive selection during evolution. These results clearly indicated that the plastid genes in the different species of Pilea may have been subjected to different selection pressures.
Phylogenetic analysis
In this study, we constructed maximum likelihood (ML) trees by using the complete plastome sequences as data sets (detailed materials are shown in Additional File 1: Table S7). The phylogenetic tree has high bootstrap support in all nodes, showing the reliability of the phylogeny recovered (Fig. 7).
Our phylogenetic tree displayed two clades clearly and then further diversified into four subclades with 100% bootstrap support (ML). These four subclades correspond to four subfamilies: Boehmerioideae, Cecropioideae, Lecanthoideae and Urticoideae. This is consistent with the traditional classification [5]. All 4 Pilea species clustered together (all nodes have BS = 100 for the ML method) and formed a monophyletic group that is a sister group to Elatostema. They all belong to the subfamily Lecanthoideae.