Chloroplast genome features of cultivated tea
Length of the whole genome of cultivated tea were slightly different for CSCA and CSCL, ranged from 157,025 bp to 157,085 bp. Protein-coding regions (CDS), intergenic spacer (IGS), intron regions showed the same results that are 30 bp, 28 bp, 2 bp difference in length, respectively. Both the two kinds of cultivated tea consist of 81 unique CDS genes, 30 tRNA, 4 rRNA and 3 pseudogenes (ycf1, ycf2 and ycf15). Among them, atpF, ndhA, ndhB, petB, petD, rpl2, rpl16, rpoC1, rps16, trnG-GCC, trnI-GAU, trnL-UAA, trnV-UAC contained a single intron, clpP and ycf3 contained two introns (Fig. 1).
Comparison of chloroplast genomes between cultivated tea and wild tea
Chloroplast genomic similarity
We used CSVS as a reference sequence and compared it with cpDNA genomes of two cultivated species and 13 wild relatives (Table 1). Compared with the genome length of CSVS, the average length variation of cultivated species and wild species are 62 bp and 186 bp respectively. Three of the wild species, named C. pitardii, C. cuspidata, C. yunnanensis, have a 500 bp length variation compared with CSVS. Similarly, the number of genes and GC content of cultivated species were more stable than that of wild species. Comparing the gene and intron deletion between cultivated and wild species, the results showed that the introns of rps12 gene were deleted in all cultivated species and some wild species. The orf42 and ycf1 genes of some wild species were deleted, but these two genes are common in cultivated species. The differences of GC content of intron and IGS in cultivated species were about 0.01% to 0.03%, but we found that the differences of intron and IGS in wild species were 0% to 1.5% and 0% to 0.46%, respectively. mVISTA and BRIG were used to compare the genomic sequence identity. The results showed that the cp genomes of these 16 species were highly similar at the genomic level, indicating that their genomes are relatively conservative. The regions with relatively low identity were atpH_atpI, trnE-UCC_trnT-GGU, psaA_ycf3, ycf15_trnL-CAA, ycf1_ndhF and ndhG_ndhI (Figs. 2 & 3). In conclusion, at the genomic level, cultivated species were more conservative than wild species.
The expansion and contraction of IR regions
The locations of IRs regions were extracted via a self-BLASTN search, and the characteristics of IR/LSC and IR/SSC boundary regions were analyzed. The IRs boundary regions of the 16 complete Camellia cp genomes were compared, showing slight differences in junction positions (Fig. 4). In order to detect possible IR border polymorphism, first of all, we compared four IR-SC boundaries among cultivated tea and CSVS cp genome, but no difference was found at the LSC-IRb, IRa-LSC border. At the IRb-SSC, SSC-IRa border, only minor differences were discovered. Then, we compared the cp genome boundaries of wild tea and CSVS. The rps19 gene at the LSC-IRb boundary expanded 52 bp from the LSC region to the IRb side in C. sinensis var. pubilimba, while it stopped at the 46 bp from the LSC region in the rest of the species. On the other side of the IRa-LSC boundary, the lengths of the spacers between the IRa-LSC junction and the rpl2 gene (in IRa) were 112 bp for C. sinensis var. pubilimba while those of the rest species were all 106 bp. Consistently, in all of the comparative cp genomes, the ycf1 gene spanned the SSC-IRb region and the length of ycf1 were from 963bp to 1069bp in IRa. Remarkably, most species have created the ycf1 pseudogene at the IRa-LSC junction while it was not observed in C. sinensis var. assamica, C. taliensis, C. impressinervis, C. pitardii, C. crapnelliana, C. cuspidate, C. yunnanensis. Like most plants, the ndhF gene involved in photosynthesis was located in the SSC region. However, ndhF gene was located at the IRb-SSC boundary of C. reticulate, and there was a 35 bp overlap between ndhF and ψycf1.
Nucleotide diversity
Comparison based on the Pi values of the sixteen species’ cp genomes were presented, including the intergeneric regions (IGS), protein-coding genes and introns (Table S1, Fig. 5). The results showed that the Pi values of the genes, introns and IGS in wild species were higher than cultivated species. The average values for the genes, introns and IGS in wild species were about 16.6, 3.5 and 9.1 times to that of cultivated species.
The genes and IGS in wild species with higher values, such as psbB and psbD, while Pi values were 0 in cultivated species. These results suggest that these genes and IGS were more conserved in cultivated species than in wild species.
The Pi values of genes and IGS, such as psbB, psbD, trnI-CAU_ycf2, trnI-GAU_rrn16, and trnI-CAU_rpl23, were higher of wild species, but they were 0 in cultivated species. Furthermore, except for ndhD, ndhF, ndhH and psbC, the Pi values of the photosynthetic genes of cultivated tea were 0. The Pi values of these genes were smaller than that of wild species. These results indicate that these genes and IGS were more conserved of cultivated species than of wild species. Although the Pi values of cultivated species were less, we still found that the Pi values of rps16, rps4, trnL-UAA_intron, rps4_trnT-UGU, ndhC_trnV-UAC, cemA_petA, rpl33_rps18, psbN_psbH, rpl36_infA, rpl14_rpl16, rps7_rps12, ndhG_ndhI, trnV-GAC_rps12, and rps12_rps7 in cultivated species were higher than that in wild species, and these difference sequences were mainly located in LSC region (Fig. 5).
Phylogenetic analysis of cultivated tea and wild tea
We constructed three phylogenetic trees of cultivated and wild tea, namely, the complete cp genomic tree (complete cp-Tree), all shared coding protein genes among all species tree (SCDS-Tree) and the ycf1 gene tree (ycf1-Tree) (Fig. 6-8). All phylogenetic trees strongly supported that Thea subgenus could be divided into two clades: clade Ⅰ, including CSVS, CSCL, CSCA, CSVA, CGR, CPU and CSVP, and clade Ⅱ, including CIM and CTA. The clade I was strongly supported, because the posterior probabilities or bootstrap values obtained by the neighbor-joining (NJ), the maximum parsimony (MP), the Bayesian inference (BI) and the maximum likelihood (ML) were very high for each lineage. These results suggested that the evolution direction of seven species in the clade I was the same, but not the same as two species in the clade II. All phylogenetic trees proved that CSVS was the closest relative species to CSCA and CSCL, and the three species were in the same branch. In particular, in ycf1-Tree, the posterior probabilities or bootstrap values of CSVS, CSCL and CSCL were lower than that of complete cp-Tree and SCDS-Tree. And the value of CSCL was less than 50%. These results suggested that the ycf1 gene has evolved in cultivated tea.
However, we found conflict in three trees (Fig. 6-8). The topological structures consisted of Protocamella subgenus (CPE and CYU), Camellia subgenus (CPI, CRE, CAZ and CCR), Metacamellia subgenus (CCU) and Thea subgenus (CIM and CTA) were poorly supported by the complete cp-Tree, SCDS-Tree and ycf1-Tree. Because most bootstrap values or posterior probabilities were less than 50% for each lineage. These results may be caused by unbalanced sampling.
The complete cp-Tree showed some structural variations in Camellia cp genomes (Fig. 6). The clade, which made up of CSVS, CSCL, CSCA, CSVA, CGR, CPU, CSVP and CPE, was reported the loss of rps12 intron, the pseudo ycf1 gene, and the pseudo ycf2 gene (except for CSVA). The other species, except for CRE and CAZ, was reported the loss of ycf1 gene and orf42 gene.
Chloroplast genome variation and evolution in cultivated tea
In order to explain the changes of cp genome structure of cultivated tea group, we took CSVS, CSCA and CSCL as objects to research SNPs and indels in the cp genome of cultivated tea. After comparing the whole cp genome of three species, 67 SNPs and 46 indels were found. The LSC, IRb, SSC and IRa regions contain 43, 3, 13, 8 SNPs and 37, 2, 5, 2 indels, respectively (Table S2). Most of SNPs and indels were located in the non-coding protein region (IGS and intron). There were 39 SNPs and 41 Indels in this region, while 28 SNPs and 5 Indels are found in the coding protein region. The two ycf1 genes, which located in the junction of SSC and IRa, contained the most SNPs and indels, with the number of 6 and 2 respectively. For photosynthetic genes, psbC, ndhD, ndhF and ndhH presented SNPs variations, while the psbI gene presented indel variation. For the 14 sequences with higher Pi value in cultivated species than in wild species, trnV-GAC_rps12 and ndhG_ndhI contained the most abundant SNPs, with 5 and 2 respectively (Fig. 5).
In order to have a clear view of the evolution of cultivated species, we used their 80 shared coding protein genes to calculated Ka, Ks and Ka/Ks ratio. The results showed that only 16 coding protein genes had synonymous or non-synonymous mutations (Fig. 9, Table S3). Among them, there are non-synonymous mutations in matK, rps16, rpoC2, rpoB, accD, clpP, rps8, ycf1, ndhD, ndhH and rps15. The genes with the highest rate of non-synonymous mutations are rps16, rps8 and rps15. There are synonymous mutations in rpoB, psbC, rps4, ycf4, rpoA and ndhF. The highest mutation rates are rps4, ycf4 and rpoA. Of the 80 genes, 79 have a Ka / Ks value of 0, and only rpoB, has a Ka/Ks value of 0.3004< 0.5, suggesting very strong purifying selective pressure.
The site specific selection events of 16 genes with synonymous or non-synonymous mutations were analyzed by Bayesian Empirical Bayes (BEB), and found some amino acid sites of ycf1 and rps15 exhibited site-specific selection (Table S4). In ycf1, there were six sites under position selection, and in rps15, there was one site under position selection. For example, in rps15 gene, the codon ACC (threonine) of CSVS mutated to AAC (asparagine) of two cultivated species.