3.1 Chloroplast Genome sequencing and features of Atractylodes species.
Six Atractylodes species were used to obtain 10,016,902 - 44,594,826 raw reads with the average coverage of 67X - 1431X (Table 1). Six complete chloroplast genome sequences were deposited in GenBank with accession numbers MT834519 to MT834524. The total chloroplast genome size ranged from 152,294 bp (A. carlinoides) to 153,261 bp (A. macrocephala). The Atractylodes chloroplast genome has a typical quadripartite structure and includes a pair of IR regions (25,132 bp - 25,153 bp), LSC regions (83,359 bp - 84,281 bp) and SSC regions (18,634 bp - 18,707 bp). The average GC content is 37.7% in the total chloroplast genome, 43.2% in IR, 35.8%-35.9% in LSC, and 31.4%-31.6% in SSC; there are almost no differences between the six Atractylodes chloroplast genomes.
The chloroplast genome of Atractylodes has 113 genes, including 79 protein-coding genes, 30 transfer RNA genes and four ribosomal RNA genes (Figure 2, Table 2). Six protein-coding genes (ndhB, rpl23, rps7, rps12, ycf2, and rpl2), seven tRNA genes (trnI-CAU, trnL-CAA, trnV-GAC, trnI-GAU, trnA-UGC, trnR-ACG and trnN-GUU) and all four rRNA genes are duplicated in the IR regions. Fourteen genes (atpF, rpoC1, ndhB, petB, rpl2, ndhA, rps12, rps16, trnA-UGC, trnI-GAU, trnK-UUU, trnL-UAA, trnG-GCC and trnV-UAC) contain a single intron and two genes (clpP and ycf3) have two introns. The rps12 gene is a trans-spliced gene with 5’-end located in the LSC region and the 3’ end located in the IR region. The gene trnK-UUU has the largest intron, which contains the matK gene.
3.2 Indels
There are 114 indels in six Atractylodes chloroplast genomes, including 30 SSR-related indels (26.3%) and 84 non-SSR-related indels (73.7%); 74.6% indels are present in 42 intergenic space regions, 7.0% indels are located in exons, and 18.4% are present in the introns (Figure 3A, Table S2). The trnT-trnL gene contains six indels; the trnE-rpoB,ndhC-trnM and ycf1 genes contain 5 indels followed by the rpl32-ndhF and trnL-rpl32 genes with 4 indels.
All SSR-related indels are single nucleotide size except an indel located in the ndhB-trnL region, which is 6 bp in size. The majority of the SSR-related indels are related to the A/T type SSRs (28 times). All SSR-related indels are located in the non-coding regions.
The size of the non-SSR-related indels ranges from 1 to 971 bp, with one bp indels being the most common (Figure 3B). The largest indel (971 bp) in the spacer of ndhC-trnM is a deletion in A. carlinoides. The second largest indel is in the exon of ycf1 with 30 bp size and is a deletion in A. lancea and an insertion in A. coreana. The majority of the NR-indels are located in the noncoding regions (91.67%), including 73.81% in the intergenic spaces and 17.86% in introns.
3.3 SSRs
A total of 265 SSRs were detected in the chloroplast genomes of six Atractylodes species by the GMATA analysis. The number of SSRs ranges from 42 (A. carlinoides) to 47 (A. lancea). SSR events are distributed randomly in the chloroplast genome. There are 210 SSRs in LSC, 28 in SSC, and 27 in the IR region (149 in spacers, 33 in introns and 83 in exons). With regard to individual genomes, the majority of SSRs were detected in LSC (ranging from 75.0% in A. lancea to 83.7% in A. japonica) and in spacers (ranging from 54.5% in A. lancea to 59.1% in A. macrocephala) (Figure 3A). The most common SSRs are mononucleotides, which account for 71%, followed by tetranucleotides accounting for 14%, and dinucleotide SSRs accounting for 7% (Figure 4B). Nearly all mononucleotide SSRs (99%) are composed of A and T in all six species. The dinucleotide repeats of TA and the tetranucleotide repeats of TTTC are the second most common SSRs (Figure 4C).
3.4 Sequence divergence and hotspots
A comparative analysis based on mVISTA was performed in the six chloroplast genomes of Atractylodes to determine the level of divergence (Figure 5). The results indicate high sequences similarities across the chloroplast genome suggesting that the chloroplast genomes are highly conserved. The IR regions and the coding regions are more conserved than the single copy regions and the noncoding regions. The coding regions of the clpP, ycf1 and rps19 genes are more variable than the coding regions of other genes.
Additionally, we compared single nucleotide substitutions and nucleotide diversity in the total, LSC, SSC and IR regions of the chloroplast genomes (Table 3). Six Atractylodes chloroplast genomes were aligned with a matrix of 153,560 bp with 445 variable sites (0.29%) and 31 parsimony-informative sites (0.02%). The average nucleotide diversity value was 0.001. The IR regions have the lowest nucleotide diversity (0.0003) and the SSC regions have the highest diversity (0.0018).
The nucleotide diversity was measured by DNAsp to identify the mutation hotspot regions in the whole Atractylodes chloroplast genomes (Figure 6). Nucleotide diversity values within 600 bp vary from 0 to 0.00656 in group A and from 0 to 0.00633 in group B. The region rpl22-rps19-rpl2 has the highest Pi values (Pi = 0.00656) followed by the other three spacer regions (Pi > 0.005) including psbM-trnD, trnR-trnT(GGU), and trnT(UGU)-trnL in the group A dataset; all these features are located in the LSC region. On the other hand, group B shares lower diversity; however, the region rpl22-rps19-rpl2 still has the highest diversity. The variability of four identified mutation hotspot regions was tested together with three universal chloroplast DNA barcodes (matK, rbcL and trnH-psbA). The universal DNA barcodes had lower variability than that of the newly identified markers.
3.5 Phylogenetic analysis
Using the whole plastome sequences, we preformed phylogenetic analysis of the 37 tribe Cynareae species. The topologies of the ML and BI trees are essentially consistent (Figure 7). Atractylodes is a sister of other Cynareae species and Atractylodes species form a monophyletic group with 100% support. Within Atractylodes, A. carlinoides is located at the base. A. japonica and A. lancea cluster into a subclade and form a sister relationship with the subclade of A. chinensis and A. coreana. The phylogenetic relationship carried out by indels is consistent with the results obtained by using the whole plastome sequences (Figure S1).