Organization and features of tomentella and saxicola genomes
The complete C. tomentella genomes were 190,198–190,247 bp long and exhibited a typical angiosperm circular cp structure, containing four regions: large single-copy region (LSC: 96,530 − 96,701 bp), small single-copy region (SSC: 9,636-9,664 bp), and a pair of inverted repeats (IR: 41,955 − 42,002 bp) (Fig. 1). The GC content of the genome and each genomic region was also typical of the angiosperm cp style. Specific lengths and contents are shown in Fig. 1 and Table 1. The lengths of the two complete C. saxicola genomes were 189,029 bp and 189,155 bp, which were slightly smaller than those of C. tomentella. The cp genome structure, size of each region, and GC content were similar between the two species (Table 1).
Table 1
Summary of chloroplast genome features of C. tomentella and C. saxicola
Species | Voucher No. | Genbank No. | Total | Length (bp) | GC content (%) |
IR | LSC | SSC | | Total | IR | LSC | SSC |
Corydalis tomentella | MHJ1 | MT093187 | 190247 | 41955 | 96701 | 9636 | | 40.3 | 42.2 | 39.2 | 35.4 |
MHJ2 | MT077878 | 190198 | 42002 | 96530 | 9664 | | 40.2 | 42.2 | 39.0 | 35.4 |
Corydalis saxicola | YHL1 | MT077877 | 189155 | 42350 | 94744 | 9711 | | 40.2 | 42.2 | 39.1 | 35.1 |
YHL2 | MT077879 | 189029 | 42164 | 94993 | 9708 | | 40.3 | 42.2 | 39.1 | 35.1 |
CPGAVAS2 was used to annotate the cp genomes of C. tomentella and C. saxicola. Removing duplicate genes, a total of 119 annotated genes (Fig. 2, Table 2 and S1), including 78 protein-coding genes, 37 tRNA genes, and four rRNA genes, were identified from the C. tomentella. There were 28 genes in the IR region, of which 15 were involved in gene expression. Introns greatly affect regulated selective splicing in the genome. There were 19 genes that contain introns in the C. tomentella cp genome. Most intron genes contained only one intron, while the ycf3 gene contained two introns. There were 12 introns with a length of more than 700 bp, and the longest gene was trnK-UUU with a length of 2,478 bp. The gene features of C. saxicola cp genome were similar to those of C. tomentella. The C. saxicola cp genome contained 120 genes, including 78 protein-coding genes, 38 tRNA genes, and four rRNA genes. Nineteen genes contained introns. The longest intron gene in the C. saxicola cp genome was trnK-UUU, and its length was also 2,478 bp (Fig. 2, Table 2 and S1).
Table 2
List of genes in the two Corydalis chloroplast genomes
Group of genes | Gene names | Number of Genes |
Photosystem I | psaA, psaB, psaC(× 2), psaI(× 2), psaJ | 5(2) |
Photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ | 14 |
Cytochrome b/f complex | petA, petB*, petD*, petG, petL, petN | 6 |
ATP synthase | atpA, atpB, atpE, atpF*, atpH, atpI | 6 |
NADH-dehydrogenase | ndhA*, ndhB*(× 2), ndhC, ndhD(× 2), ndhE(× 2), ndhF(× 2), ndhG(× 2), ndhH, ndhI(× 2), ndhJ, ndhK, | 11(6) |
RubisCO large subunit | rbcL | 1 |
DNA dependent RNA polymerase | rpoA, rpoB, rpoC1*, rpoC2 | 4 |
Small subunit of ribosome | rps2, rps3, rps4, rps7(× 2), rps8, rps11, rps12*(× 2), rps14, rps15, rps16*, rps18, rps19 | 12(2) |
Large subunit of ribosome | rpl2*(× 2), rpl14, rpl16*, rpl20, rpl22, rpl23(× 2), rpl32(× 2), rpl33, rpl36 | 9(3) |
Proteins of unknown function | ycf1, ycf2(× 2), ycf3**, ycf4, ycf15(× 2) | 5(2) |
Other genes | ccsA(× 2), cemA, infA, matK, clpP** | 5(1) |
Transfer RNAs | 37 tRNAs(C. tomentella); 38 tRNAs(C. saxicola) | 37/38 |
Ribosomal RNAs | rrn16S(× 2), rrn23S(× 2), rrn4.5S(× 2), rrn5S(× 2) | 4(4) |
*One or two asterisks followed genes indicate the number of contained introns, respectively. (× 2) indicates the number of the repeat unit is 2. The numbers in parenthesis at the line of ‘Number’ indicate the total number of repeated genes. |
Codon usage bias, SSRs analysis, and repeat sequences
Coding sequence codon usage patterns for the C. tomentella and C. saxicola cp genomes were calculated on the basis of relative synonymous codon usage (RSCU) values. We defined codons with RSCU values greater than 1.00 to be used more frequently, and vice versa. All protein-coding genes in the C. tomentella and C. saxicola cp genomes were encoded by 52,244 codons and 51,125 codons, respectively (Table S2). The most prevalent amino acid was Leucine in the cp genomes of C. tomentella (5,656; 10.83%) and C. saxicola (5,528; 10.81%). Conversely, the least frequently utilized amino acid was Cysteine in the cp genomes of these two species(591–634; 1.16–1.18%). The third position nucleotides in each codon of all the coding genes had a high AT content, at 65.83% and 65.91% for C. tomentella and C. saxicola, respectively.
SSRs are short tandem repeats of 1–6 bp DNA sequences that are widely distributed throughout the cp genome [17]. In this study, CPGAVAS2 software was used to analyze the sequences and the classification statistics of SSRs with a length greater than or equal to 8 bp. Here, we analyzed the distribution and the type of SSRs contained in C. tomentella and C. saxicola cp genomes. A total of 172 SSRs were identified in the whole C. tomentella cp genome (take MHJ1 as an example), including 100 mono-, 34 di-, and one compound nucleotide SSRs. Among all SSR types, A and T were the most commonly used bases and 116 SSRs in the C. tomentella cp genome had A, T, or AT repeat units (Table 3 and S3). For C. saxicola, 170 SSRs (take YHL2 as an example) were categorized as 96 mono-, 36 di-, six tri- and six compound nucleotide SSRs, including 115 SSRs with A, T, or AT repeat units (Table 3 and S3).
Table 3
Interspersed repeat sequences and tandem repeat sequences of C. saxicola and C. tomentella
Species | Voucher No. | SSR | | Interspersed repeat sequences |
Total | Mono SSR | | total | T | F | P |
Corydalis tomentella | MHJ1 | 172 | 100 | | 111 | 61 | 39 | 11 |
MHJ2 | 174 | 102 | | 112 | 62 | 39 | 11 |
Corydalis saxicola | YHL1 | 171 | 96 | | 132 | 82 | 23 | 27 |
YHL2 | 170 | 96 | | 133 | 83 | 26 | 24 |
T: tandem repeats, F: Forward repeats and P: palindromic repeats. |
In addition to SSRs, forward repeats (F) and palindromic repeats (P) are also called interspersed repeat sequences (length ≥ 30 bp). In the C. tomentella cp genome, there were 112 interspersed repeat sequences, comprised of 64 tandem repeats, 39 forward repeats, and 11 palindromic repeats (Table 3). A total of 132 long repeats were present in C. saxicola cp genome, comprised of 82 tandem repeats, 23 forward repeats, and 27 palindromic repeats (Table 3). Comparing the cp genomes of the two species, the C. saxicola genome had a greater total number of repeats than the C. tomentella cp genome, and the cp genome repeat content in both species was significantly higher than that of most species.
IR contraction and expansion
IR regions are the most conserved regions in the plant plastome, contraction and expansion at their borders are regarded as the major causes of size variation [18–19]. We selected four phylogenetically close species (Papaver rhoeas, Papaver orientale, Papaver somniferum, and Coreanomecon hylomeconoides) and two model species (N. tabacum and Arabidopsis thaliana) as references for cp genome structure comparisons. Figure 3 displays the detailed information about the boundaries between IR/SSC and IR/LSC in the eight species.
Except for C. tomentella and C. saxicola, the IRb/SSC boundaries were generally positioned in the coding region of the ycf1 gene, resulting in duplication of the 3′ end of this gene. This duplication also produced a variably sized pseudogene ycf1 at the IRa/SSC border. The length of the ycf1 pseudogene varied from 916 bp to 1,200 bp. However, the ycf1 genes in C. tomentella and C. saxicola cp genomes have been transferred to the SSC region to become a single copy gene. Except for C. tomentella, C. saxicola and N. tabacum, the LSC/IRb borders of other species were located within the rps19 coding region. Correspondingly, a 3’-truncated rps19 pseudogene with a length of 74 bp to 113 bp was located at the IRb/LSC border. In the C. tomentella cp genome, the LSC/IRb border was located in the rpl2 coding region. Additionally, in C. tomentella and C. saxicola cp genomes, the IRa/SSC boundaries were positioned in the ndhA coding region, and trnN was situated in the IRa and IRb regions, away from the LSC/IRa and IRb/LSC borders. The trnH gene was present in LSC regions, away from the IRb/LSC border.
Comparative genomic analysis and genome sequence divergence
VISTA software was used to make multiple comparisons of the C. tomentella and C. saxicola cp genome sequences, and results show that intra-specific variation was small but there were still some inter-specific differences (Fig. 4). The coding and non-coding regions of C. saxicola samples were conserved, while the coding regions of C. tomentella samples were conserved but there were differences in several consecutive intergenic regions of rps12-clpP, clpP-psbB, and petB-psbH. Comparing C. tomentella and C. saxicola, the most highly divergent regions mainly was observed in coding regions and intergenic regions, including rpl20, rrn23s, trnH-GUG, trnN-GUU, rps12-clpP, clpP-psbB, petB-psbH, and ycf1-ndhL. On the basis of morphological features and cluster analysis of DNA barcodes, it was found that the two species are closely related and difficult to identify accurately. The cp genome differences between the two species have potential for use as molecular markers for species authentication.
Comparisons with the N. tabacum outgroup and Papaveraceae family plants P. rhoeas, P. orientale, P. somniferum, and C. hylomeconoides showed that C. tomentella and C. saxicola cp genomes have distinct cp genome structures. The differences included genome size, number of genes, and genome structure (Fig. 5). First, the C. tomentella and C. saxicola cp genome sizes (189.1-190.2 kb) were larger than those of N. tabacum (155.9 kb) and P. somniferum (152.9 kb). Second, the length of intergenic regions in C. tomentella and C. saxicola cp genomes were longer than those in N. tabacum and P. somniferum, as seen, for example, in the lengths of intergenic regions for psal/rpl32 (7 kb) in the IR region and rps12/clpP (5 kb) in the LSC region. Third, C. tomentella and C. saxicola cp genome structures were significantly different from those of the other six species, including large-scale gene replication, movement, reversal, and changes in the number and arrangement of genes. Fourth, C. tomentella and C. saxicola IR regions were highly dilated (41.9–42.5 kb). The ndhF, ndhD, ndhL, ndhG, ndhE, psaC, ccsA, trnL-UAG and rpl32 genes, usually located in the SSC region, migrated to the IR regions to become double-copy genes (Fig. 1). A few rpl19 and rpl2 genes migrated from the IR region to the LSC region. In particular, in C. tomentella and C. saxicola, there is a large fragment (containing rpl23, trnL-CAU, ycf2, ycf15, and trnL-CAA) that moved within the IR region. Gene migration increased the length of the IR region and decreased the length of the SSC region. Fifth, the LSC region was highly conserved, but the accD gene was lost and the position of the rbcL gene changed substantially. In short, both the coding and non-coding regions of C. tomentella and C. saxicola cp genomes differ greatly from those of other Papaveraceae and tobacco.
Phylogenetic analysis of Papaveraceae
With C. chinensis and N. tabacum as outgroups, common protein coding sequences from 13 cp genome sequences were extracted from C. saxicola, C. tomentella, and six Papaveraceae species (P. somniferum: NC029434, P. orientale: NC037832, P. rhoeas: MF943221, Coreanomecon hymenoides: NC031446, Macleaya microcarpa: NC039623, and Meconopsis racemosa: MH394401 NC039625) to build a Maximum Likelihood (ML) phylogenetic tree (Fig. 6). The ML tree has high bootstrap values at each node, indicating a highly credible tree. In this ML tree, the Papaveraceae family is monophyletic, and all samples from Papaveraceae are clustered in a clade. In Papaveraceae, the samples from the genus Papaver (P. somniferum, P. orientale, and P. rhoeas) are clustered in a clade; the samples from Corydalis (Corydalis saxicola and Corydalis tomentella) are clustered in a clade; the samples from Meconopsis (M. racemosa) are clustered in a clade; and C. hymenoides and M. microcarpa are clustered in a clade. Except for Coreanomecon and Macleaya genera, which had only one sample, species in the same genera are clustered into one branch, consistent with previous classification of Papaveraceae genera. At the species level, the C. saxicola and C. tomentella samples are clustered into separate branches, indicating that the cp genome clustering analysis could effectively distinguish them, while these two closely related species were not monophyletic in the Phylogenetic analysis based on short sequence DNA barcodes. At the same time, C. saxicola and C. tomentella are clustered in a clade in the ML phylogenetic tree that is distant from other Papaveraceae genera. On one hand, this shows that C. saxicola and C. tomentella, both from Sect. Thalictrifoliae in Corydalis, have a close genetic relationship. On the other hand, it also shows that Corydalis has a relatively distant genetic relationship with the other Papaveraceae genera included in this study.