Analysis and comparison of basic codon characteristic parameters
CodonW1.4.2 and EMBOSS were used to analyze the chloroplast genomes of 63 species of Magnoliaceae(Supplementary Table 1 and Supplementary Table 2 ), Supplementary Table 1 shows the average exon number of the chloroplast genome of 63 species of Magnoliaceae was 2869, and the lowest exon number was
Magnolia laevifolia, whose value was only 2640. The largest number of exons is Magnolia dodecapetala, which has a value of 3152; the more exons there are, the more proteins are translated. The highest GC value of chloroplast genome codon in 63 Magnoliaceae species was
Magnolia dixonii (0.3931). The species with the lowest GC value was
Magnolia liliiflora(0.3909). The GC values of both species were less than 0.5, indicating that the chloroplast genome codon of 63 Magnoliaceae plants preferred to use A or U.
In multi-species codon use bias studies, effective codon number (ENC) is often used to describe the degree of codon use deviation from random selection, and can be used to judge the codon use bias of genome or gene, its value is between 20 and 61. The smaller the ENC value is, the stronger the genome or gene codon preference of the species, and the other way around, the genome or gene codon preference of the species is weaker11. Previous studies have shown that when the ENC value is less than or equal to 35, it can be inferred that the genome or gene codon use preference of a species is more significant12. As can be seen from Supplementary Table 1, the mean and maximum ENC values of chloroplast genomes of 63 Magnoliaceae plants were 55.9 and 56.41, respectively. The species name was Magnolia shiluensis. The minimum value is 55.36 and the species name is Magnolia pilocarpa. In the species with the lowest ENC value, gene preference was not significant, the average ENC value was significantly greater than 35, and the extreme ENC value of 63 Magnoliaceae plants was only 0.51, indicating that the chloroplast codon use preference of Magnoliaceae plants is similar. In summary, the codon use preference of Magnoliaceae is weak.
Codon adaptation index (CAI) refers to the degree of consistency between the synonymous codon and the best codon usage frequency in the coding region, and the value is between 0 and 1. If a gene uses exactly the codon used in the highly expressed gene, its CAI value is 1. In other words, the larger the value, the higher the translation efficiency and the higher the superiority13. From Supplementary Table 2, it can be seen that among the 63 species of Magnoliaceae, the lowest CAI value was 0.154, whose species name was Magnolia martini, and the largest CAI value was 0.162, whose species name was Magnolia wilsonii. The average CAI value of 63 species of Magnoliaceae was 0.159, which was much lower than 1. These results indicated that the codons used in the high-expression genes were not fully used in Magnoliaceae, and the translation efficiency was not high. Codonbias index (CBI) reflects the composition of a gene with high expression of superior codons. The CBI value ranges from 0 to 1, with a CBI of 1 indicating maximum codon bias and a CBI of 0 indicating no codon bias. Synonymous codons are used equally. If the number of uses of the predetermined optimal codon is less than the average number of uses, the smallest CAI value in species C is 0.154 and its species name is Magnolia martini, and the largest CAI value is 0.162 and its species name is Magnolia wilsonii. The mean CAI value of 63 Magnoliaceae plants was 0.159, which was much lower than 1, indicating that Magnoliaceae plants did not fully use the codon used in the high-expression genes, and the translation efficiency was not high. Codon bias index (CBI) reflects the composition of a gene with high expression of superior codons. The CBI value ranges from 0 to 1, with a CBI of 1 indicating maximum codon bias and a CBI of 0 indicating no codon bias. Synonymous codons are used equally. If the predetermined optimal codon is used less than the average number of uses, the CBI will be negative. It can be seen from Supplementary Table 2 that the CBI values of 63 species of Magnoliaceae are all negative, which indicates that the optimal codon usage times of Magnoliaceae plants are less than the average14. The optimal codon usage frequency (Optimal codon usage frequency, FOP) refers to the high expression of genes in a species used in the highest frequency of codon. The value of FOP ranges from 0 to 1, where 1 means that only the optimal codon is used and 0 means that no optimal codon is used15. Supplementary Table 2 shows that among the 63 species of Magnoliaceae, the lowest Fop value is 0.355, and the species name is Magnolialace; the highest Fop value is 0.367, and the species name is Magnolia baillonii; the average value of 63 species of Magnoliaceae is 0.361, and the maximum value is not more than 0.4. This indicates that the frequency of optimal codons in Magnoliaceae is very low.
In summary, the chloroplast genome codons of 63 Magnoliaceae plants all prefer to use A or U, and the use of codons is weak. Magnoliaceae plants do not fully use the codons used in the high-expression genes, the translation efficiency is not high, and the frequency of using the optimal codon is very low.
High Frequency Codon Analysis (RSCU)
The RSCU value refers to the ratio between the actual use frequency of a codon and its theoretical expected use frequency, and is often used as an important parameter to measure the codon preference. When RSCU = 1, it means that the frequency of use of the codon and its synonymous codon is equal, and there is no bias. When RSCU > 1, it indicates that its codon usage is strong, that is, it is considered to be a high frequency codon. When RSCU < 1, it indicates that the codon is less favorable than other synonymous codons. The results show (Fig. 1): The chloroplast genomes of 63 species of Magnoliaceae have 33 high frequency codons, that is, RSCU value > 1, among which 13 end in U, that is, UUU, CUU, AUU, GUU, UCU, CUU, ACU, GCU, UAU, CAU, AAU, GAU, UGU. There are 15 ending in A, that is, UUA, AUA, GUA, UCA, CCA, ACA, GCA, UGA, UAA, CAA, AAA, GAA, AGA, GCA, GGA and two ending in G, that is, UUG, AGG. There are two ending in C, namely UCC, ACC. These results indicate that codons ending in U or A are preferred codons in Magnoliaceae chloroplast genome, while codons ending in G or C are non-preferred codons. There are two codons with RSCU value of 1, namely AUG and UGG, and most of the corresponding amino acids are methionine and tryptophan, indicating that the use of these two amino acid codons is not biased. The analysis results of RSCU values of each amino acid are shown in the figure below.
Optimal Codon Analysis
The most frequently used codon in the chloroplast genome of an organism is generally defined as the optimal codon and can be determined according to ENC and RSCU. ENC values are used to determine the relative levels of gene expression. In general, the codon of high expression gene has strong use preference and small ENC value. The low expression genes contain more rare codons and the ENC value is larger. The ENC values of 63 species of Magnoliaceae were sequenced, and then 10% genes at both ends were selected as high and low gene expression banks, and the RSCU and ΔRSCU values of 12 species of Magnoliaceae were calculated. RSCU > 1 was used as the criterion for screening high frequency codons, RSCU > 1 and ΔRSCU value ≥ 0.08 were used as the criterion for screening high expression codons, and the codons with high frequency and high expression were defined as the optimal codons of Magnoliaceae chloroplast genome. The result can be found as Supplementary Table 3 that there is no optimal codon in Magnoliaceae family, among which there are 63 high-frequency codons and two codons with △RSCU > = 0. 08, namely AGC and GCG (Fig. 1).
Analysis of influencing factors of codon preference
ENC-plot Plot analysis
The ENC values of 63 chloroplast genes were taken as the vertical coordinate and GC3 as the horizontal coordinate, and the expected value of ENC under the ideal condition where codon preference completely depended on base mutation was taken as the standard curve, the formula was ENC = 2 + GC3 + 29/[GC3^2+(1-GC3)^2]. As can be seen from the Fig. 2: 63 species of Magnoliaceae plants are close to the standard curve, and are greatly affected by base mutation, but are subjected to little selection pressure.
PR2 Analysis
Analysis of the third codon of the chloroplast genome of Magnoliaceae showed that the values of G3/(G3 + C3) ranged from 0.48 to 0.51, while the values of A3/(A3 + T3) ranged from 0.48 to 0.50. The distribution of chloroplast genes in 63 plants was uneven and concentrated in the plane, and most of the gene points were concentrated at the center line 0.5. The results showed that the codon preference of Magnoliaceae chloroplast genes was only affected by its own base mutation, and the four bases were used at the same frequency. The actual analysis of PR2 (Supplementary Fig. 1) showed that the codon use pattern was not affected by selection and other driving forces in the evolutionary process.
Simple repeat sequence analysis of chloroplast genome
Using the https://webblast.ipk-gatersleben.de/misa/index.php?action=1 online analysis and Excel software to analyze long repeats, The results can be found as Supplementary Table 4 : The genome of 63 Magnoliaceae plants included compound repeat ssr, single base repeat ssr, double base repeat ssr, three base repeat ssr, four base repeat ssr and six base repeat ssr. Among them, the number of single base repeat ssr is the largest. The less common types of ssr are three base repeats, four base repeats, and six base repeats. A three-base repeat was found in Liriodendron tulipifera and Magnoliaceae yunnanensis, and a four-base repeat was found in the genomes of Magnoliaceae sieboldii and Magnoliaceae wilsonii. In particular, Magnoliaceae doltsopa has three three-base repeats, eight four-base repeats, and three six-base repeats, while Magnoliaceae duclouxii has only 32 single-base repeats and two double-base repeats. There were some differences among ssr base repeats in different plants.
Comparative analysis of chloroplast genome in Magnoliaceae
In order to evaluate the degree of differences in chloroplast genome sequences of Magnoliaceae plants, mVISTA was used to compare the sequence differences, The results can be found as Supplementary Fig. 2 : 63 Magnoliaceae plant sequences were globally compared with annotated Liriodendronchinense. The difference of IR region is smaller than that of LSC and SSC, and the difference of coding region is smaller than that of non-coding region. Most regions with high variation are observed in conserved non-coding sequences. The variation areas in the gene coding region are:matK, atpA, atpl, rps2, ycf15, rps19, rpl19, rpoC2, rpoA, rpoB, psaA, rps11, rps4, ndhJ, ndhK, rbcL, accD, ndhD, ndhF, ycf1, ycf4, petA,There were significant differences in the ycf1 coding region between 63 Magnoliaceae species and ycf2, and the similarity of some fragments was even less than 50%. The trnG-UCC similarity between Magnoliaceae dealbata and Magnoliaceae duperreana was close to 0%, indicating obvious variation in this region. In addition, there are other notable genes outside the coding region: rpoC1, trnH-GUG, psbK, psbl, ycf3, clpP, petB, petD, rpl16 and ndhA genes with high variation are concentrated in the gene interval region, where there are some genes regulating photosynthesis and gene transcription expression. Cell phenotypes may be affected, and high variability in this region may affect species evolution. In the untranslated region, we found three genes with close to 0% similarity, namely rps14, trnS-UGA and trnV-GAC. In summary, the 63 Magnoliaceae chloroplast genomes are conserved in the whole sequence, most of the coding genes have no obvious variation, and the major variation regions are concentrated in non-coding sequences.
Analysis of contraction and expansion of IR region
The IR region is the most conserved region in the chloroplast genome, and the contraction and expansion of the IR region is considered to be an important reason for the length diversity of the chloroplast genome, which is also a common phenomenon in the process of evolution. In addition, as the IR region shrinks and expands, genes close to the boundary will have the opportunity to enter the IR or SSC region. We screened 64 species of Magnoliaceae for analysis (Fig. 3) and found some genes that crossed or were close to the boundary between IR and SSC regions. They mainly include rps19, rpl12, rpl23, psbA, rpl22, ycf2, ycf1, trnM, trnl and trnH. Notably, the gene rpl2 at the boundary of the IR and SSC regions of Magnolia grandis is 1113bp long, spanning both regions. Similarly, Magnolia duclouxii shows that rpl2 is 1476bp long, spanning the LSC region and the IR region. The boundary gene rpl2 of the other species is completely located in the IR region. The boundary gene rps19 occasionally crosses the LSC and IR regions, most of its genes exist in the LSC region, and the length of genes in the IR region is not more than 5bp. IRb/SSC and SSC/IRa regions are variable, IRb/SSC connections in Magnoliaceae chloroplast genomes, In addition to Magnolia chapensis, Magnolia martini, Magnolia wilsonii, Magnolia ventii, Magnolia tamaulipana, Magnolia sinostellata, Magnolia sinica, Magnolia sieboldii and Magnolia shiluensis were all linked by ycf1 gene. It is worth noting that in Magnolia doltsopa, ycf2 gene was abnormal across the LSC region and the IR region, but no abnormal expansion of the IR region was seen in other species, and the length of the IR region was about 25,000 bp. Therefore, the genome structure, gene sequence and number of most species of Magnoliaceae did not change much.
In summary, contractions and expansions of reverse repeats were detected in 64 species of Magnoliaceae. Genes around the boundary are shown above or below the main line. JLB, JSB, JSA, and JLA represent connection points for LSC/IRb, IRb/SSC, SSC/IRa, and IRa/LSC, respectively.
Collinearity analysis
The chloroplast genome rearrangement and collinearity of 63 species of Magnoliaceae were detected by Mauve multiple genome alignment method. The size of collinear fragments has a great relationship with the differentiation time between species. Species with a shorter differentiation time accumulate fewer variation fragments and retain more features inherited from ancestors, while species with a longer differentiation time acquire shorter collinear fragments due to the accumulation of variation. Five local collinear regions were detected among the chloroplast genomes of 63 species by multiple genome alignment an be found as Fig. 4. The type, quantity and sequence of genes of most species were highly consistent within the family, and the Magnolia martini gene was missing. There are four species that are not completely collinear with other species: Magnolia champaca and Magnolia chapensis, and Magnolia crandiflora and Magnolia grandis. No rearrangement or inversion was found in any of the species, indicating a high degree of genomic similarity among the 63 species, suggesting that species may have shared similar pathways to evolution.
Analysis of system evolution
Based on the chloroplast sequences of 63 Magnoliaceae plants in NCBI database, the phylogenetic tree was constructed by maximum likelihood method. Based on the confidence likelihood function, the maximum likelihood method analyzes the observed data characteristics of species, speculates the relationships between species, and then establishes hierarchical connections to construct an evolutionary tree. The results show (Fig. 5): A total of 58 nodes are formed, among which 34 nodes have a self-developing support rate of less than 100%, 1 node has a self-developing support rate of 100%, and the remaining 23 nodes have a self-developing support rate of more than 100%. The evolutionary tree was divided into 14 groups. Group14 evolved the slowest while Group1 evolved the fastest. There are 11 species in Group1, among which Magnolia zenii and Magnolia chapensis have the slowest evolutionary speed, Magnolia fordiana and Magnolia shiluensis have the fastest evolutionary speed. There are 4 species in Group2, Magnolia sieboldii is the most conservative in evolution, Magnolia aromatica and Magnolia conifera are closely related and evolve rapidly in this group. There are 7 species in Group3. Magnolia alba and Magnolia liliiflora are closely related, and Magnolia pyramidata and Magnolia tamaulipana have higher homology and conservative evolution. There are three species in Group4, of which Magnolia wilsonii is the slowest to evolve. There are 5 species in Group5, Magnolia crassipes and Magnolia ventii have higher homology and the fastest evolutionary speed, Magnolia champaca and Magnolia kwangsiensis have more conservative evolution. There are five species in Group6, with Magnolia grandiflora and Magnolia kobus evolving the fastest. There are four species in Group7, two species in Group8, two species in Group9,In these two groups, Magnolia officinalis and Magnolia officinalis subsp. biloba and Magnolia coco and Magnolia ovata were closely related. Of the five species in Group10, Magnolia balansae and Magnolia insignis are the most conservative and have the slowest evolutionary rate. There are four species in Group11, with Magnolia salicifolia evolving the slowest. There are 5 species in Group12, Magnolia guangdong and Magnolia yunnanensis have the fastest evolutionary speed and are closely related to each other. There are four species in Group13, Magnolia dodecapetala being the most conservative and the slowest, Magnolia gilbertoi and Magnolia ofeliae being the fastest. There are two species in Group14. Magnolia martini is more conservative and Magnolia grandis is more rapid in evolution.