Identification and characterization of SR genes in poplar, Arabidopsis, grape, and papaya
According to the method described by Barta and Kalyna [10], 18 Arabidopsis and 22 rice SR protein sequences were used as the query in a BLASTp (e-value cutoff = 1 e − 10) search to identify poplar, grape, and papaya SR protein sequences from the phytozome database (http://www.phytozome.net) [22], and grape genome database (http://genomes.cribi.unipd.it/grape/) [23]. SMART (http://smart.embl-heidelberg.de) [24] and PFAM (http://pfam.janelia.org/) [25] were used to confirm whether the candidate sequence had one or two N-terminal RRMs (RBD; PF00076), and the sequences were manually confirmed to have at least 50 amino acids in the downstream sequence with 20% of the RS or SR dipeptide [10]. Finally, 30 genes in Populus trichocarpa, 14 genes in Vitis vinifera and 9 genes in Carica papaya were identified (Table S1). Basic information describing the primary transcript of each genes from poplar, Arabidopsis, grape, and papaya were listed in Table S1, and include gene ID, NCBI accession number, gene length, CDS length, coding protein length, pI, MW and the prediction of their subcellular location (Table S1). In poplar, the SR genes were located on 14 of 19 chromosomes with no distribution on chromosomes 4, 7, 9, 11 and 17. Among them, the most poplar SR genes distributed on chromosome 2. Arabidopsis SR genes were distributed on all chromosomes. Grape SR genes were distributed on 11 of 19 chromosomes (i.e., 1, 4, 6, 7, 8, 12, 13, 14, 15, 16 and 18) (Table S1). The statistical data showed that VvSCL34 had the longest gene length (18119 bp). CpRS50 coded for the longest protein (447 aa), whereas PtRSZ20 (180 aa) coded for the shortest protein among the 71 SR proteins identified. The pI values of six proteins in the 71 SR protiens were less than 10, whereas that of the remaining proteins were more than 10. Furthermore, the MW of these proteins ranged from 20.46 to 50.42 kDa with an average of 30.46 kDa. Predicted subcellular localization of the SR proteins indicated that most of these proteins were located in the nuclear, which was consistent with their putative roles as splicing factors.
Phylogenetic tree and conserved motifs analysis in poplar, Arabidopsis, grape, and papaya
Based on different structural architectures (Figure S1), the SC subfamily contained proteins with a single RRM followed by an RS domain. The SR subfamily proteins had two RRMs with an evolutionarily conserved SWQDLKD motif in their second RRM followed by an RS domain with a characteristic SR dipeptide. The RSZ subfamily consisted of SR proteins with one Zn knuckle. The plant-specific SCL subfamily (SC35-like) was similar to the SC subfamily but had an N-terminal extension that is rich in Arg, Pro, Ser, Gly and Tyr residues. Proteins of the plant-specific RS2Z subfamily had two Zn knuckles and an additional SP-rich region after the RS domain. The plant-specific RS subfamily contained two RRMs (without the SWQDLKD motif) followed by an RS-rich with many RS dipeptides [10].
And the protein sequences of the 71 putative SR genes were used to construct a phylogenetic tree to study the evolutionary relationship between these proteins (Fig. 1a). For genes with different transcripts, the primary transcript specified by phytozome and grape genome database was selected. The phylogenetic tree showed a similar pattern of evolutionary divergence among poplar, Arabidopsis, papaya and grape, which reflected conserved evolution and function in different SR genes. The SCL subfamily constituted the largest clade, each containing 18 members and accounting for 25.4% of the total SR genes, whereas RS2Z had the lowest number of members (seven members of each group) and accounted for 9.8% of the total SR genes. SR genes in poplar, Arabidopsis, grape and papaya were distributed in all six groups. In poplar, seven genes belonged to the SR subfamily, eight to the SCL subfamily, six to the RS subfamily, three to the RSZ subfamily, two to the RS2Z subfamily and four to the SC subfamily.
We analyzed the conserved motifs of SR proteins using the online MEME software, as shown in Fig. 1b. Ten specific motifs were detected and detail information was provided in Table S2. Each assumed pattern was annotated by searching PFAM and SMART. Motifs 1, 2, and 5 were found to encode the N-terminal RRMs (RBD; PF00076), whereas motif 3 encoded proteins of RRM with an evolutionarily conserved SWQDLKD motif. Since the RRM containing an evolutionarily conserved SWQDLKD motif was only in the SR subfamily, we found that only the SR subfamily had motif 3. Motif 7 and 9 were the RS domain of each SR gene. Other motifs identified had unknown functions. As expected, members of the same subfamily had highly similar motifs, for example, the 3, 6, and 10 Motifs existed only in the SR subfamily, and the motif 5 existed only in the RS subfamily. It was suggested that SR proteins of the same subfamily had functional similarities.
Gene structure in poplar, Arabidopsis, grape, and papaya and AS profile analysis in poplar, Arabidopsis, and grape
For intron/exon structure analysis (Fig. 2b, Fig. 2c), we found that the intron number of SR genes was 4 ~ 13 in general, and the SR subfamily had more introns, the distribution of intron number was 10 ~ 13. Second, the number of introns in the SC subfamily was between 6 and 9. We found that 10, 7 and 5 SR genes had 5 introns in the RS, SCL and RSZ subfamilies, respectively, accounting for 76.9%, 38.8%, and 55.5% of all genes in the subfamily. There were 5 genes in the RS2Z subfamily with 6 introns, accounting for 71.4% of all RS2Z subfamily genes. It was worth noting that some genes contain introns in their 5' or 3' untranslated (UTR) regions. An extreme example was that AtSR34b contained 4 introns in the 3' untranslated region.
Some researches have reported that AS profile is common in SR genes [18]. For each gene, the primary transcript was placed at the top and the other alternative transcripts were listed below using phytozome database and grape genome database annotation files. And the same protein isoform conserved motif analysis was only shown once (Fig. 3, Figure S2, Figure S3). Since papaya had no annotated multi-transcript information, it was not analyzed here. The results showed that a total of 79 transcripts were detected in 30 genes from poplar (Fig. 2, Fig. 3). PtSR34b and PtSCL29a had the most transcripts, i.e., 5. A total of 68 transcripts were detected in 18 genes from Arabidopsis thaliana with AtRS40 having the highest number of transcripts, i.e., 8 (Figure S2). A total of 65 transcripts were detected in 14 genes in grape (Figure S2). VvSR28 was found to have the highest number (16) of transcripts (Figure S3). Multiple transcripts of genes could produce multiple different protein isoforms. For example, AtRS41 had seven transcripts that could produce four protein isoforms (Figure S2), and VvRS26 had ten transcripts that produce five protein isoforms (Figure S3). PtRS29b had three transcripts that produce two protein isoforms (Fig. 3). Two transcripts of PtRS29b had undergone AS in the 5' untranslated (UTR) region, indicating potential mutations that controlled their transcriptional or translational efficiency (Fig. 2, Fig. 3).
Paralogs and orthologs of SR genes in poplar, Arabidopsis, grape, and papaya
To further investigate the evolution of the SR family, duplicated gene pairs (paralogs and orthologs) analysis was used to investigate gene duplication events within the poplar, Arabidopsis, grape, and papaya. We identified 94 paralogs in four dicots (Fig. 4, Table S3). Specifically, there were 51 paralogs within poplar, 22 within Arabidopsis, 15 within grape, and 6 within papaya. In addition, 68 SR genes including 30 poplar SR genes, 14 grape SR genes, 18 Arabidopsis SR genes and six papaya SR genes were shown to have a duplicated relationship, accounting for 95.8% of all SR gene family members. Transposed duplication, and dispersed duplication were found in the paralogs of four dicots, and whole-genome duplication (WGD) was found in the paralogs of poplar, grape, and Arabidopsis. Tandem duplication and proximal duplication were not found in the four dicots (Fig. 4, Table S3). We next identified 408 orthologs among four dicots (Table S4). Specifically, there were 147 orthologs between Arabidopsis and poplar, 66 between Arabidopsis and grape, 44 between Arabidopsis and papaya, 67 between poplar and grape, 45 between poplar and papaya, and 39 between grape and papaya.
To study evolutionary selection process, Ka value, Ks value, and Ka/Ks ratios of all duplicated gene pairs were listed in Table S3 and Table S4. Ks value of poplar paralogs varied from 0.0155 to 4.9695 (Table S3). The Ks value of Arabidopsis, grape and papaya paralogs varied from 0.6421 to 2.9420, 0.9462 to 4.9368 and 1.3729 to 2.4849, respectively. The Ka/Ks ratios of poplar, Arabidopsis, grape and papaya duplicated gene pairs varied from 0.0555 to 1.4837, 0.0863 to 0.3738, 0.0644 to 0.4607 and 0.1128 to 0.4715, respectively. The results showed that all Ka/Ks ratios were smaller than 0.5 except that the ratio of PtSR34/PtSR33 pair was greater than 1, indicating that the SR genes had undergone strong purifying selection. In addition, for orthologs, we found that all Ka/Ks ratios were less than 1, orthologs had undergone positive selection (Table S4).
To further analyze the evolutionary relationship of the SR genes, we analyzed it by the Ks value of the duplicated gene pairs (Ks value could be used to determine the separation time of the duplicated genes). In poplar, according to Tang et al., the median Ks of the duplicated genes associated with the γ triplication event was 1.54, and the total Ks associated with P-WGD was 0.27 [36]. We detected 22 WGDs associated with amplification of the SR gene in the poplar genome. The values of Ks in different WGDs showed two different ranges (Table S3): 0.2249–0.4493, and 0.9350–1.7387. The former Ks range of duplicated genes might come from P-WGD event, the latter Ks range of duplicated genes might come from γ triplication event. In Arabidopsis, the median Ks value associated with β-WGDs and γ triplication event was close to the saturation median Ks value of 2.00 [36]. Therefore, based on the Ks value, β-WGDs and γ triplication event were indistinguishable. The median Ks associated with α-WGD was reported to be 0.86 [36]. The values of Ks in WGD showed one ranges (Table S3): 0.6421-1.040. This might indicate that these Arabidopsis WGD only experience α-WGD. In grape and papaya, the overall median Ks values of γ triplication event were 1.76 and 1.22, respectively [36]. Then, according to the Ks value of the orthologous gene, it could be divided into a paralogous gene formed before polyploidization and a paralogous gene formed after polyploidization. So we mapped the evolutionary relationship of the SR genes of poplar, Arabidopsis, grape, and papaya (Fig. 5). The SR gene, which was not in the same line, indicated that the isolation of these genes was before the γ triplication event. For each row of SR genes of the corresponding species, they were produced by different WGDs. The results showed that most of the SR genes were lost after WGD. For example, in the γ triplication event, CpSC30 did not produce two other corresponding SR genes. Then the presence of two SR genes in the box indicated that the two genes were genes produced by other duplicated types after the WGD. PtSR29 and PtSR34b were SR genes produced by other duplicated types after WGD. In addition, we found that the duplicated genes generated by WGD and the duplicated genes generated by other recent duplicated types had highly similar gene structures, for example PtSR33, PtSR34, PtSR34c, which had the same number of transcripts, protein numbers, and similar intron numbers.
Cis-acting elements analysis in poplar, Arabidopsis, grape, and papaya
The key roles of cis-acting elements in the promoter region have affected the tissue-specific or stress-responsive expression patterns. In this study, we identified four cis-acting elements of environmental stress type, including those directly related to the ABA response element (ABRE), cis-acting element involved in defense and stress responsiveness (TC-rich repeats), low temperature responsive element (LTR) and MYB binding site involved in drought-inducibility (MBS). In poplar, six genes contained ABRE cis-acting elements, 17 contained LTR cis-acting elements, 15 contained MBS cis-acting elements and nine contained TC-rich repeats cis-acting elements. In Arabidopsis, papaya and grape, 32 contained ABRE cis-acting elements, 20 contained LTR cis-acting elements, 16 contained MBS cis-acting elements and 14 contained TC-rich repeats cis-acting elements (Fig. 6). These findings could aid further investigations into the stress-regulatory mechanisms of SR genes in plant.
Expression pattern analysis of SR genes in poplar, Arabidopsis, grape, and papaya
Since abiotic stress may adversely affect plant growth and development, stress-tolerant gene studies of plants are important. We obtained microarray (Arabidopsis and grape) and RNA-seq (poplar and papaya) data under various stress conditions from different tissues of the four dicots [34, 35]. The results showed that only the CpRSZ20 gene was not expressed, and the rest were expressed (Fig. 7). In poplar, we found that three poplar SR genes (PtSCL25, PtSR34a, PtSCL23) were significantly down-regulated (the value of log2-foldchange was larger than 2) in cold stress, whereas under heat stress three poplar SR genes (PtSR35, PtSCL25, PtSCL23) were up-regulated (the value of log2-foldchange was larger than 2) and the expression level of the two poplar SR genes (PtSR33, PtSR34) were declined. No significant changes were observed under drought and salt stress conditions (Fig. 7). Two papaya genes (CpSCL25, CpSCL35a) were significantly upregulated under drought stress. In Arabidopsis, the expression value of AtSCL28 was much lower than that of other Arabidopsis SR genes, but its expression value was significantly up-regulated under heat stress, and other Arabidopsis SR genes were not significantly changed (the value of log2-foldchange was larger than 2). In grape, no significant changes were observed in the expression levels of SR genes (Fig. 7).
In addition, we studied the correlation between transcript number, protein number, intron number, four cis-acting elements, and the expression values of SR genes (Figure S4). The results showed that no correlation was found in Arabidopsis, grape, and papaya. In poplar, transcript number, protein number, intron number showed a negative correlation with partial heat stress. The four cis-acting elements showed no correlation with expression.
Intron retaining in poplar SR genes
In humans, ES accounts for 35.2% of AS, whereas IR represents only 0.01% of AS [37]. In contrast, IR is the most common type in plants [38]. And we obtained information on the IR events of the poplar SR genes from the RNA-seq data (Table S5). The results showed that 26 poplar SR genes underwent IR events; seven IR sites in PtSR34b, six IR sites in PtSR34c and 5 IR sites in PtRS29, PtRS29a, PtRS2Z33 and PtRS2Z34 (Fig. 3, Table S5). These IR events greatly increased the complexity of the SR genes. In addition, PtSCL29b-2, PtSR34c-4, PtSCL29c-4 and PtRS29a-5 were without IR rates under standard conditions, whereas they had IR events under hot and cold stress conditions (Table S5). Under cold stress, the relative IR rates (treated tissues vs. untreated control samples) of PtSR29-1, PtSR29-1, PtSR29-3, PtSCL23-1, PtSCL23-2, PtSCL25-1 and PtSCL25-2 increased significantly (significant change indicated that relative IR rates changed more than 30% under a certain stress), whereas the relative IR rates of PtSR34b-3, PtRS29a-1 and PtRS29a-2 decreased significantly (Fig. 8, Table S5). Under heat stress, the relative IR rates of PtSC27-1 and PtSCL23-1 decreased significantly, whereas the relative IR rates of PtSR34b-3, PtSR34-2 and PtSR34c-2 increased significantly. Under drought stress, the relative IR rates of PtSCL31-1 in the leaf increased, whereas the ratio in xylem and root decreased. Under salt stress, the relative IR rates of PtRS29-2 in the leaf decreased whereas the ratio in the roots increased (Fig. 8, Table S5).