Genome-wide identification of CrRLK1Ls in G. raimondii,G. arboreum, G. hirsutum
In this study, we identified 125, 73, and 71 full-length putative CrRLK1L genes in G. hirsutum (TM-1), G. arboreum (Shixiya 1) and G. raimondii respectively. According to their chromosomal locations, the identified CrRLK1Ls members of the three cotton species were designated as GhCrRLK1L1 to GhCrRLK1L125 in G. hirsutum, GaCrRLK1L1 to GaCrRLK1L73 in G. arboreum, and GrCrRLK1L1 to GrCrRLK1L72 in G. raimondii. The protein length of GhCrRLK1L genes ranged from 402 (GhCrRLK1L111, localized on Ghir_D11 chromosome) to 1360 (GhCrRLK1L81, localized on Ghir_D07 chromosome) amino acids (aa), while GaCrRLK1Ls varied from 476 aa (GaCrRLK1L28) to 1688 aa (GaCrRLK1L1), and GrCrRLK1Ls ranged from 581 aa (GrCrRLK1L34) to 1061 aa (GrCrRLK1L60). The detailed information of CrRLK1Ls in different Gossypium species and their predicted subcellular location and signal peptide were shown in Table S2, S3 and S4.
Phylogenetic analysis of the CrRLK1L gene family
To examine the evolutionary relationships among CrRLK1L proteins, all 125 GhCrRLK1L protein sequences were used to construct phylogenetic trees using neighbor joining method, which contained six clades (I-VI) (Fig.1A). Group I and II were the largest which contained 40 and 37 CrRLK1L members, while groups IV and V and VI contained 13, 27 and 7 CrRLK1L members respectively. Interestingly, subfamily III contained only one member GhCrRLK1L107 which was localized on Ghir_D11 chromosome and contained a protein of 518 aa with one malectin and phosphokinase domain. It was localized on the plasma membrane without a transmembrane domain, having a glycosyl-phosphatidyl inositol (GPI) anchor site in the 482 position of its protein sequence. We extracted the extracellular malectin domain of GhCrRLK1L genes sequences to generate a separate phylogenetic tree (Fig.1B) as it played important roles in their biology functions. These all 125 extracellular domain sequences were also divided into six groups, showing minor differences compared with the tree constructed by the full length protein sequences. These results indicated that the extracellular domain in CrRLK1Ls protein determined their biological functions.
Arabidopsis contained 17 CrRLK1Ls (AtCrRLK1Ls) members which played important roles in cell growth regulation, reproduction, stress responses. So, these 17 AtCrRLK1Ls protein sequences of Arabidopsis were downloaded from Ensembl Plants websites (http://plants.ensembl.org), and were used to construct phylogenetic tree (Fig.2) combined with the all GhCrRLK1Ls members, which were also divided into six groups (I-VI). However, all the 17 AtCrRLK1Ls were distributed in subfamily I and II which were further classified into six subgroup (I-a, I-b, I-c, II-a, II-b and II-c).Among them, 17 AtCrRLK1Ls were distributed in all subgroups except subgroup I-c, which is specific for cotton, containing only two GhCrRLK1Ls including GhCrRLK1L21 and GhCrRLK1L62. Besides, subfamilies III, IV, V, and VI contained 48 GhCrRLK1Ls which were also specific for cotton. This data suggested that 50 GhCrRLK1Ls are specific for cotton, which might evolve in different directions and expanded to exhibit diverse functions among various cotton species.
Gene structure and conserved protein motifs analysis
During the evolution of gene families, the diversification of gene structure is responsible for evolving gene new function to acclimatize in the changing environment. Therefore, we explored the structural diversity of GhCrRLK1L genes, their exon-intron structure and conserved domains (Fig.3B-C). The GhCrRLK1L genes of group I and II contained 1 to 3 exons, except for GhCrRLK1L124 which comprised of 7 exons. Most gene members in group II contained exons varying 1 to 4, except GhCrRLK1L149, GhCrRLK1L58, GhCrRLK1L22 and GhCrRLK1L81. Specially, GhCrRLK1L22 and GhCrRLK1L81 comprised of 21 and 20 exons, respectively. The groups III to VI contained much more exons as compared to groups I and II, the number of exons in these subgroups varied 13 to 25, while GhCrRLK1L111 in group V contained the least exon number (7 exons). The exon numbers were highly diverse among GhCrRLK1Ls, might indicating their various functions. Based on the exon number, these six groups of GhCrRLK1Ls could be divided into two major groups, among which group I and II belong to first major group, while the rest of four groups belong to the second major group. As mentioned above, 17 AtCrRLK1Ls of Arabidopsis distributed in group I and II, belong to the first major group, while all members in groups III to VI were from G. hirsutum. These results illustrated that cotton evolved diverse exon number which may entitle with new functions for GhCrRLK1L family genes.
Furthermore, the protein domain analysis was performed to identify the conserved motifs present in GhCrRLK1Ls. All the GhCrRLK1Ls localized to the cytomembrane, so transmembrane helix was detected among all the members. Results showed that 8 out of the 125 GhCrRLK1Ls have no transmembrane helix domain, however, a GPI anchor site was discovered in each sequence of the 8 GhCrRLK1Ls. These result indicated that GhCrRLK1Ls localized in the plasma membrane by two methods. Signal peptide detection by the SignaIP 5.0 [19] revealed that 89 out of the 125 GhCrRLK1Ls had signal peptides at their N-teiminal ends, and all the signal peptide belong to the secreted signal peptide I (SPI). These results indicated that these proteins may be secreted to the cytomembrane which was consistent with the results of their subcellular location. Furthermore, all the GhCrRLK1L proteins contained a phosphokinase domain and one or two malectin domains (Fig.3B), except for GhCrRLK1L48, GhCrRLK1L54 and GhCrRLK1L128 with a malectin-like domain instead of malectin domain. As the malectin domain is vital for GhCrRLK1Ls functions, we also analyzed the putative motifs in the extracellular domain using MEME programme which identified five conserved motifs (motif1 to motif5) (Fig.S1). These motifs were regularly distributed in the 125 GhCrRLK1Ls. The members of group I and II contained motif2, motif4 and motif5, while motif1, motif2 and motif3 were presented in group III, IV and V, except for GhCrRLK1L109 and GhCrRLK1L111 in group V which contained only motif2 and motif3. Six members of the group VI showed irregular distribution of motifs. Generally, motif1 and motif3 were present only in group III to V, and motif4 and motif5 were specific for group I and II. Collectively, these results suggested that GhCrRLK1Ls possessed similar gene structures, and motifs were clustered in the same group and might have similar functions during the evolution of G. hirsutum.
Distribution and gene duplication events of CrRLK1Ls in Gossypiums
The TBtools [20] was used to map the physical position of CrRLK1Ls (Fig.4). In G. hirsutum, 124 GhCrRLK1Ls were localized in 26 chromosomes (Fig.4A), while one gene (GhCrRLK1L62) was present on scaffold. The distribution of GhCrRLK1Ls on chromosomes was highly uneven i.e. At07 and Dt07 contained high number (13 and 11) of the GhCrRLK1Ls family members respectively. In contrast, A04 and D04 contained only 1 gene. In G. arboreum, 70 GaCrRLK1Ls were localized on 12 chromosomes (Fig.4B), with the exception of A02, while 3 genes (GaCrRLK1L71- GaCrRLK1L73) were localized on scaffolds. A07 of G. arboreum contained most (13) of the GaCrRLK1Ls, followed by A11 with 11 GaCrRLK1Ls. The gene numbers contained by the other chromosomes were varied from 2 to 7. A total of 70 GrCrRLK1Ls were mapped to the 13 chromosomes with one gene localized on scaffold. D07 and D01 of G. raimondii contained 13 and 11 GrCrRLK1Ls genes respectively (Fig.4C). The CrRLK1L gene numbers distributed on each chromosome of Gossypiums was showed in Table S5.
Fragment duplications in the genome region are vital for gene family expansion and occur along with plant genome evolution. In the present study, a gene duplication analysis was performed to investigate the expansion mechanism of the CrRLK1L gene family in the three Gossypium species. Previous studies have shown tandem and segmental duplications were two major events in gene duplication. Based on the open reading frame (ORF) sequence of all genes in each species, MCScan software [21] was used to detect the collinearity relationship between CrRLK1Ls. A total of 83, 8 and 11 gene-pairs with segmental duplication were discovered in G. hirsutum, G. arboreum and G. raimondii, respectively (Fig.5A, Fig.S2, Table S6, S8 and S10). As well, we identified 13, 9 and 6 tandem duplicated gene-pairs in corresponding species (Table S7, S9 and S11). While, there were 5 GhCrRLK1Ls tandem clustered localized on At07 from GhCrRLK1L29 to GhCrRLK1L33, and from GhCrRLK1L108 to GhCrRLK1L111, 4 GhCrRLK1Ls tandem clustered localized on Dt11 (Table S7). All these results indicated that both segmental and tandem duplication played important roles in the expansion of CrRLK1L family proteins in G. arboreum and G. raimondii, while segmental duplication was predominantly present in G. hirsutum. In addition, the relationship between two diploid cottons and G. hirsutum was discovered 57 CrRLK1L protein pairs between G. raimondii and G. hirsutum (Fig.5B, Table S13), and 26 homologous gene-pairs were found between G. arboreum and G. hirsutum (Table S12). These results suggested that most of the CrRLK1L family proteins in tetraploid G. hirsutum were derived from its ancestors diploid cottons, and most (57/72) of CrRLK1Ls in G. raimondii were inherited to the tetraploid cotton, while only 35.6% (26/73) CrRLK1Ls of G. arboreum inherited to G. hirsutum.
GhCrRLK1L6 and GhCrRLK1L7 genes derived from noncorresponding chromosomes
It was reported that G. raimondii and G. arboreum were the ancestors of G. hirsutum [22-25]. According to the CrRLK1Ls distribution within the chromosomes of three Gossypium species, we found that there were two GhCrRLK1Ls including GhCrRLK1L6 and GhCrRLK1L7 localized on At02, while there was no GhCrRLK1Ls on the ancestor corresponding chromosome Ga-A02. In order to find out the origin of GhCrRLK1L6 and GhCrRLK1L7, the collinearity of these two genes was analyzed. GhCrRLK1L70 localized on Dt02 was found to have collinearity with GhCrRLK1L6. However, four genes such as GhCrRLK1L49, GhCrRLK1L73, GhCrRLK1L105 and GhCrRLK1L115 were identified to have collinear relationship with GhCrRLK1L7. Further, multiple sequence alignment was performed and phylogenetic tree was constructed to find their cognate relationship (Fig.S3). GhCrRLK1L7 was found to be most homologous with GhCrRLK1L73 which was localized on Dt03, speculated its duplication from GhCrRLK1L73. Besides, localized site of the GhCrRLK1L7 and GhCrRLK1L73 on chromosome were identified to have segmental duplication, which is similar with localized regions of GhCrRLK1L6 and GhCrRLK1L70 (Fig.6). These results indicated that GhCrRLK1L6 of At02 chromosome duplicated from GhCrRLK1L70 of Dt02 chromosome, and GhCrRLK1L7 of At02 duplicated from GhCrRLK1L73 of Dt03. Further, half of the At02 of tetraploid G. hirsutum originated from A02 of diploid G. arboreum, while other half originated from the other chromosomes Dt03 and Dt02.
Expression patterns of GhCrRLK1Ls in fibers
To explore the possible biological functions of GhCrRLK1Ls during cotton fiber development, the expression patterns of GhCrRLK1Ls in ovules and fiber at different fiber developmental stages were investigated. The ovules of G. hirsutum (TM-1) at different development, -3 days post anthesis (-3 DPA), - 1, 0, 1, 3, 5, 7, 10, 15, 20 and 30 DPA were used for transcriptome analysis (unpublished). A total of 68 GhCrRLK1Ls were found with FPKM > 1 in at least one of the 11 development stages. Based on the previous reported method [26], 46 GhCrRLK1Ls were discovered to be expressed during all 11 stages without any fluctuate change (Fig.S4). However, 22 GhCrRLK1Ls were identified to express at specific fiber development stages. According to their expression patterns, 22 genes were clustered into three groups (A-C) (Fig.7). The group A contained only two members GhCrRLK1L26 and GhCrRLK1L84 belonging to group VI, which preferred to express at 1 DPA to 15 DPA. There were 11 GhCrRLK1Ls in cluster B, which tended to express in 15 DPA to 30 DPA. For instance, GhCrRLK1L119 was specifically expressed at 15 DPA with almost no expression at the other stages, while GhCrRLK1L52 was predominantly expressed only at 20 DPA. The 9 GhCrRLK1Ls in cluster C were inclined to express at the early ovules development stages (-3 DPA to 5 DPA), such as GhCrRLK1L17 exhibited higher expression at -1 DPA. The GhCrRLK1Ls expression pattern suggested that genes in cluster A might play roles during cotton fiber initiation and elongation, while genes in cluster B might be involved in fiber secondary cell wall synthesis, and the cluster C genes were responsible for fiber initiation.
NAC, MYB and WRKY transcription factors might involve in regulating the specific expression of GhCrRLK1Ls during fiber development
To illuminate the regulation network of the 22 specific expressed GhCrRLK1Ls, the 1000 bps upstream regions from the transcript start site (TSS) of the 22 specific expressed and 46 nonspecific expressed GhCrRLK1Ls were extracted and analyzed by homer software. The promoters of the 46 nonspecific expressed GhCrRLK1Ls were used as a background, and 9 conserved motifs were identified in most of the target genes promoters and not in the background sequences. Among the 9 conserved motifs, 6 motifs were predicted to be NAC transcription factor binding sites, while 2 motifs were the MYB transcription factor binding sites and 1 motif was WRKY transcription factor binding site (Table S14). These results suggested that NAC, MYB and WRKY transcription factors might bind to the promoters of GhCrRLK1Ls, leading to their specific expression during fiber development. To further clarify the three transcription factors functions in regulating expression of GhCrRLK1Ls, we selected 2 of each transcription factor in the transcriptome data, and qRT-PCR was performed to identify the expression of the six transcription factors (Ghir_D12G017660.1, Ghir_A13G007770.1, Ghir_A11G026140.1, Ghir_D01G006060.1, Ghir_A06G010870.1, Ghir_A03G015780.1). All of the six transcription factors preferred to express at specific fiber development stage (Fig.8), which were coincident with the 22 GhCrRLK1Ls. This data indicated that GhCrRLK1Ls played important roles during fiber development and their functions are generally regulated by NAC, MYB and WRKY transcription factors.