2.1 Genome-wide identification of GST gene family in Cajanus cajan led to the identification of 68 CcGST genes
With the pBLAST search against the C. Cajangenome on the LIS database, a total of 68 GST genes were identified in the Cajanus genome.The identified GST genes were validated for the presence of thioredoxin fold at the N-terminal domain with NCBI Batch-CD search, SMART database, and Pfam Database. All the 68 genes were found to have N and C-terminal domain with the Trx fold. The identified 68 GST genes were grouped into eight canonical GST classes i.e. tau, phi, theta, zeta, lambda, DHAR, GHR, and EF1G. The protein, genomic DNA, and mRNA sequences were downloaded from the LIS database. The nomenclature was done by adding the prefix Cc from Cajanus cajan with the identifier of the respective GST classes: CcGSTU, CcGSTF, CcGSTT, CcGSTZ, CcGSTL, CcDHAR, CcGHR, and CcEF1G.The tau and phi GST genes were highest in number i.e. 44 (CcGSTU) and 9 (CcGSTF) followed by 5 (CcGSTL) and 2 (CcGSTT, CcGSTZ, CcDHAR, CcGHR, and CcEF1G). The numbering of genes was done based on their corresponding chromosomal position from top to bottom (Table 1).
2.2 CcGST proteins are highly stable and dominantly localized in the cytoplasm
Among the 68 CcGST genes the largest protein was encoded by CcEF1G1 and the smallest was encoded by CcGSTU41, which was 385 and 78 amino acids in length respectively with their respective molecular weight i.e. 44.13 kDa and 9 kDa. The isoelectric point (pI) ranged from 4.65 (CcGSTU26) to 9.69 (CcGSTT2). Out of 68 CcGST proteins, 16 CcGST proteins were basic and 52 were acidic in nature. The aliphatic index (AI) ranged from 77.48 (CcGSTF8) to 132.31 (CcGSTU41). The CcGST with AI greater than 100 such as 102.34 (CcGSTU2), 103.91 (CcGSTU5), 104.8 (CcGSTU21), 100.59 (CcGSTU27), 101.33 (CcGSTU36), 103.52 (CcGSTU37), 132.31 (CcGSTU41)and 106.95 (CcGSTT2) were more hydrophobic than other GST members as they contained a higher number of amino acids containing aliphatic side chain in their structure such as alanine, methionine, isoleucine, glutamate, lysine. The value of hydropathicity (GRAVY) for all CcGSTs was negative which is indicative of these proteins as more hydrophilic and had good interaction with water molecules (Table.1). The subcellular localization was predicted through three independent online available tools. The results showed that major CcGSTs were localized in the cytoplasm followed by mitochondria, chloroplast, endoplasmic reticulum, plasma membrane, and nucleus (TableS1; Fig.1).
2.3 Thirty-seven CcGST genes were localized on nine Cajanus Chromosomes and tandem and segmental duplication were equally involved in CcGST gene family expansion
Among 68 CcGSTs only 37 GST genes were annotated on nine Cajanus chromosomes,the rest31 were found on scaffolds with an unknown chromosomal location. Chr 7 possessed the highest ten CcGST genes and Chr 8possessed the lowest i.e., only one CcGST gene. Chr 2 and 9 contained 5 CcGST genes, Chr 1 and 11 each contained 4 CcGST genes, Chr 3 and 6 carried 3 CcGST genes each whereas Chr 2 possessed only two CcGST genes (Fig.2). Tandem and segmental duplication and transposition play an important role in gene family expansion. The gene family expansion event was also analyzed in C. cajan. A total of 19 gene pairs were found to be duplicated with a percent identity of more than 80% against each other. Ten gene pairs CcGSTU8/9, CcGSTU9/12, CcGSTU28/29, CcGSTU36/37, CcGSTU38/40, CcGSTF5/6, CcGSTF6/7, CcGSTL3/4, CcGSTL3/5, and CcGSTL4/5 were found to be involved in tandem duplication with common chromosomal or scaffold location and nine gene pairs CcGSTU14/41, CcGSTU20/41, CcGSTU34/43, CcGSTU36/38, CcGSTU36/40, CcGSTU37/38, CcGSTU37/40, CcGSTUL2/3 and CcEF1G1/ CcEF1G2 were part of segmental duplication with different chromosomal or scaffold localization. CcGSTU plays a major role in CGST gene family expansion as tau CcGST are majorly involved in gene duplication events (Table.2; Fig. 2).
2.4 Phylogenetic tree showed clustering of classes into separate clades
To further understand the relationship among the GSTs of different plant species viz. A. thaliana, G. max, O. sativa, (angiosperm), Physcomitrella patens (a bryophyte), and Larix kaempferi (a gymnosperm), the GST protein sequences of all these plants were aligned through Clustal Omega and a combined phylogenetic tree was constructed using MEGA.X tool. The results showed that GST genes of these crops can be divided into twelve classes namely tau, phi, theta, zeta, lambda, DHAR, EF1G, GHR, Hemerythrin, iota, and Ure2p. Hemerythrin, iota, and Ure2p classes were found only in P. patens whereas tau, phi, theta, zeta, lambda, DHAR, EF1G, and GHR classes are common to all plant species. Each GST class branched out into eleven clades. The two superclades were plant-specifictau and phi GST genes. The gene pairs of CcGSTs under tandem and segmental duplications were close together in a phylogenetic tree showing close relatedness with each other.The outcome also revealed that the GST gene family had undergone divergentevolution between dicotyledonous and monocotyledonous plantsfrom a common ancestor (Fig. 3). Additionally,it can also be predicted thatthe evolution of plant GSTs might be earlier than their division into individual groups such as bryophyte, pteridophyte, gymnosperm, and angiosperm (Fig.3).
2.5 Fifteen conserved motifs were identified and canonical gene architecture was observed in CcGSTs
To investigate the conserved motifs in CcGSTs,the MEME suite tool was implemented. Fifteen highly conserved protein motifs were recognized in Cajanus GSTs. The amino acid length ranged from 6 to 50. Among 15 motifs, CcGSTU contained the highest number of motifs i.e. motifs 1, 2, 3, 4, 5, 6, 7, 8,12, 13, and 14. Few motifs were class-specific and few motifs were found in all the CcGST classes.Motif 1is found in all CcGST classes except CcGHR, whereas motif 3 was present in all CcGSTs except CcEF1G. Motif 5 was observed in all CcGST classes. Motif 9 was found only in CcGSTF whereas motif 11 was found in CcGSTF, CcGSTT, and CcGSTZ. Motif 10 was found in CcGSTF and CcEF1G. Motif 15 was observed only in CcGSTL. Motif 1, 3, 4, 11, and 12 was localized at the N-terminus and motif 6 and 10 was localized at the C-terminus. Motif 3 containing highly conserved Serine residue was predicted to be the active site residue (Fig.4).
The gene structure of 68 CcGST geneswere analyzed using the genomic and CDS sequences with a Gene structure display server. There is a significant difference in the exon number across the CcGST classes. The number of exons ranged from one to ten. All the CcGSTU members had two exons in their gene structure except for CcGSTU14, 31, and 32 which contained threeexons,and CcGSTU41 which contained only one exon. All the CcGSTF contained three exons except for CcGSTF2 which possessed two exons. All CcGSTT genes possessed seven exons, CcGSTZ1 had ten exons and CcGSTZ2 had nine exons. All CcDHAR genes had six exons whereas CcGSTL1 and CcGSTL3 had eight exons and CcGSTL2 and CcGSTL4 had nine exons. All the CcEF1G genes contained six exons and CcGHR1 and CcGHR2 contained three and six exons respectively(Fig. 5).
2.6 Ser and Cys are conserved catalytic residues
For the confirmation of the presence of catalytic residue in the predicted GST protein sequence of Cajanus, the amino acid sequences of each class of GSTs were aligned with corresponding amino acid sequences of Arabidopsis, G. max, and O. sativa (Fig. 6). Ser (S) as a catalytic residue located in the N-terminus G-site was observed in tau, phi, theta, and zeta class whereas Cys (C) was observed in DHAR, lambda, and GHR class(Fig. 6). However, the positions of the active site residues varied greatly among the different CcGST classes. For example, the Ser of tau and theta CcGSTs was found at position 10-20 (Fig.6a and 6c), whereas in Phi CcGSTs it was localized at position 60-70 (Fig. 6b). In zeta CcGSTs, it was at position 30-40 (Fig. 6d). Inlambda and DHAR CcGST classes the Cys residue was found at position 100-110(Fig. 6e), and 20(Fig. 6f), respectively. In GHR, catalytic Cys was found at position 40-50 (Fig).The catalytic residue in EF1G class was tyrosine (Tyr) residue but its position is not confirmed.
2.7 Secondary structure prediction
In Cajanus GSTs, the percentage of secondary structural elements like alpha-helix, beta-sheet, coils, and turns were estimated through the SOPMA tool. The percent of the alpha helix was found to be highest followed by coils and β-strands. In Cajanus GSTs, all the tau, phi, theta, and zeta classes possessed the highest percent of the alpha helix. The CcGSTU42 contained the highest percentage of α-helices which is 61.29 and CcDHAR1 contained the lowest percentage of α-helices which is 36.74. It is observable that the protein sequences of a few Cysteinyl GSTs viz. CcGSTL2, CcGSTL4, CcDHAR1, CcGHR1, and CcGHR2 and Tyr active site residue containing GST class i.e. CcEF1G1, CcEF1G2possessed a higher percent of the coil than α-helices. These structural differences can be correlated with their stability (Table.S3; Fig. 7).
2.8 Phosphorylation is the major post-translational modification in CcGSTs
For post-translational modification analyses such as phosphorylation and glycosylation, the 68 CcGST amino acid sequences were investigated. Serine (Ser) was found to be the major site of phosphorylation followed by threonine (Thr) and tyrosine (Tyr) accounting for 45%, 31%, and 25% respectively (Table. S3; Fig. 8). Furthermore, the glycosylation sites were predicted. Among 68 CcGSTs, 35 CcGST genes were found to have possible glycosylation sites. The CcGHR1 was found to have the maximum number of 7 glycosylation sites (Table S).A score above 0.70 is indicative of most potential glycosylation sites. In CcGSTU8, CcGSTU13, CcGSTU44, CcGSTF3, and CcGSTT1 had the score ≤ 0.70 and can be considered as likely sites for glycosylation (Table. S4).
2.9 CcGSTU38 was found to be highly expressed in all developmental stages
To explain the functions of CcGST genes, their expression levels were analyzed in seventeen anatomical tissues, namely seed, pod, shoot apical meristem, sepal, petal, root, leaf, petiole, stem, nodule, pistil, stamen, bud, embryo, hypocotyls, radicals, and cotyledon at different developmental stages from germination to senescence. On analyzing the expression pattern, the CcGST genes can be classified into three types. The expression analysis was done based on its developmental stages viz. reproduction stage, seedling stage, germination stage (Fig.9b), vegetation stage, and senescence stage (Fig. 9c). In the reproductive stage, the CcGST genes are expressed ubiquitously in most of the tissues like CcGSTU5, CcGSTU22, CcGSTU27, CcGSTU28, CcGSTU32, CcGSTU34, CcGSTU35, CcGSTU38, CcGSTU39, CcGSTU40, CcGSTU44, CcGSTL2, CcGSTL3, CcGSTL4, CcGSTL5, CcGSTF2, CcGSTF5, CcGSTF6, CcGSTF9, CcGSTZ2, CcGHR1, CcGHR2, CcDHAR1, CcDHAR2, CcEF1G1, and CcEF1G2. Among them, the expression level of CcGSTU38, CcGSTU40, CcEF1G1, CcEF1G2, CcGSTL3, CcGSTL4, CcDHAR2, and CcGSTF6 was highest in all the tissues in all the developmental stages except for petals at the reproduction stage (Fig. 9a, b, c). In the radical germination stage, the expression level of CcGSTU16, CcGSTU17, CcGSTU18, CcGSTU19, and CcGSTF4 was high. In the senescence and vegetation stages of most tissues, many of the CcGSTs were found to have a very low level of expression, whereas few CcGST genes like CcGSTU2, CcGSTU8, CcGSTU12, CcGSTU13, CcGSTU24, CcGSTU25, CcGSTU41, CcGSTU42, CcGSTU43, CcGSTF1, CcGSTF3, CcGSTF7, CcGSTF8, and CcGSTL1 were found to have very low transcript abundance (Fig. 9c). Comparatively, a remarkable difference is found in the different developmental stages; viz. majority of the genes were found to be expressed in the seedling, germination, and reproduction stages in different anatomical tissues, whereas in the vegetation and senescence stage most of the genes were found to have an extremely low level of expression in nearly all the tissues.
2.10 Molecular docking analyses showedthe highest binding affinity of CcGSTU38 with Triapenthenol
In the expression profiling, it was analyzed that the expression level of CcGSTU38 was highest in all the anatomical tissues under all the developmental stages, hence this candidate gene was selected for molecular docking study with eight most commonly used herbicide safeners. The three-dimensional structure of CcGSTU38 was modeled using a Swiss model workspace. The PDB structure of CGSTU38 was used for a molecular docking study against safener molecules namely;Fenclorim, Benoxacor, Flurazole, Dichlormid, Oxabetrinil, Fluxofenim, Cyometrinil, and Triapenthenol. The Docking study of CcGSTU38 showed different binding energy with the ligand molecule accounting for -3.41 kcal/mol with Benoxacor, -5.03 kcal/mol with Dichlormid, -4.73 kcal/mol with Dietholate,
-5.44 kcal/mol with Fenclorim, -5.33 kcal/mol with Flurazol, -5.17 kcal/mol with Fluxofenim, -5.02 kcal/mol with Oxabetrinil, and -5.48 kcal/mol with Triapenthenol. The binding energy of CcGSTU38 was lowest with Fenclorim (-5.44 kcal/mol) having a high affinity with protein molecule and could be a potential substance to enhance the expression level of CcGSTU38 under herbicide treatment (Table.3; Fig. 11).