Schematic representations of the overall experimental design were depicted in Figure 1.
Sequence retrieval and similarity identification
The hypothetical protein EMK97_00595 [Litorilituus sediminis] was chosen by exploring the NCBI database which can act as a significant research interest in numerous cancer research fields in the near future. The sequence of the hypothetical protein (GenBank Accession: QBG34344.1 and NCBI Reference Sequence: WP_130598461.1) that may contain a tumor suppressor domain was retrieved and collected as a FASTA format and submitted to several prediction servers for the in-silico characterization. Initially, a similarity search was performed using the NCBI BLASTp program [15] against the non-redundant and Swissprot database [16], for predicting the function of the hypothetical protein.
Multiple sequence alignment and phylogeny analysis
A multiple sequence alignment is a tool used to explore closely related genes or proteins to find the evolutionary relationships between genes and to identify shared patterns among functionally or structurally related genes. Sequence alignment was performed by the MUSCLE server of EBI [17], and an evolutionary relationship was accomplished by Jalview 2.11 software [18], between the hypothetical protein EMK97_00595 and the proteins that had structural similarity with the protein of interest.
Analysis of physicochemical properties
ProtParam [5] is a tool that computes various physical and chemical parameters of protein sequences. The physicochemical properties of the hypothetical protein were predicted using the ProtParam tool in the ExPASy server [19], which predicts all the relative properties including molecular weight, theoretical pI, amino acid composition, the total number of positive and negative residues, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [20][21][22].
Analysis of the secondary structure
The servers that are utilized to predict protein secondary structure were SOPMA [23] and PSIPRED [24]. SOPMA is a general secondary structure prediction tool, on the other hand, PSIPRED is a server for comprehensive analysis of protein. The server SOPMA was initially employed to predict the secondary structure and then the result derived from the SOPMA server was validated by exploiting PSIPRED.
3D Structure Modeling and Quality Assessment
HHpred server [25] that works based on the pairwise comparison profile of hidden Markov models, was used to build the 3-dimensional structure using the best scoring template. The confidence of the predicted structure was also visualized by SWISS-MODEL [26]. Several quality assessment tools of the SAVES and ProFunc [27] server were applied to estimate the reliability of the predicted 3D structure model of the hypothetical protein. The Ramachandran plot for the model was built using the PROCHECK program [28] to visualize the backbone dihedral angles of amino acid residues. The quality of the protein 3D structure was assessed with the help of the ERRAT server [29] and Varify 3D server was used to determine the compatibility of an atomic model (3D) with its amino acid sequence as well as comparing the results to standard structures [30][31].
Active site determination
Computed Atlas of Surface Topography (CASTp) is an online active site determination server [32] that calculates the location, delineation, and concave surface regions on 3D structures of proteins. CASTp predicted the active site of the selected hypothetical protein that showed the binding sites, amino acid binding regions with area and volume.
Identification of protein subcellular localization and topology
The subcellular location of the following protein was predicted by using the BUSCA web server [33]. BUSCA amalgamates different tools - DeepSig, TPpred3, PredGPI, BetAware, ENSEMBLE3.0, BaCelLo, MemLoci, and SChloro to predict protein features related to localization. The result was further checked by Cello [34], PsortB [35], Gneg-mPLoc [36], SOSUIGramN [37], and PSLpred [38]. Prediction of signal peptide was done by using PrediSi [39] and SignalP-5.0 Server [40]. The solubility of the hypothetical protein was evaluated by Protein-sol [41] and SOSUI [42] webserver. Protein transmembrane helices were assessed by HMMTOP [43], TMHMM [44] and, Sable [45] webserver. The topology of hypothetical protein was predicted by the ProFunc server [13].
Prediction of protein domain, superfamily, family, coil, and folding pattern
Domain/Superfamily/Family of the following hypothetical protein was analyzed by using the servers – CDD from NCBI [46], Pfam [47], SMART [48], Interpro [49], SCOP [50][51], Supfam [52], Motif , ProFunc [27], Phyre [53], and CATH-Gene3D [54]. Among them, CDD, Pfam, SMART, Interpro, SCOP, Supfam, MotifFinder were employed to predict function from the sequence of the hypothetical protein, and ProFunc, Phyre 2, and CATH-Gene3D servers were used to predict the function from the 3-dimensional structure of the hypothetical protein. Only the lowest e-value was considered to determine protein classification, which indicates good similarity. The protein folding pattern was determined by using Phyre 2 and PFP-FunDSeqE [55] servers where protein coil nature was determined by using PCoils [56] from the Bioinformatics toolkit server.
Generation of Protein-protein interaction network
As the proposed investigation seeking a tumor suppressor protein from microorganisms, STRING [57] has been used to summarize the network information of VHL tumor suppressor protein. Because of being a novel microorganism, there is no specific network is available. Here the VHL protein from humans has been used as a supposition model that might give an intellectual knowledge about VHL protein if it may apply to the human.