3.1. Similarity identification, Multiple sequence alignment, and phylogeny analysis
The results of BLASTp against a non-redundant database revealed similarities with other proteins (Table 1). The FASTA sequences of the hypothetical protein HP33 and homologous identified proteins were aligned using multiple sequence alignment. To corroborate homology assessments of proteins at the complex and subunit levels, phylogenetic analysis was used. The alignment and BLAST results were used to create a phylogenetic tree, which offers a comparable idea about the protein (Figure. 1). The distances between branches are also taken into consideration.
Table 1. Non-redundant sequencing yielded a protein with similar properties
Protein ID
|
HP33
|
Organism
|
Protein Name
|
Identity (%)
|
e value
|
WP_009697735.1
|
Vibrio harveyi
|
hypothetical protein
|
99.33
|
0.00
|
WP_045488642.1
|
Vibrio harveyi
|
hypothetical protein
|
99.00
|
0.00
|
WP_045455267.1
|
Vibrio campbellii
|
hypothetical protein
|
96.99
|
0.00
|
WP_104035359.1
|
Vibrio jasicida
|
hypothetical protein
|
65.02
|
0.00
|
WP_045495220.1
|
Vibrio hyugaensis
|
hypothetical protein
|
64.29
|
0.00
|
3.2. Physicochemical features
ProtParam calculates the molecular weight of proteins by summing the average isotopic masses of amino acids in a protein and one water molecule's isotopic mass (Wilkins et al., 1999). The protein consists of 299 amino acids, among the most abundant was Ser 31 followed by, Leu 27, Glu 23, Gly 20, Thr 20, Asp 19, Val 18, Ile 18, Lys 18, Gln 17, Ala 17, Asn 15, Pro 12, Phe 11, Met 8, Arg 8, Tyr 8, Cys 3 and Trp 3. The computed molecular weight was 32987.00 Da, with a theoretical pI of 4.63, indicating a negatively charged protein. The pKa values of amino acids are used to compute the protein pI. The side chain of amino acid determines its pKa value. It plays a crucial role in determining a protein's pH-dependent properties (Pace et al., 2009). The total number of positively charged (Arg + Lys) and negatively charged (Asp + Glu) residues were discovered to be 26 and 42, respectively. An instability index provides information about how stable your protein is when in a test tube. A protein whose instability index is smaller than 40 is predicted as stable, a value above 40 predicts that the protein may be unstable (Gamage et al., 2019). The protein was classified as unstable by the computed instability index of 37.34. In a protein, the aliphatic index is the number of amino acids that have an aliphatic side chain, such as leucine, alanine, valine, and isoleucine (Nehete et al., 2013). The aliphatic index was 81.84, indicating that proteins are stable across a wide temperature range. The GRAVY value was -0.370. GRAVY with a negative value implies that the protein is nonpolar. To calculate GRAVY values, several hydropathy values of each amino acid residue are added and divided by the number of residues or length of the sequence. The higher the positive score, the more hydrophobic the substance (Zhang et al., 2016). Mammalian reticulocytes (in vitro) were found to have a half-life of 30 hours, yeast, > 20 hours, and Escherichia coli, > 10 hours. And the molecular formula of protein was identified as C1447H2286N382O475S11.
3.3. Hypothetical protein functional annotation
This potential protein sequence was discovered to have only a domain using the conserved domain search tool which is a family of unknown functions (accession No. DUF6068). Two further domain search tools, InterProScan and Pfam, were used to verify the result. Pfam server predicted the Sodium: a family of unknown function at 1–65 and 193-276 amino acid residues with an e-value of 6.8e and 15e respectively. This domain represents a biosynthetic gene cluster member (BGC). MIBiG describes this BGC as an example of NRP (non-ribosomal peptide) and polyketide biosynthesis classes. This family includes a protein from the benzamide biosynthetic gene cluster from Myxococcus virescens and appears to be predominantly found in cystobacterineae (Wenzel et al., 2015). PFP-FunDSeqE server recognizes the protein predicted fold type as immunoglobulin-like. The immunoglobulin (Ig)-like the domain is a protein domain that is similar to the Ig domains of immunoglobulins in amino acid sequence and structure. Ig domains have a specific immunoglobulin fold that consists of 70–110 amino acids. 7–10 β-strands are spread across two sheets with typical structure and connectivity in conventional Ig-like domains. The MEME suite was used to examine the motif of V. harveyi Y6 strain HP33 protein. Five motifs were investigated for HP33. The MEME package can also forecast breadth, locations, and E-value. There were zero or one occurrences (of a contributing motif site) per sequence for HP33. The E-value was discovered to be 7.59e-30. Motif width discovered between 6 wide and 50 wide (inclusive). Figure 2 shows the motif analysis for HP33.
3.3 Evaluation of performance
For the five in silico tools and servers, ROC analysis revealed a high level of dependability and credibility (Kumar et al., 2017). When two or more tools predicted the same result for HP33, the confidence of prediction was judged high. An average of 95.96%, 96.804%, and 97.762% were recorded for accuracy, sensitivity, and specificity respectfully of the tools used in functional annotation (Table 2 and Figure 3) which indicates the results were valid (Ahmed, 2022).
Table 2. ROC curve assessment analysis
SL. NO.
|
Tools/Servers
|
Accuracy (%)
|
Sensitivity (%)
|
Specificity (%)
|
ROC area
|
1.
|
InterProScan
|
100
|
100
|
100
|
1
|
2.
|
Pfam
|
100
|
100
|
100
|
.98
|
3.
|
CDD
|
96.7
|
97.1
|
98.1
|
.79
|
4.
|
SMART
|
89.3
|
90.0
|
93.7
|
0.4
|
5.
|
MEME suites
|
93.8
|
96.92
|
97.01
|
.85
|
Average
|
95.96
|
96.804
|
97.762
|
.804
|
3.4. Nature of subcellular localization
To understand the function of hypothetical proteins, we need to know their subcellular localization, because different cellular locations represent different roles. This information can also be designed to make a drug that targets the target protein (Wang et al., 2005). CELLO predicted subcellular localization analysis, which was confirmed by PSORTb and PSLpred. The HP’s subcellular location was anticipated to be the outer membrane (Table 1). In contrast to THMM, the Outer Membrane protein is predicted nothing to contain transmembrane helices. All of these findings point to the protein being Outer Membrane (Islam, Sanjida, et al., 2022).
Table 3: Sub-cellular localization of hypothetical protein predicting from different servers
No.
|
Analysis
|
Result
|
1.
|
CELLO 2.5
|
Outer Membrane
|
2.
|
PSORTb
|
Outer Membrane
|
3.
|
PSLpred
|
Outer Membrane
|
4.
|
TMHMM 2.0
|
No transmembrane helices present
|
5.
|
HMMTOP
|
One transmembrane helices present (6-25)
|
6.
|
CCTOP
|
No Transmembrane protein
|
3.5. Virulence factor prediction
Pathogenic organisms generate virulence factors that allow them to evade host defense mechanisms, which is critical to causing diseases to hosts. As a result, understanding the molecular underpinnings of virulence is critical for vaccine development and reverse vaccinology (Chaudhuri & Ramachandran, 2014). By using MP3 and VFDB servers, HP33 has been confidently predicted as a virulence factor and pathogenic protein. While the VICMpred server confidently predicted that this protein has a cellular process function (Ahmed, 2022).
3.6. Secondary structure analysis
A secondary structure is formed by intermolecular and intramolecular hydrogen bonding between the amide groups in the primary structure of a protein. The two most important secondary structures in proteins are alpha helices and beta sheets. The right-handed helix configuration of the alpha helix. Hydrogen bonds between the carbonyl (CO) group and the amino (NH) group of the fourth amino acid in the C – terminal amino acid stabilize it. Beta sheets are planar structures made up of beta strands linked together by hydrogen bonds (Han et al., 2017). The proportions of α-helix, β sheet content, coil content, number of sequence alignments used for ab-initio predictions and overall confidence value were 22 %, 32 %, 45%, 1, and 75.3%, respectively, according to the PROTEUS Structure Prediction Server 2.0 study.
3.7. 3D structure prediction, model quality refinement, and assessment
The biochemical or biophysical functions of hypothetical proteins may be inferred from their structures (Bernstein et al., 1977). Uncharacterized and hypothetical proteins can benefit from 3D structure to help with function assignment. Because protein folding patterns are frequently retained during evolution, structure-based comparisons can find homologs where sequence-based comparisons are useless (Sivashankari & Shanmughavel, 2006). As a result, structure-based molecular function assignment is a promising strategy for large-scale biochemical protein assignment and the discovery of novel motifs. The three-dimensional structure of the target protein was predicted using the RaptorX server (http://raptorx.uchicago.edu/) and protein model 1 was chosen. The RaptorX program predicts 3D structures for protein sequences that have no close homologs in the Protein Data Bank (PDB) developed by the Xu group. A sequence input is used to predict secondary and tertiary structures, solvent accessibility, disordered regions, and solvent accessibility, according to RaptorX (Källberg et al., 2014). Tertiary model and refine model 1 were chosen and visualized in Discovery Studio (Supplementary Figure 1 A). Through a Ramachandran plot analysis, PROCHECK evaluated the scalability of the galaxy server refined model, where the distribution of φ and ψ angles according to the model limits are depicted in (Supplementary Figure 1B). A valid model covers 84.9 % of the residues in the most preferred regions. A 3D structure model of the target sequence was validated by Verify3D and ERRAT and then compared against the established model. On the Verify3D graph, 82.23 % of residues have an average 3D-1D score of ≥ 0.2, showing that the model has an excellent environmental profile, and an overall quality factor of 71.6157 in ERRAT indicates that the model is good. The YASARA energy minimization server later modified the 3D structure. Before energy minimization, the computed energy was –61,130.4 kJ/mol, but after energy minimization (by three rounds of steepest descent approach), it was reduced to –254,513.5 kJ/mol, making the modeled structure more stable. In addition, ProSA web server analysis resulted in a Z score of -4.09 which indicates the model validation (Figure 1C).
3.8. Analysis of protein-protein interactions and protein disulfide bonds
To fulfill a similar function, proteins frequently interact with one another in a mutually reliant manner. The transcription factors, for example, interact with one another to cause transcription. As a result, the functions of proteins can be deduced from their interacting partners (Sivashankari & Shanmughavel, 2006). Interactions between residues determine the functionality of proteins. We used the STRING 10.0 algorithm to predict the protein's possible functional interactions [31]. The identified functional partners with scores were shown in Supplementary Table 1. Moreover, the most common identified functional partners proteins with HP33 are not annotated yet in the database thus it can conclude that this is a unique functional protein from V. harveyi Y6 strain. Similarly, disulfide bonds are important for the folding of proteins, and they play an important role from both the structural and functional perspectives. In this way, disulfide bond analysis in proteins plays a crucial role in understanding the higher structure and function of proteins. As a result, disulfide bond analysis in proteins is critical for revealing the higher structure and biological roles of proteins. Furthermore, antibody aggregation can be caused by improper disulfide bond formation or exchange, hence disulfide bonding is crucial for protein characterization in the biopharmaceutical manufacturing process (Zhang et al., 2011). CYSPRED predicted two bonding states and one non-bonding state of the predicted tertiary model of HP33 indicates the quality of the model (Islam, Mou, et al., 2022a). In addition, DIANA predicted three bonded cystine sequences in different locations of the protein (Table-3). These servers predicted that in this HP33 protein bonded cystine formed disulfide bond between three cysteine residues, carefully protected inside of the protein to function as a stabilizer for the high-order structure of the protein, or an active center for its bioactivity.
Table 4: CYSPRED and DIANA predict cysteine residues important in disulfide bonding
CYSPRED
|
DIANA
|
Cysteine
|
Prediction
|
Reliability
|
Distance
|
Score
|
Bonded cysteine
|
CYS 23
|
Bonding State
|
4
|
189
|
0.01073
|
IALYGCGGGGS-ADNAQCKTTWS
|
CYS 212
|
Bonding State
|
8
|
228
|
0.99116
|
IALYGCGGGGS-TYNLVCDGMEL
|
CYS 251
|
NON-Bonding State
|
9
|
39
|
0.01104
|
ADNAQCKTTWS-TYNLVCDGMEL
|
3.9. Ligand binding interactions
Galaxy server ligand binding site predictions were done by matching target models with the PDB file of the best-predicted domain-A model. Three models were predicted by a galaxy server with different ligands. Galaxy server also combines the results into three parts Predicted ligand-binding residues, Predicted binding poses of model, and Templates for protein-ligand complex (Table 5; Fig. 4 (A, B)). Interactions at the predicted ligand-binding site were analyzed using LIGPLOT. The details of the protein-ligand interaction analysis were given in Table 5. The most probable protein-ligand binding poses and templates model for another protein-ligand complex were given in Figure 4. Ligand-binding residues depend on the definition of residue-ligand contact (Islam, Mou, et al., 2022b). If the distance between an amino acid residue and a ligand atom is less than the sum of van der Waals radii of the two atoms + 0.5 A, the residue is considered as a binding site residue (Chen et al., 2016). In addition, ligand HEC is a small molecule commonly known as Ferroheme C (DrugBank Accession Number DB03317). It has a molecular weight of around 684.65 KDa and a Monoisotopic value is 684.152734 with the chemical formula C34H36FeN4O4S2. On the other hand, Valine (DrugBank Accession Number DB00161) is a branched-chain essential amino acid that has stimulant activity. It promotes muscle growth and tissue repair. It is indeed a step in the penicillin biosynthesis process (Arakawa et al., 2010). The chemical formula is C5H11NO2 with a molecular weight of 117.5 KDa having a Monoisotopic value of 117.078978601.
Table 5: Predicted ligand-binding residues
Ligand Name
|
Molecular Weight
|
DrugBank Name
|
Binding Residues
|
HEC
|
618.50
|
Ferroheme C
|
272F 273I 274E 275Q 276V 277E 287R 288E 289T 290K
292T
|
VAL
|
117.15
|
Valine
|
3K 4R 7L 25S 26G 27S 90Q 92D 93P
|
3.10. Homo-oligomer models prediction and Ab initio docking results analysis
The interface area (Ǻ2) between one chain and the other chains is calculated using the Naccess program by the Galaxy server (Baek et al., 2017). Sequence identity between query and template protein is shown in Figure 9. It ranges from 0 (totally different) to 100 (identical). Table 6 shows the oligomer templates, number of subunits, and interface area. For template-based models, sequence identity or structure similarity is displayed, whereas docking score is displayed for models predicted by ab initio docking. Structure similarity between the query and template protein with the docking score (measured by TM-align) is shown in Table 6. The higher the docking score is the better score. If a sequence is supplied, up to five proteins are chosen as templates depending on the ordering of S among those with S >0.2 times the maximum S overall and those with S >0.7 times the highest S for the particular oligomeric state. Additional templates are picked using the monomer structure anticipated by the template-based modeling if the number of detected templates using this sequence-based method is less than five (Ko et al., 2012). The ranking of S among those having monomer structures similar to the given monomer structure (TM-score obtained using TM-align>0.5) and in the specified oligomeric state is used to pick structure-based templates (Zhang & Skolnick, 2005). Response time depends on the total number of residues in homo-oligomer. In this study, template-based method is used (red dots) (Figure 8) and in most cases, the homo-oligomers structure prediction is finished within 2 hours (Baek et al., 2017). Figure 9 depicted the five different models of homo-oligomer of HP33 protein with chains A and B.
Table 6. Ab initio Docking Results of Five homo-oligomer models of HP33
Model No.
|
Number of subunits
|
Interface area Ǻ2
|
Docking score
|
1
|
2 mer
|
1200.9
|
2024.170
|
2
|
2 mer
|
2721.9
|
1951.639
|
3
|
2 mer
|
1179.9
|
1855.651
|
4
|
2 mer
|
1469.9
|
1852.032
|
5
|
4 mer
|
6666.6
|
1269.693
|
3.11. Active Site Detection
As predicted by the CASTp v.3.0 algorithm, the protein modeled contains 41 unique active sites (Supplementary Figure 2 (A)). CASTp is a database server that can recognize regions on proteins, determine their boundaries, compute the area of the areas, and calculate the dimensions of the areas. Vacuums concealed within proteins and pockets on protein surfaces are also involved. To define a pocket and volume spectrum or vacuum, surfaces of solvent-accessible molecules (Richard surface) and molecular surfaces (Connolly surface) are employed. CASTp might be utilized to look at the operational zones and surface properties of proteins. CASTp provides a dynamic, graphical user interface as well as on-the-fly measuring of user-submitted constructs (Wei Tian et al., 2018). Based on the area of 2602.319 and the volume of 4124.259 the top active sites of the model protein were identified (Supplementary Figure 2 (B)).