We first describe the implementation workflow of HIR V2 that is used for functional association prediction between human genes. We subsequently describe the implementation of GSLA tool, which is developed based on high-quality HIR V2 to interpret the potential functional impacts based on the observed OTCs. Afterwards, we provide the description of backend implementation of HIR V2/GSLA website.
Data integration for the prediction of functional associations in humans
For the prediction of functional associations between genes in humans, we selected six types of evidence, which were collected from seven public databases for the years prior to 2018, including 22,004 expression profiles (Coxpresdb) [12], 288,375 gene annotations (GOC) [13], 59,617 subcellular gene localizations (Compartments) [14], 156,859 domain interactions (IDDI [15] and Pfam [16]), 20,567 phylogenetic profiles (DIOPT) [17] , and 9,220 human proteins and proteins from Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae, and Schizosaccharomyces pombe used to compute interologs (Inparanoid) [18] (Fig. 1). From these six types of evidence, 36 feature values were taken. We used these 36 feature values to measure the strength of functional associations (Additional file 1: Table S1).
In addition to the above six types of evidence, protein-protein interactions were also considered to be evidence of high-strength functional interactions between genes. In this work, we collected 319,696 protein-protein interactions that were reported in experimental studies of humans from two public databases, BioGRID [19] and IntAct [20] (Fig. 1 and Additional file 2: Table S2). To ensure the quality of the experimentally reported protein-protein interactions, we filtered the interactions that were reported in less than two independent studies and reported only in high-throughput experiments. The remaining 4,509 high-quality protein-protein interactions were used for subsequent prediction model training to obtain the inferred functional associations that are as strong as protein-protein interactions. In this work, the UniProt [21] and BioMart [22] software were used to convert different gene IDs to unique HGNC IDs according to the reference gene IDs of the HGNC database [23] (Fig. 1).
Computation and Evaluation of feature value
Thirty-six feature values of six types of functional association evidence were utilized to characterize the functional interactions between human genes (Additional file 1: Table S1). The detailed equations are on the HIR V2 website. These 36 feature values include 1 homologous interaction feature, 3 phylogenetic profile features, 23 domain interaction features, 4 subcellular colocalization features, 2 coexpression features and 3 shared annotation features (Additional file 3: Table S3).
To successfully separate protein interactions from random gene pairs, not all of these 36 features are suitable. Therefore, only those features showing a strong correlation with functional associations were retained, based on which we could decrease the signal-to-noise ratio in the subsequent step of functional association interference. To evaluate the power of the functional association indication of our selected 36 feature values, the area under the curve (AUC) of the receiver operating characteristic (ROC) curve was preferred. When computing the protein-protein interaction inference, each feature value with different cut-offs will lead to a series of sensitivities and specificities. We plotted the sensitivities and specificities related to different cut-offs as the ROC curve (X-axis, 1-specificity; Y-axis, sensitivity). Feature values with AUCs higher than 0.6 were considered informative, indicating strong functional associations (Additional file 4: Fig. S1). Eventually, a total of 18 features with AUCs higher than 0.6 were selected for the subsequent prediction of functional associations between human genes (Additional file 3: Table S3 and Additional file 4: Fig. S1).
Interference of functional associations between human genes
The LibSVM package was used to train and predict functional associations [24, 25] (Fig. 1). We chose 4,509 high-quality protein-protein interactions, which were confirmed by experiments and published before 2018, to serve as positive examples representing the strong functional associations between human genes. Negative examples were randomly generated gene pairs (overlapping gene pairs with positive examples were removed). Two random gene pairs may have functional associations, although the probability is low. Here, we set the positive-to-negative ratio to 1:100 in the training dataset to reduce the false positive rate in the negative examples so that only a notably small fraction of gene pairs have functional associations. This functional gene association prediction approach may be considered an implementation of transfer learning. Based on the evidence of functional associations, both protein interactions and functional gene associations may be predicted. Here, protein interactions may actually be considered one type of strong functional gene interaction. Thus, “knowledge” (i.e., the classification model) gained from predicted protein interactions may be used for the inference of functional associations between genes. In reality, gold-standard protein interactions have been reported by experiments; however, for strong functional gene associations, no well-established gold-standard dataset exists. When we predict the functional associations, the transfer learning strategy may help us to address this lack of a gold-standard dataset and to use the knowledge gained in predicting protein interactions (i.e., a special form of strong functional associations) to infer the functional associations between genes.
For the prediction model training, we used the soft-margin Gaussian kernel SVM algorithm. Two parameters, s (kernel width) and C (soft margin), were used to obtain an optimal harmonic mean of the sensitivity and specificity and were optimized with a 5-fold cross-validation. We trained the prediction model with the optimized s and C. An external validation dataset with 435 protein interactions (published after December 31, 2017) and randomly generated negative examples were used to validate the prediction model. This model showed a sensitivity of 32.48% and a specificity of 99.98%. Moreover, we evaluated the sensitivity of HPRD, HI-III, HIPPIE, STRING, and UniHI to see how well the predicted interactions in each database covered these new interactions. The comparison results are shown in Additional file 5: Table S4.
After we applied this model to all human gene pairs, a total of 83,125 predicted functional associations were obtained. In addition to these inferred functional interactions, we added 4,944 experimentally reported interactions to the HIR V2 dataset, which includes 88,069 interactions. The following equation was used to estimate the proportion of protein-protein interactions that were covered by the predicted functional interactome in humans.
where Ninteractome is the expected number of all protein-protein interactions in humans; Nall-pairs is the number of all gene pairs in humans; Npredict is the number of predicted gene associations; and sensitivity and specificity are the accuracy measures produced when the prediction model was validated with the newly published protein interactions. Solving this equation gives an estimated human protein interactome size of 1.52 x 105, which corresponds to 1 protein interaction among 1,230 gene pairs. This result is similar to the reported fraction of protein interactions in yeast (1/775, [26]). Based on the estimated interactome size (1.52 x 105) and the estimated sensitivity (32.48%, the conservative estimation from the training stage sensitivity (32.88%) and the evaluation stage sensitivity (32.48%)), the predicted interactions in the HIR V2 are expected to include 86,359 protein interactions. Therefore, 57.04% of the HIR V2 functional interactions (49,249 out of 86,359) are expected to represent protein interactions.
Gene set linkage analysis tool
The GSLA web tool was first developed together with the predicted Human Interactome Resource (HIR 2013) [10] to interpret the potential functional impact from the observed OTCs in humans. Two hypotheses (Q1 and Q2) are assumed by GSLA to ensure that the reported functional associations between two gene sets are significant (Fig. 2). Q1 measures whether the density of inter-gene-set gene associations between two functionally associated gene sets is higher than the density of background gene associations connecting two random gene sets. Q2 assumes that the high density between functionally associated gene sets can be observed only in the biologically correct interactome and not in random interactomes. In other words, when we compare the density of the HIR V2 to a random gene association network, both consisting of the same genes and with each gene having the same number of neighbours, the HIR V2 will have a higher density. In a biological sense, Q1 examines the strength of the functional associations between two gene sets, while Q2 verifies that the observed strong functional association is the result of a biologically correct network topology (i.e., our knowledge of the molecular mechanisms) rather than the result of the compositions of these two gene sets. Some genes, known as hubs, have considerably more neighbours than other genes. Therefore, if the gene sets have many hubs, they are more likely to connect to other gene sets. To ensure the biological significance of functional associations that were detected between two gene sets, the second hypothesis (Q2) can remove the confounding factor of gene set composition. In general, Q1 and Q2 are related and different hypotheses. They complement each other so that the GSLA tool can increase its sensitivity and specificity. We set density > 0.01 for Q1 and p < 0.001 for Q2 as the default criteria for GSLA when reporting the functional associations between two gene sets.
Construction of the HIR V2/GSLA website
To deploy the online database, we used the LNMP system, which is an integrated system that includes Linux, Nginx, MySQL, and PHP. The MySQL database was used to store data. The web interface of the online database was developed using the Laravel framework using PHP. The front-end of the online database was implemented with the Vue.js script library, which implements a single page application (SPA). Vue.js is an open-source JavaScript library designed for SPA web interface creation. Cytoscape [27] was used for the visualization of the functional association networks.