Proteins are essential components of all living organisms and participate in almost every biological process. However, most proteins do not function as a single entity; instead, they often interact with other proteins to form large macromolecules, i.e. protein complexes, that are involved in different cellular functions. Identifying protein complexes allows assigning functions to proteins of yet unknown roles by using the known function of their interacting partners, following the principle of guilt-by-association (Tian, et al., 2008). Moreover, due to the protein structures, proteins are often involved in more than one complex in different subcellular compartments and biological processes. Therefore, studying protein complexes is important to understand the functional principles of the cell system, from signaling to metabolism (Pawson & Nash, 2000; Maslov & Sneppen, 2002; Reyes-Turcu, et al., 2009; Sweetlove & Fernie, 2018), and provide a better understanding hierarchy of intra- and inter-cellular activities (Bauer & Kuster, 2003).
One way to obtain data on protein-protein interactions (PPIs) relies on high-throughput profiling techniques (Ho, et al., 2002; Gavin, et al., 2002). The advances in experimental techniques (Fields & Sternglanz, 1994; Bauer & Kuster, 2003; Fujikawa & Kato, 2007; Lin & Lai, 2017; McBride, et al., 2019) provide us with a plethora of resulting PPI networks from several model organisms (Gavin, et al., 2006; Szklarczyk, et al., 2014; Babu, et al., 2017; McWhite, et al., 2020). In a protein-protein interaction (PPI) network, nodes correspond to proteins and edges represent physical interaction between two proteins. However, due to the noisiness of experimental techniques, the resulting PPI networks contain spurious interactions, which result in false-positive and false-negative interactions (Berger, et al., 2013).
Computational approaches based on graph clustering algorithms are often used to complement the experimental approaches in the identification of protein complexes. Several studies(Li, et al., 2010; Srihari & Leong, 2013; Bhowmick & Seah, 2016; Wu, et al., 2019) have categorized the existing computational approaches for protein complex prediction in multiple groups, such as (i) supervised(Qi, et al., 2008; Shi, et al., 2011) (ii) unsupervised (Spirin & Mirny, 2003; Bader & Hogue, 2003), (iii) using the only topological structure of PPI network(Enright, 2002; Nepusz, et al., 2012) or (iv) integrating additional knowledge or data, such as gene expression (Feng, et al., 2011), functional and evolutionary information (King, et al., 2004; Sharan, et al., 2005; Dost, et al., 2008). Further, several protein complex gold standards of different species such as EcoCyc for Escherichia coli (Keseler, et al., 2016), MIPS, SGD, and CYC2008 for Saccharomyces cerevisiae (Mewes, 2004; Hong, et al., 2007; Pu, et al., 2008), and CORUM for Homo sapiens (Giurgiu, et al., 2018), have been assembled to facilitate the comparison and evaluation of predicted complexes from different approaches.
Due to the incompleteness and noisiness of interactions data, a variety of computational approaches have been proposed as an alternative to experimental tools to predict protein interactions (Zeng, 2016; Kovács, et al., 2019; Wang, et al., 2020). For instance, link prediction algorithms enable us to overcome some of the disadvantages of experimental approaches by identifying false-negative interactions in the PPI network. Therefore, the link prediction and graph clustering algorithms are jointly used to improve the performance of approaches for the prediction of protein complexes. One can employ a link prediction algorithm as a pre-processing step to tune the PPI network and then predict more accurate protein complexes. Alternatively, one can first employ a graph clustering algorithm to group the proteins that are more likely to interact together, and then apply a variety of local or global structure-based similarity measures to compute the possibility of protein interactions in the same cluster (Hu, et al., 2021).
Although the performance of existing computational approaches has gradually increased over time, they still have some notable disadvantages. Overall, the existing computational approaches to solve this problem are based on the idea that protein complexes correspond to highly connected or near-cliques clusters in the PPI network. Therefore, it is most likely that they predict only large and dense protein complexes, while they are incapable of finding sparse and small ones (Srihari & Leong, 2013; Wu, et al., 2019). If an approach solely depends on the PPI network, as mentioned earlier, its performance is expected to be affected by errors and missing interactions in PPI networks. Although the additional biological information might help in identifying protein complexes, this requires wet-lab experiments that are time-consuming, labor-intensive, and the functional annotations of a protein might be outdated or unverified (Li, et al., 2010).
Moreover, the computational approaches for protein complex prediction are parameter-dependent, which renders it difficult to interpret the resulting protein complexes (Omranian, et al., 2021). Finally, the performance of existing computational approaches is mostly evaluated with the PPI network of S. cerevisiae, and few of the existing methods conducted experiments to assess their results across other species, such as bacteria, plants, and humans (Sharma, et al., 2018; Omranian, et al., 2021; Omranian, et al., 2021; Omranian & Nikoloski, 2022).
In contrast to the existing computational approaches, PC2P, GCC-v, and CUBCO (Omranian, et al., 2021; Omranian, et al., 2021; Omranian & Nikoloski, 2022) represent parameter-free algorithms and compare the performance of their results with several state-of-the-art approaches across different species. These approaches detect a protein complex based on partitioning the network into biclique spanned subgraphs, which is also known as coherent network partition (CNP) (Angeleska & Nikoloski, 2019; Angeleska, et al., 2021). PC2P and GCC-v rely on local properties of the network by finding the minimum cut in complement of the second neighborhood of a node (Omranian, et al., 2021) and computing the clustering coefficient for each node to partition the network into biclique spanned subgraphs (Omranian, et al., 2021), respectively. Alternatively, CUBCO (Omranian & Nikoloski, 2022) is based on the global properties of the network and utilizes global minimum cut to partition the network into biclique spanned subgraphs. Moreover, to overcome the incompleteness of PPI networks, CUBCO integrates link prediction (Kovács, et al., 2019) as a pre-processing step to cluster more probable interacting proteins together. The three approaches show consistent performance across different species, in contrast to other approaches that obtain different ranking scores for different combinations of species and the corresponding gold standards.
Here we introduce a new approach, referred to as CUBCO+, that predicts protein complexes based on the same concept as the PC2P, GCC-v, and CUBCO algorithms, i.e. biclique spanned subgraphs. However, CUBCO + not only considers the effect of false-negative interactions, like CUBCO but also evaluates the false-positive interactions by weighing the interactions with Gene Ontology (GO) semantic similarity. To the best of our knowledge, CUBCO + is the first algorithm to take the effect of both false-positive and false-negative interactions into account, while providing predictions of protein complexes with improved performance over the contenders. The rest of the paper is organized as follows: Section 2 presents the proposed complex prediction algorithm followed by comparing the performance of CUBCO + with 17 other state-of-the-art methods based on 12 performance measures; Section 3 contains the related works, the introduction of PPI networks, gold standards, and well-established performance measures that are used in this study; finally, the conclusion of this study with a future scope is presented in Section 4.