Integrated dataset generation of client, driver, and negative proteins in LLPS
To integrate LLPS proteins into complete specific categorical datasets, we compiled data from the most recognized LLPS resources. Since different databases provide varying levels of evidence for the collected data, our first step implied the design of standardized filters aligned with LLPS vocabulary definitions to generate a curated group of proteins with consistent levels of confidence for all protein categories.
First, for databases that collect general LLPS proteins but do not specifically differentiate between clients and driver/scaffold proteins, entries were retrieved by applying filters that ensure that those proteins are actually drivers. This means that they indeed have no partner dependency -nor protein or RNA/DNA- or require further modifications such as PTM or mutations to phase separate. This distinction is crucial because even databases specifically developed to collect driver proteins with associated experimental evidence, such as PhaSePro, include partner-dependent proteins.
For databases that already consider both driver and client labels, the first stage involved distinguishing them from one another (drivers from clients) and then classifying only those proteins with at least in vitro experimental evidence, thus ensuring a higher confidence level.
Considering the high context-dependency of LLPS, a critical aspect of this kind of study involves integrating specific negative datasets of proteins not involved in LLPS. These datasets should include disordered proteins (DisProt), which are mostly overlooked in current negative datasets, in addition to globular proteins (PDB), which are often taken as the naive and only negative set (Fig. 1).
The description of confident negative datasets of proteins not involved in LLPS is challenging because of the condition-dependent nature of the process and the lack of dedicated studies on this specific protein trait. However, having well-defined negative datasets is crucial for effective training and benchmarking of unbiased predictive methods [29]. To address this need, here we implemented two independent datasets: ND (DisProt) and NP (PDB). Filters applied to the original DisProt and PDB databases involved selecting clear negative entries with no current evidence of association with LLPS, ensuring that these entries were not present in any of the positive datasets.
When specific category classifications were applied in each independent dataset we generated, the number of final entries was significantly reduced compared to the source databases due to the stringency of the applied filters (Fig. 2). These results suggest that predictive bioinformatics tools trained with generic data from LLPS databases might produce nonspecific models.
Given the multilabel condition of some LLPS participants, unambiguously distinguishing LLPS proteins as either drivers or clients is not trivial. To address this, here we attempt to provide lists of specific and confident datasets of clients and drivers by cross-checking the information from previous datasets (Fig. 3). Exclusive clients (CE) are proteins that appear only in CD-CODE or DrLLPS as clients/members and not as drivers in the rest of the positive datasets. Exclusive drivers (DE) only appear with the scaffold/driver tag and never as clients. Finally, a protein is both a client and a driver if it is tagged with both terms (C_D). The confidence of each category is also assessed by counting the number of appearances of clients and drivers in the original databases. Thus, intersecting clients (C+) are proteins found in both client databases (CD-CODE and DrLLPS), whereas intersecting drivers (D+) are those observed in at least 3 out of the 5 driver databases. All dataset records are deposited into an interactive, user-oriented website (https://llpsdatasets.ppmclab.com), intended to provide an accessible platform for users to browse and filter data intuitively.
Finally, additional annotations of disordered-related sequential elements (IDRs and PrLDs) have been precalculated. Predicting such sequences from full-length proteins could help detach existing biases in LLPS predictions and reveal how certain physicochemical features may vary between datasets. Detailed descriptions of these annotations are provided in Supplementary S1.
Physicochemical analysis of LLPS properties indicates differences between drivers, clients, and negative proteins
The generation of both positive and negative datasets, along with their segmentation into their disordered elements (IDRs and PrLDs), enables a comparative analysis of how certain physicochemical properties may differ in these protein subsets. We evaluate four different physicochemical traits traditionally linked to LLPS sequences: charge distribution (κ), sticker/spacer distribution (κs|s), percentage of tyrosines and arginines (%Y + R) and net charge per residue (NCPR). Additionally, we include three more features with evidence in mediating interactions that can promote condensation: aggregation propensity [33, 34], cryptic amyloidogenicity [35, 36], and the presence of conditionally disordered regions capable of undergoing disorder-to-order-transition upon partner binding [1, 4].
While NCPR is a key feature for LLPS [33, 37], the distribution of charged amino acids along the sequences also influences this behavior [38, 39]. The κ parameter was first introduced to compute the patterning of positively and negatively charged residues [40]. Beyond charges, the sticker and spacer model of LLPS assumes that sticky residues are responsible for establishing the first weak interactions required for condensation, whereas spacer amino acids are intercalated between stickers to regulate droplet formation and properties [41–43]. In this framework, we have introduced a variant of κ, the κ stickers-spacers (κs|s), to evaluate the distribution of sticker (YRF) and spacer (GSQN) residues (Supplementary S2), thus expanding on previous approaches that just consider stickiness [44].
The %Y + R metric barely reveals significant differences between datasets when considering full-length sequences (FLS) or only IDRs (Fig. 5). This indicates that the percentage of sticky amino acids alone is insufficient to distinguish LLPS proteins (C_D, CE, and DE) from those not found in condensates (NP and ND).
When considering FLS, NP differs from LLPS proteins across the six other properties analyzed, but it also differs significantly from ND. Worryingly, this explains why state-of-the-art LLPS prediction methods, trained solely against NP, approximate LLPS with intrinsic disorder [26, 31]. This highlights the need for caution when using PDB entries alone as the negative dataset in benchmarking exercises. Accordingly, the NP dataset was not considered in the comparisons we outline below.
The κs|s, metric discriminates exclusive drivers (DE) from ND and exclusive clients (CE) when considering FLS, highlighting the importance of stickers and spacer residues distribution along the sequence. However, when analyzing exclusively IDRs, κs|s loses its discriminative property. This implies that the distinction between clients and drivers is not confined, or at least not only to the disordered segments, but is contingent on the entire protein sequence.
NCPR allows discriminating ND from LLPS proteins, particularly for CE and DE datasets in FLS. Again, this differentiation is lost when considering IDRs alone. Conversely, κ (distribution of charged residues) does not have discriminatory power in FLS, but it does for IDRs between CE and DE, as well as between CE and C_D, underscoring the relevance of charge distribution for IDRs condensation [39] and suggesting its potential use not only to discriminate exclusive clients from exclusive drivers but, also from ambiguous participants (C_D). These data also indicate that, despite NCPR and κ being obviously related, they convey different information, which can be combined for better discrimination between datasets.
Aggregation shows significant differences between NDs and LLPS-positive datasets for FLS (Table 1). This aligns with the hypothesis that aggregation is one of the driving forces for the reversible assembly of proteins in stress granules [14, 33] and plays a key role in the liquid-to-solid transition of condensates [45, 46]. Importantly, aggregation propensity is significantly different between CE and DE, as well as between CE and C_D, providing strong discrimination between the different roles played by LLPS proteins.
Table 1
significant comparisons observed for specific datasets in their full-length sequences. Physicochemical properties that can significantly discriminate (p ≤ 0.01) are marked.
|
% Y + R
|
κss
|
NCPR
|
κ
|
Aggregation propensity
|
Cryptic amyloidogenicity
|
Disorder binding
|
ND-CE
|
|
|
X
|
|
X
|
X
|
X
|
ND-DE
|
|
X
|
X
|
|
X
|
X
|
X
|
CE-DE
|
|
X
|
|
|
X
|
|
X
|
While hydrophobic aggregation-prone regions in IDRs are traditionally considered deleterious due to their likelihood to nucleate toxic amyloid formation [47, 48], cryptic amyloidogenic regions of a polar nature are widespread in both IDRs and PrLDs [36, 49]. These regions endorse disordered proteins with a self-assembly potential to establish interactions while minimizing the risk of pathogenic aggregation. This is translated into the fact that it is the property with the highest levels of significance for discriminating IDRs in NDs from the IDRs of LLPS proteins.
The profile of disorder-binding regions (DBRs) mirrors the significance levels of aggregation propensity. This is expected, because these are likely the regions that contribute the most to LLPS, and, in many instances, DBRs overlap with aggregation-prone regions [35, 50]. Again, differences between datasets are less pronounced or lost when considering IDRs alone.
Interestingly, when considering all the properties together, DE vs C_D is the only pairwise comparison without a significance level in any sequence subset. This implies that C_D proteins are more similar to drivers, and the properties to discriminate them (if any) go beyond the ones considered in this study.
A key observation of our analysis is that, despite the presence of PrLDs being often assumed to be a trait of LLPS [2, 41, 51, 52], we could not identify any specific physicochemical trait that differs very significantly between ND and any of the datasets linked to LLPS. In the same spirit, IDRs seem to be less informative than entire protein sequences. The properties that seem to be less affected by not considering the full sequence context are cryptic amyloidogenicity and disorder binding, which can still distinguish NDs from LLPS-positive proteins in IDRs. Overall, disordered elements per se, when considered individually, bear poor discriminative information. These findings support the notion that multivalency extends beyond IDRs [43], as other sequential traits could be exploited in LLPS prediction to mitigate the intrinsic IDR bias [31].