Confident protein datasets for liquid-liquid phase separation studies

doi:10.21203/rs.3.rs-4594179/v1

Download PDF

Research Article

Confident protein datasets for liquid-liquid phase separation studies

https://doi.org/10.21203/rs.3.rs-4594179/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Proteins self-organize in dynamic cellular environments by assembling into reversible biomolecular condensates through liquid-liquid phase separation (LLPS). These condensates can comprise single or multiple proteins, with different roles in the ensemble’s structural and functional integrity. Driver proteins form condensates autonomously, while client proteins just localize within them. Although several databases exist to catalog proteins undergoing LLPS, they often contain divergent data that impedes interoperability between these resources. Additionally, there is a lack of consensus on selecting proteins without explicit experimental association with condensates (non-LLPS proteins or negative data). These two aspects have prevented the generation of reliable predictive models and fair benchmarks.

Results

In this work, we used an integrated biocuration protocol to analyze information from all relevant LLPS databases and generate confident datasets of client and driver proteins. Besides, we introduce standardized negative datasets, encompassing both globular and disordered proteins. To validate our datasets, we investigated specific physicochemical traits related to LLPS across different subsets of protein sequences. We observed significant differences not only between positive and negative instances but also among LLPS proteins themselves. The datasets from this study are publicly available as a website at https://llpsdatasets.ppmclab.com and as a data repository at https://github.com/PPMC-lab/llps-datasets.

Conclusions

Our datasets offer a reliable means for confidently assessing the specific roles of proteins in LLPS and identifying key differences in physicochemical properties underlying this process. These high-confidence datasets are poised to train a new generation of multilabel models, build more standardized benchmarks, and mitigate sequential biases associated with the presence of intrinsically disordered regions.

liquid-liquid phase separation

datasets

integration

driver

client

negative

proteins

disorder

machine learning

benchmark

The discovery of intracellular membraneless organelles (MLOs) has marked a paradigm shift in our understanding of spatiotemporal cellular organization [1]. These condensates are dynamic supramolecular structures that can concentrate different biomolecules, including proteins and nucleic acids. They act as central hubs for interactions that enable rapid and reversible compartmentalization, critical for diverse biological functions [2–5].

Although it is increasingly evident that numerous proteins can undergo liquid-liquid phase separation (LLPS), the heterogeneous composition of these condensates complicates our understanding of the precise role played by each particular protein in a given MLO. Therefore, the introduction of specific controlled vocabularies for categorizing LLPS participants has been instrumental in the progression of the field [6, 7]. Driver proteins can undergo LLPS on their own, without any partner -either protein, DNA, or RNA. In contrast, client proteins are recruited into pre-existing condensates and are not essential for their integrity. Other proteins act as regulators and can influence the behavior of drivers and clients, but they are not physically part of condensates. It is important to highlight that these roles are not mutually exclusive; a driver of a specific condensate can also be a client in another molecular scenario. Similarly, proteins can behave as clients in a given condensate but also phase-separate individually under different conditions. This duality stems from the high context-dependency of LLPS, which can be modulated by environmental conditions [8], crowding agents [9], and additional partners [10]. Indeed, for any multivalent protein, there likely exists a solute condition regime under which self-assembly into condensates will occur [11]. Therefore, an unequivocal categorization of LLPS proteins into drivers and clients requires a cautious examination of both attributes.

Given the biological relevance of LLPS in physiology, aging, and disease [12–14], several databases have been deployed to annotate proteins observed in biomolecular condensates. However, the conceptual strategies followed to build such databases vary significantly. Consequently, the number of entries, their annotations, and the level of experimental evidence seen on each repository are highly divergent [7]. For instance, the PhaSePro database [6] collects only experimentally validated driver proteins or regions. PhaSepDB [15] contains regions with the potential to drive LLPS (psself) but also others that require protein or nucleic acid partners (psother). LLPSDB [16] annotates several protein components and solute conditions across different LLPS experiments. CD-CODE [17] is oriented toward biomolecular condensates and their constituents, making a specific distinction between driver and member proteins for each MLO. Finally, DrLLPS [18], while more protein-centric, also collects the associated condensates for each protein and the role it plays, either as a scaffold, client, or regulator.

Despite the efforts of curators to annotate proteins involved in LLPS, it is clear that different databases are built aiming for different objectives and collecting distinct types of data, eventually diluting important information across sources. Considering this, efforts to unify LLPS data sources are needed for a better understanding of proteins’ role in condensates, as well as to train and benchmark machine learning (ML) models. MLOsMetaDB constitutes a first attempt at centralizing annotations from most LLPS databases while enriching them with external information (disorder, globular domains, function, orthologs) [19]. Still, little attempts have been made to maintain a comparable level of experimental evidence while integrating proteins from different sources, a fact that has hindered data interoperability and noiseless data annotation [20]. Besides, the evident lack of biologically relevant negative datasets and an unambiguous distinction between client, driver, and negative proteins pose significant challenges for benchmarking predictive algorithms. This situation motivated us to carefully inspect and process the data collected by LLPS databases to generate reliable datasets of client, driver but also negative proteins that should be useful for building more accurate predictive tools and standardized benchmarks.

LLPS predictive tools such as FuzDrop [21] and catGRANULE [22] are designed to detect protein regions driving the formation of MLOs under standard conditions. In many instances, intrinsically disordered regions (IDRs) [23, 24] or prion-like domains (PrLDs)[25] overlap with these predicted LLPS-promoting regions. However, not all IDRs or PrLDs necessarily engage in LLPS, leading to potential biases in predictions that favor these features over actual domains with multivalent potential to establish the weak interactions necessary for LLPS [26–28]. In an effort to alleviate this issue, beyond the full-length protein, in the present study we also annotated disorder-related sequential elements, including IDRs and PrLDs. We illustrate how the analysis of relevant features commonly linked to LLPS can be applied to identify significant differences between datasets and mitigate sequential overlaps.

Based on current knowledge of the LLPS phenomenon and the harmonization of curation criteria, we have developed high-quality datasets of client and driver proteins involved in LLPS. These datasets should allow a better understanding of the physicochemical properties that distinguish proteins participating in different condensates from proteins that do not. Additionally, they should help in distinguishing the specific roles played by participant proteins in LLPS reactions.

Integrated dataset generation of client, driver, and negative proteins in LLPS

To integrate LLPS proteins into complete specific categorical datasets, we compiled data from the most recognized LLPS resources. Since different databases provide varying levels of evidence for the collected data, our first step implied the design of standardized filters aligned with LLPS vocabulary definitions to generate a curated group of proteins with consistent levels of confidence for all protein categories.

First, for databases that collect general LLPS proteins but do not specifically differentiate between clients and driver/scaffold proteins, entries were retrieved by applying filters that ensure that those proteins are actually drivers. This means that they indeed have no partner dependency -nor protein or RNA/DNA- or require further modifications such as PTM or mutations to phase separate. This distinction is crucial because even databases specifically developed to collect driver proteins with associated experimental evidence, such as PhaSePro, include partner-dependent proteins.

For databases that already consider both driver and client labels, the first stage involved distinguishing them from one another (drivers from clients) and then classifying only those proteins with at least in vitro experimental evidence, thus ensuring a higher confidence level.

Considering the high context-dependency of LLPS, a critical aspect of this kind of study involves integrating specific negative datasets of proteins not involved in LLPS. These datasets should include disordered proteins (DisProt), which are mostly overlooked in current negative datasets, in addition to globular proteins (PDB), which are often taken as the naive and only negative set (Fig. 1).

The description of confident negative datasets of proteins not involved in LLPS is challenging because of the condition-dependent nature of the process and the lack of dedicated studies on this specific protein trait. However, having well-defined negative datasets is crucial for effective training and benchmarking of unbiased predictive methods [29]. To address this need, here we implemented two independent datasets: ND (DisProt) and NP (PDB). Filters applied to the original DisProt and PDB databases involved selecting clear negative entries with no current evidence of association with LLPS, ensuring that these entries were not present in any of the positive datasets.

When specific category classifications were applied in each independent dataset we generated, the number of final entries was significantly reduced compared to the source databases due to the stringency of the applied filters (Fig. 2). These results suggest that predictive bioinformatics tools trained with generic data from LLPS databases might produce nonspecific models.

Given the multilabel condition of some LLPS participants, unambiguously distinguishing LLPS proteins as either drivers or clients is not trivial. To address this, here we attempt to provide lists of specific and confident datasets of clients and drivers by cross-checking the information from previous datasets (Fig. 3). Exclusive clients (CE) are proteins that appear only in CD-CODE or DrLLPS as clients/members and not as drivers in the rest of the positive datasets. Exclusive drivers (DE) only appear with the scaffold/driver tag and never as clients. Finally, a protein is both a client and a driver if it is tagged with both terms (C_D). The confidence of each category is also assessed by counting the number of appearances of clients and drivers in the original databases. Thus, intersecting clients (C+) are proteins found in both client databases (CD-CODE and DrLLPS), whereas intersecting drivers (D+) are those observed in at least 3 out of the 5 driver databases. All dataset records are deposited into an interactive, user-oriented website (https://llpsdatasets.ppmclab.com), intended to provide an accessible platform for users to browse and filter data intuitively.

Finally, additional annotations of disordered-related sequential elements (IDRs and PrLDs) have been precalculated. Predicting such sequences from full-length proteins could help detach existing biases in LLPS predictions and reveal how certain physicochemical features may vary between datasets. Detailed descriptions of these annotations are provided in Supplementary S1.

LLPS-positive proteins and DisProt negative dataset display a similarly low proportion of ordered residues

The generation of the DisProt negative dataset (ND) was paramount as it adds a necessary subset of negative proteins beyond the naive PDB dataset (NP). ND proteins are not present in the condensates-related specific thematic dataset of DisProt or within our positive data.

LLPS proteins often contain a considerable degree of disorder [20] that facilitates multivalent interactions [26] to the point where disorder predictors have turned out to be acceptable LLPS predictors [30]. In other words, there exists an intrinsic bias toward the prediction of IDRs rather than genuine multivalent sequences when forecasting LLPS propensities [31, 32]. This becomes evident when comparing the fraction of ordered residues in ND with that in LLPS-positive proteins; with both datasets displaying a very similar distribution profile (Fig. 4). Considering this unavoidable bias, annotating the fraction of order and disorder for every protein becomes instrumental to uncovering possible stratifications of disorder that could help to identify protein regions contributing the most to condensate formation.

We acknowledge that some proteins in ND might possess LLPS properties that have not yet been evaluated, and thus, future studies might reveal their potential to undergo LLPS under certain conditions.

Physicochemical analysis of LLPS properties indicates differences between drivers, clients, and negative proteins

The generation of both positive and negative datasets, along with their segmentation into their disordered elements (IDRs and PrLDs), enables a comparative analysis of how certain physicochemical properties may differ in these protein subsets. We evaluate four different physicochemical traits traditionally linked to LLPS sequences: charge distribution (κ), sticker/spacer distribution (κ_s|s), percentage of tyrosines and arginines (%Y + R) and net charge per residue (NCPR). Additionally, we include three more features with evidence in mediating interactions that can promote condensation: aggregation propensity [33, 34], cryptic amyloidogenicity [35, 36], and the presence of conditionally disordered regions capable of undergoing disorder-to-order-transition upon partner binding [1, 4].

While NCPR is a key feature for LLPS [33, 37], the distribution of charged amino acids along the sequences also influences this behavior [38, 39]. The κ parameter was first introduced to compute the patterning of positively and negatively charged residues [40]. Beyond charges, the sticker and spacer model of LLPS assumes that sticky residues are responsible for establishing the first weak interactions required for condensation, whereas spacer amino acids are intercalated between stickers to regulate droplet formation and properties [41–43]. In this framework, we have introduced a variant of κ, the κ stickers-spacers (κ_s|s), to evaluate the distribution of sticker (YRF) and spacer (GSQN) residues (Supplementary S2), thus expanding on previous approaches that just consider stickiness [44].

The %Y + R metric barely reveals significant differences between datasets when considering full-length sequences (FLS) or only IDRs (Fig. 5). This indicates that the percentage of sticky amino acids alone is insufficient to distinguish LLPS proteins (C_D, CE, and DE) from those not found in condensates (NP and ND).

When considering FLS, NP differs from LLPS proteins across the six other properties analyzed, but it also differs significantly from ND. Worryingly, this explains why state-of-the-art LLPS prediction methods, trained solely against NP, approximate LLPS with intrinsic disorder [26, 31]. This highlights the need for caution when using PDB entries alone as the negative dataset in benchmarking exercises. Accordingly, the NP dataset was not considered in the comparisons we outline below.

The κ_s|s, metric discriminates exclusive drivers (DE) from ND and exclusive clients (CE) when considering FLS, highlighting the importance of stickers and spacer residues distribution along the sequence. However, when analyzing exclusively IDRs, κ_s|s loses its discriminative property. This implies that the distinction between clients and drivers is not confined, or at least not only to the disordered segments, but is contingent on the entire protein sequence.

NCPR allows discriminating ND from LLPS proteins, particularly for CE and DE datasets in FLS. Again, this differentiation is lost when considering IDRs alone. Conversely, κ (distribution of charged residues) does not have discriminatory power in FLS, but it does for IDRs between CE and DE, as well as between CE and C_D, underscoring the relevance of charge distribution for IDRs condensation [39] and suggesting its potential use not only to discriminate exclusive clients from exclusive drivers but, also from ambiguous participants (C_D). These data also indicate that, despite NCPR and κ being obviously related, they convey different information, which can be combined for better discrimination between datasets.

Aggregation shows significant differences between NDs and LLPS-positive datasets for FLS (Table 1). This aligns with the hypothesis that aggregation is one of the driving forces for the reversible assembly of proteins in stress granules [14, 33] and plays a key role in the liquid-to-solid transition of condensates [45, 46]. Importantly, aggregation propensity is significantly different between CE and DE, as well as between CE and C_D, providing strong discrimination between the different roles played by LLPS proteins.

Table 1

significant comparisons observed for specific datasets in their full-length sequences. Physicochemical properties that can significantly discriminate (p ≤ 0.01) are marked.
	κ_ss	NCPR	Aggregation propensity	Cryptic amyloidogenicity	Disorder binding
ND-CE		X	X	X	X
ND-DE	X	X	X	X	X
CE-DE	X		X		X

While hydrophobic aggregation-prone regions in IDRs are traditionally considered deleterious due to their likelihood to nucleate toxic amyloid formation [47, 48], cryptic amyloidogenic regions of a polar nature are widespread in both IDRs and PrLDs [36, 49]. These regions endorse disordered proteins with a self-assembly potential to establish interactions while minimizing the risk of pathogenic aggregation. This is translated into the fact that it is the property with the highest levels of significance for discriminating IDRs in NDs from the IDRs of LLPS proteins.

The profile of disorder-binding regions (DBRs) mirrors the significance levels of aggregation propensity. This is expected, because these are likely the regions that contribute the most to LLPS, and, in many instances, DBRs overlap with aggregation-prone regions [35, 50]. Again, differences between datasets are less pronounced or lost when considering IDRs alone.

Interestingly, when considering all the properties together, DE vs C_D is the only pairwise comparison without a significance level in any sequence subset. This implies that C_D proteins are more similar to drivers, and the properties to discriminate them (if any) go beyond the ones considered in this study.

A key observation of our analysis is that, despite the presence of PrLDs being often assumed to be a trait of LLPS [2, 41, 51, 52], we could not identify any specific physicochemical trait that differs very significantly between ND and any of the datasets linked to LLPS. In the same spirit, IDRs seem to be less informative than entire protein sequences. The properties that seem to be less affected by not considering the full sequence context are cryptic amyloidogenicity and disorder binding, which can still distinguish NDs from LLPS-positive proteins in IDRs. Overall, disordered elements per se, when considered individually, bear poor discriminative information. These findings support the notion that multivalency extends beyond IDRs [43], as other sequential traits could be exploited in LLPS prediction to mitigate the intrinsic IDR bias [31].

The datasets generated in this work allow for a confident evaluation of the role of a given protein in LLPS while integrating information from diverse LLPS sources. A total of 4526 different proteins (755 positives -either drivers, clients, or both- and 3771 negatives) are classified in the datasets, aiming to provide a realistic context for the LLPS phenomenon. This is significant given that fully annotated LLPS proteins constitute only a small fraction of the entire protein universe. Proteins not included in the positive datasets lack sufficient evidence of undergoing LLPS. Despite the known context-dependency of the process, efforts were made to select reliable structured and disordered negative proteins, resulting in a larger number of negatives compared to positives. Multiple proteins from the original source databases are not included in any of the final datasets (either positive or negative) due to the stringent filtering criteria we used to obtain highly confident driver, client, and negative proteins. Excluded proteins need additional partners (e.g., a protein co-driver or an RNA), undergo post-translational modifications (e.g., phosphorylation), or simply participate as regulators.

The level of annotation of the datasets should allow for specific protein stratifications to perform further analyses. For instance, it is possible to work with exclusive clients or exclusive drivers (category specificity; CE, DE) to uncover additional properties that influence the client-driver distinction [53]. Conversely, working with proteins from ambiguous datasets (e.g. C_D) can prove useful in studying context-dependent LLPS and finding possible associated variables [8, 54]. As observed here, IDRs and PrLDs alone are generally insufficient to significantly discriminate LLPS proteins from negative data, but specific properties such as cryptic amyloidogenicity or disorder binding provide hints on the features that set these sequences apart from negative disordered proteins.

Importantly, these datasets offer an opportunity to reassess the performance of current LLPS predictive methods and train more accurate models. Our data allows the development of both single-label and multi-label models. Single-label models could address problems such as distinguishing between LLPS and non-LLPS proteins, or even specific client prediction [53]. Multi-label models should allow to assessment of the probability of a protein being driver, client, both client, and driver or none, thus identifying the most probable role of each protein. This strategy would provide a more precise and protein-centric perspective compared to other tools that combine independent models for predicting self-assembled and partner-dependent LLPS proteins [55].

Machine learning classifiers such as n-grams could be used as a first approach to identify multivalent patterns along the sequences, as they have already proven successful in predicting amyloidogenic motifs in protein sequences [56]. Although the modest size of our datasets might constrain the effective usage of deep models that require large training data [57], they could still be valuable for fine-tuning transformer-based models [58].

The incorporation of expanded and confident negative datasets, in addition to the novel client and driver distinction, should establish the basis for setting up fair benchmarks. Particularly, the generation of a dedicated disordered negative dataset plus the annotation of proteins’ disorder fraction can help to promote the development and refining of specific predictive tools minimizing sequential IDR biases [31], advancing towards the implementation of a new generation of LLPS predictors [30].

In this work, we share holistic and rigorously scrutinized datasets to reevaluate the prediction, distinction, and benchmarking of the client, driver, and negative proteins in LLPS. We highlight a similarly low proportion of ordered residues between positive and negative data and elucidate significant differences between full-length drivers, clients, and negative proteins in specific physicochemical properties connected to LLPS behavior.

Filtering clients and drivers

To obtain proteins that fulfill the definition of drivers (ability to phase-separate by themselves), we thoroughly filtered the databases to exclude entries with any known partner dependency:

D1: 57 proteins from PhaSePro v1.1.0 with no partner, RNA or PTM dependency.
D2: 116 psself proteins from PhaSepDB v2.1 without LLPS partners (either proteins, RNA or DNA) or regulations (PTMs, repeats, mutations or splicing).
D3: 184 unambiguous natural proteins with one protein component without mutations, repetitions or PTMs obtained from LLPSDB v2.0.
D4: 207 driver proteins from all biomolecular condensates in CD-CODE with in vitro, in cellulo or in vivo evidence (confidence score >= 3).
D5: 130 scaffold proteins from DrLLPS with condensate information and tissue/cell annotations.

To collect client proteins that are recruited into preformed biomolecular condensates, we could only make use of CD-CODE and DrLLPS, since they specifically accommodate the definition of member and client proteins, respectively.

C1: 155 member proteins from all biomolecular condensates with in vitro, in cellulo or in vivo evidence were obtained from CD-CODE v1.
C2: 288 client proteins from DrLLPS with condensate information, tissue/cell defined and evidence descriptions. To avoid possible high throughput annotations, we excluded proteins reported in publications covering more than 10 entries.

We did not include regulator proteins in our datasets because they are not physically associated with condensates and are only considered by DrLLPS, precluding a consensus annotation of these types of proteins.

Obtaining unambiguous clients and unambiguous drivers

Category specificity:

CE: 367 exclusive clients are those collected in CD-CODE as member proteins (C1) or DrLLPS as clients (C2) which are not present in any of the five driver datasets (D1, D2, D3, D4, D5).
DE: 358 exclusive drivers are those collected in any of the driver datasets and not present in C1 or C2.
C_D: 59 clients and driver proteins appear either in C1 or C2 and also in D1, D2, D3, D4 or D5.

Category intersection:

C+: 17 intersecting clients appear in C1 and C2.
C-: 409 non-intersecting clients appear either in C1 or C2.
D+: 77 intersecting drivers appear in at least 3 out of 5 driver datasets.
D-: 340 non-intersecting drivers appear less than 3 times in all driver datasets.

Generation of negative datasets

NP: 1530 structured proteins from the PDB, with length >= 50aa and <= 5000 residues and similarity cutoff > 30% [59, 60], neither present in D1, D2, D3, D4, D5, C1 or C2. This was used as the classical "naive" dataset, which is used in many other publications of the field for benchmarking and/or training models. UniProt Accession numbers were obtained from BLASTp. Although specific contacts in globular proteins -many relying on modular interaction domains [1]- have been associated with phase separation, in general terms, they are not that prone to establishing most of the weak multivalent interactions required for LLPS. In light of this, globular domains seem to be the most obvious negative dataset and are represented in this first negative group of proteins.
ND: 2379 proteins with annotated disorder collected from DisProt (2023_06 release) not present in the ‘Condensates-related proteins’ thematic dataset, not associated with the GO term ‘molecular condensate scaffold activity’, neither present in D1, D2, D3, D4, D5, C1, C2 or PDB. DisProt entries are manually curated from the literature by expert biocurators [61].

Protein disorder/order annotation

Proteins in datasets can have different levels of disorder content. Since IDRs can overlap with LLPS regions, two metrics accounting for the fraction of disorder and order were extracted from Mobi-DB [62] for all protein datasets. The “disordered fraction” collects curated and derived annotations whereas the “ordered fraction” collects PDB-derived annotations. These metrics allow for possible further stratifications according to the fraction of disorder/order of well annotated proteins.

General protein annotation with UniProt

The UniProt database was used to collect relevant information, such as the protein cellular location (GO-CC) and the amino acid sequence. The cytoplasmic or nuclear localization of certain proteins involved in LLPS has become pivotal in unveiling the reasons behind their pathogenicity [63, 64]. Therefore, proteins with cytoplasmic (cyto*) or nuclear (nucl*) related GO terms were saved. Proteins without GO information, obsolete entries or isoforms (n=168) were discarded since they are, in most cases, associated with low annotated proteins/variants. After UniProt annotation, 4526 unique proteins were integrated from all datasets into a single .tsv file and included in the final website (Supplementary S3).

Disordered-related sequential elements: IDRs and PrLDs

IDRs with at least 10 amino acids were obtained by considering the ‘disorder consensus’ sequences annotated by MobiDB [62]. PrLDs were obtained with the PLAAC algorithm [65] using a core length of 60 amino acids and relative weighting of background probabilities of 100. All sequences from disordered elements are collected in a .json file for each unique protein. Length distributions of IDRs and PrLDs for both positive and negative datasets can be checked in Supplementary S4.

Physicochemical property analysis

Each feature was calculated for all independent sequences and disorder-related sequential elements (IDRs and PrLDs)(Supplementary S5). κ and κ_s|swere calculated with localCIDER [66] and an adapted version for stickers and spacers. Briefly, positive and negative charged residues calculated for κ were changed for sticker (YRF) and spacer (GSQN) residues. NCPR was calculated with the Hendersson-Hasselbalch equation at pH 7.0. The %Y+R was calculated as the percentage of tyrosines and arginines. Aggregation propensity was calculated with AGGRESCAN [67], using the Na4vSS derived score. Cryptic amyloidogenicity was calculated using the Waltz algorithm at threshold 85 [50, 68], averaging the score obtained for each region with at least 7 residues. Disorder binding propensity was calculated with ANCHOR2 [69, 70], averaging the per-residue score obtained for each sequence. Heatmap’s statistical significance was assessed by the Mann-Whitney-Wilcoxon two-sided test with Benjamini correction.

Author Contribution

C.P-G and O.B generated the datasets and curated the data. C.P-G, O.B, V.I and M.B performed the formal analysis, investigation and designed the figures. E.A-R and M.B designed and developed the website. C.P-G drafted the manuscript. M.B and S.V supervised the project. S.V acquired the funding. All authors contributed to the study's conceptualization, reviewing and editing of the manuscript.

Code availability

Datasets and coding scripts can be found at https://github.com/PPMC-lab/llps-datasets. Dataset website can be also accessed at https://llpsdatasets.ppmclab.com.

Funding

CP-G was supported by the Secretariat of Universities and Research of the Catalan Government and the European Social Fund (2023 FI_3 00018). OB was supported by the Spanish Ministry of Science and Innovation via a doctoral grant (FPU22/03656). VI was supported by the Polish National Agency for Academic Exchange under the ULAM NAWA Programme (Grant agreement no. BPN/ULM/2023/1/00189/U/00001). MB was supported by the Maria Zambrano grant funded by the European Union-NextGenerationEU. SV was supported by Spanish Ministry of Science and Innovation (PID2022-137963OB-I00), ICREA, ICREA-Academia 2020 and 2021-SGR-00635 AGAUR (Generalitat de Catalunya).

Acknowledgements

We thank Jakub Kołodziejczyk for his valuable help in data curation.

Banani SF, Lee HO, Hyman AA, Rosen MK: Biomolecular condensates: organizers of cellular biochemistry. Nat Rev Mol Cell Biol 2017, 18:285-298.
Hutin S, Kumita JR, Strotmann VI, Dolata A, Ling WL, Louafi N, Popov A, Milhiet PE, Blackledge M, Nanao MH, et al: Phase separation and molecular ordering of the prion-like domain of the Arabidopsis thermosensory protein EARLY FLOWERING 3. Proc Natl Acad Sci U S A 2023, 120:e2304714120.
Decker CJ, Parker R: P-bodies and stress granules: possible roles in the control of translation and mRNA degradation. Cold Spring Harb Perspect Biol 2012, 4:a012286.
Brocca S, Grandori R, Longhi S, Uversky V: Liquid-Liquid Phase Separation by Intrinsically Disordered Protein Regions of Viruses: Roles in Viral Life Cycle and Control of Virus-Host Interactions. Int J Mol Sci 2020, 21.
Alberti S, Gladfelter A, Mittag T: Considerations and Challenges in Studying Liquid-Liquid Phase Separation and Biomolecular Condensates. Cell 2019, 176:419-434.
Mészáros B, Erdős G, Szabó B, Schád É, Tantos Á, Abukhairan R, Horváth T, Murvai N, Kovács OP, Kovács M, et al: PhaSePro: the database of proteins driving liquid-liquid phase separation. Nucleic Acids Res 2020, 48:D360-D367.
Farahi N, Lazar T, Wodak SJ, Tompa P, Pancsa R: Integration of Data from Liquid-Liquid Phase Separation Databases Highlights Concentration and Dosage Sensitivity of LLPS Drivers. Int J Mol Sci 2021, 22.
Pintado-Grima C, Bárcenas O, Ventura S: In-Silico Analysis of pH-Dependent Liquid-Liquid Phase Separation in Intrinsically Disordered Proteins. Biomolecules 2022, 12.
André AAM, Yewdall NA, Spruijt E: Crowding-induced phase separation and gelling by co-condensation of PEG in NPM1-rRNA condensates. Biophys J 2023, 122:397-407.
Zhou H, Song Z, Zhong S, Zuo L, Qi Z, Qu LJ, Lai L: Mechanism of DNA-Induced Phase Separation for Transcriptional Repressor VRN1. Angew Chem Int Ed Engl 2019, 58:4858-4862.
Poudyal M, Patel K, Gadhe L, Sawner AS, Kadu P, Datta D, Mukherjee S, Ray S, Navalkar A, Maiti S, et al: Intermolecular interactions underlie protein/peptide phase separation irrespective of sequence and structure at crowded milieu. Nat Commun 2023, 14:6199.
Alberti S, Hyman AA: Biomolecular condensates at the nexus of cellular stress, protein aggregation disease and ageing. Nat Rev Mol Cell Biol 2021, 22:196-213.
Emmanouilidis L, Bartalucci E, Kan Y, Ijavi M, Pérez ME, Afanasyev P, Boehringer D, Zehnder J, Parekh SH, Bonn M, et al: A solid beta-sheet structure is formed at the surface of FUS droplets during aging. Nat Chem Biol 2024.
Batlle C, Yang P, Coughlin M, Messing J, Pesarrodona M, Szulc E, Salvatella X, Kim HJ, Taylor JP, Ventura S: hnRNPDL Phase Separation Is Regulated by Alternative Splicing and Disease-Causing Mutations Accelerate Its Aggregation. Cell Rep 2020, 30:1117-1128.e1115.
Hou C, Wang X, Xie H, Chen T, Zhu P, Xu X, You K, Li T: PhaSepDB in 2022: annotating phase separation-related proteins with droplet states, co-phase separation partners and other experimental information. Nucleic Acids Res 2023, 51:D460-D465.
Li Q, Peng X, Li Y, Tang W, Zhu J, Huang J, Qi Y, Zhang Z: LLPSDB: a database of proteins undergoing liquid-liquid phase separation in vitro. Nucleic Acids Res 2020, 48:D320-D327.
Rostam N, Ghosh S, Chow CFW, Hadarovich A, Landerer C, Ghosh R, Moon H, Hersemann L, Mitrea DM, Klein IA, et al: CD-CODE: crowdsourcing condensate database and encyclopedia. Nat Methods 2023, 20:673-676.
Ning W, Guo Y, Lin S, Mei B, Wu Y, Jiang P, Tan X, Zhang W, Chen G, Peng D, et al: DrLLPS: a data resource of liquid-liquid phase separation in eukaryotes. Nucleic Acids Res 2020, 48:D288-D295.
Orti F, Fernández ML, Marino-Buslje C: MLOsMetaDB, a meta-database to centralize the information on liquid-liquid phase separation proteins and membraneless organelles. Protein Sci 2024, 33:e4858.
Orti F, Navarro AM, Rabinovich A, Wodak SJ, Marino-Buslje C: Insight into membraneless organelles and their associated proteins: Drivers, Clients and Regulators. Comput Struct Biotechnol J 2021, 19:3964-3977.
Hatos A, Tosatto SCE, Vendruscolo M, Fuxreiter M: FuzDrop on AlphaFold: visualizing the sequence-dependent propensity of liquid-liquid phase separation and aggregation of proteins. Nucleic Acids Res 2022, 50:W337-W344.
Bolognesi B, Lorenzo Gotor N, Dhar R, Cirillo D, Baldrighi M, Tartaglia GG, Lehner B: A Concentration-Dependent Liquid Phase Separation Can Cause Toxicity upon Increased Protein Expression. Cell Rep 2016, 16:222-231.
Kato M, Han TW, Xie S, Shi K, Du X, Wu LC, Mirzaei H, Goldsmith EJ, Longgood J, Pei J, et al: Cell-free formation of RNA granules: low complexity sequence domains form dynamic fibers within hydrogels. Cell 2012, 149:753-767.
Nott TJ, Petsalaki E, Farber P, Jervis D, Fussner E, Plochowietz A, Craggs TD, Bazett-Jones DP, Pawson T, Forman-Kay JD, Baldwin AJ: Phase transition of a disordered nuage protein generates environmentally responsive membraneless organelles. Mol Cell 2015, 57:936-947.
Dorone Y, Boeynaems S, Flores E, Jin B, Hateley S, Bossi F, Lazarus E, Pennington JG, Michiels E, De Decker M, et al: A prion-like protein regulator of seed germination undergoes hydration-dependent phase separation. Cell 2021, 184:4284-4298.e4227.
Martin EW, Holehouse AS: Intrinsically disordered protein regions and phase separation: sequence determinants of assembly or lack thereof. Emerg Top Life Sci 2020, 4:307-329.
Ibrahim AY, Khaodeuanepheng NP, Amarasekara DL, Correia JJ, Lewis KA, Fitzkee NC, Hough LE, Whitten ST: Intrinsically disordered regions that drive phase separation form a robustly distinct protein class. J Biol Chem 2023, 299:102801.
Lin Y, Currie SL, Rosen MK: Intrinsically disordered sequences enable modulation of protein phase separation through distributed tyrosine motifs. J Biol Chem 2017, 292:19110-19120.
Sidorczuk K, Gagat P, Pietluch F, Kała J, Rafacz D, Bąkała L, Słowik J, Kolenda R, Rödiger S, Fingerhut LCHW, et al: Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data. Brief Bioinform 2022, 23.
Vernon RM, Forman-Kay JD: First-generation predictors of biological protein phase separation. Curr Opin Struct Biol 2019, 58:88-96.
Hou S, Hu J, Yu Z, Li D, Liu C, Zhang Y: Machine learning predictor PSPire screens for phase-separating proteins lacking intrinsically disordered regions. Nat Commun 2024, 15:2147.
Shen B, Chen Z, Yu C, Chen T, Shi M, Li T: Computational Screening of Phase-separating Proteins. Genomics Proteomics Bioinformatics 2021, 19:13-24.
Iglesias V, Santos J, Santos-Suárez J, Pintado-Grima C, Ventura S: SGnn: A Web Server for the Prediction of Prion-Like Domains Recruitment to Stress Granules Upon Heat Stress. Front Mol Biosci 2021, 8:718301.
Wallace EW, Kear-Scott JL, Pilipenko EV, Schwartz MH, Laskowski PR, Rojek AE, Katanski CD, Riback JA, Dion MF, Franks AM, et al: Reversible, Specific, Active Aggregates of Endogenous Proteins Assemble upon Heat Stress. Cell 2015, 162:1286-1298.
Santos J, Pallarès I, Iglesias V, Ventura S: Cryptic amyloidogenic regions in intrinsically disordered proteins: Function and disease association. Comput Struct Biotechnol J 2021, 19:4192-4206.
Pintado-Grima C, Santos J, Iglesias V, Manglano-Artuñedo Z, Pallarès I, Ventura S: Exploring cryptic amyloidogenic regions in prion-like proteins from plants. Front Plant Sci 2022, 13:1060410.
Das S, Lin YH, Vernon RM, Forman-Kay JD, Chan HS: Comparative roles of charge,. Proc Natl Acad Sci U S A 2020, 117:28795-28805.
Hazra MK, Levy Y: Charge pattern affects the structure and dynamics of polyampholyte condensates. Phys Chem Chem Phys 2020, 22:19368-19375.
Bianchi G, Mangiagalli M, Ami D, Ahmed J, Lombardi S, Longhi S, Natalello A, Tompa P, Brocca S: Condensation of the N-terminal domain of human topoisomerase 1 is driven by electrostatic interactions and tuned by its charge distribution. Int J Biol Macromol 2024, 254:127754.
Das RK, Pappu RV: Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A 2013, 110:13392-13397.
Martin EW, Holehouse AS, Peran I, Farag M, Incicco JJ, Bremer A, Grace CR, Soranno A, Pappu RV, Mittag T: Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science 2020, 367:694-699.
Wang J, Choi JM, Holehouse AS, Lee HO, Zhang X, Jahnel M, Maharana S, Lemaitre R, Pozniakovsky A, Drechsel D, et al: A Molecular Grammar Governing the Driving Forces for Phase Separation of Prion-like RNA Binding Proteins. Cell 2018, 174:688-699.e616.
Choi JM, Holehouse AS, Pappu RV: Physical Principles Underlying the Complex Biology of Intracellular Phase Transitions. Annu Rev Biophys 2020, 49:107-133.
Villegas JA, Levy ED: A unified statistical potential reveals that amino acid stickiness governs nonspecific recruitment of client proteins into condensates. Protein Sci 2022, 31:e4361.
Vendruscolo M, Fuxreiter M: Protein condensation diseases: therapeutic opportunities. Nat Commun 2022, 13:5550.
Garcia-Pardo J, Ventura S: Cryo-EM structures of functional and pathological amyloid ribonucleoprotein assemblies. Trends Biochem Sci 2024, 49:119-133.
Langenberg T, Gallardo R, van der Kant R, Louros N, Michiels E, Duran-Romaña R, Houben B, Cassio R, Wilkinson H, Garcia T, et al: Thermodynamic and Evolutionary Coupling between the Native and Amyloid State of Globular Proteins. Cell Rep 2020, 31:107512.
Linding R, Schymkowitz J, Rousseau F, Diella F, Serrano L: A comparative study of the relationship between protein structure and beta-aggregation in globular and intrinsically disordered proteins. J Mol Biol 2004, 342:345-353.
Pintado-Grima C, Bárcenas O, Manglano-Artuñedo Z, Vilaça R, Macedo-Ribeiro S, Pallarès I, Santos J, Ventura S: CARs-DB: A Database of Cryptic Amyloidogenic Regions in Intrinsically Disordered Proteins. Front Mol Biosci 2022, 9:882160.
Pintado-Grima C, Bárcenas O, Ventura S: Expanding the Landscape of Amyloid Sequences with CARs-DB: A Database of Polar Amyloidogenic Peptides from Disordered Proteins. Methods Mol Biol 2024, 2714:171-185.
Gotor NL, Armaos A, Calloni G, Torrent Burgas M, Vabulas RM, De Groot NS, Tartaglia GG: RNA-binding and prion domains: the Yin and Yang of phase separation. Nucleic Acids Res 2020, 48:9491-9504.
Han TW, Kato M, Xie S, Wu LC, Mirzaei H, Pei J, Chen M, Xie Y, Allen J, Xiao G, McKnight SL: Cell-free formation of RNA granules: bound RNAs identify features and components of cellular assemblies. Cell 2012, 149:768-779.
Miyata K, Iwasaki W: Seq2Phase: language model-based accurate prediction of client proteins in liquid-liquid phase separation. Bioinform Adv 2024, 4:vbad189.
Pintado C, Santos J, Iglesias V, Ventura S: SolupHred: a server to predict the pH-dependent aggregation of intrinsically disordered proteins. Bioinformatics 2021, 37:1602-1603.
Chen Z, Hou C, Wang L, Yu C, Chen T, Shen B, Hou Y, Li P, Li T: Screening membraneless organelle participants with machine-learning models that integrate multimodal features. Proc Natl Acad Sci U S A 2022, 119:e2115369119.
Burdukiewicz M, Sobczyk P, Rödiger S, Duda-Madej A, Mackiewicz P, Kotulska M: Amyloidogenic motifs revealed by n-gram analysis. Sci Rep 2017, 7:12961.
García-Jacas CR, Pinacho-Castellanos SA, García-González LA, Brizuela CA: Do deep learning models make a difference in the identification of antimicrobial peptides? Brief Bioinform 2022, 23.
Chandra A, Tünnermann L, Löfstedt T, Gratz R: Transformer-based deep learning for predicting protein properties in the life sciences. Elife 2023, 12.
Saar KL, Morgunov AS, Qi R, Arter WE, Krainer G, Lee AA, Knowles TPJ: Learning the molecular grammar of protein condensates from sequence determinants and embeddings. Proc Natl Acad Sci U S A 2021, 118.
Zhou S, Zhou Y, Liu T, Zheng J, Jia C: PredLLPS_PSSM: a novel predictor for liquid-liquid protein separation identification based on evolutionary information and a deep neural network. Brief Bioinform 2023, 24.
Aspromonte MC, Nugnes MV, Quaglia F, Bouharoua A, Tosatto SCE, Piovesan D, Consortium D: DisProt in 2024: improving function annotation of intrinsically disordered proteins. Nucleic Acids Res 2024, 52:D434-D441.
Piovesan D, Del Conte A, Clementel D, Monzon AM, Bevilacqua M, Aspromonte MC, Iserte JA, Orti FE, Marino-Buslje C, Tosatto SCE: MobiDB: 10 years of intrinsically disordered proteins. Nucleic Acids Res 2023, 51:D438-D444.
Chou CC, Zhang Y, Umoh ME, Vaughan SW, Lorenzini I, Liu F, Sayegh M, Donlin-Asp PG, Chen YH, Duong DM, et al: TDP-43 pathology disrupts nuclear pore complexes and nucleocytoplasmic transport in ALS/FTD. Nat Neurosci 2018, 21:228-239.
Tyzack GE, Luisier R, Taha DM, Neeves J, Modic M, Mitchell JS, Meyer I, Greensmith L, Newcombe J, Ule J, et al: Widespread FUS mislocalization is a molecular hallmark of amyotrophic lateral sclerosis. Brain 2019, 142:2572-2580.
Lancaster AK, Nutter-Upham A, Lindquist S, King OD: PLAAC: a web and command-line application to identify proteins with prion-like amino acid composition. Bioinformatics 2014, 30:2501-2502.
Holehouse AS, Das RK, Ahad JN, Richardson MO, Pappu RV: CIDER: Resources to Analyze Sequence-Ensemble Relationships of Intrinsically Disordered Proteins. Biophys J 2017, 112:16-21.
Conchillo-Solé O, de Groot NS, Avilés FX, Vendrell J, Daura X, Ventura S: AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides. BMC Bioinformatics 2007, 8:65.
Maurer-Stroh S, Debulpaep M, Kuemmerer N, Lopez de la Paz M, Martins IC, Reumers J, Morris KL, Copland A, Serpell L, Serrano L, et al: Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat Methods 2010, 7:237-242.
Dosztányi Z, Mészáros B, Simon I: ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics 2009, 25:2745-2746.
Erdős G, Pajkos M, Dosztányi Z: IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res 2021, 49:W297-W303.

No competing interests reported.

supplementaryGenomeBiology.docx

Download PDF

Editorial decision: Revision requested
02 Sep, 2024
Reviews received at journal
01 Aug, 2024
Reviewers agreed at journal
23 Jul, 2024
Reviews received at journal
22 Jul, 2024
Reviewers agreed at journal
04 Jul, 2024
Reviewers invited by journal
02 Jul, 2024
Editor assigned by journal
20 Jun, 2024
Submission checks completed at journal
18 Jun, 2024
First submitted to journal
17 Jun, 2024

You are reading this latest preprint version

Confident protein datasets for liquid-liquid phase separation studies

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Results

Integrated dataset generation of client, driver, and negative proteins in LLPS

LLPS-positive proteins and DisProt negative dataset display a similarly low proportion of ordered residues

Physicochemical analysis of LLPS properties indicates differences between drivers, clients, and negative proteins

Discussion

Conclusions

Methods

Declarations

Author Contribution

References

Additional Declarations

Supplementary Files

Status:

Version 1