The primary achievement of this study is the development of SFpointer, a novel algorithm designed to identify RNA-Binding Proteins (RBPs) that are significantly enriched in regions associated with differential splicing events. This is accomplished by integrating multiple CLIP-seq databases into a unified resource, enabling comprehensive enrichment analysis of RBPs. The input is the RNA-seq analysis of several experiments. The output is a prioritized list of RBPs that are more likely drivers of alterations in splicing patterns across the experiments, providing valuable insights for further biological validation.
The development of SFpointer involved several key challenges: 1) Limited CLIP-seq Usage: Given that CLIP-seq is less commonly utilized than RNA-seq, the accuracy of predicting splicing factor binding motifs is heavily reliant on the availability of CLIP experiments, necessitating extensive database integration, 2) Statistical Significance: The reliability of our findings is contingent upon both the quality of splicing event calls and the statistical methods employed in the enrichment analysis, 3) Validation Requirements: The results from our statistical pipeline must be validated against experiments with known ground truth.
We prioritized user-friendliness in the algorithm's design to ensure accessibility for the scientific community. By successfully addressing these challenges, we have created a robust tool that enhances researchers' ability to explore the role of RBPs in splicing regulation.
2.1 Included RBPs are increased by 30% with the updated databases
In our updated version, we have integrated POSTAR3, a comprehensive CLIP-seq database containing 1,445 experiments across seven species, covering 348 RNA-binding proteins (RBPs). We included experiments mapped to the human and mouse genomes due to their genetic similarities, with mouse data converted to human genome coordinates using liftover (10). This increased the total number of RBPs included in our analysis to 244, a 25% increase over the previous version (6).
This integrated database provides genomic loci mapped to the human genome (hg38), facilitating the identification of potential splicing events. To explore the relationship between RBP binding sites and splicing events, we constructed an indicator sparse matrix, termed ExS (Events x Splicing Factors), which indicates whether an RBP binding locus is within a 400 nt window of a splicing event. A detailed methodology for the construction of the ExS matrix can be found in our previous publication (6).
2.2 Boosting sensitivity and specificity through a new statistical modeling
In our analysis, we observed that accurate identification of altered splicing events significantly affects the performance of splicing factor (SF) calculations. Careful selection of differential splicing events improves the accuracy of RBP enrichment. To improve this aspect, we have adopted a bootstrap-based statistical approach implemented in the EventPointer package (Figure 1.2), which increases the sensitivity compared to the previous version (11).
In addition, we have implemented four different statistical enrichment analyses on the alternative splicing (AS) events to predict the differential activity of the RBPs (Figure 1.3). The first method uses the Fisher's exact test. This test estimates the enrichment of splicing factors (SFs) using the ExS matrix by setting a threshold on p-values to identify differentially spliced events. This method is commonly used for Gene Ontology (GO) enrichment analysis and was included in the previous version of our algorithm (6, 12, 13).
The data used for enrichment analysis (the ExS matrix) is a potential source of bias because some RBPs bind to a large proportion of splicing events while others bind to very few. Furthermore, certain splicing events may have numerous RBP hits while others have minimal hits. A similar statistical analysis was performed in Discover (14) for the detection of mutually exclusive mutations, and showed that variations in the density of rows and columns in the input matrix (in our case ExS) can introduce bias in naïve analyses based on the hypergeometric distribution.
To address this bias, we used Rediscover (15), an R package that implements the Poisson-Binomial distribution instead of the standard hypergeometric distribution. Both the hypergeometric and Poisson-Binomial methods require the user to select a threshold to determine when a p-value is considered significant.
Regarding the computation of the input data, we noticed that the correct identification of the altered splicing events strongly impacts the performance of the SFs calculation. A proper selection of the differential splicing events makes the enrichment of RBPs more precise. We have switched to a bootstrap-based statistic implemented in the EventPointer package (Figure 1.2). This statistical approach increases the sensitivity compared to the previous version (11).
Secondly, we implemented four different statistical enrichment analyses on the AS events to predict the differential activity of the RBPs (Figure 1.3). The first method is based on the hypergeometric distribution using a Fisher's Exact Test. This test estimates, using the ExS matrix, the enrichment of the splicing factors (SFs) by setting a threshold on the p-values to state which are the differentially spliced events. This method is consistently used to perform GO enrichment analysis and was already implemented in the previous version of the algorithm (6,12,13).
The data used to perform the enrichment analysis (the matrix ExS) shows some potential sources of bias: some RBPs bind to a large proportion of the splicing events (and other RBPs scarcely bind to any splicing event). Furthermore, there are some splicing events with many hits from RBPs and other ones with very few. A similar statistical analysis was performed in Discover (14) for the detection of mutually exclusive mutations. In that study, it was shown that the differences in the density of both the rows and the columns of the input matrix induce a bias in a naïve analysis based on the hypergeometric distribution.
In order to correct for this bias, we used Rediscover(15), an R package that implements the Poison-Binomial distribution instead of a plain hypergeometric distribution. Both the hypergeometric and the Poison-Binomial approaches, require the users to select a threshold to state when a p-value is considered significant.
Importantly, neither of these methods fully exploits the ranking of aberrant AS events; for example, a splicing event ranked first is treated the same as one ranked last, provided its p-value is below the threshold. To improve this analysis, we incorporated Gene Set Enrichment Analysis (GSEA) (16), which is based on the Kolmogorov-Smirnov test and effectively exploits ranking information, demonstrating strong statistical power in GO enrichment analysis. Finally, we also included a standard Wilcoxon test, a non-parametric method that similarly exploits the ranking of events.
Figure 1: SFpointer Pipeline. 1) the ExS matrix is built from POSTAR3 CLIP experiments, where each entry i, j is 1 if RBP “j” binds near splicing event “i” annotated in the reference transcriptome and 0 otherwise. 2) the differentially spliced events are detected using a bootstrap version of EventPointer. 3) SFpointer uses these events and the ExS matrix to estimate RBP enrichment by applying one of four methods: Poisson Binomial, Fisher's Exact Test, GSEA, or Wilcoxon Test, resulting in a ranked list of RBPs with enrichment p-values. This method is implemented as a Shiny app and in Bioconductor.
2.3 SFpointer accurately identifies the RBP causing splicing disruption.
We evaluated the proposed pipeline using four RNA-seq experiments with seven different knocked-down RBPs, which allowed us to assess its accuracy (Figure S1). To facilitate a fair comparison, we reran these experiments using the previous version of SFpointer, using the current ExS matrix from our updated algorithm (Table 1). The primary goal of this comparison was to determine whether the improvements in precision and sensitivity were due to changes in the selection of affected splicing events, updates to the enrichment statistics incorporated into the algorithm, and the expanded data set available from POSTAR3. Detailed results of the enrichment analyses are presented in Supplementary Tables 2-7.
Table 1 shows the ranking positions for the seven knockdown RBP (KD-RBP) conditions. This table contrasts the ranking positions obtained using the original version of SFpointer with the current ExS matrix against those generated with the four new enrichment methods using the updated EventPointer pipeline and ExS. It also considers the database used, comparing POSTAR3 and its predecessor POSTAR2. The lowest ranking positions for each condition are highlighted in bold.
RBP
|
POSTAR2
|
POSTAR3
|
SFPointer Original
(Fisher’s Exact Test)
|
SFpointer Original
(Fisher’s Exact Test)
|
SFpointer New (Fisher’s Exact Test)
|
SFpointer New (Poisson Binomial)
|
SFpointer New (GSEA)
|
SFpointer New (Wilcoxon Test)
|
PRJEB39343
|
PTBP1
|
-
|
1/244|0.99
|
1|0.99
|
1|0.99
|
1|0.99
|
2|0.99
|
PRJEB39343
|
MBNL1
|
-
|
81/244|0.67
|
9|0.96
|
4|0.98
|
21|0.91
|
22|0.91
|
GSE77702
|
FUS
|
11/195|0.94
|
1/244|0.99
|
2|0.99
|
4|0.98
|
2|0.99
|
2|0.99
|
GSE77702
|
TAF15
|
NS
|
125/244|0.49
|
117|0.52
|
152|0.37
|
164|0.32
|
177|0.27
|
GSE77702
|
TARDBP
|
20/195|0.90
|
68/244|0.72
|
45|0.81
|
43|0.82
|
10|0.96
|
30|0.88
|
GSE136366
|
TDP43
|
-
|
1/244|0.99
|
1|0.99
|
1|0.99
|
1|0.99
|
1|0.99
|
GSE75491
|
RBM47
|
-
|
7/244|0.97
|
8|0.97
|
3|0.99
|
10|0.96
|
4|0.98
|
We revisited the analysis presented in reference (6), which evaluated the ability of the previous algorithm to identify RBPs from the GSE77702 dataset. In this dataset, we compared three different contrasts: KD-FUS, KD-TARDBP, and KD-TAF15 against scramble transfection (17). For the KD-FUS contrast, we observed a significant improvement, with its ranking advancing from 11th to the top position, highlighting the improvement of POSTAR3 over POSTAR2. In the case of the KD-TARDBP condition, expression analysis indicated that the knockdown of TARDBP was incomplete, resulting in an insufficient reduction of gene expression levels (Figure S2). Conversely, the KD-TAF15 condition was excluded from the original study because of the minimal effect of TAF15 on alternative splicing regulation as previously demonstrated (6, 7). In the original results using the POSTAR2 database, TARDBP ranked 20th, while TAF15 was not identified as a significant RBP. After re-analysis using POSTAR3 as the reference database, TARDBP was ranked 68th and TAF15 was ranked 125th. Although the analysis with POSTAR3 changed their rankings, both remained as non-significant, consistent with the previously mentioned limitations.
We also analyzed three additional datasets: (i) PRJEB39343, in which three RBPs (PTBP1, ESRP2, and MBNL1) were knocked down in gastric cancer cell lines (18), (ii) GSE136366, in which TDP43 was knocked down in HeLA cell lines (19), and (iii) GSE75491, in which RBM47 was knocked down in H358 cell lines (20). For the PRJEB39343 dataset, we excluded the KD-ESRP2 condition because ESRP2 is not included in the current ExS.
We compared the new version of SFPointer using Fisher's exact test (Table 1, third column) with the previous version, also using Fisher's exact test (Table 1, second column). Notably, the ranking of MBNL1 improved from 81st to 9th. For TAF15 and TARDBP, no significant improvement was observed, which is to be expected given the conditions previously discussed. For the other knockdowns, there was little variability and their rankings remained high. These differences can be attributed to the new SFPointer using the latest version of EP, which is more accurate in identifying differentially spliced events between conditions.
In addition to the traditional Fisher method for enrichment analysis, we introduced three new approaches –GSEA, Wilcoxon, and Poisson Binomial– that improve accuracy in all scenarios. In experiments where the targeted RBPs (PTBP1, MBNL1, FUS, TDP43, and RBM47) were effectively knocked down, these RBPs were consistently in the top 10% across all methods, demonstrating the robust detection capability of SFPointer. In particular, the Poisson binomial method yielded excellent results, accurately ranking the knocked-down RBPs within the top 4 out of 244 positions (Table 1). Full results of the enrichment analyses are provided in Supplementary Tables 2-7.
As expected, the results are highly dependent on the quality of the experiments. The alteration in alternative splicing is small, RBP does not achieve a high ranking, as seen with KD-TAF15 and KD-TARDBP. Interestingly, TARDBP and TDP43 refer to the same gene, but their ranking results differ significantly between GSE77702 and GSE136366, highlighting the influence of experimental quality on enrichment results. In GSE77702, TARDBP expression decreases almost twofold, whereas in GSE136366 it decreases tenfold (Figures S3 and S7).
Finally, for each KD experiment, Figure 2 includes an AS analysis result that illustrates the specific AS changes for each condition that are likely related to the splicing regulatory activity of the RBP. For example, BRWD1 and MALAT-1, which show significant decreases in Ψ in the KD-FUS condition (Figure 2A), have previously been implicated in FUS activity (21). In addition, FLNB in KD-MBNL1 (Figure 2C) has been described as part of the MBNL1-mediated apoptosis pathway (22), and MAPK kinase genes in KD-PTBP1 (Figure 2E) have been reported as inhibitors of the MAPK/ERK pathway (23).
Figure 2. RBP ranking and volcano plot all experiment described in table 1 but KD-TAF15 due to its low impact on alternative splicing. For each condition, the top 8 RBPs (top 10 in KD-TARDBP) and their enrichment p-values are reported, using the method that optimizes the RBP's ranking. Each volcano plot displays in red the AS events with an absolute Delta PSI (ΔΨ) greater than 0.1 and a p-value lower than 1e-3. Significant events with smaller ΔΨ changes are shown in blue, and events with large ΔΨ changes but not significant are shown in green.
2.4 Analysis of the ENCODE database
The application of SFPointer to the ENCODE dataset involved the analysis of 212 experiments related to the knockdown of 106 different RBPs in HEGP2 and K562 cell lines. Most of these experiments included only 2 control and 2 knockdown samples, with shared control samples utilized across multiple experiments. Specifically, the HEGP2 experiments employed 21 distinct types of control sample sets, while the K562 experiments used 29 types.
However, the results obtained from SFPointer were suboptimal. This is likely attributed to the presence of other differentially expressed RBPs in addition to the targeted knockdown RBP (Figures S9). Furthermore, when examining the ΔΨ values, it became evident that the experiments clustered more significantly based on the control sample sets rather than the knockdown effects (Figures S10-S13). This suggests that factors such as the choice of control samples may account for the observed differences in splicing, rather than the intended knockdown of the specific RBP.
This analysis highlights the importance of considering experimental design and control sample selection when interpreting results from RBP knockdown studies, as they can significantly influence the outcomes and conclusions drawn from the data
2.5 Pan-cancer analysis of splicing regulators reveals three groups of tumors with similar RBPs profiles
Several studies have demonstrated the significant role of aberrant splicing in cancer development (12,25,26). Using SFPointer, We conducted a comprehensive pan-cancer AS study to investigate the role of RNA-binding proteins (RBPs) in driving aberrant alternative splicing (AS) across 19 different cancer types, utilizing data from 9,514 patients sourced from the TCGA and TARGET databases (Figure 3a). The results presented in Figure 3 were obtained using only the Poisson Binomial approach. Finally, we clustered the most frequently identified RBPs using STRING (27).
Figure 3b shows the top five AS events for each cancer type from the TCGA and TARGET datasets. Notably, several AS events recur across cancer types, while AS events in childhood cancers are highly specific to each type, with no shared AS events between these and adult cancers. In contrast, two genes –AGRN (ENSG00000188157) and RER1 (ENSG00000157916)– are recurrently differentially spliced in adult tumors, consistently appearing in the top five positions.
AS events involving AGRN are present in four out of sixteen adult cancers: ESCA, KICH, READ, and THCA. AGRN is a gene known for its tissue-specific isoform expression and has recently been implicated in the Hippo pathway in the tumor microenvironment in several cancer types (28,29). Its aberrant splicing is associated with impaired neuromuscular junction synaptogenesis (30), although no current studies directly link its splicing to tumorigenesis.
In the case of RER1, AS events in this gene ranked in the top 5 in 9 out of 16 tumor types. Interestingly, the RER1 gene has been associated with colon and pancreatic cancer (31,32). Specifically, one of its AS events has been reported to be associated with disease recurrence in colorectal cancer (32), and its biological function has been reported to induce carcinogenesis in pancreatic cancer (31).
Furthermore, using the results obtained by SFpointer for each cancer site, we selected the RBPs that appeared to be significantly enriched in at least 5 different cancer types. We performed k-means clustering by RBPs and cancer types with 10-fold cross-validation. The results are shown in Figure 3c. We clustered these results by both columns (cancer types) and rows (enriched RBPs). The column clustering shows 3 different clusters of cancer types according to the number of RBPs disregulated in each condition. The top bar graph shows the number of enriched RBPs for each cancer type. The middle cluster shows that HNSC, STAD, BLCA, BRCA and ESCA are the tumor types with the highest number of altered RBPs. They all have in common the enrichment of splicing sites regulated by DKC1, METTL14, PABPC4 and MKRN1. These RBPs have a strong relationship with cancer development, e.g. DKC1 is related to the expression of tumor suppressors (33), METTL14 mediates tumor progression through SOX4 alteration and WTAP (3), PABPC4 is downregulated in metastatic cells (34), and MKRN1 modulates tumor progression through the AKT pathway (35).
The second cluster (rightmost group) includes relevant cancer types such as COAD or READ and shares the enrichment of PABPN1 and NOL12, both of which are related to tumor progression (36,38). Finally, the third group (leftmost group) includes the tumors with the lowest number of dysregulated genes. This group is characterized by the high presence of altered CELF4 and MOV10 among its samples. Both genes have been implicated in carcinogenesis (39,40).
Regarding the clustering by rows (RBPs), there are two main clusters: the first one (mostly related to AS alterations in adult cancer), bottom cluster in the plot, includes relevant cancer genes such as MKRN1, DKC1 or PABPC4. The second cluster seems to modulate AS at more tissue-specific sites (top part of the plot) and includes relevant oncogenes such as CREBBP (8) or MBNL2 (9).
Finally, these 22 RBPs were clustered the RBPs using STRING MCL methodology (26), finding 6 clusters shown in Figure 3d and Supplementary Table 7, e.g., cluster 1 includes LARP4, ATXN2, MKRN1, PABPC1, PABPC4, MOV10, TNRC6C, and MSI2 RBPs; and cluster 2 contains DKC1, NOL12, GRWD1, and RPS3 RBPs. We observed that 13 out of 19 (about 70%) of the RBPs included in the largest STRING clusters were included in the same group by our method, suggesting that our approach can find functional relationships among RBPs.
Figure 3: A) Tumor types from TCGA and TARGET included in the study, focusing on those with sufficient normal samples. B) Heatmap of pan-cancer alternative splicing analysis, showing the top 5 significant splice events per cancer type. The x-axis lists cancer types, the y-axis lists splicing events, with red indicating negative ΔΨ and green positive ΔΨ. C) Heatmap of 22 RBPs enriched in over 5 tumor types, clustered into two groups, with enrichment shown in blue; includes bar charts of RBP abundance and RBP count per tumor type. D) STRING clustering of RBPs, with colors indicating clusters, bubbles representing RBPs, and lines showing STRING relationship evidence; includes a cluster description table.
2.6 Constructing a pan-cancer splicing regulators resource.
To facilitate exploration of our findings, we developed a Shiny application that integrates the results of our pan-cancer RBP enrichment and alternative splicing (AS) analysis. This application is accessible at (https://gitlab.com/Jferrerb/sfpointer_gui) and allows users to select specific cancer sites while providing a comprehensive ranking of 244 RBPs across 19 different tumor types derived from the TCGA and TARGET databases.
The Shiny app enables users to view the results of alternative splicing analysis for each of the 16 conditions shown in Figure 2a, along with the ability to visualize specific splicing events. In addition, it includes enrichment results for each RBP, allowing users to query the data both at the RBP level –to see which tumors exhibit enrichment– and by condition, to see all RBPs enriched in a particular tumor type. Users can also download graphs and tables directly from the app and perform survival analyses based on each RBP in relation to tumor types, providing insight into the relevance of each RBP in contributing to overall survival.
In addition, the underlying code of SFPointer has been integrated into the EventPointer package already available in the Bioconductor repository. For those interested in the technical details, the code vignettes and a model of the pipeline can be found at https://github.com/clobatofern/SFPointer_testPipeline.git. This integration not only improves accessibility, but also supports researchers in further exploring the implications of our findings in the context of alternative splicing and cancer biology.