Genome-scale CRISPR screen to identify regulators that maintain mESC pluripotency
To establish a function-based PGRN, we first performed a CRISPR-Cas9 mediated genome-wide screen to detect genes essential for self-renewal. mESCs were cultured under Leukaemia inhibitory factors (LIF)/serum condition (L/S), which was commonly used in similar tasks and confer a naïve state to pluripotency18, 19. For a comprehensive screen, the Brie library was chosen, which can target 19,674 genes, with high coverage across the genome20. Cas9-expressing R1 ESCs were infected with lentiviruses containing the library. The cells were propagated in L/S culture and collected on day 0 (P.Sc_0d) and day 14 (P.Sc_14d) post-screen (Fig. 1a). We sequenced the pre-transfected plasmid library and the P.Sc_0d and P.Sc_14d cell samples. The results revealed the presence of 99.79% single-guided RNA (sgRNA) in the plasmid library and a mean of 166 reads per sgRNA (Fig. 1b). In the P.Sc_0d samples, the sgRNA presentations were 99.49% and 99.40% in two biological replicates (Fig. 1b), correlating highly with the plasmid representation (r = 0.73 on average) (Fig. 1c,d). The sgRNA representations of the P.Sc_14d samples also showed high concordance between biological replicates (Fig. 1d). The sgRNAs with significantly increased or decreased abundance were almost exclusively observed for expressed genes (FPKM > 0.5). The abundances of the sgRNAs targeting non-/low- expressed genes (FPKM ≤ 0.5) remained the same as the initial pool (P.Sc_0d) (Fig. 1e).
Using MAGeCK21, we detected 2930 genes whose sgRNAs were depleted, suggesting those as genes required for mESC fitness, as well as 1384 genes whose sgRNAs were enriched, indicating genes harmful to the self-renewal of mESCs (Supplementary Data 1). Despite statistical differences, the positive selection genes showed low fold-change values (only 14 genes with log2-Fold Change ≥ 1) and were mostly related to growth restriction and lineage development. Thus, we focused on the negative selection genes that were essential for self-renewal maintenance of mESCs. These genes were distributed across all chromosomes without enrichment in specific chromosomal regions (Supplementary Fig. 1a). A total of 44.3% of the identified genes encoding proteins localized in the nucleus, 35.2% were located in mitochondria, and the rest were distributed in the cytosol, and among ribosomes and the cytoskeleton (Supplementary Fig. 1b). Twenty-five percent of the genes encoded nucleic acid-binding proteins, TFs and chromosome-associated proteins, while the rest were metabolic enzymes and plasma membrane proteins (Supplementary Fig. 1c). GO analysis of these genes showed that the enrichments were associated with fundamental cellular processes for cell survival (Fig. 1f, Supplementary Fig. 1e, f and Supplementary Data 1). Correspondingly, the majority of the genes involved in RNA transport, ribosomes and DNA replication were identified and included in this group (Fig. 1g).
To examine the correlation between the negative selection genes and pluripotency, we used the low-expression22 and non-essential genes23 as negative controls, the ribosome genes and core TFs in ESCs as the positive controls, and compared the target gene sgRNA abundance in the P.Sc_14d and P.Sc_0d samples. The results showed that the sgRNAs targeting the ribosome and core TFs genes were significantly decreased, whereas little change was observed in those targeting the low-expression and non-essential genes (Fig. 1h). When compared to the non-essential genes, the top100 and top1000 targeted genes ranked by MAGeCK had significantly higher expression levels in mESCs (Fig. 1i). Enrichment analysis also indicated that the negative selection genes were enriched in biological processes and pathways involved in ESC self-renewal (Supplementary Fig. 1d, g). The highest-ranking genes in the screen included the core factors Oct4, Nanog and Sox2, and several genes not previously implicated in ESC self-renewal maintenance, such as Zpr1, Ykt6 and Zbtb17 (Fig. 1j). These genes were all chosen as candidates for constructing the PGRN.
Generation of an extended self-renewal gene set by integrating different screening data
Since results from a single screen might be influenced by the specific CRISPR library used or by other factors, we then compared our list of negative selection genes to those identified in four previous screens performed under the same L/S culture conditions but with different mESC lines and CRISPR libraries13, 24–26. Unexpectedly, there was only 1 common gene between the five screens (Supplementary Fig. 2a). Considering that the different analysis methodologies used may influence the readout, we re-analysed the raw data of these screens with MAGeCK and identified genes that were significantly changed (p < 0.05) under negative selection. Pearson correlation analysis of these normalized data showed high concordance between the Tzelepis and Li screens and our screen. However, the screens by Zhao and Shohat pointed to a unique set of genes (Fig. 2a). In total, 457 (11.26%) genes were identified in all five screens (defined as the “common” gene set), while 3601 (88.74%) genes were identified in four or fewer screens (defined as the “context-specific” gene set) (Fig. 2b, Supplementary Fig. 2b). To test whether the context-specific genes were false positives caused by different screens, we assessed the functional relevance of these genes in self-renewal by examining their expression levels in E14 ESCs using RNA-seq data from GEO22. Comparatively, despite being lower than the core TFs gene set, both the common and context-specific genes showed significantly higher expression levels than the non-essential genes (Fig. 2c). To exclude any bias caused by cell lines and culture conditions, we analysed the single-cell RNA-sequencing (scRNA-seq) data of IB10 ESCs27 and inner cell mass (ICM) cells from E4.5d embryos28. Indeed, we observed high concordance of expression between the common genes, context-specific genes, core TFs and highly expressed genes in mESCs (Supplementary Fig. 2c, d).
To further clarify whether the high expression of the common and context-specific genes was specific to self-renewing mESC, weighed gene co-expression network analysis (WGCNA) was performed to analyse the gene expression profiles during mESC differentiation. As shown in Fig. 2d, six co-expression modules were constructed. Module-trait relationship analysis indicated that the blue module, which contains 86.35% of the common and 72.43% of the context-specific genes, had a high correlation with mESCs (Fig. 2e) and was significantly downregulated during either embryonic body (EB) formation or directional differentiation (Supplementary Fig. 2e), suggesting a strong functional relevance in mESC self-renewal.
Functional Validation Of The Candidate Genes In Maintaining Mesc Self-renewal
To prove the veracity of the screening and integrative analysis, four genes (Ykt6, Polr3a, Adoa, Wdr75) in the common set and six genes (Serbp1, Zbtb17, Zpr1, Usp8, Pi4a and Bap1) in the context-specific set were selected to further validate their functions in mESC self-renewal. For each candidate, two sgRNAs were designed, and two mESC lines, R1 and CCE, were used to assess the self-renewal phenotype. Because both sgRNAs behaved similarly in two cell lines, data is presented for only one. The silencing level of each sgRNA was measured and confirmed by qRT-PCR (Supplementary Fig. 3). Compared to the wild-type (WT) and control (non-targeting sgRNA transduction) cells, all the target gene knockout (KO) cells, except for the Ykt6 KO cells, showed significantly reduced proliferation and colony forming capacity (Fig. 3a, b). Morphological observation and alkaline phosphatase (AP) staining assays showed that all 10 target gene KO cells displayed a differentiation-prone phenotype, i.e., flattened colony morphology, more scattered differentiation-like cells, and fewer AP-positive colonies (Fig. 3c, d). Collectively, these results suggest that both the common and context-specific genes are valid hits and required for the maintenance of self-renewal in mESCs.
Characterization Of The Integrated Self-renewal Related Gene Set
We next combined the core and context-specific genes and defined them as the integrated self-renewal related gene set (iSRGS) (Supplementary Data6). The enrichment analysis showed that the iSRGS genes were mainly associated with fundamental cellular pathways such as DNA replication, proteasome degradation, oxidative phosphorylation, and the cell cycle (Fig. 4a, Supplementary Data6). To ascertain the cellular pathways specific to mESCs, we performed gene set enrichment analysis (GSEA) to analyses the expression profiles of genes in enriched pathways. The results showed that genes involved in the “oxidative phosphorylation”, “ubiquitin proteasome”, “mRNA processing” and “translations” pathways were highly expressed in mESCs and downregulated after differentiation (Fig. 4b). Since these pathways are common across cell types29, 30, these results suggest a possibly specific functional gene set of these cellular pathways in mESCs.
Signal pathways involved in the pluripotent state, such as the Wnt, JAK/Stat, TGFβ, p53 and FGF pathways18, were also significantly enriched. Moreover, our results showed enrichment of pathways that were not reported to be involved in pluripotency regulation, such as the androgen receptor signalling pathway and Epo signalling pathway. In addition, pathways involved in the immune system, including T/B cell receptor signalling and IL-7/IL-6 signalling pathways, were also enriched (Fig. 4a).
Reconstruction Of The Pgrn In Mescs
To re-construct the PGRN, we incorporated the transcriptional regulators in the iSRGS into the known regulatory networks, and clustered the regulatory units according to co-occupancy targets using ChIP-seq datasets that were available in public databases. As a result, we obtained nine sub-classes (Fig. 5a). In contrast to previous reports, the contents of the CORE, PRC and MYC sub-classes were all significantly increased, whereas regulators in the CTCF, REST and P53 classes were almost unchanged31. In addition, three new classes, PCGF, PAF and TBX, were identified and named based on the representative factors in each class. The PCGF class consisted of PcG protein-related transcriptional repressors and methyltransferases. The PAF class included the PAF1 complex, H3K9me3 binding protein and mRNA methyltransferases. The TBX class included TFs of the POU and T-box family, chromatin looping proteins, nucleosome re-modelling proteins, histone-related proteins, RNA-related proteins and DNA methyltransferase (Fig. 5a, Supplementary Data7).
The peak annotation analysis showed that the majority of binding peaks generated by factors in the MYC and PRC classes were more centred at the transcriptional start site (TSS), whereas the CORE, CTCF, P53 and REST classes generally localized further away from the TSS (Supplementary Fig. 4a). The peaks of the newly identified PAF class were located on both promoters and gene bodies, while the PCGF and TBX classes were localized mainly on introns and distal intergenic regions, which may suggest different regulatory modes (Supplementary Fig. 4a).
Since the target site of TFs is often related to specific chromatin marks32, we then examined the association between factor co-occupancy and the histone modification signatures in each class. As shown in Fig. 5b, the CORE class binding regions were highly enriched with putative enhancer histone signature H3K4me1 and active histone signature H3K27ac. The factors in the PRC class harbored both repressive (H3K27me3) and active (H3K4me3) histone signatures on their binding sites, which indicated bivalent chromatin33. Targets occupied by the MYC class showed high levels of the active histone signatures H3K4me3 and H3K27ac. The occupancies of the P53 and REST classes were associated with H3K27ac and H3K4me2, respectively. The PAF class targets were enriched with the active histone markers H3K4me3, H3K79me2 and H3K27ac, while PCGF and TBX occupied targets specifically associated with the repression histone markers H3K9me3 and K4K20me3 (Fig. 5b, Supplementary Fig. 4b). These results suggested that the CORE, P53, REST, PAF and MYC classes were mainly involved in the regulation of transcriptional activation. Among them, the genes in the CORE, P53 and REST classes generally bound distal regulatory elements, whereas those in the PAF and MYC classes preferentially bound to proximal regulatory elements. The PCGF and TBX classes typically correlated with transcriptional repression, by targeting distal silencers. The PRC class factors were enriched on poised genes by occupying proximal promoter regions with bivalent modifications.
Establishment Of Individual Functional Modules Based On The Newly Defined Transcriptional Sub-classes
In a sub-class, TFs and their co-occupied target genes compose a regulatory module, which represents a co-operative function of factors in the sub-class. Nearest gene linkage is the commonly used method for calling target genes. For each binding site identified by ChIP-seq, this approach usually assigns the nearest gene as its potential transcriptional target6, 7. Because some transcriptional sub-classes (CORE, CTCF and TBX) were preferentially located in intergenic regions > 10 kb from the TSS of annotated genes (Supplementary Fig. 4a), we used adaptive sampling and an ensemble model (AdaEnsemble) to assign target genes. This approach integrated gene expression profiles with TF binding profiles and chromatin conformation data to predict high-confidence target genes regulated by both proximal and distal sites34, 35. Accordingly, putative proximal and distal target genes were identified for six major classes (CORE, PRC, MYC, PAF, PCGF and TBX). The other classes were not investigated further, as they consisted of non-specific chromatin looping regulators (CTCF) or just a single factor (REST and P53). As shown in Fig. 6a, while the PCGF module contained a similar ratio of proximal and distal targets, the MYC and PRC modules had more putative proximal target genes (59.5% and 67.6% respectively), and the CORE, PAF and TBX modules had more putative distal target genes (69.4%, 60.1% and 58.2% respectively). These results were in line with the distribution characteristics of binding sites of the individual sub-classes (Supplementary Fig. 4a).
Lists of the module gene sets were summarized in Supplementary Data9. The PRC, PCGF and TBX modules showed clear separation, whereas the PAF, MYC and CORE modules shared many targets with each other (Fig. 6b, c). It was shown that there were 143 intersecting genes between the PAF and MYC modules and 106 intersecting genes between the PAF and CORE modules (Fig. 6c). As the CORE, PAF and MYC classes were mainly involved in the regulation of transcriptional activation, these results indicate that the promoters of active genes may always be bound by multiple factors, whereas repressed genes were regulated by fewer factors. To test whether the modules were functionally separable, we performed GO analysis and found that each module was primarily involved in different biological processes (Fig. 6d), suggesting that the modules are functionally independent.
Since the same CORE, MYC and PRC modules were defined in a previous report6, we examined the similarity of two gene sets in each module. Comparatively, 454 CORE, 341 MYC and 516 PRC genes were identified in this study, whereas 111, 503 and 560 corresponding genes were identified by Kim et al. (hereafter referred to as CORE-Kim, MYC-Kim and PRC-Kim). Nonetheless, only 40, 62 and 233 genes overlapped between the respective modules (Supplementary Fig. 5a). The functional enrichment analysis also showed that different GO terms and pathways were identified in each respective module (Supplementary Fig. 5b-d). Some known pluripotency-associated pathways, such as the Wnt, PI3K/Akt and Hippo/Yap pathways, were specifically enriched in the current CORE module. In contrast, in the CORE-Kim module, more development-related pathways were enriched (Supplementary Fig. 5b). These results indicate remarkably different gene compositions and module functions between the previous and current studies.
For validating the newly defined PAF, PCGF and TBX3 modules, the ChIP assay have been performed using anti-Flag antibody in Flag-tagged Ctr9 (for PAF module), Tbx3 (for TBX module), Dpf2 (for TBX module) and Pcgf6 (for PCGF module) transfected mESCs. Then 11 target genes in different modules were chosen for qPCR analysis. The results were consistent with previous reports36–38 and showed that all the four transcriptional factors preferentially occupied the targets of their own modules (Supplementary Fig. 6), indicating the specific and reliable assignment of target genes for each module. Furthermore, functional analyses were performed to assess the impact of the newly constructed modules on mESCs self-renewal. The major transcriptional factor of each module (Ctr9, Pcgf6 and Tbx3) was knocked down individually by RNAi in a Nanog-GFP reporter ESC line39 (Supplementary Fig. 7a). Compared to the WT ESCs, Ctr9, Pcgf6 and Tbx3 KD cells displayed an impaired self-renewal phenotype characterized by slower proliferation rate, decreased proportion of AP-positive colonies and fewer Nanog-GFP positive cells (Supplementary Fig. 7b-d). These results indicated that PAF, PCGF and TBX3 modules were essential for mESC self-renewal. Moreover, we compared the RNA-seq data from mESCs in which the major component of CORE, MYC, PAF, PRC, PCGF and TBX3 module were silenced respectively. The GO analysis for DEGs in each gene KD cells showed that distinct prominent terms were enriched, despite all of them were involved in development regulation (Supplementary Fig. 7e), which further indicated that different module may have separate functions in mESC self-renewal maintenance.
Mapping The Module Activity Patterns During Mesc Differentiation And Embryo Development
Next, we tested the module activities in mESCs and differentiated cells. GSEA revealed that the genes in the CORE, MYC, and PAF modules were highly expressed in mESCs and downregulated after differentiation. In contrast, the genes in the PRC, TBX and PCGF modules were repressed in mESCs (Fig. 7a). We additionally tested the activity of each module during a time course of directed ectoderm, endoderm and mesoderm differentiation of mESCs. As expected, the CORE, MYC and PAF modules were highly active in mESCs and became repressed gradually with time during differentiation, whereas the PRC, TBX and PCGF modules showed the opposite activity pattern (Fig. 7b).
We then mapped the spatiotemporal module activity patterns during mouse embryo development from E2.5 to E7.5 based on published scRNA-seq data40, 41. As shown in Fig. 7c, the CORE, MYC and PAF modules were highly active in pluripotent cells (MOR in E2.5, ICM in E3.5, epiblast in E4.5-E7.5) and repressed in differentiated cells (E and M in E5.5-E7.5), whereas the activities of the PCGF and TBX modules displayed almost opposite patterns. In comparison, the PRC module genes showed a similar expression pattern to those of the PCGF and TBX module genes except for the expression in the early stages of the pre-implantation embryo (MOR in E2.5 and ICM in E3.5) (Fig. 7c), suggesting a possible stage-specific function of this module in early and late embryo development. Together, these data reveal consistent correlations between the module activities and the pluripotent states of cells both in vitro and in vivo.
Module Activity In Hescs
hESCs have a multilineage differentiation potential similar to that of mESCs42. In addition, core mESC TFs, such as Oct4 and Tbx3, are active in hESCs and directly participate in pluripotency maintenance43, 44. Using the gene expression profiles of the hESCs and EB samples, we tested whether the module activity pattern was similar between mESCs and hESCs. The results showed that the activities of the six modules in hESCs were comparable to those in mESCs (Fig. 7d). To exclude cell line-specific effects, we performed analyses in both H1 and H9, and consistent results were obtained. These observations suggest conserved roles for these modules in human and mouse ESCs. Since hESCs are usually believed to be at the primed pluripotent state45, these results are in accordance with the data obtained from in vivo development and indicate limited variation in the module activity pattern in different pluripotent states.
Module Activity In Human Cancers
According to previous reports, human tumours, especially poorly differentiated tumours display an ESC-like expression signature that may result from re-wiring of stem cell regulatory circuits46, 47. Therefore, ESC-like gene modules have been widely used in the assessment of cancer gene signatures. To test the activity of the re-defined ESC modules and establish relevance with human cancers, we first analysed the expression profile of 750 gliomas, which included 200 low-grade gliomas (LGG; astrocytomas and oligodendrogliomas of grades 2 and 3) and 550 glioblastoma multiforme of grade 4 (GBM). As shown in Fig. 8a, we observed that the average activation levels of the CORE, MYC and PAF modules were higher in gliomas than in normal brain samples and showed a positive correlation with tumour grade. Conversely, the PRC, TBX and PCGF modules showed low activation in gliomas (Fig. 8a, Supplementary Fig. 8a). In GBMs, the same module activities as in ESCs were observed, in which the CORE, MYC and PAF modules were highly active, whereas the other modules were repressed. In LGG and normal tissues, however, these modules displayed an opposite activity pattern (Fig. 8b). These results indicate that the CORE, MYC and PAF modules may also be involved in the maintenance of the malignant phenotype of cancers.
To assess whether the activity of the modules is associated with tumour prognosis, we performed Kaplan-Meier analyses of the progression-free interval (PFI) for patients. As expected, tumours that displayed the strongest CORE, MYC or PAF module activity (top 50% of the samples) were associated with significantly worse survival outcomes than tumours with the weakest module activity. In contrast, tumours with increased activities of the TBX and PCGF modules tended to be associated with better survival outcomes, whereas the PRC module did not show significant correlations (Fig. 8c).
Next, we examined whether the module activity patterns of mESCs were present in other cancers. Analyses of the public gene expression profiling datasets of bladder, breast and non-small-cell lung cancer revealed that the high-grade tumours displayed high activities of the CORE, MYC and PAF modules with repressed expression of the the PRC, TBX and PCGF modules (Supplementary Fig. 8b-f). Together, these results suggest that ESC-specific signatures are shared by various human cancers. Nonetheless, we also observed inconsistent activity of the PRC module in lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) (Supplementary Fig. 8b), indicating underlying impacts from tumour origin or cell heterogeneity.
Furthermore, for assessing the function of CORE, MYC and PAF modules in cancer cells, Nanog (for CORE module), c-Myc (for MYC module) and Ctr9 (for PAF module) were knocked down in glioma cell line U87 respectively (Fig. 9a). As expected, all of the three gene KD cells displayed a decreased proliferation rate alongside limited colony formation capacity when compared to the control cells (Fig. 9b-d). Consistently, the same phenotypes were observed in the respective gene KD lung adenocarcinoma A549 cells (Fig. 9). These findings supported previous analysis results on clinical tumor samples (Fig. 8), demonstrating the importance of the CORE, MYC, and PAF modules in cancer progression.