High stability of REOs in single-cell samples
REOs are the features that describe the relative expression orders of gene pairs within a sample. One attribute of their features is that the REOs of RPKM won’t change after within-sample normalization, such as TPM or log transformation. Indeed, the REOs of RPKM showed 100% overlap with those after TPM or log transformation in 11 public human ESC datasets (Fig. 1A, Table S1). Another attribute is that the REOs stably showing same pattern in the same type of samples, which we called stable REOs (see Methods), could be retained in other independent bulk datasets, even in paraffin-embedded bulk samples with relative low gene expressions [21]. Thus, we inferred that the stable REOs could also be stably retained in the single-cell samples with low gene expressions. Indeed, we found that the stable REOs recognized in each of the 11 single-cell datasets exhibited high consistency among datasets, which was similar for bulk datasets (consistency, 0.99 for single-cell datasets and 0.92–0.96 for bulk datasets Fig. 1B). Besides, there was also a high consistency of stable REOs between single-cell datasets and bulk datasets (consistency, 0.97–0.99, Fig. 1B). In addition, we found that increasing the number of datasets can improve the stability of REOs in independent datasets (Fig. 1C). The larger the number of the merged datasets, the more stable the stable REOs are.
Considering that the previous REO-based stemness index did not approach the set value 1 in single-cell ESC samples (Figure S1), it was necessary to add single-cell data to train a signature more suitable for single-cell data. All above results inspired us to add bulk samples to the single-cell samples to build a stemness index that is more suitable for single-cell samples and more robust to batch effects.
Development and validation of StemSC
The development procedure of StemSC is shown in Fig. 2. We collected both bulk and single-cell datasets as training sets because of the shortage of single-cell data and the high consistency of REOs between bulk and single-cell datasets (Fig. 1B). Firstly, to reduce redundant REOs, we identified 437 stemness-related genes by choosing the genes which were significantly associated with differentiation time in all five datasets with different differentiation directions (Spearman, FDR < 0.05, Table S2). Naturally, these genes showed significant enrichment in the pathways related to cell renewal and differentiation, such as cell cycle, Ribosome biogenesis in eukaryotes (Figure S2). Then, we respectively identified 19,937 and 50,827 gene pairs with stable REOs in 92 single-cell and 47 bulk ESC samples from 13 datasets (Table S3). For the 16,848 shared gene pairs of the above two gene pair lists, 99.9% (16,839/16,848) of them had the same REOs, which showed the high consistency of REOs between single-cell and bulk samples again. Finally, the stemness index values of the given human single cells, StemSC, were calculated by the percentage of gene pairs with the same REOs as ESC samples in 16,839 gene pairs (see Methods).
In all five independent validation datasets (Fig. 3A, Table S4), there were strong negative correlations between StemSC values and differentiation time (Spearman correlation; |r|, 0.43–0.85, Fig. 3B, D-H), which was greatly higher than that between CytoTRACE and differentiation time (Spearman correlation; |r|, 0.14–0.84, Fig. 3B). Further, the robustness of StemSC to batch effect was showed in three following aspects. Firstly, the median StemSC values of ESCs were all centered around the 1 in two independent validation datasets (median StemSC, 0.990 for GSE85066 and 0.984 for GSE109979). Secondly, after combining the two batches of dataset GSE102066, the correlation between StemSC and differentiation time was higher in the merged data than in one of the single batch data (Fig. 3C, I), but not for CytoTRACE. Thirdly, StemSC showed negative correlations with the differentiation time in all validation sets, but CytoTRACE showed a positive correlation in the dataset GSE109979 (Fig. 3B). In addition, to demonstrate that our method can be applied to the cells with different origins, we deleted the ESC samples from the above two datasets with ESCs (GSE85066 and GSE109979). Results showed that there was still a higher negative correlation between the StemSC values and differentiation time than CytoTRACE (r, -0.798 and − 0.583 for StemSC; -0.761 and 0.864 for CytoTRACE). This feature could allow the StemSC to be used in the cancer cells with multiple origins.
The ability of the StemSC to identify stemness- associated genes and construct cellular differentiation trajectories
Given the high correlation between StemSC values and differentiation time, we next explored the ability of the StemSC to identify the stem makers or differentiation factors. Firstly, by ranking all genes according to their correlations with StemSC, we found the enrichment of the stemness-associated genes (the top 100 genes positively correlated with differentiation time) in the positive region and differentiation-associated genes (the top 100 genes negatively correlated with differentiation time) in the negative region in all the validation sets (Fig. 4A). Besides, the majority of the most positive or negative genes showed their role in stemness or differentiation (Fig. 4B, Table S6). For example, L1TD1, the most positively correlated gene with stemness, was reported to be a marker for undifferentiated human ESCs [22]. For another example, CDH2, the gene trigging the endodermal germ-layer formation [23], showed the most negative correlation with stemness in the endodermal differentiation samples. All above results showed the potential of StemSC to recognize the tissue-specific and stemness-associated genes.
Cell lineage trajectory can be determined by using transcriptome-based branch detection tools, such as Monocle 2 [16]. However, users need to enter the starting point of the biological processes. For example, when applied to the dataset GSE109979, Monocle 2 constructed seven possible cell trajectories with different roots (Fig. 4C). On the contrary, our method could identify the correct root by choosing the state with the largest average of StemSC values (Fig. 4C), which was similar to time-based lineage trajectory (Fig. 4D). However, CytoTRACE chose the wrong root in this dataset. Similarly, StemSC recognized the correct roots in the four remaining validation sets (Figure S3). The above results showed a better method to automatically construct cellular differentiation trajectories by combining StemSC and the branch detection tools Monocle 2.
Applicability of StemSC on tumor tissue cells
To validate the applicability of StemSC in tumor cells, we took the single-cell data of colorectal cancer and glioma as examples (Table S4). For three single-cell datasets of colorectal cancer (Table S4), we found that the 30 intestinal stem markers [24] were significantly enriched in the gene list ranked by their correlations with the StemSC values (Fig. 5A-C, Table S7) and the StemSC values were significantly correlated with the sum of expression values of these 30 markers (spearman, p < 0.05, Fig. 5D-F). Besides, many of the stem markers showed significantly positive correlations with the StemSC values (Fig. 5G-I). Especially, histological grade reflects the dedifferentiation of cancer tissue. For the dataset GSE81861 with grade information, significantly higher StemSC values were found not only in the 290 tumor tissue cells than the 170 normal tissue cells but also in the 230 low differentiation cells (grade 2) than the 60 high differentiation cells (grade 1) (student T-test, p = 6E-9, Fig. 5J, K), which was more significant than the values calculated by CytoTRACE (student T-test, p = 0.007, Fig. 5L).
Similarly, for four single-cell datasets of glioma (Table S4), we firstly derived stem markers from dataset GSE57872 by choosing the top 200 genes with the largest log fold change (FC) values between Glioblastoma stem-like cells and the differentiated cells (edgeR, FDR < 0.05, Table S7). Then, we found these genes were not only enriched in the StemSC-ranked gene list (Fig. 6A-D) but also correlated with StemSC values (p < 0.05, Fig. 6E-L) in both the dataset GSE57872 and the other three independent datasets. Besides, for the dataset GSE117891 with grade information, significantly higher StemSC values were not only in the cells with label Grade IV than those with label Grade III − IV but also in the tumoral cells than the peritumoral cells (student T-test, p < 0.05, Fig. 6M, N). Particularly, there were significantly lower StemSC values of the 563 differentiated tumor cells than those of the 134 cancer stem cells in the dataset GSE57872 (student T-test, p < 0.05, Fig. 6O). Besides, the mixed StemSC values between differentiated tumor cells and cancer stem cells reminded us the existence of high-stemness cells in the differentiated tumor cells. Thus, we divided the 563 differentiated tumor cells into low- and high-stemness cells by the median StemSC value (0.866) of the cancer stem cells. Enrichment analysis showed that the stemness markers were significantly enriched in the gene sets ranked by the FCs between high- and low-stemness cells (Fig. 6P), which revealed that the StemSC threshold 0.866 could be used to distinguish the high-stemness cells in glioma. All above results supported the applicability of StemSC on tumor tissue cells.
The effect of stemness on tumor immune microenvironment
Cancer progression is accompanied by the acquisition of stemness, which greatly affects the immune response of the tumor cells. As reported, the resistance to immune-mediated destruction was shown to be an intrinsic property of cancer stem cells [25]. Similarly, at the bulk tissue level, pervasive negative associations were found between cancer stemness and anticancer immunity [4]. However, at the single-cell level, the effect of stemness on the interaction between tumor cells and tumor microenvironment remains incompletely understood.
Here, we further divided the cells into different kinds of immune cells, low-stemness, and high-stemness tumor cells in the above four glioma datasets except dataset GSE57872, which only has tumor cells. For dataset GSE117891, we used inferCNV (see Methods) to infer the copy number variations (CNV) of all the 4623 tumor tissue cells. Further, we divided these cells into 2724 tumor cells and the 1899 normal cells without obvious CNV by using the hierarchical cluster (Fig. 7A, see Methods). For the identified normal cells, we used SingleR [18] to identify the four immune cell types with more than 50 cells, which was further confirmed by the corresponding cell markers (Fig. 7B). For the identified tumor cells, we used the above threshold 0.866 to classify these cells into 2202 low-stemness and 522 high-stemness cells, which was confirmed by the enrichment of the 200 stemness markers in the FC-ranked gene sets between high- and low-stemness cells (Fig. 7C). Further, cell-cell communication analysis (see Methods) showed that, contrary to the low-stemness cells, the high-stemness cells had fewer connections with each other and fewer interactions with the four types of immune cells (Fig. 7F). A similar result could be found in the two additional glioma datasets (Fig. 7D, E, G, H), which implied the resistance of high-stemness cells to the immunotherapy. To validate this result, we further collected the bulk RNA-seq samples of 13 immunotherapy-treated patients as the mean values of single-cell samples (dataset PRJNA482620). And we found that the median StemSC values of non-responders were significantly higher than those of responders (student T-test, p < 0.05, Fig. 7I). Besides, we used the median values to divide the 13 samples into high- and low-stemness groups. Survival analysis showed that the high-stemness group had marginally significantly worse overall survival than the low-stemness group (HR = 6.20; C-index = 0.63; p = 8.18E-2, Fig. 7J), which validated the negative effect of stemness on immunotherapy.