Identification of Kidney Cell Types in scRNA-seq and snRNA-seq Data Using Machine Learning Algorithms

doi:10.21203/rs.3.rs-3814951/v1

Download PDF

Research Article

Identification of Kidney Cell Types in scRNA-seq and snRNA-seq Data Using Machine Learning Algorithms

https://doi.org/10.21203/rs.3.rs-3814951/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Single-cell RNA sequencing (scRNA-seq) and single-nucleus RNA sequencing (snRNA-seq) provide valuable insights into the cellular states of kidney cells. However, the annotation of cell types often requires extensive domain expertise and time-consuming manual curation, limiting scalability and generalizability. To facilitate this process, we tested the performance of five supervised classification methods for automatic cell type annotation.

Results

We analyzed publicly available sc/snRNA-seq datasets from five expert-annotated studies, comprising 62,120 cells from 79 kidney biopsy samples. Datasets were integrated by harmonizing cell type annotations across studies. Five different supervised machine learning algorithms (support vector machines, random forests, multilayer perceptrons, k-nearest neighbors, and extreme gradient boosting) were applied to automatically annotate cell types using four training datasets and one testing dataset. Performance metrics, including accuracy (F1 score) and rejection rates, were evaluated. All five machine learning algorithms demonstrated high accuracies, with a median F1 score of 0.94 and a median rejection rate of 1.8%. The algorithms performed equally well across different datasets and successfully rejected cell types that were not present in the training data. However, F1 scores were lower when models trained primarily on scRNA-seq data were tested on snRNA-seq data.

Conclusions

Our findings demonstrate that machine learning algorithms can accurately annotate a wide range of adult kidney cell types in scRNA-seq/snRNA-seq data. This approach has the potential to standardize cell type annotation and facilitate further research on cellular mechanisms underlying kidney disease.

Kidney

RNA-seq

machine learning

classification

cell identity

annotation

The human kidney is a highly complex organ composed of various cell types with distinct functions. Recent advancements in single-cell (sc) and single-nucleus (sn) RNA sequencing (RNA-seq) have provided researchers with the ability to examine the transcriptome of individual cells (1, 2). This technological breakthrough enables a detailed understanding of the components and functional processes of distinct kidney cell types and presents opportunities for targeted therapeutic interventions aimed toward these (3). Consequently, the field of kidney medicine is poised to undergo a transformative shift towards a data-driven, precision-based approach.

Despite the improvements in the clustering of cell types made possible by these sophisticated techniques, the task of annotating the resulting data remains predominantly manual (4, 5, 6, 7, 8, 9). Researchers typically rely on a combination of personally identified biomarkers to identify specific cell populations, a laborious and non-standardized process that necessitates expertise in navigating the intricate transcriptomic diversity of the human kidney (4, 5, 6, 7, 8, 9). Consequently, this manual annotation introduces subjectivity into an otherwise data-driven analysis and restricts the ability of researchers to conduct cross-study and validation analyses and scale up these investigations due to inconsistent ontologies (4, 5, 6, 7, 8, 9, 10).

Modern machine learning tools offer a potential solution for addressing the challenge of cell type annotation. Various algorithms have been developed specifically for cell type annotation by leveraging scRNA-seq data. For instance, in one study, researchers successfully employed an extreme gradient boosting (XGBoost) algorithm as part of a machine learning pipeline to classify and predict cardiac developmental cell types (11). Another comprehensive study conducted by Abdelaal et al. (2019) compared several supervised machine learning algorithms, such as linear discriminant analysis, nearest mean classifiers, support vector machines (SVM), random forests (RF), and k-nearest neighbors (KNN), across 27 distinct scRNA datasets encompassing brain, pancreas, and peripheral blood mononuclear cells from both human and mouse samples (4). The results demonstrated that all the algorithms exhibited high median F1 scores and low rejection rates. Notably, the SVM classifier with a linear kernel demonstrated the most optimal performance in their analysis (4). However, it is important to note that the study conducted by Abdelaal et al. (2019) did not specifically examine kidney cell types, leaving the applicability of machine learning algorithms for accurately predicting kidney cell types uncertain (4). Furthermore, there are relatively fewer studies that compare machine learning methods for cell type annotations using snRNA-seq data (12).

In this study, we aimed to assess and compare the effectiveness of various machine learning algorithms for automating kidney cell type annotations. To achieve this, we utilized publicly five available scRNA-seq and snRNA-seq datasets that had been previously annotated by experts. We pooled author-identified cell types into harmonized cell types, applied five different machine learning algorithms to predict harmonized cell type annotations, and evaluated the performance of the different machine learning models using F1 scores and the rate at which models labeled cells as “unknown.” Findings from our study build on ongoing efforts focused toward the development and implementation of standardized cell type ontologies, and more broadly serve to improve our understanding of kidney physiology.

Harmonization of cell type annotations across datasets

Our dataset encompasses a diverse collection of kidney cell-specific transcriptomic data, consisting of a total of 62,120 cells obtained from 79 kidney biopsy samples originating from 40 healthy donors across 5 different studies. We deliberately included data from donors of varying ages, spanning the cortex and medulla, and obtained through multiple sequencing technologies, as outlined in Table 1.

Table 1

Metadata from the 5 different sc/snRNA datasets analyzed in this study.
Study (PMID)	Number of Cells	Number of Donors	Number of Samples	Donor Age Range	Donor Sex			Sampling Locations				Sequencing Method
					M	F	Not Reported	Cortex	Medulla	Both	Unknown/Other
Menon (32107344)	22,264	22	24	< 50 = 2 ≥ 50 = 13 Unknown = 7	7	6	9	NA	NA	NA	24	(sc) 10X
Young (30093597)	6,197	5	17	49–72	3	2	0	14	0	1	2	(sc) 10X
Liao (31896769)	16,145	2	2	59–65	1	1	0	NA	NA	NA	2	(sc) 10X
Wu (29980650)	4,259	1	1	70	1	0	0	NA	NA	NA	1	(sn) InDrops
Lake (31249312)	13,255	14	35	< 50 = 4 ≥ 50 = 7 Unknown = 3	10	4	0	15	14	6	0	(sn) Drop-Seq
Total	62,120	40	79		22	13	9	29	14	7	29

The age range of the donors spanned from under 30 to over 70 years old. 29 samples consisted only of cortical tissue, 14 samples consisted only of medullary tissue, 7 samples consisted of both, and the sampling location of the remaining 29 samples were either unknown or from ureteral tissue (n = 1). As shown in Fig. 1a, among cells of known sampling location, 8,698 (44.7%) were from the cortex alone, 7,742 (39.8%) were from the medulla alone, 2,050 (10.5%) were from the corticomedullary junction, and 962 (4.95%) were from the ureter. Among the five datasets incorporated, three utilized the 10X single-cell technology (13), one used the InDrops single nucleus technology (14), while the remaining one employed Drop-Seq single nucleus technology (15) as illustrated in Fig. 1b. Additionally, our dataset consisted of at least 22 males and 13 females, which contributed to 31,838 (51.2%) and 15,753 (25.4%) cells, respectively, as shown in Fig. 1c. For comprehensive details regarding the donors, we refer readers to the original publications associated with each dataset (5, 6, 7, 8, 9).

To ensure the quality of our dataset, we performed quality control measures by leveraging published data and code from each study. We pre-processed each study dataset individually, including performing pertinent transformations and dropping of samples and/or cells as described in Methods. We validated the original author cell type annotations using the uniform manifold approximation and projection for dimension reduction (UMAP) visualization technique. The UMAP visualizations can be found in Figure S2.

Overall, we identified a total of 85 unique cell type annotations across all of the cohorts. It is important to note that all these cells were derived from healthy, adult human kidneys. To consolidate the annotations and establish a unified cell type nomenclature, we leveraged the transcriptomic data and observed high correlations between individual study annotations with respect to expression of marker genes. Consequently, cell type annotations that exhibited strong correlation patterns were grouped together into harmonized cell types. Figure 2 illustrates the results of this analysis, revealing 16 distinct harmonized kidney cell types based on transcriptomic data. For instance, annotations from different studies that included the term "podocyte" were highly correlated with each other, leading us to assign them to a single harmonized cell type referred to as "Podocyte." This consolidation approach was applied consistently across the remaining 15 harmonized cell types.

The number of individual cells included in each harmonized cell type varied across studies. As depicted in Fig. 3a, the “Proximal Tubule” harmonized cell type encompassed the largest number of cells, totaling 23,177. On the other hand, the “Mast” harmonized cell type had the smallest cell count, with only 22 cells identified. Rare cell types, such as “Fibroblasts” and “B, Plasma, & Plasmacytoid” benefitted from the inclusion of multiple studies in our dataset, compensating for their low cell counts in individual studies (Fig. 3b).

By combining multiple datasets, we were able to overcome the limitations of each individual study regarding the inclusion of specific cell types. This was observed even among harmonized cell types that contained a substantial number of cells. For instance, although Lake et al. had only 16 cells in the “Monocytes, Macrophages, & Other Myeloid” cell type, the inclusion of these cells from three of the other datasets compensated for this omission, resulting in 2,429 “Monocytes, Macrophages, & Other Myeloid”-labeled cells in our final integrated dataset. In a few cases, certain cell types were only present in a single study, as exemplified by the “Neutrophil,” “Mast,” and “Urothelium” harmonized cell types. This highlights the significance of incorporating multiple studies in our data to complement one another and achieve a comprehensive coverage of healthy, adult human kidneys in our training dataset.

Prior to integrating all five datasets, we examined the combined UMAP visualization of the datasets and observed the presence of batch effects, as shown in Fig. 4a. To address this issue, we employed rPCA from the Seurat package to mitigate the batch effects, resulting in a batch-corrected UMAP plot illustrated in Fig. 4b. Following batch correction, we observed a more even distribution of harmonized cell types across the different studies, as depicted in Fig. 4c. However, it's important to note that despite the batch correction, we encountered instances where certain cells did not align perfectly with the harmonized cell types based on the original authors' annotations. To address this discrepancy, we trained a support vector machine (SVM) model using all 62,120 cells. When we applied the trained model on the same data, it predicted the wrong harmonized cell type label for a subset of 4,256 cells (6.9%). Consequently, we categorized these 4,256 cells as low-quality and excluded them from further analyses, as illustrated in Figure S3. The choice of SVM for cell type classification was based on its demonstrated high performance in previous studies (4, 16, 17, 18, 19, 20). Our final integrated dataset included 57,864 cells that were not identified as low-quality cells. The distribution of these cells by harmonized cell type and study along with metadata regarding each cell and sample can be found in Table S1.

Prediction of harmonized cell types

Next, we employed five distinct supervised learning methods to predict the harmonized cell type annotations in our integrated dataset. These methods included a support vector classifier (SVC), a random forest classifier (RF), a multilayer perceptron (MLP), a k-nearest neighbors classifier (KNN), and an extreme gradient boost (XGB) model. To evaluate the performance of these models, we adopted an inter-dataset evaluation scheme. This involved utilizing combinations of four out of the five datasets as the training data and using the remaining fifth dataset as the testing data. By employing this approach, we aimed to reduce the risk of overfitting by ensuring that the testing data was not used during the training process of the model. Figure 5a demonstrates that all the employed algorithms exhibited a median F1 score of 0.94 or higher when tested on each of the individual datasets. These high median F1 scores indicate the strong performance of the algorithms in accurately identifying harmonized cell type annotations using transcriptomic data.

Performance evaluation of classifiers across different harmonized cell types and datasets

Upon comparing the performance of each algorithm across all datasets, we observed that the median F1 scores varied depending on the specific dataset used for testing. For instance, the XGB algorithm achieved the highest median F1 score across cell types when the Menon dataset was used for testing. On the other hand, the KNN algorithm achieved the highest median F1 score across cell types when the Wu dataset was used for testing and the lowest median F1 score across cell types when the Young dataset was used for testing, and the MLP algorithm attained the highest median F1 score when the Lake dataset was used for testing (Fig. 5a). However, it is noteworthy that none of the five machine learning algorithms significantly outperformed the others across the five testing sets. This observation is supported by the results of Kruskal-Wallis tests showing that the p-values for the Menon, Lake, Liao, Wu, and Young datasets were 0.62, 0.97, 0.94, 0.85, and 1, respectively (Table S3).

In some instances, certain harmonized cell types were only present in a specific study, such as “Neutrophil,” “Mast,” and “Urothelium” in the Young dataset. When the Young dataset was used as the testing dataset, models trained on the other four datasets were unable to predict these cell types, resulting in an F1 score of 0. Consequently, we also compared the rejection rates, which represent the percentage of cells labeled as "Unknown," across the different machine learning algorithms and datasets to assess their effectiveness (Fig. 5b). Interestingly, none of the five machine learning algorithms significantly outperformed each other in terms of rejection rates across the five testing sets. The results of the Kruskal-Wallis tests yielded Holm-adjusted p-values of 0.178, 1, 0.178, 1, and 1 for the Menon, Lake, Liao, Wu, and Young datasets, respectively (Table S3). However, it is worth noting that SVC exhibited the lowest rejection rate across all five testing datasets, although this difference was not statistically significant compared to the other machine learning algorithms.

When considering the Young dataset as the testing dataset, the best model is one that accurately rejects cells in the “Neutrophil,” “Mast,” and “Urothelium” cell types as these cell types are not present in the training data. Figure 5c demonstrates that the RF model had the highest rejection rate for cells in the “Urothelium” type, while the MLP model had the highest rejection rate for cells in the “Neutrophil” type and the KNN model had the highest rejection rate for cells in the “Mast” type when the Young dataset was used as the testing dataset.

The performance of the machine learning algorithms also varied across different harmonized cell types. For instance, when Menon was used as the testing dataset, the "Distal Convoluted Tubule and Connecting Tubule" harmonized cell type exhibited lower F1 scores across the machine learning algorithms compared to harmonized cell types such as "Natural Killer & T," "Monocytes, Macrophages, & Other Myeloid," "Proximal Tubule," or "Endothelium," which had higher F1 scores across the algorithms (Fig. 6a). Specifically, the XGB model incorrectly labeled 208 out of 745 (27.9%) cells belonging to the "Distal Convoluted and Connecting Tubule" harmonized cell type as belonging to the "Ascending Loop of Henle" harmonized cell type (Table S2). It is worth noting that the "Distal Convoluted and Connecting Tubule" and "Ascending Loop of Henle" harmonized cell types exhibited a high degree of correlation, as depicted in Fig. 2.

The lowest F1 scores across machine learning algorithms were observed when Lake and Wu were used as the testing datasets (Fig. 5a). The harmonized cell types with the lowest F1 scores when tested on the Lake dataset were "Fibroblasts," "Parietal Epithelium, Late Proximal Tubule, & Descending Loop of Henle," and "Proximal Tubule" (Fig. 6b). Notably, in the Menon dataset, the F1 scores for the “Fibroblast” cell type exceeded 0.97 across machine learning algorithms, whereas in the Lake et al. dataset, the F1 scores for this cell type ranged from 0.65 to 0.8, indicating misclassification of cells of this type. For instance, the KNN model misclassified these cells as belonging to the "Perivascular & Mesangium" harmonized cell type in 18.3% of cases and as "Endothelium" in 14.6% of cases (Table S2). In the case of Wu, the “Ascending Loop of Henle” harmonized cell type had the lowest F1 scores across algorithms when the Wu dataset was used for testing (Fig. 6c). For example, 291 cells belonging to the “Parietal Epithelium, Late Proximal Tubule, & Descending Loop of Henle” harmonized cell type were misclassified by the MLP model as belonging to the “Ascending Loop of Henle” cell type, driving the low F1 scores for this cell type.

When Young was used as the testing dataset, the average F1 scores for cells belonging to the "Principal" or "Proximal Tubule" harmonized cell types were below 0.35 (Fig. 6d). This can be attributed to the low precision of all the machine learning algorithms in predicting cells of the “Proximal Tubule” type and the low recall in predicting cells of the “Urothelium” type. The mislabeling of cells in “Urothelium” as cells in “Principal” instead of rejecting them and the mislabeling of “Endothelium” and “Ascending Loop of Henle” cells as “Proximal Tubule” cells were the main factors contributing to these low F1 scores. Detailed information can be found in Table S2.

In this study, we applied several machine learning algorithms including SVC, RF, MLP, KNN, and XGB to accurately classify kidney cell types using publicly available scRNA-seq and snRNA-seq datasets. Overall, the performance of the machine learning algorithms was satisfactory, with high median F1 scores and low rejection rates for most harmonized cell types across different testing datasets. This suggests that the machine learning algorithms successfully annotated the majority of cells and achieved a high level of concordance with the actual harmonized cell type annotations.

No single machine learning algorithm demonstrated clear superiority in classifying kidney cell types. Each algorithm had its strengths and limitations across different datasets. XGB and SVC consistently performed well but had relative difficulty identifying urothelial cells, neutrophils, and mast cells as novel. RF models had lower median F1 scores overall but had the highest rejection rates for urothelial cells and some other cell types. MLP and KNN models achieved a balanced performance overall but encountered challenges with specific cell types in certain datasets. Notably, some cell types with low cell counts posed difficulties for MLP and KNN models with respect to classification.

We observed that the overall performance of the machine learning algorithms varied in different scenarios. For example, the algorithms struggled to differentiate cell types with highly correlated transcription profiles, such as distal convoluted tubule and ascending Loop of Henle cells (Table S2). Additionally, cell types with smaller sample sizes, such as fibroblasts and principal cells, posed challenges for accurate classification. To improve the performance for cell types with limited cell numbers, we recommend investigating the correlation between clusters with regards to transcription profiles and merging highly correlated clusters into a single cluster. Notably, the performance for some small, harmonized cell type clusters, such as intercalated cells in Liao et al., showed higher accuracies, potentially due to lower correlation with other harmonized cell types.

Lower F1 scores were observed when training datasets consisted primarily of scRNA-seq data and testing datasets consisted exclusively of snRNA-seq data, as observed when Wu and Lake were used as testing datasets. This discrepancy may stem from inherent differences between the two sequencing methods, as well as differences in protocols across studies. Previous studies have highlighted variations in the detected kidney cell types based on sample storage and processing, leading to differences in gene enrichment and subsequent cell type annotations (21, 22, 23). For instance, snRNA-seq has been associated with reduced enrichment of leukocytes, including T cells, B cells, and natural killer cells, which are often indicative of underlying inflammatory states (21, 22). Notably, Wu et al. specified in their study that they were unable to detect stromal or leukocyte populations, possibly due to dissociation bias or cell frequency below the limit of detection (6). Another study comparing scRNA-seq and snRNA-seq in adult mouse kidney models reported an enrichment of specific kidney cell types, such as podocytes, mesangial cells, and endothelial cells, exclusively in snRNA-seq data (24). These discrepancies between the sequencing methods contribute to the overall lower performance of machine learning models when tested on data derived from a sequencing method primarily different from the one on which they were trained.

Within the realm of biomarker ontologies, it is crucial to consider the diversity of the datasets analyzed in our study, which originated from distinct studies utilizing varying pipelines, ontologies, and manual annotations by experts. Despite these differences, our machine learning models, trained on standardized cell type labels, exhibited strong performance. This indicates that expert-derived annotations can be effectively harmonized across studies with several implications. First, harmonization of cell types across studies can allow for greater sample sizes in future transcriptomic analysis and allow for comparison between studies. Consequently, we believe that our approach of identifying and labeling matching cell types across studies will facilitate the adoption of standardized cell labels for identical cell populations in future research endeavors. This promotes consistency and comparability in the field of biomarker ontologies, enabling more comprehensive and cohesive analyses across diverse studies.

In the field of kidney research, there are ongoing efforts to establish standardized ontologies. The Kidney Precision Medicine Project (KPMP) is actively developing the Kidney Tissue Atlas Ontology, aiming to create a unified system that incorporates clinical, pathological, imaging, and molecular data (25). This ontology seeks to standardize labels for biomarkers, phenotypes, disease states, cell types, and anatomical structures in the kidney across both healthy and diseased conditions (25). By utilizing scRNA-seq and snRNA-seq, KPMP aims to identify gene, metabolite, and protein biomarkers that differentiate cell types and contribute to disease pathways.

KPMP builds upon previous ontological projects in the kidney, such as the Genitourinary Development Molecular Anatomy Project and the Chronic Kidney Disease Ontology, which focused on specific disease states or cell types rather than encompassing all kidney cell types (25). The collaboration between KPMP and the Human BioMolecular Atlas Program (HuBMAP) resulted in the publication of the Anatomical Structures, Cell Types, and Biomarkers (ASCT + B) tables in 2019 (25, 26). These tables aid in the annotation of anatomical structures, cell types, and biomarkers in the kidney. Furthermore, the HuBMAP initiative, which includes KPMP and other data consortia, is actively working on the Human Reference Atlas (HRA) which aims to develop biomarker ontologies for various organs in the human body (26). Additionally, the Human Cell Atlas (HCA) initiative has introduced the Cell Annotation Platform (CAP), a data visualization tool intended to facilitate the visualization and integration of annotation data from multiple published studies (27). Moreover, our work complements the exceptional work done by the Tabula Sapiens Consortium and HubMAP’s Azimuth team as well as generative AI models in this space such as scGPT by utilizing general-purpose machine learning algorithms such as SVM, which were demonstrated by Abdelaal et al. to have better overall performance with faster computation time than scRNA-specific algorithms (4, 28, 29, 30). Our research aligns with these ongoing initiatives by providing valuable insights that can contribute to the less labor-intensive compilation of independent datasets, enhance interoperability, increase cell sample sizes, and strengthen the utilization of machine learning-derived cell type annotations.

In our study, we focused on analyzing kidney cell types in healthy individuals and did not include patients with kidney disease. The decision was based on the heterogeneity of kidney diseases and the incomplete understanding of disease mechanisms and expression patterns (26, 31). However, recent studies have made efforts to investigate gene expression patterns in various kidney disease states. Some studies have specifically focused on certain diseases, such as hypertensive or diabetic kidney disease, while others have utilized murine models to identify biomarkers and analyze cell type enrichment (32, 33, 34). Lake et al. (2021) took a different approach by leveraging data from HuBMAP, KPMP, and HCA, including cells from both healthy and diseased kidneys, to characterize differential gene expression in disease states using spatial transcriptomics (35). Their findings revealed associations between disease states, elevated cytokine production, and tubular regeneration and differentiation, as well as increased expression of inflammatory and fibrotic cell markers (35).

In the context of machine learning algorithms, one potential application in the study of kidney disease is to utilize the rejection rate of the models trained on healthy kidney cells. By identifying cells that are more likely to represent a disease state based on higher rejection rates, researchers can target those cells for further analysis of differential gene expression patterns. As databases of kidney disease continue to expand over time, similar approaches to the ones described in our study can be applied to enhance our understanding of kidney diseases.

In conclusion, it is crucial for the identification of kidney cell types using sc/snRNA-seq technologies to be both accurate and standardized. Our study demonstrated the ability for machine learning algorithms to successfully classify kidney cell types. We also investigated the limitations and challenges associated with these algorithms, highlighting situations where they excelled and areas where improvements are needed, such as when differentiating highly correlated cell types, labeling cell types with small sample sizes, or when labeling cell types derived from a different sequencing method than the one on which the models were primarily trained. Our study methodology can be applied to harmonize cell types derived from scRNA-seq and snRNA-seq data across other validation cohorts as well as utilize machine learning algorithms for automatic annotation of cell types in novel data.

To promote the widespread adoption of these methods, it is essential for the research community to work together in standardizing cell type annotations and preparation protocols to reduce variability across different centers. KPMP is at the forefront of these efforts, aiming to enhance the accessibility of modern data-driven precision medicine technologies and advance our understanding and management of kidney disease.

To facilitate the expansion of our research by other scientists, we have made our entire pipeline available, including detailed documentation for adding new training datasets or implementing alternative machine learning algorithms. In the interest of reproducibility, all the code for our project can be found in our GitHub repository, and our data is accessible on Zenodo. By leveraging the power of machine learning algorithms and fostering collaborative efforts, we can accelerate the discovery of novel insights into kidney cell types and drive advancements in precision medicine for kidney diseases.

Data Collection and Quality Control:

We initially identified five studies of sc/snRNA-seq data on kidney cells from the GEO database. The selection criteria included studies with publicly available data that could be replicated using the methods described in this section or the code provided on our GitHub repository. Subsequently, we filtered the data to include only normal, healthy cells with “well-annotated” cell types as described in the sections corresponding to each original study below. Our analysis pipeline was implemented using Snakemake, and a visual representation of the pipeline can be found in Figure S1. The complete code for our analysis, including the pipeline, is available on our GitHub repository. Additionally, the data used in this study can be accessed on Zenodo. To ensure data quality, we performed UMAP analyses, which are illustrated in Figure S2, and compared these to the UMAPs presented in the original publications (36).

Lake et al. (8):

The normalized data from Lake et al. was generously shared with us. Several of the cells in this dataset were also included in the data from Menon et al., and these duplicates were removed. Additionally, we excluded cells marked as “distressed” or “unassigned.”

Liao et al. (9):

The raw data from Liao et al. was downloaded from GSE131685 (9). Our replication of the original dataset used an adaptation of the original analysis code available on Github (37). Sample ‘kidney1’ was removed due to its uniformly high mitochondrial expression and the misalignment of cells from ‘kidney1’ with those of ‘kidney2’ and ‘kidney3,’ as visualized by UMAP (Figure S6). The rest of the cells from this study were included.

Menon et al. (7):

The normalized data from this study was downloaded from GSE140989 (7). The original annotations were recreated using the published description of their workflow from the methods section of their paper. All cells from this dataset were used.

Wu et al. (6):

The raw data from Wu et al. was downloaded from GSE114156. Our replication of the original dataset was based on the instructions provided in the supplementary files to the original publication. No cells were excluded from this analysis prior to the SVM quality control step. As the authors did not provide a file with marker genes to label clusters, cluster labeling was performed using the marker genes listed in Fig. 3 of the original manuscript.

Young et al. (5):

The raw data from this study was downloaded from the supplementary files of the study. Annotations were replicated with the provided metadata in their supplement and an adaptation of their original code, which is available on Github (38). Samples derived from children and annotated as tumor samples were excluded using the cell manifest prior to reading the data. We then removed all cells annotated as ‘junk’, ‘private,’ or ‘nephron epithelium.’ This step resulted in the loss of several cells that clustered with proximal tubular cells, resulting in lower representation of this cell type from this particular dataset.

Batch Correction:

Batch Correction was performed using Seurat v4 rPCA integration (39). The resulting integrated assay was then scaled, reduced in dimensionality, clustered, and visualized with the standard Seurat functions (40).

Harmonized Cell Type Labeling:

The cell-type annotations from the original datasets were classified into 16 different harmonized cell type classes, which were determined by the pattern of their PCA-coordinate Pearson correlations, implemented with Scanpy (41). These categories were named based on the original, expert annotations present in each original dataset.

SVM Outlier Detection:

We removed outlier cells from each harmonized cell type by training and testing an SVM model on the integrated dataset. Cells that were classified with low probability (< 0.6) were removed. SVM was chosen for this task due to its previously shown high performance in outlier detection (4, 16, 17, 18, 19, 20).

Supervised Learning:

We evaluated five different popular machine learning algorithms including a support vector classifier (SVC), a random forest classifier (RF), a multi-layer perceptron (MLP), a k-nearest neighbors classifier, and XGBoost (XGB), each implemented in the scikit-learn python library (42). We trained the machine learning algorithms on four datasets and tested on the fifth dataset. We performed this process 5 times with unique single different testing datasets in each run. The performance of machine learning algorithms were evaluated using F1 scores and rejection rates. The overall F1 scores and rejection rates for each machine learning algorithm were calculated as the median of all individual harmonized cell types.

Ethics Approval and Consent to Participate: Not applicable.

Consent for Publication: Not applicable.

Availability of Data and Materials: The datasets supporting the conclusions of this article are available in Zenodo [doi: 10.5281/zenodo.8303415, https://zenodo.org/record/8303415]. Our results are reproducible with the code available in our GitHub repository (https://github.com/smadapoosi/IKCTML) and our Snakemake pipeline.

Competing Interests: MK reports grants from the NIH/NIDDK and JDRF in support of this manuscript. Grants and contracts outside the submitted work through the University of Michigan with the NIH, the Chan Zuckerberg Initiative, AstraZeneca, NovoNordisk, Eli Lilly, Gilead, Goldfinch Bio, Janssen, Boehringer-Ingelheim, Moderna, the European Union Innovative Medicine Initiative, Certa, Chinook, amfAR, Angion, RenalytixAI, Travere, Regeneron, and IONIS. MK reports consulting fees through the University of Michigan from Astellas, Poxel, Janssen, and UCB. In addition, MK has a patent licensed (PCT/EP2014/073413 “Biomarkers and Methods for Progression Prediction for Chronic Kidney Disease”). ASN has received consulting fees from CareDx for participating in external advisory boards, which are unrelated to this work. SE reports grant support from AstraZeneca, NovoNordisk, Eli Lilly, Gilead, Janssen, Moderna, Certa, Chinook, amfAR, Angion, and IONIS outside of the submitted work. FA. Eli Lilly, AstraZeneca, NovoNordisk, Janssen, Chinook, IONIS, Genentech, outside of the submitted work.

Funding: The KPMP is funded by the following grants from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK): U01DK133081, U01DK133091, U01DK133092, U01DK133093, U01DK133095, U01DK133097, U01DK114866, U01DK114908, U01DK133090, U01DK133113, U01DK133766, U01DK133768, U01DK114907, U01DK114920, U01DK114923, U01DK114933, U24DK114886, UH3DK114926, UH3DK114861, UH3DK114915, UH3DK11493.

Authors’ Contributions: AT wrote the code, downloaded the data, performed quality control, and drafted the manuscript. SM replicated the data including integration and machine learning results, generated the final figures, expanded on and revised the manuscript to its final form, and updated the Zenodo, Google Collab, and Github. SB created the docker, singularity image, and snakemake, wrote the machine learning scripts, and drafted the manuscript. JR replicated the results including developing and running jupyter notebooks and predicted the kidney cell types in the new datasets. SE, LM, AN, CL, PM, RM, BL, SR, CP, MK, and AM were involved in the writing group and provided their expert feedback to the manuscript. FA oversaw the entirety of the project, including code generation, data selection, generation, and quality control, and manuscript preparation.

Acknowledgments: The authors would like to thank NIH-DDK and HCA for funding the studies utilized in this project, the authors of Menon et al. (2020), Lake et al. (2019), Liao et al. (2020), Wu et al. (2019), and Young et al. (2018) for generously sharing their data and code, and the kidney sample donors for their contribution to science.

Ju W, Greene CS, Eichinger F, Nair V, Hodgin JB, Bitzer M, et al. Defining cell-type specificity at the transcriptional level in human disease. Genome Res. 2013;23(11):1862–73.
Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, et al. Cell type-specific gene expression differences in complex tissues. Nat Methods. 2010;7(4):287–9.
Gawel DR, Serra-Musach J, Lilja S, Aagesen J, Arenas A, Asking B, et al. Correction to: A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Med. 2020;12(1):37.
Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20(1):194.
Young MD, Mitchell TJ, Vieira Braga FA, Tran MGB, Stewart BJ, Ferdinand JR, et al. Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors. Science. 2018;361(6402):594–9.
Wu H, Malone AF, Donnelly EL, Kirita Y, Uchimura K, Ramakrishnan SM, et al. Single-Cell Transcriptomics of a Human Kidney Allograft Biopsy Specimen Defines a Diverse Inflammatory Response. J Am Soc Nephrol. 2018;29(8):2069–80.
Menon R, Otto EA, Hoover P, Eddy S, Mariani L, Godfrey B et al. Single cell transcriptomics identifies focal segmental glomerulosclerosis remission endothelial biomarker. JCI Insight. 2020;5(6).
Lake BB, Chen S, Hoshi M, Plongthongkum N, Salamon D, Knoten A, et al. A single-nucleus RNA-sequencing pipeline to decipher the molecular anatomy and pathophysiology of human kidneys. Nat Commun. 2019;10(1):2832.
Liao J, Yu Z, Chen Y, Bao M, Zou C, Zhang H, et al. Single-cell RNA sequencing of human kidney. Sci Data. 2020;7(1):4.
Kameneva P, Artemov AV, Kastriti ME, Faure L, Olsen TK, Otte J, et al. Single-cell transcriptomics of human embryos identifies multiple sympathoblast lineages with potential implications for neuroblastoma origin. Nat Genet. 2021;53(5):694–706.
Galdos FX, Xu S, Goodyer WR, Duan L, Huang YV, Lee S, et al. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun. 2022;13(1):5271.
Le H, Peng B, Uy J, Carrillo D, Zhang Y, Aevermann BD, et al. Machine learning for cell type classification from single nucleus RNA sequencing data. PLoS ONE. 2022;17(9):e0275070.
Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, et al. Comparative Analysis of Single-Cell RNA Sequencing Methods. Mol Cell. 2017;65(4):631–43e4.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.
Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol. 2018;36(1):70–80.
Zhao P, Xu Z, Chen J, Ren Y, King I. Single Cell Self-Paced Clustering with Transcriptome Sequencing Data. Int J Mol Sci. 2022;23(7):3900. 10.3390/ijms23073900. Published 2022 Mar 31.
Zhu X, Wolfgruber TK, Tasato A, Arisdakessian C, Garmire DG, Garmire LX. Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists. Genome Med. 2017;9(1):108. Published 2017 Dec 5. 10.1186/s13073-017-0492-3.
Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun. 2018;9(1):997. 10.1038/s41467-018-03405-7. Published 2018 Mar 8.
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16(10):906–14. 10.1093/bioinformatics/16.10.906.
Kim B-H, Yu K, Peter CW, Lee. Cancer classification of single-cell gene expression data by neural network. Bioinformatics. March 2020;36(5):1360–6. https://doi.org/10.1093/bioinformatics/btz772.
Denisenko E, Guo BB, Jones M, Hou R, de Kock L, Lassmann T, et al. Systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus RNA-seq workflows. Genome Biol. 2020;21(1):130.
Deleersnijder D, Callemeyn J, Arijs I, Naesens M, Van Craenenbroeck AH, Lambrechts D, et al. Current Methodological Challenges of Single-Cell and Single-Nucleus RNA-Sequencing in Glomerular Diseases. J Am Soc Nephrol. 2021;32(8):1838–52.
Habib N, Avraham-Davidi I, Basu A, Burks T, Shekhar K, Hofree M, et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat Methods. 2017;14(10):955–8.
Wu H, Kirita Y, Donnelly EL, Humphreys BD. Advantages of Single-Nucleus over Single-Cell RNA Sequencing of Adult Kidney: Rare Cell Types and Novel Cell States Revealed in Fibrosis. J Am Soc Nephrol. 2019;30(1):23–32.
Ong E, Wang LL, Schaub J, O'Toole JF, Steck B, Rosenberg AZ, et al. Modeling kidney disease using ontology: insights from the Kidney Precision Medicine Project. Nat Rev Nephrol. 2020;16(11):686–96.
Börner K, Teichmann SA, Quardokus EM, Gee JC, Browne K, Osumi-Sutherland D, et al. Anatomical structures, cell types and biomarkers of the Human Reference Atlas. Nat Cell Biol. 2021;23(11):1117–28.
Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587e29. 10.1016/j.cell.2021.04.048.
Tabula Sapiens Consortium*, Jones RC, Karkanias J, et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376(6594):eabl4896. 10.1126/science.abl4896.
Osumi-Sutherland D, Xu C, Keays M, Levine AP, Kharchenko PV, Regev A, et al. Cell type ontologies of the Human Cell Atlas. Nat Cell Biol. 2021;23(11):1129–35.
Cui H, Wang C, Maan H, Wang B, scGPT. Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI. bioRxiv. 10.1101/2023.04.30.538439. Preprint.
Hansen J, Sealfon R, Menon R, et al. A reference tissue atlas for the human kidney. Sci Adv. 2022;8(23):eabn4965. 10.1126/sciadv.abn4965.
Obradovic A, Chowdhury N, Haake SM, Ager C, Wang V, Vlahos L, et al. Single-cell protein activity analysis identifies recurrence-associated renal tumor macrophages. Cell. 2021;184(11):2988–3005e16.
Conway BR, O'Sullivan ED, Cairns C, O'Sullivan J, Simpson DJ, Salzano A, et al. Kidney Single-Cell Atlas Reveals Myeloid Heterogeneity in Progression and Regression of Kidney Disease. J Am Soc Nephrol. 2020;31(12):2833–54.
Fu J, Akat KM, Sun Z, Zhang W, Schlondorff D, Liu Z, et al. Single-Cell RNA Profiling of Glomerular Cells Shows Dynamic Changes in Experimental Diabetic Kidney Disease. J Am Soc Nephrol. 2019;30(4):533–45.
Lake BB, Menon R, Winfree S, et al. An atlas of healthy and injured cell states and niches in the human kidney. Nature. 2023;619(7970):585–94. 10.1038/s41586-023-05769-3.
Madapoosi S. (2023). Automatic Identification of Kidney Cell Types in scRNA-seq and snRNA-seq Data Using Machine Learning Algorithms - Datasets [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7810913.
Yu Z, Lessonskit. 2019 [Available from: https://github.com/lessonskit/Single-cell-RNA-sequencing-of-human-kidney.
Young MD. constantAmateur 2018 [Available from: https://github.com/constantAmateur/scKidneyTumors.
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15.
Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–87e29.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495–502.
Pedregosa F, Varoquax G, Gramfort A, Michel V, Thirion B, Grisel O et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30.

SCRNAMSSupplementalFigures083023.docx
Additional File 1 (Supplemental_Figures.docx): Figures S1-S6
TableS1DistributionofCellsbyStudy.xlsx
Additional File 2 (Table_S1_Distribution_of_Cells_by_Study.xlsx): Table S1 ● Distribution of samples and 57,864 renal cells used in our analyses, following SVM-based exclusion of low-quality cells.
TableS2ConfusionMatrices.xlsx
Additional File 3 (Table_S2_Confusion_Matrices.xlsx): Table S2 ● Confusion matrices comparing predicted cell type class labels with actual cell type class labels from each of the testing datasets. Each sheet has 1 confusion matrix and is labeled as “[Algorithm]_[Testing Dataset]”
TableS3ModelComparisons.xlsx
Additional File 4 (Table_S3_Model_Comparisons.xlsx): Table S3 ● Median F1 scores, unknown percents, and p-values from Kruskall-Wallis tests comparing F1 scores and unknown rejection rates across all 5 algorithms for each testing dataset.

Download PDF

Version 1

posted

You are reading this latest preprint version

Identification of Kidney Cell Types in scRNA-seq and snRNA-seq Data Using Machine Learning Algorithms

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

BACKGROUND

RESULTS

Harmonization of cell type annotations across datasets

Prediction of harmonized cell types

Performance evaluation of classifiers across different harmonized cell types and datasets

DISCUSSION

CONCLUSION

METHODS

Data Collection and Quality Control:

Lake et al. (8):

Liao et al. (9):

Menon et al. (7):

Wu et al. (6):

Young et al. (5):

Batch Correction:

Harmonized Cell Type Labeling:

SVM Outlier Detection:

Supervised Learning:

Declarations

References

Supplementary Files

Status:

Version 1