Atlas preprocessing
Each atlas is preprocessed using the same steps to prepare it for use in the ArchMap machine learning pipeline. First, the atlas cells and genes are filtered to conform with the attributes of the respective trained model. Then, the atlas is minified. In particular, the latent posterior parameters are calculated and stored in the .obsm attribute of the AnnData object. This step is done before any mapping to save computing the parameters during a query mapping, reducing compute time drastically. Additionally, the count matrix is removed from the atlas AnnData object before mapping to further reduce compute time and the launch time of the CellxGene instance. The combined reference and query count matrix is then re-added to the h5ad file containing the final joint results for download.
Query requirements
All query requirements are outlined on the query upload page under the heading “Query Upload Requirements!!” when creating a mapping on the website. The query data file needs to fulfil certain requirements upon upload by the user. The file should be of h5ad format. Furthermore, the batch/study key of the data is mandatory and should be labelled as “batch”. This is done by creating a new column in the obs data frame of the h5ad file. Users can make use of the colab notebook here, and referenced on the websites and in the docs to add the batch key to their h5ad file.
Furthermore, the size of the query has a limit of 50 000 cells due to limitations in server capacity. If your query exceeds this limit, you can split your query into multiple batches and create separate mapping projects. The resulting h5ad files can then be concatenated post-mapping. This procedure will not harm the mapping performance (Extended Data Fig. 3); however, it is important that all cells with the same batch label remain in the same query subset upon splitting. You can use the following linked colab notebooks to separate and concatenate your query and results. These links are also referenced on the website and in the docs.
Users should also ensure that raw expression counts are saved in .X of the query AnnData object, unless otherwise specified, and that either ensemble IDs or gene names are stored in the var_names AnnData object attribute of your query. This is dependent on the choice of atlas. Some atlases require ensemble IDs, while others require gene names. The requirement on which label should be stored in your query is given on the query upload page of the website.
During preprocessing in the ArchMap pipeline, query genes of the uploaded query are subsetted to match the genes of the atlas and for any missing genes, expression values are padded with zeros. This may lead to inaccuracy in results based on the number of missing genes in the query. The user will be made aware of this on the website during query processing. Specifically, the user will receive a message on the frontend that communicates the percentage of shared genes between the query and reference. The user can then use this information to scrutinise the accuracy of the resulting mapping. We recommend that mappings where more than 20% of the reference genes are missing in the query should be treated with caution.
Mapping procedure
The mapping procedure in ArchMap is outlined as follows: during the query upload, the batch key of the query is checked to make sure that it is labelled as “batch”. If not, a message is returned to the user stating that the batch key should be renamed to “batch”.
Next, the query data is preprocessed based on the model that was used for the atlas integration. For the case of scVI and scANVI, the prepare_query_anndata method for the respective model is run to add zero padding for missing genes in the query and to sort the genes to match the atlas. Subsequently, the setup_anndata method is used to map between the data fields used by the model and their locations in the query anndata. These methods prepare the query to be loaded to the model using the method load_query_data. For the case of scPoli, the preprocessing steps mentioned above for scVI and scANVI are incorporated into scPoli’s implementation of the load_query_data method.
After the query is loaded and the model is initialised with the new data, the model is trained using 10 epochs to update the reference model with the query data and obtain a joint embedding. The query latent embedding is then calculated by feeding the input query data through the updated model encoder. The query embedding is then combined with the pre-computed reference embedding and stored in the obsm anndata attribute of the joint anndata object under the key name “latent_rep”. Before combining the query with the reference, we label the query and reference cells as either “query” or “reference” in the “type” key of the obs anndata attribute to be used for visualisation in CellxGene.
Visualisation and data download
CellxGene Annotate is used to visualise the query mapping. Users can colour their mapped query by the label transfer results saved under the category on the left hand side that ends with “prediction_knn” or “prediction_xgboost” (depending on the classifier that was chosen by the user).
Before a CellxGene instance is created for visualisation of the mapping results, the joint reference and query cells are subsetted to speed up computation of the UMAP and to guarantee conformity to CellxGene cell number requirements. The user has the ability to download the full mapping; however, the neighbour graph will need to be calculated by the user for the full data if visualisation is desired on the downloaded data. A colab notebook has been prepared here which can be used by the user to calculate the neighbour graph of their results.
Label Transfer and Uncertainty Quantification
The XGBoost and KNN pre-trained classifiers were trained on each atlas and evaluated using an 80/20 split strategy. Comparing the F1 scores for the KNN, XGBoost and native (either scPoli or scANVI dependent on the model used to integrate the atlas) classifiers, KNN performed the best for all annotation levels of all atlases.
If the pre-trained classifier chosen by the user is KNN or XGBoost, the pre-trained classifier and label encoding for the user-specified atlas is downloaded from GCP. If the native scANVI or scPoli classifier is chosen, the respective instantiated model that has been updated on the query data is used for the classification.
ArchMap uses Euclidean and Mahalanobis uncertainty to quantify the label transfer uncertainty. The Euclidean uncertainty calculated for a cell is based on the Euclidean distance between that cell and its 15 nearest neighbours. The Mahalanobis uncertainty is based on the Mahalanobis distance between a query cell and all the centroids obtained from a K means clustering of the reference embedding, where K is equal to the number of cell types. Mahalanobis uncertainty accounts for potential correlation between the latent features of the cells. This is done by rescaling the latent features such that the covariance between any two correlated latent features is removed. The Euclidean distance is then used to calculate the distances between each query cell and the centroids of each cell type cluster.
Performance metrics
During processing, performance metrics to quantify the quality of the mapping are calculated. These metrics include the percentage of reference genes in the query data, the percentage of cells labelled “Unknown”, the percentage of query cells with anchors, and the cluster preservation score. For the percentage of reference genes in the query data, a value less than 85% may contribute to poor mapping quality. The percentage of cells that are labelled as "Unknown" during cell type label transfer is based on the Euclidean uncertainty score for each query cell. A query cell is classified as having an "Unknown" cell type if its Euclidean uncertainty score is above 0.5. A value of 0.5 is used here to take into account query datasets where cells of the same type may look different from the reference atlas (e.g. when the query data consists of cells from diseased tissue.
The percentage of query cells with anchors represents query cells that have at least one mutual nearest neighbour among the cells of the reference dataset. These mutual nearest neighbours are termed anchors. A lower percentage of query cells with anchors may not necessarily mean poor mapping quality. A lower score may be returned for query datasets from the same tissue as the reference, but with a large proportion of cells from diseased donors. For example, mapping a dataset consisting of samples from healthy tissue25 to the HLCA gives a percentage of query cells with anchors of 63%. However, mapping a dataset consisting of samples from diseased tissue24 to the HLCA gives a percentage of query cells with anchors of 54%.
The cluster preservation score assesses how well Leiden clustering of the query dataset is preserved after the mapping process. The mean entropy of cluster labels within each neighbourhood is computed for both the original and integrated query. The median of the differences in mean entropy between the original and integrated query is then calculated. Scores are scaled between 0 and 5 with 5 being the best.
Presence score
The presence score is based on the weighted shared nearest neighbours between the reference and query cells. A high presence score for a reference cell indicates a high frequency and likelihood of such a cell state or one similar being in the query. This complements the uncertainty scores given to the query cells.
Query mapping onto HLCA
ArchMap can be used for the identification and validation of disease-specific cell states. To show this we mapped a single-cell dataset of human samples from healthy donors and patients with idiopathic pulmonary fibrosis to the Human Lung Cell Atlas reference and used the resulting uncertainty scores calculated in the ArchMap server to identify potential disease-associated cell states and analyse marker genes.
The percentage of reference genes in the query data is 96.85%. The percentage of cells labelled “Unknown” is 8.99% and the percentage of query cells with anchors is 53.96%. The lower percentage of anchors may be a result of the query data containing a large portion of cells from diseased donors, while the reference only contains cells from healthy donors. The cluster preservation score is 5 out of 5, implying perfect preservation of clusters in the query.
Once the user’s query data has been mapped onto the reference using the ArchMap server, the mapping can be visualized in CellxGene. The embedding can be coloured based on query vs. reference and based on the cell type label transfer predictions. To quantify the uncertainty of the cell type label transfer, the Euclidean uncertainty is used. Colouring by this uncertainty metric, the user can identify clusters of interest that warrant further analysis. Adventitial fibroblasts can be seen as one of the cell type clusters with the most varied uncertainty scores. We expect that clusters with both low and high uncertainty contain both healthy and diseased cells.
Upon filtering, we split the cluster of interest into two groups: (1) all reference cells and all query cells with uncertainty scores less than 0.4 and (2) all query cells with uncertainty scores greater or equal to 0.4. Investigating the correlation between the uncertainty scores and the ground truth cell labels of healthy and diseased, we find a positive correlation between higher uncertainty scores and cells from diseased donors. This suggests that the uncertainty scores can be used as a starting point to identify disease-associated cell states (Fig. 1d).
Differential expression analysis can be performed in CellxGene to compare low and high uncertainty groups. In Fig. 1d, the embedding of adventitial fibroblasts are coloured by the gene expression of CTHRC1 and SERPINF1, two genes from the top 40 most differentially expressed genes for the high and low uncertainty group, respectively.
Atlas Upload Revision
To support community uploads of reference atlases, a platform user must first be an admin. In order to become an admin, the user must reach out to ArchMap’s development team. This can be done by checking the box “Request for Beta feature” upun registration to ArchMap. Once approved, the user can start contributing by providing additional atlases. A user-uploaded atlas must first go through a revision process before it is publicly available on the website. The revision process consists of the following evaluation steps: (1) benchmarking the user uploaded model used for integration with scVI, scPoli with prototype learning, scPoli without prototype learning, and PCA (no batch effect removal). Benchmarking is done using scIB4 (2) Comparing the KNN classifier F1 scores between classifiers trained on the latent representations taken from the user integrated model, scVI, scPoli with prototype learning, scPoli without prototype learning and PCA.
Framework Architecture
Archmap’s backend is written in Express.js, running on a node.js server. The machine learning pipeline is a stand-alone service, primarily based on the scArches package for reference mapping. The pipeline is written in Python.
The backend as well as the machine learning pipeline are hosted on GCP (Google Cloud Platform). The backend is primarily running on Google App Engine. Google Cloud Storage is used to store persistent data such as project mapping results and annotation files. The machine learning pipeline as well as ephemeral CellxGene visualisation instances are run on Google Cloud Run.
The code for the frontend and backend implementation of ArchMap is available at theislab/archmap (github.com).
The ArchMap website is available at https://www.archmap.bio.