The code for genomic analyses in this paper is available at https://github.com/tgen/banovichlab/
scRNA-seq samples. scRNA-seq data were obtained from published data with samples in the “VUMC/TGen” dataset from Habermann et. al. (2020) (GEO accession GSE135893), samples in the “Yale/BWH” dataset came from Adams et. al. (2020) (GEO accession number GSE136831), samples in the “Pittsburg” dataset from Morse et. al. (2019) (GEO accession GSE128033) and samples in the “Northwestern” dataset from Reyfman et al. (2019) (GEO accession GSE122960) (Supplementary Table 1, 2). Additionally, there are 39 unpublished scRNA-seq samples in the “VUMC/TGen” dataset that were collected under Vanderbilt IRB #’s 060165, 171657 and Western IRB # 20181836.
scRNA-seq data processing. Seurat v3.1.5 package 74 was used to process the scRNA-seq data. Specifically, for the Pittsburgh, Northwestern datasets and 39 unpublished samples from the VUMC/TGen, the CellRanger (10X Genomics) output files were read into Seurat using the function Read10X, the remaining datasets were already in Seurat format and were loaded using the function readRDS. To eliminate low-quality/dying cells or empty droplets, we removed any cells containing fewer than 1,000 genes or more than 25% mitochondrial genes. Due to the large size of the joint dataset, we performed SCTransform75 for normalization and scaling of each dataset separately, split into four major cell populations using known markers: EPCAM+ (Epithelial), PECAM1 + PTPRC - (Endothelial), PTPRC + (Immune) and EPCAM- PECAM- PTPRC- (Mesenchymal). Each population from the four datasets was then merged together to generate four merged Seurat objects (Endothelial, Epithelial, Immune and Mesenchymal). Next, each object was SCTransformed with “dataset” being used as batch_var to eliminate batch effects between datasets. Cell type annotation was manually performed on each object using known cell-type specific markers (Supplementary Fig. 1, 32). For each cell population, cell type annotation was performed at four levels, ranging from the most general to more granular annotation. After removing doublet cells, all four population datasets were merged to generate the final ILD object containing a total of 605,904 cells and 38 distinct cell types (Supplementary Table 2, Supplementary Fig. 1).
Integrated analysis of joint dataset. To calculate the percentage of single positive or double positive cells for ACE2 and other cofactors, we counted the number of cells with > 0 transcripts of corresponding genes. For double positive, cells have > 0 transcripts of both genes of interest. Tukey Honest Significant Difference (Tukey_HSD) statistical test from the R package rstatix with a confidence level of 0.95 was used to test statistical dependence of cells expressing the SARS-CoV-2 mediators among chronic disease subsets.
To assess the expression profile of SARS-CoV-2 mediators (ACE2, BSG, NRP1, HSPA5), the corresponding proteases (TMPRSS2, CTSL, FURIN) and other candidate genes involved in SARS-CoV-2 infection in different chronic lung disease subset (COPD, IPF or Other-ILD), we ran the function FindMarkers in Seurat package using the negative binomial test. To account for batch effects, we used the parameter “latent_vars” to incorporate the four variables (Age, Ethnicity, Smoking status and Dataset) into the negative binomial model. For the binary heatmap, the differential expression analysis was performed between the Disease (including all chronic disease subset) and Control samples. Then, log2 fold-change was converted into 0 (downregulated in the disease samples) or 1 (upregulated in the disease samples) regardless of the Bonferroni adjusted p-values. Heatmaps were generated from the adjusted log2FC values using the heatmap.2 function of the gplots R package 76, and exact binomial statistics tests were performed to test whether the selected genes are upregulated in the disease samples across cell types. For the boxplots, count numbers of selected genes were plotted using the geom_boxplot and geom_jitter function of the ggplot2 R package 77. Significant differences in gene expression in the boxplots were the Bonferroni adjusted p-values calculated from the FindMarkers function between Control and Disease groups (COPD, IPF, Other ILD) using the fitted negative binomial model and latent_vars parameters as described above.
Gene module score. To calculate the combined expression of genes, we used the AddModuleScore in Seurat v3.1.5 package. SARS-CoV-2 entry gene scores were calculated on SARS-CoV-2 receptors and mediators: ACE2, BSG (CD147), NRP1, HSPA5(GRP78), TMPRSS2, CTSL, FURIN and ADAM17. HLA type II score includes HLA-DRA, HLA-DQA1, HLA-DQA2, HLA-DPA1, HLA-DRB1, HLA-DPB1, HLA-DQB2, HLA-DRB5, HLA-DQB1, HLA-DMA, HLA-DMB. IFN score includes ISG15, IFI44, IFI27, CXCL10, RSAD2, IFIT1, IFI44L, CCL8, XAF1, GBP1, IRF7, CEACAM1. IL6 scores were calculated on six tocilizumab responsive genes: ARID5A, SOCS3, PIM1, BCL3, BATF, MYC that are associated with the IL-6 pathway 56. Cytotoxicity associated genes include PRF1, GZMH, IFNG, NKG7, KLRG1, PRF1 56 and GNLY, GZMB, GZMK 78. Exhaustion gene set: LAG3, TIGIT, PDCD1, CTLA4, HAVCR2, TOX 58 and PRDM1, MAF 56. Significant differences between different groups were calculated using the Tukey_HSD statistic test in the R package rstatix with a confidence level of 0.95 (Supplementary Table 5).
Gene correlation analysis. To identify genes that correlate with ACE2 in the AT2 ACE2 + cells, we performed Spearman correlation coefficient analysis on the log-transformed and normalized data using the function cor.test in the R stats v3.6.1 package with Benjamini-Hochberg corrections for p_adjusted values.
Immunohistochemistry of ACE2 and anti-αvβ6 integrin. Formalin-fixed paraffin-embedded histological sections of human lung were cut at 5-microns and dewaxed in xylene prior to rehydration in decreasing concentrations of ethanol. The tissue samples were obtained after informed consent and local ethics approval (South East Scotland SAHSC Bioresource-reference number 06/S1101/41; Brompton Node samples - reference number 15/SC/0101; Papworth Node Samples; non-diseased controls- reference number (Q)GM030404 and Nottingham BRC samples- reference number 08/H0407/1). IHC staining was performed using the Novocastra Novolink™ Polymer Detection Systems kit (Code: RE7280-K, Leica, Biosystems, Newcastle, UK) as previously described 79. Heat-induced citrate antigen retrieval (pH 6.0) and pepsin antigen retrieval was performed for Rabbit monoclonal ACE2 (ab108252, EPR4435(2) Abcam, UK) and the anti-αvβ6 integrin antibody (6.2A1; Biogen, Cambridge, MA, USA), respectively. Rabbit monoclonal ACE2 (1:400) and anti-αvβ6 integrin (1:3000) was diluted in Leica antibody diluent (RE AR9352, Leica, Biosystems, UK) and incubated with the sections overnight at 40C. Novolink DAB substrate buffer plus was used as the chromogen and the slides were counterstained with Novolink haematoxylin for 6 min, dehydrated and cover slipped. A negative control without the application of the primary antibody, and was also used to ensure staining was only related to the presence of the antibody.
The immunohistochemically stained slides were scanned using a ScanScope XT Slide Scanner (Leica Aperio Technologies, Vista, CA, USA) under 20x objective magnification (0.5 µm resolution) using Pannoramic Viewer (3DHISTECH Ltd Budapest, Hungary) slide viewing software. Both the percentage of staining and staining intensity of ACE2 expression in lung sections were individually assessed. For ACE2 quantification, the following scoring system of six high-power fields at X40. per tissue section were used:
The coding was performed prior to scoring/analysis as: 0-Negative; 1 − 0–⩽10%; 2–11–⩽25%; 3-⩽26%. Statistical analyses were completed using GraphPad Prism 7.0 (GraphPad Software, San Diego, CA, USA). One-way analysis of variance was used for comparison of more than two data sets and significant differences between diagnosis groups were calculated using Tukey HSD test.