Retrospective Clinical Study
Cohort and Study Description
In this observational cohort study, we used a data warehouse derived from electronic health records (EHRs) from 11,116 patients treated at New York-Presbyterian/Columbia University Irving Medical Center for suspected cases of SARS-CoV-2 infection. For these patients we collected contemporary data from their current encounter (i.e. the encounter associated with their suspected SARS-CoV-2 infection) as well as historical data, if available, from their previous encounters. Contemporary data (data collected between February 1, 2020 and April 12, 2020) included insurance billing information, laboratory measurements, procedures, and SARS-CoV-2 diagnostic test results. These data were derived from the data warehouse tables in Epic. 6,927 patients have historical data (data collected prior to September 24, 2019) available from an OMOP v5 instance stored using MySQL, which included all of the standard tables for recording condition, procedure, medication, and measurement data (among others). Of these we used the insurance billing information from the condition occurrence table and demographics from the person table. See Preparation of data for modeling for further details on data preparation.
We used the contemporary data to define inclusion criteria and outcomes (requiring mechanical respiration and mortality) and used historical data to define patient comorbidities. We defined three hypothesized comorbidity covariates, macular degeneration, complement deficiency disorders, and disorders of coagulation. We used historical data to define these comorbidities, age, and sex. We did not include race and ethnicity data in the modeling as we have previously found issues with the data quality20. The race/ethnicity data we do have is included in the tables for reference. We also modeled other comorbidities previously associated with morbidity and mortality (Zhou et al and others), including history of cardiovascular disease, hypertension, obesity, and diabetes (Table 1, Table S1) -- all derived from the historical data. Coded covariate definitions, as well as lists of which diagnosis codes are most common in each group, are available in the supplemental materials and methods. We used established institutional procedures and an institutional clinical data warehouse to extract all data from the EHR.
Defining patient outcomes
Outcome definitions were defined by data derived from the electronic health record between February 1, 2020 and April 12, 2020. Mortality is derived from a death note filed by a resident or primary provider that records the date and time of death. Intubation was used as an intermediary endpoint and is a proxy for a patient requiring mechanical respiration. We used note types that were developed for patients with SARS-CoV-2 infection to record that this procedure was completed. We validated outcome data derived from notes against the patient’s medical record using manual review.
Ethics and Data Governance Approval
The study is approved by the Columbia University Irving Medical Center Institutional Review Board (IRB# AAAL0601) and the requirement for an informed consent was waived. A data request associated with this protocol was submitted to the Tri-Institutional Request Assessment Committee (TRAC) of New-York Presbyterian, Columbia, and Cornell and approved. The research on the UK Biobank data has been conducted using the UK Biobank Resource under Application Number 41039.
Preparation of data for modeling
We used MySQL and python libraries (pymysql, pandas) to extract and prepare the data for modeling. The code for data preparation is available in the github (https://github.com/tatonetti-lab/complementcovid) as a Jupyter Notebook titled Data Setup. We begin by creating a master list of suspected covid patients. These are patients that are either diagnosed with the disease, as indicated by a ICD10 code for SARS-CoV-2 infection, in their billing data or a patient that was tested for the presence of the virus using RT-PCR as indicated by a “lab” order for the test. We found 2,821 using the former method and 11,116 patients using the latter. We then extracted birthdates, death dates (if the patient had died or a null value otherwise), and sex codes (1 for female, 2 for male). Patients which had sex codes for non-binary genders were excluded from our analysis. We then define a “first diagnosis date” for each patient as either their first diagnosis date (by billing code) or the first date that they tested positive for SARS-CoV-2, whichever comes first. Next, we calculate each patient’s age at the time of this “first diagnosis date.” Each of the outcomes and covariates are extracted from their respective tables as detailed in the github. Whenever possible, we use the highest-level ancestor code (from the structured vocabulary in OMOP) that represents the concept we want to model. We then use the concept ancestor tables to grab all the descendant codes. Note that diabetic kidney disease was considered for inclusion and so is represented in the data preparation script, however, it was never modeled. Cough is included as a covariate as a reference symptom for comparison. The last step in the preparation process was to compute the censor dates. To do, we iterated through each patient in our master list and computed their time (in days) to intubation (if they required mechanical respiration) or death (if they died). If not, then the study end date (April 25, 2020) was used as the patient’s censored time (in days). Finally, for any patients that were not SARS-CoV-2 positive, their time-to-event values were set to a null indicator to be dropped from the dataset later. Finally, the data are all combined in a pandas (version 1.0.3) dataframe and saved to disk as a pickle file for efficient loading.
Statistical Model
Our patient timelines may be censored since our study cohort included patients that were being treated at the time of analysis. We performed survival analysis on the intubation orders and death using a Cox proportional-hazards model and visualized the risk using Kaplan-Meier curves using the lifelines python package (version 0.24.4). Error estimates on the Kaplan-Meier curves are estimated using Greenwood’s Exponential Formula21. We fit both univariate models and models fit on the covariate, age, and sex and used log-likelihood to assess significance. We reported Cox proportional hazards coefficients and their 95% confidence intervals (Table 1). We modeled whether or not a patient had macular degeneration, a complement deficiency disorder, or a coagulation disorder as binary variables (1=yes, 0=no). Code definitions provided in Table S1. We also included other significant comorbidities suggested by previous studies, CAD, hypertension, T2DM, or obesity as binary variables (1=yes, 0=no), sex as a binary variable (0=female, 1=male), age as quantitative variable, older age (65+), and outcome as a binary variable (1=yes, 0=no). The outcome of interest was coded as 0 until the day it occurred (the date of the first intubation order following admission or the death date) or the date of analysis, whichever occurred first. Survival curves are generated for the indicated variables by setting all other variables to their respected averages within the training data. Note that we dropped patients who experienced the outcome before their initial diagnosis. This is either due to patients being hospitalized prior to infection (in the case of intubation) or errors in the coded data. We dropped 121 patients for intubation prior to infection and 12 patients for prior death. We also restricted the study to 90 days from the start date. One patient was removed for having an event outside of this range.
Covariate Correlations
Using the data prepared as discussed above, we computed pairwise statistical correlations between age, sex as well as history of macular degeneration, complement deficiency disorders, coagulation disorders, HTN, T2DM, obesity, and CAD. We computed them using data from all suspected patients (tested both positive and negative) as well as only those patients who tested positive. We chose spearman rho as our measure of correlation.
Statistical Software
Models were generated each day that data was available beginning on March 23rd, 2019 with data from patients available through that day. We used Jupyter Notebooks (jupyter-client version 5.3.4 and jupyter-core version 4.6.1) running Python 3.7 and all fit models using the python lifelines package (version 0.24.4).
UK Biobank Genetic Analysis
Data Source
UK Biobank subjects that were of White British descent, in the UK Biobank PCA calculations and therefore without 3rd degree and above relatedness and without aneuploidy, were used in this study, totaling 337,147 subjects (181,032 females and 156,115 males) (Bycroft 2018). Of the nearly 500,000 participants, approximately 50,000 subjects were genotyped on the UK BiLEVE Array by Affymetrix while the rest were genotyped using the Applied Biosystems UK Biobank Axiom Array, with over 800,000 markers using build GRCh37 (hg19). The arrays share 95% marker coverage. We extracted markers with a minor allele frequency greater than 0.005, INFO score greater than 0.3, and Hardy-Weinberg equilibrium test mid-p value greater than 10-10 using PLINK222. UKBB version 3 Imputation combined the Haplotype Research Consortium with the UK10K haplotype resource using the software IMPUTE4 (UK Biobank White paper). Association analyses were performed using a logistic regression model with additive gene dosage and covariates including age at 2018, sex, first 10 principal components (provided by the UK Biobank), and the genotyping array the sample was carried out on. We adjusted for multiple testing with FDR-corrected p-values using the Benjamini-Hochberg method.
Genetic Association Studies
We performed three study-wide association analyses: (i) comparing variants for SARS-CoV-2 positive patients against the entire population of 337,147 subjects, (ii) comparing positive patients who required hospitalization against the entire population, and (iii) comparing patients who tested negative versus the entire population.
Targeted Gene Set Definition
We identified a set of 69 genes relating to the complement and coagulation cascades from the Kyoto Encyclopedia of Genes and Genomes (KEGG accession id: hsa04610). For each gene, we used the transcriptional start and stop site from the hg19 build of the human genome to define a catchment window of 60kbp. From the 805,426 variants profiled in the UK Biobank genotyping data after quality control, 4,248 variants within the transcribed region of the genes of interest or within 60kbp flanking the transcribed region. After applying additional QC filters using PLINK2 (see Data Source above), 2,097 SNPs remained for analysis. We calculated counts for each variant for each of our groups of interest listed in Genetic Association Studies above.
SNP Set Empirical Statistical Evaluation
To assess the probability of getting 10 study-wide significant hits (using BH corrected p-value < 0.05), we used empirical sampling to generate 100 sets of randomly chosen SNPs. In each sample, 69 genes were chosen at random from the genome and mapped to nearby SNPs (within a 60kbp flanking region), resulting in sets of SNPs sized 1712 to 2945 – similar to the 2097 that resulted from our complement and coagulation cascade set. We then repeated the association study and counted the number of significant hits (using BH corrected p-value < 0.05). We fit the empirical data to a Poisson distribution and used the derived lambda to compute p-values for our observations of 10, 4, and 1 hit (corresponding to the number of significant results from our severe analysis, positive analysis, and negative, respectively). We performed a chi-squared goodness-of-fit test to verify the data were consistent with a Poisson.
Software
We used PLINK v2.00a2LM 64-bit Intel (26 Aug 2019) to run the genetic association analysis.