Data source
Our study uses de-identified United States medical claims records from Change Healthcare collected over a period from April 1, 2018, to Dec 31, 2020, encompassing over 50 million records from over 2 million patients. Every claims record contains information about the diagnoses (in the form of ICD-10 codes), the procedures performed and prescribed drugs. The claims in our dataset include primarily open claims, and a subset of closed payer claims which are normalized for analytics purposes providing sound directional insight for this study. The open claims are derived from broad based healthcare sources and consists of all the medical claims that Change Healthcare processes and for which they have the rights to use. The closed claims are derived from the payer and capture nearly all events that occur during the patient’s enrollment period. Roughly 95% of the claims used for this study are commercial and 5% are Medicare Advantage/other types of plans.
In addition to medical claims, we use patient-level social determinants of health (SDOH) data from Change Healthcare. The SDOH attributes included in this study are: i) race, ii) gender, iii) age, iii) income, iv) education level, v) veteran status. Of these attributes, gender and age are obtained from patient claims. SDOH data (other than gender and age) are available for 43.91% of the individuals in the data.
Study population
Our dataset includes all COVID-19 positive patients, identified by the ICD-10 diagnosis codes of U07.1 (COVID-19, virus identified, lab confirmed) or U07.2 (COVID-19, virus not identified but clinically diagnosed) as the principal diagnosis. We defined a subject’s index date as the date of the SARS-CoV-2 diagnosis and only included patients whose index date was between March 1, 2020, and September 30, 2020. For these patients, we had claims data available between April 1, 2018, to December 31, 2020. The total size of our study population was 2.7 million, reduced to 1.37 million after discarding records with missing fields. Of this group, we possess supplementary SDOH data for 602,025 patients. Henceforth, we shall refer to the cohort of patients for whom we possess the SDOH data as the ‘SDOH cohort’ and the other patients as the ‘non-SDOH cohort’. The non-SDOH cohort is used to first define the long-term effects of interest, as described in the statistical analysis section. We then test the association of certain long-term effect outcomes with the SDOH variables using the SDOH cohort. The descriptive statistics of both cohorts can be found in Table 1. We can see that the populations are qualitatively similar in terms of age and gender.
Table 1
Descriptive statistics of the study cohort
Variable
|
Category
|
SDOH fraction
|
non-SDOH fraction
|
All fraction
|
Age
|
0-20
|
0.009333996
|
0.15870638
|
0.09311697
|
|
21-30
|
0.080847434
|
0.118384518
|
0.10190198
|
|
31-40
|
0.120001865
|
0.116660258
|
0.11812756
|
|
41-50
|
0.149404386
|
0.12315018
|
0.1346784
|
|
51-60
|
0.208195612
|
0.153282147
|
0.17739465
|
|
61-70
|
0.199574833
|
0.147831305
|
0.17055189
|
|
71-80
|
0.14482287
|
0.109365417
|
0.12493478
|
|
80+
|
0.087819005
|
0.072619797
|
0.07929377
|
Gender
|
Female
|
0.611106506
|
0.5813309
|
0.59440537
|
|
Male
|
0.388893494
|
0.41866777
|
0.40559389
|
Veteran status
|
Non-veteran
|
0.798827767
|
x
|
x
|
|
Veteran
|
0.201172233
|
x
|
x
|
Race
|
Asian
|
0.027929459
|
x
|
x
|
|
Black
|
0.119549412
|
x
|
x
|
|
Hispanic
|
0.189645049
|
x
|
x
|
|
White
|
0.66287608
|
x
|
x
|
Income
|
Less than $15,000
|
0.101988374
|
x
|
x
|
|
$15,000 - $19,999
|
0.071895086
|
x
|
x
|
|
$20,000 - $29,999
|
0.105519923
|
x
|
x
|
|
$30,000 - $39,999
|
0.107134593
|
x
|
x
|
|
$40,000 - $49,999
|
0.103200671
|
x
|
x
|
|
$50,000 - $74,999
|
0.201743843
|
x
|
x
|
|
$75,000 - $99,999
|
0.115691476
|
x
|
x
|
|
$100,000 - $124,999
|
0.061549115
|
x
|
x
|
|
Greater than $124,999
|
0.131276918
|
x
|
x
|
Education
|
Completed High School
|
0.607909937
|
x
|
x
|
|
Completed College
|
0.269532168
|
x
|
x
|
|
Completed Graduate School
|
0.115415918
|
x
|
x
|
|
Attended Vocational/Technical
|
0.007104643
|
x
|
x
|
Study design
We utilize a self-controlled cohort design (SCCD)6 in this study. In this design, event rates during a time window after SARS-CoV-2 diagnosis are compared to event rates during a time window prior to diagnosis, where the study population is restricted to patients diagnosed with SARS-CoV-2. The outcome period is defined as beginning 2 months after the index date and continuing through January 31, 2021, the last date for which reliable claims data are present (see Supplementary figure 1). The control period is defined as the three-month period from 10 months to 7 months prior to the index date. This control period begins during the same calendar month as the outcome period, and so should reduce possible confounding by seasonal variations in incidence of events of interest.
Pre-existing comorbidities were defined based on ICD-10 codes assigned to medical encounters during the six-month period from 16 months to 10 months prior to the index date (see Supplementary figure 1). This period does not overlap with the control period, so events during the control period will not also be counted as comorbidities. The Elixhauser comorbidity index7 was used to define comorbid conditions and their corresponding ICD-10 codes9.
Statistical Analysis
Identification of statistically significant ICD10 codes that define long term effects – Following common practice, we grouped the ICD10 codes by their first three digits which approximately represents high level health conditions. Relative abundances for each condition (represented by a three-digit ICD10 code) were calculated for both control and post-covid periods. Conditions that occurred in less than 0.01% of the post-covid population were discarded to limit the analysis to conditions that were present in a large enough population. A 2-proportion one-way z-test was performed to identify conditions that were significantly higher in the post-covid period, compared to the control period. The significance level was set to 0.05 with multiple testing correction using the Bonferroni method, for this and all subsequent analyses unless mentioned otherwise. This analysis was done on the non-SDOH cohort. However, for the purposes of validation, we also replicated the same analysis on the SDOH cohort.
Identification of month-wise long-term effects – To study the month-wise prevalence of the long-term effects that we identified, we perform the same analysis as described in the previous section, on one month long post-covid and matched control periods shown in Supplementary figure 1. The analysis was done for months 3, 4 and 5 post-covid. Since we had used the non-SDOH cohort to identify the long-term effects, to prevent ‘double dipping’, we performed this analysis on the SDOH cohort.
Identification of co-occurring long-term effects – Identification of frequently co-occurring conditions was done using a data mining technique known as market-basket analysis or affinity analysis7. Briefly, affinity analysis identifies co-occurring items (long-term effects in our case) in the data by comparing the observed co-occurrence frequency with the expected co-occurrence frequency (assuming that the co-occurrence was purely random). We first performed market affinity analysis with (support\(\ge\)0.01, lift\(\ge\)1) on the post-covid period to identify co-occurring conditions. We then identified the relative proportion of patients who experienced each ‘basket’ of conditions in the post-covid and control periods. Finally, we performed a 2-proportion one-way z-test to identify which buckets were significantly overrepresented in the post-covid period compared to the control period. This analysis was performed on the non-SDOH cohort.
Studying associations of SDOH variables with long-term effects – Association testing of SDOH variables with each significant long-term effect was done using a logistic regression model, which adjusted for comorbidities and presence of the same long-term conditions in the control period (prior events). The mathematical model can be expressed as:
Where, \({p}_{m}=\text{P}\text{r}({Y}_{m}=1)\) is the probability of long-term effect m occurring. Prior to performing the logistic regression, we performed feature selection using a chi-squared test of independence between each outcome and independent variable. Only variables that met a significance level of 0.05 were used in the logistic regression. However, a Bonferroni corrected p-value (correcting for \(m\) outcomes) was used to determine significant associations in the logistic regression model. The selected baseline categories were: Race-White, Education – Completed college, Income-greater than $124,999, Gender-male, Non-veteran, Age-31-40.