Prediction Performance
We developed our primary results on the INSIGHT cohort and used the OneFlorida + cohort as a validation cohort. Both cohorts were collected from patients who has at least one PCR/antigen test for SARS-CoV-2 infection from March 2020 to November 2021, and the inclusion-exclusion cascade was provided in Fig. 1. The INSIGHT cohort included 35,275 adult patients with lab-confirmed SARS-CoV-2 infection and 326,126 non-infected control patients.
The current definition of PASC in the RECOVER protocols is ongoing, relapsing, new symptoms, or other health effects occurring four or more weeks after the acute phase of SARS-CoV-2 infection.21 We compiled a broad list of potential PASC conditions in terms of Clinical Classifications Software Refined (CCSR) categories22 based on our previous findings14 and evidence from other literature.3,4 Here we studied incident PASC conditions ascertained from 31 days to 180 days after the start of the acute SARS-CoV-2 infection date, denoted as the index date, but not existed one week to three years prior to the index date (See Method for the approach on how the list was compiled and Supplementary Table 1 for the detailed information).
We built a list of 89 covariates that are potentially associated with PASC based on a revised list of Elixhauser comorbidities, recommendations of our RECOVER clinician team, and the severity of acute infection of SARS-CoV-2. These covariates included basic demographics (e.g., age, gender, race, ethnicity), social-economic status in terms of Area Deprivation Index (ADI)21, healthcare utilization history, body mass index, the period of infection, comorbidities, and the care settings in acute phase including hospitalization and ICU admission. For each of the categorical covariates, we defined its reference group the same as prior studies for acute SARS-CoV-2 infection (details see Method covariates section).6 We built different machine learning models to predict the individual risk of encountering each incident condition using these covariates. The prediction performance of a regularized Cox model measured by the Concordance index (C-index)23 with a 95% confidence interval was shown in Fig. 2 (results for other machine learning models are provided in the Sensitivity Analysis section).
Figure 2 shows that different incident conditions were associated with heterogeneous predictive performance. Conditions such as dementia, malnutrition, stroke, non-specific PASC (U099/B948), and kidney failure had a C-index > 0.8, in addition to other conditions such as myopathy, and pressure sores. We noted that diabetes, thromboembolic disease, and COPD were moderately predictable, with a C-index > 0.7, and other conditions such as fatigue, anxiety disorders, and sleep disorders were less predictable, with a C-index < 0.6.
Associations between risk factors and specific PASC conditions.
Furthermore, we analyzed the associations between the covariates and the risk of developing any incident condition from our list. The unadjusted hazard ratio (HR) and fully adjusted hazard ratio (aHR) for each covariate. A covariate was identified as a potential risk factor for developing a particular condition if it satisfied the following three criteria: (1) the corresponding aHR of the covariate with respect to the target condition is larger than 1 when compared with the reference group (Method covariates section and Extended Data Table 1); (2) the association was statistically significant after multiple testing correction (p-Value < 0.000562); and (3) the associated risk was higher in SARS-CoV-2 infected patient population compared to the non-infected population. Note that criterion (3) is to guarantee the risk association we identified is not a common one that widely exists in patients without COVID-19, and the technical details on implementing this have been provided in Methods. Overall, among 35,275 enrolled SARS-CoV-2 infected patients in the INSIGHT cohort, 17,571 (49.8%) of them had at least one incident potential PASC condition (Table 1). The associations between the covariates and the risk of getting at least one PASC were summarized in Extended Data Table 1. Figure 3 depicted the associations between the identified risk factors and specific PASC conditions, which we would further elaborate on as follows.
The severity of acute infection. Increased severity of the acute SARS-CoV-2 infection (according to the care settings) was associated with a higher risk of being diagnosed with new incident conditions in the post-acute period. Overall, a higher risk of getting any incident diagnosis was observed in patients who were hospitalized during the acute phase (1.29 (1.24–1.33)) or in ICU (1.40 (1.32–1.49)) compared to patients who were not hospitalized during the acute phase (as a reference group, see the Extended Data Table 1). Figure 3 further showed the associations between the acute phase severity and a range of potential PASC conditions. Specifically, compared to non-hospitalized patients, the ICU patients showed a 4.7-fold higher risk of being diagnosed with myopathy, 2.5-fold higher risk of being diagnosed with pressure ulcers, 2.3-fold higher risk of being diagnosed with thromboembolism, 2.1-fold higher risk of being diagnosed with malaise and fatigue. In addition, patients who were hospitalized or admitted to ICU during the acute phase had a higher risk of being diagnosed with general PASC codes U099/B948, with 4.3- and 2.2- fold increases compared to non-hospitalized patients.
Age. Patients aged 75 or older showed an increased risk of being diagnosed with a wide range of potential PASC conditions in the post-acute infection phase, including dementia (5.8-fold higher), COPD (2.2-fold), cerebral ischemia (2.1-fold), malnutrition (1.8-fold), pressure ulcer (1.8-fold), anemia (1.6-fold), cognitive problems (1.6-fold) compared to patients were 55–64 years old (as reference). Patients with 65 to 74 years old showed an increased risk of being diagnosed with dementia (2.7-fold), heart failure (1.6-fold), and diabetes mellitus (1.6-fold) compared to reference patients. By contrast, younger patients aged 20–39 years old exhibited an increased risk of getting milder potential PASC conditions including acute pharyngitis (1.7-fold), headache (1.4-fold), and anxiety disorder (1.4-fold) than patients in the reference group.
Gender and Race. Female patients exhibited a 4.3- and 1.3-fold increased risk of being diagnosed with incident hair loss and anxiety disorder in the post-acute infection period compared to male patients. Black patients exhibited a 1.9-fold increased risk of being diagnosed with incident diabetes mellitus than white patients.
Body Mass Index. Patients who were underweight (BMI < 18.5 kg/m2) or obese (BMI \(\ge\) 30 kg/m2) were at higher risk of being diagnosed with certain potential PASC conditions than those with normal BMI (BMI from 18.5 to 24.9 kg/m2). Specifically, underweight patients were at a 1.6-fold-increased risk of being diagnosed with heart failure, and diabetes mellitus, and a 1.4-fold-increased risk of being diagnosed with malnutrition than patients with normal BMI. Obese patients showed a 1.8-fold-increased risk of being diagnosed with diabetes mellitus and a 1.3-fold-increased risk of being diagnosed with a sleep disorder.
Period of infection. We observed that patients who got infected from July 2021 to November 2021, which was dominated by the Delta variant of SARS-CoV-2 24, showed an increased risk of being diagnosed with incident pharyngitis (3.2-fold), chest pain (1.9-fold), abdominal pain (1.7-fold), dyspnea (1.6-fold), as well as being diagnosed with general PASC symptoms and signs with the U099/B948 ICD codes (5-fold) in the post-acute infection period compared to patients got infected during March 2020 to June 2020 (the 1st wave) as the reference period.
Pre-existing conditions. As shown in Fig. 3, having one or more baseline conditions was associated with a higher risk of potential PASC diagnosis including malnutrition, fluid disorders, anemia, and chest pain. Specifically, cancer patients showed increased risk in a broad list of post-acute conditions including malnutrition, atelectasis, fever, anemia, pulmonary fibrosis, constipation, and fibromyalgia compared to those without cancer diagnoses at baseline. Patients having baseline chronic kidney disease showed an increased risk of being diagnosed with heart failure and anemia. Those with baseline cirrhosis showed a 3-fold-increased risk of gastroparesis, a 2-fold-increased risk of atelectasis, and a 1.8-fold-increased risk of anemia. Those with baseline coagulopathy showed a higher risk of thromboembolism and cognitive problems. Patients with end-stage renal disease showed a higher risk of COPD and malnutrition. Those with baseline mental health disorders exhibited a higher risk of dementia and anxiety disorders in the post-acute period. Parkinson’s disease patients showed a 2.2-fold-increased risk of encephalopathy. Pregnant females showed a 2.4-fold increased risk of anemia in the post-acute period. Those with baseline pulmonary circulation disorder showed a 3.3-fold-increased risk of pulmonary embolism and a 1.9-fold-increased risk of heart failure. Patients with weight loss at baseline were at a higher risk of being diagnosed with pressure ulcers, COPD, constipation, and general PASC (with U099/B948) in the post-acute phase.
Sensitivity analysis
We have examined the impact of the criterion on requiring the identified association to be with a higher risk in SARS-CoV-2 infected patients compared to non-infected patients. Extended Data Fig. 1 depicted the identified associations after we lifted this requirement, i.e., we only require these associations to satisfy the adjusted hazard ratio and statistical significance constraints. From the figure, we observed that more associations have been identified compared to Fig. 3, and many of these associations may not be relevant to SARS-CoV-2 infection. Taking patients with pre-existing cancer as an example, they were associated with a higher risk of being diagnosed with fluid disorders, acute kidney failure, thromboembolism, encephalopathy, edema, malaise, and fatigue in the post-acute period after SARS-CoV-2 infection. However, these associations can also be identified for non-infected cancer patients. Therefore, the excessive risk criterion is necessary for filtering out the associations that are not specific to SARS-CoV-2 infection.
We also tested to what extent the predictability of incident potential PASC conditions is affected by different machine learning models. We investigated a range of machine learning models with different complexities, including regularized logistic regression models, gradient boosting machines, and feed-forward deep neural networks. As shown in Extended Data Fig. 2, we observed little difference across the performance of these different models, and the heterogeneous predictability patterns were still observed, i.e., conditions that were difficult to predict in Fig. 2 were still with low predictive performance despite using more complex models.
Lastly, we studied if different feature engineering can impact the prediction results of different PASC conditions. Instead of using pre-defined baseline comorbidities, we leveraged a data-driven approach by using the first three digits of ICD-10 codes of all diagnoses and all medications in RxNorm codes at the ingredient level in the baseline period to predict PASC. We reported the predictive performance of different machine learning models using this large set of features in Extended Data Fig. 3, which does not show big differences compared to the performance in Extended Data Fig. 2 or Fig. 2, and the heterogeneous predictability patterns remain the same.