Identifying Long COVID Definitions, Predictors, and Risk Factors using Electronic Health Records: A Scoping Review

doi:10.21203/rs.3.rs-3689967/v1

Download PDF

Research Article

Identifying Long COVID Definitions, Predictors, and Risk Factors using Electronic Health Records: A Scoping Review

https://doi.org/10.21203/rs.3.rs-3689967/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Objective

Long COVID, or post-COVID condition, is characterized by a range of physical and psychological symptoms and complications that persist beyond the acute phase of the coronavirus disease of 2019 (COVID-19). However, this condition still lacks a clear definition. This scoping review explores the potential of electronic health records (EHR)-based studies to characterize long COVID.

Methods

We screened all peer-reviewed publications in the English language from PubMed/MEDLINE, Scopus, and Web of Science databases until September 14, 2023. We identified studies that defined or characterized long COVID based on EHR data, regardless of geography or study design. We synthesized these articles based on their definitions, symptoms, and predictive factors or phenotypes to identify common features and analytical methods.

Results

We identified only 20 studies meeting the inclusion criteria, with a significant majority (n = 17, 85%) conducted in the United States. Respiratory conditions were significant in all studies, followed by poor well-being features (n = 17, 85%) and cardiovascular conditions (n = 14, 70%). Some articles (n = 8, 40%) used a long COVID-specific marker to define the study population, relying mainly on International Classification of Diseases, Tenth Revision (ICD-10) codes and clinical visits for post-COVID conditions. Among studies exploring plausible long COVID (n = 12, 60%), reverse transcription-polymerase chain reaction and antigen tests were the most common identification methods. The time delay for EHR data extraction post-test varied, ranging from four weeks to more than three months; however, most studies considering plausible long COVID used a waiting period of 28 to 31 days.

Conclusion

Our findings suggest a limited global utilization of EHR-derived data in defining or characterizing long COVID, with 60% of these studies incorporating a validation step. Future meta-analyses are essential to assess the homogeneity of results across different studies.

Bioinformatics

Electronic Health Records

Long COVID

Phenotypes

Post-acute COVID-19

Post-COVID conditions, known as long COVID, refer to symptoms that manifest about four or more weeks after the initial infection by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus that causes the coronavirus disease of 2019 (COVID-19) [1, 2]. Patients with long COVID experience many symptoms and conditions affecting various organ systems, including the respiratory, circulatory, nervous, endocrine, and digestive systems [3]. Moreover, chronic medical conditions and certain risk factors such as high blood pressure, diabetes, and obesity can exacerbate symptoms [4]. The increase in medical spending in the United States (US) as a result of long COVID is estimated at $528 billion total (about $1,570 per person), about 2.5% of the total gross domestic product in 2019 [5]. A historical cohort study in Israel estimated the excess cost of long COVID on healthcare utilization as an additional 7.6% of control patient care [6]. Another study estimated productivity-based losses to the German economy at $3.7 billion [7]. Due to the debilitating conditions of long COVID, functional impairment, low work productivity, and long-term health complications are anticipated [8, 9].

At present, little is known about the pathophysiology of these multisystem complications. While several studies have attempted to characterize long COVID [10–13], comparing the findings proves difficult due to the heterogeneity of symptoms and study period. Further, the accepted definition of long COVID keeps evolving [14]. The largest initiative in the US to define long COVID has been through the National Institute of Health (NIH)'s Researching COVID to Enhance Recovery (RECOVER) program. The program identified the 12 most prevalent symptom patterns among those with long COVID, including post-exercise malaise, fatigue, brain fog, dizziness, gastrointestinal symptoms, heart palpitations, changes in sexual desire or capacity, loss of taste or smell, thirst, chronic cough, chest pain, and abnormal movements [15]. As of November 2023, about 450 studies listed on ClinicalTrials.gov are investigating long COVID, including the response to antivirals, lithium therapy, and nitrate supplements, to discover possible treatments [16]. Several studies identified predictors or risk factors of long COVID [17–19] but used survey-based approaches that are prone to question heterogeneity and voluntary response bias. Identified examples included increasing age and body mass index (BMI), female sex, frailty, experiencing more than five symptoms or a hospital or emergency room visit during acute COVID-19 illness, and comorbidities such as asthma and heart disease [17–19].

Electronic health records (EHR) can provide a reliable, cost-effective, and comprehensive overview of the medical histories of a large population of patients, including previous infections and pre-existing conditions, thus enabling the tracking of symptoms and conditions [20]. Within EHR systems, a valuable resource for obtaining disease etiology is the International Classification of Diseases, Tenth Revision (ICD-10), used to code and classify all symptoms, procedures, and diagnoses [21]. Unexplainable symptoms of COVID-19 were classified as B94.8 as a placeholder to signify long COVID until September 30, 2021; the code was subsequently changed to U09 [22].

While previous review studies attempted to define long COVID [23–26], these efforts are constrained by their heterogeneous designs and lack of specific diagnostic definitions. For example, a previous meta-analysis aimed at characterizing long COVID included 39 studies of diverse study designs, such as cohort, cross-sectional, and case-control [23]. Among these studies, the majority (n = 34) had either a moderate or high risk of bias. Likewise, Kelly et al. [27] identified a wide spectrum of symptoms and significant heterogeneity across the studies, noting that the population was limited to hospitalized patients. An earlier review by Iqbal et al. [28] incorporated 35 articles until March 2021 and highlighted limitations pertaining to various study designs, a limited number of countries, and questionnaire-based cross-sectional studies that fail to capture the evolution of symptoms over time. However, to our knowledge, no review article has comprehensively explored the potential of EHR-based studies to characterize long COVID. Thus, this scoping review aims to collate the studies that defined long COVID based on phenotypes or identified predictors or risk factors derived from EHR data. We also identified the analytical methods and summarized the common significant phenotypes, predictors, or risk factors.

We used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to conduct this scoping review.

2.1. Data Source

We used three search engines, PubMed/MEDLINE, Scopus, and Web of Science, to identify peer-reviewed articles until September 14, 2023, regardless of geography or study design.

2.2. Search Strategy

The initial screening of abstracts and titles included all articles that contained our key terms pertaining to long COVID, and those that contained EHR or electronic medical records (EMR) (Table 1). Studies were considered eligible if they were written in English, were peer-reviewed, and defined or characterized long COVID phenotypes or its predictors or risk factors. Review articles, pre-prints, case reports, editorials, abstracts, and articles published in early 2020 and before were excluded. Notably, studies that used non-EHR-based features of long COVID were excluded.

Table 1

Search terms input into PubMed, Scopus, and Web of Science.
Theme	Key terms
Long COVID	"long COVID" OR "long-term COVID" OR "post-acute sequelae" OR "late-stage COVID" OR "SARS-CoV-2 post-recovery" OR "post-COVID" OR "PASC" OR "long-haul COVID" OR "Chronic COVID" OR "persistent COVID" OR "prolonged COVID" OR "extended COVID" OR "post-recovery COVID" OR "Aftermath COVID" OR "survivorship COVID" OR "late effects COVID*" OR "long-term effects of COVID" OR "post-acute COVID-19" OR "post-acute sequelae of SARS-CoV-2 (PASC)" OR "ICD-10-CM" OR "ICD-10"
Electronic health records	"Electronic Medical Record" OR "Electronic Health Record" OR "Electronic Patient Record*" OR "EHR" OR "EMR" OR "N3C" OR "All of US"

2.3. Study Selection

We imported all the retrieved abstracts and titles into the Covidence online tool (https://www.covidence.org/), where duplicates are excluded. Two reviewers (AM and GSC) independently screened abstracts and titles and excluded the articles not about long COVID. Next, two different reviewers (RAL and GSJ) screened the full texts of the selected articles. In this step, articles were excluded if they did not have a clear definition of long COVID or did not identify risk factors, phenotypes, or predictors. Conflicts were resolved through follow-up discussions to reach a consensus on the final selection of articles. We also conducted a reference check of the selected articles for possible inclusion. Figure 1 shows the process of study selection.

2.4. Data Extraction

We extracted information from the selected articles, including author names, study dates, sample size, types of data collected (qualitative and quantitative), study population inclusion criteria, identified long COVID features, methods used to identify such features, methods used to validate results, and study limitations.

2.5. Narrative Synthesis

A summary of long COVID labels assigned to samples in each study was created, and data was categorized based on clinical phenotypes and databases used in each of these papers. We synthesized various methods used to identify the important characteristics of long COVID phenotypes, predictors, and risk factors.

Initially, we identified 543 articles from PubMed, Web of Science, and Scopus databases. Of these, 327 studies were duplicates and removed, resulting in 216 unique studies for the initial screening. After the abstract and title screening of those articles, another 113 studies were excluded. We conducted the full-text review of the remaining 103 studies and excluded 93 studies due to a lack of a clear definition of long COVID and/or missing identification of predictors or risk factors. We identified ten additional studies via a reference check of the articles. Therefore, the final selection comprised 20 studies. Table 2 provides a summary of the selected studies.

The total sample size of long COVID study and control populations ranged from 11,209 to 5,213,885. The Veterans Affairs Electronic Healthcare database (n = 5,808,018), the Consortium for the Clinical Characterization of COVID-19 by EHR (4CE) (n = 5,434,528), and INSIGHT Network, New York City Health System (n = 5,346,357) provided the largest databases of potential patients. The size of the 4CE database of positive COVID-19 patients varied among each study as new data were added monthly. Additionally, some datasets used inpatient and outpatient records to identify positive cases of COVID. The geographic distribution of articles revealed that 85% of the selected studies utilized datasets composed of US patients. The remaining studies were carried out on patient data from France, Germany, Italy, and Singapore. The authors used qualitative (20%), quantitative (20%), and a mixture (30%) of qualitative and quantitative data. The remaining did not provide explicit evidence, or the data was unclear.

Figure 2 shows the methods for identifying candidate patient labels, classified as long COVID (n = 8/20) or plausibly long COVID (n = 12/20). Among articles that used a long COVID-specific marker (Fig. 2a), most (6/8, 75%) chose an ICD-10 code for post COVID-19 condition, unspecified, or sequelae of other specified infectious diseases. Among the studies that used acute COVID-19 illness to define plausibly long COVID patients (n = 12/20), reverse transcription-polymerase chain reaction (RT-PCR) and antigen tests were most common (Fig. 2b). These twelve studies extracted features to study from the EHR after a waiting period following a recorded acute COVID-19 positive test. The waiting period shown in Fig. 2b varied from 28 days to more than three months [29, 30], but most used a one-month delay.

Table 2

Summary of the included studies.
Study*	Year	Study Design	Total Sample Size	Total Participants (n = study population)	EHR Source	Main Methods	Method of Validation
Al-Aly et al. [31]	2021	Observational Study	5,213,885	73,435	US Department of Veterans Affairs electronic healthcare databases	Cox regression	N/A
Baskett et al. [32]	2022	Retrospective Cohort Study	N/A	17,487	Cerner Real-World data set	Logistic regression; Propensity score matching	N/A
Dagliati et al. [29]	2023	Retrospective Cohort Study	30,422	12,424	4CE	Clustering; MLHO	Clinical expert review
Estiri et al. [30]	2021	Retrospective Cohort Study	96,025	22,475	Mass General Brigham Hospital	MLHO; Multivariate time series analysis	Clinical expert review
Fritsche et al. [33]	2023	Case-Control	63,675	1,724	Michigan Medicine EHR data	Logistic regression	Sensitivity analysis; Logistic regression; AAUC
Haupert et al. [34]	2022	Case-Crossover Design	204,597	44,198	Michigan Medicine Health System	Logistic regression	N/A
Jiang et al. [35]	2022	Retrospective Cohort Study	85,196	28,558	N3C EHR repository	Deep neural networks; PCA	AUC; F1 score
Kessler et al. [36]	2023	Retrospective Exploratory Study	272,588	5,440	IQVIA Disease Analyzer Database	Light gradient boosting (decision tree)	AUC; Precision; Recall; Specificity; F2 Score
Khullar et al. [37]	2023	Retrospective Cohort Study	310,220	62,339	INSIGHT Network; New York City Health Systems	Logistic regression	Sensitivity analysis
Lorman et al. [38]	2023	Observational Study	14,399	1,309	RECOVER PEDSnet EHR	Propensity score matching; Decision trees	Sensitivity analysis
Nasir et al. [39]	2023	Observational Study	11,209	4,091	Health Choice Network	Bayesian structural time series modeling	N/A
Pfaff et al. [40]	2023	Retrospective Cohort Study	36,880	33,782	N3C	Clustering; network analysis	N/A
Pfaff et al. [22]	2022	Retrospective Cohort Study	1,793,604	73,972	N3C	Extreme gradient boosting (decision tree)	Cross-validation; AUC; Precision; Recall; F-score
Rao et al. [41]	2022	Exploratory, Retrospective Cohort Study	659,286	59,893	PEDSnet	Cox regression; logistic regression	N/A
Reese et al. [42]	2023	Retrospective Cohort Study	5,434,528	20,532	N3C	NLP; k-means clustering	N/A
Sengupta et al. [43]	2022	Retrospective Cohort Study	49,950	7,511	N3C	Convolutional and LSTM neural networks	AUC
Wang et al. [44]	2022	Observational Study	51,485	26,117	Mass General Brigham Hospital	Rule-based NLP	Precision
Zang et al. [45]	2023	Observational Study	361,401 (INSIGHT) 199,351 (OneFlorida+)	35,275 (INSIGHT) 22,341 (OneFlorida+)	INSIGHT CRN and OneFlorida + CRN	Propensity score matching	Sensitivity analysis
HG Zhang et al. [46]	2022	Retrospective Cohort Study	2,745,130	414,602	4CE	Random-effects meta-analysis	N/A
H Zhang et al. [47]	2022	Retrospective Cohort Study	34,605	20,881	INSIGHT CRN and OneFlorida + CRN	Clustering; PFA topic modeling	Topic Coherence; Sensitivity analysis

*Abbreviations: 4CE: Consortium of Clinical Characterization of COVID-19 by EHR; AAUC: area under the covariate-adjusted receiver operating characteristic curve; AUC: area under the receiver operating characteristic curve; ANOVA: analysis of variance; CRN: clinical research network MLHO: Machine Learns Health Outcomes; PEDSnet: a National Pediatric Learning Health System; N3C: National COVID Cohort Collaborative; NLP: natural language processing; LSTM: long short-term memory; PCA: principal component analysis; PFA: Poisson factor analysis.

Figure 3 depicts the candidates used to define long COVID or identify predictors or risk factors. All studies used diagnostic categories as candidates. Many studies organized their diagnostic category candidates by grouping ICD-10 codes into diagnostic clusters using a version of Clinical Classifications Software Refined. In contrast, some used the R package PheWAS to map ICD-10 codes to unique PheCodes. Notably, seven (35%) studies included medications, and six (30%) studies used a combination of social determinants of health (SDOH) and patient demographics. Only one study [35] used quantitative data based on biophysical metrics recorded during acute COVID-19 hospitalizations.

The main data analysis methods used by the studies are presented in Fig. 4. In this figure, we depicted the methods used by at least two studies. See Table 2 for methods for each study. Due to the relatively large sample sizes, most studies used data science techniques. Ten (50%) articles discussed the implementation of machine learning for feature down-selection or identification of significant features. Various clustering methods were utilized, including principal component analysis, the Louvain algorithm, probabilistic topic modeling, and hierarchical agglomerative clustering. Moreover, several studies implemented decision trees, two of which used the extreme gradient boosting algorithm.

All articles incorporated statistical tests, but approaches varied. Nearly all articles (n = 17, 85%) reported results using hazard ratios, risk ratios, odds ratios, or p-values. Hazard ratios were often found using Cox survival models, and reported p-values often underwent the Bonferroni correction. Some studies reduced their initial pool of candidates to a final set based on feature importance criteria such as hazard ratios greater than 1 or p-values less than 0.05. In contrast, others merely discussed the top 20 or 30 features.

We categorized the phenotypes, predictors, and risk factors of long COVID identified by the articles (Fig. 5). As some articles used hundreds of candidates in their analyses, Fig. 5 only shows the categories of most significant findings. The studies varied significantly in how they clustered and summarized their results; we grouped these by broad categories to facilitate comparison. All articles found respiratory conditions significant, nearly all studies (n = 17, 85%) reported poor well-being features as significant, and most (n = 14, 70%) included cardiovascular conditions. Examples for these top three categories were: respiratory conditions (e.g., respiratory failure, asthma), poor general well-being (e.g., fatigue, pain), and cardiovascular conditions and diseases (e.g., chest pain, coronary disease). Fifty-five percent of the definitions, predictors, and risk factors used to understand long COVID were qualitative; 15% of the articles used a quantitative approach, 5% were qualitative and quantitative, and 25% had either an incomplete or nonexistent description.

Twelve of the studies (60%) included a validation step to test their purported long COVID phenotype, predictors, or risk factors on unlabeled data. One-fourth used their entire suite of candidate features in a long COVID prediction task and measured their success with the area under the receiver operating characteristic curve (AUC) or F scores. A quarter of the studies (25%) used sensitivity analysis to test the robustness of their results. Wang et al. [44] used natural language processing (NLP) to validate their reduced symptom lexicon on a test dataset and achieved an average precision and estimated recall of 0.94 and 0.84. Dangliati et al. and Estiri et al. [29, 30] employed clinical experts to review the phenotypes identified by their method.

This is the first scoping review for identifying long COVID phenotypes, predictors, and risk factors based on EHR data. We found 20 articles that met the eligibility criteria. The articles were classified based on how long COVID was defined, methodologies, and the identification of significant risk factors or phenotypes. ICD-10 codes (U09.9 or B 94.8) were the most common markers. A large majority of the studies reported poor general well-being and respiratory and cardiovascular conditions features as significant.

The studies collectively illuminate the heterogeneous nature of long COVID, revealing that it does not manifest as a uniform condition. Instead, it presents diverse phenotypic clusters, which include respiratory, neuropsychiatric, cardiovascular, and pain and fatigue subtypes [42]. These subtypes are characterized by distinct clinical features, patient demographics, and associations with various organ systems. Ten of the twelve symptom patterns identified by the NIH RECOVER initiative were considered as features by the studies and fell under broad, significant categories (Fig. 5). Both thirst and changes in sexual desire were absent as individual features but may fall under broad symptom characterizations [42]. Phenotypic clustering utilizing machine learning methods may provide a unique approach to examining phenotype commonalities and drastic differences, and may enhance our understanding of long COVID's heterogeneity. While several of the phenotypes found by the studies are common in understanding long COVID, latent phenotypes like substance use disorders (e.g., opioid use) and genital organs present unique ways to define the disease.

Beyond the symptoms of long COVID, some studies reported predictors and risk factors for developing the condition. We noted that SDOH and demographic information cannot plausibly be used to form a clinical definition of long COVID but may indicate the likelihood of developing the condition. Such studies included patients who recovered quickly from their acute infection in their test population, therefore diluting the conclusions that can be drawn from observed symptoms, diagnoses, and medicines.

Surprisingly, our research found limited global use of EHR data in defining or characterizing long COVID, with 85% of the included studies utilizing datasets composed of US patients. These geographical disparities in research distribution could mainly be related to privacy regulations governing access to EHR data across different countries. The lack of diverse EHR-based studies may hinder our understanding of long COVID's complex nature and the generalizability of findings beyond the studied regions. Thus, it is recommended that future international studies harness the full potential of EHR data for long COVID research, enabling the development of more effective interventions.

The studies identified a broad range of symptoms, complications, and clinical conditions significantly associated with long COVID. However, only one study [39] examined causality, using Bayesian structural time series models. This complexity underscores the need for a more comprehensive understanding of clinical conditions to elucidate the causal relationships. Moreover, while the studies primarily relied on ICD-10 codes to define long COVID, only one study leveraged the potential of NLP for extracting information from unstructured clinical notes in EHR data. Combined with longitudinal EHR data, this approach can enhance our understanding of patterns, progression, treatment, and management of long COVID.

While EHR, EMR, and electronic patient records (EPR) systems are often used interchangeably, they differ in scope and usage. EHRs were the most frequently encountered systems in this review. They serve as digital repositories of a patient's complete medical history from all healthcare providers and are intended for sharing with other healthcare entities, practices, and hospitals. EMRs focus on a single practice or hospital. Some EMR systems offer integrated care, possibly within larger hospital corporations. Lastly, EPR usage is primarily limited to Europe, which was not used in the articles. The localized usage of the term EPR may indicate the use of other terminologies globally that we have not accounted for in our literature search strategy, as our search was confined to specific health information systems. For future studies, it would be beneficial to include broader research terms, such as data hubs, data lakes, registries, and repositories, to encompass a more comprehensive scope of relevant literature.

The included studies shared some common limitations. There is no widely accepted definition for long COVID to date, so the disease was defined based on various arbitrary time intervals. Additionally, the symptoms and conditions of long COVID were examined mostly in the initial months. Most studies failed to differentiate between incident and prevalent symptoms, contributing to ambiguity in characterizing long COVID. To address these challenges, a standardized research protocol that facilitates capturing systematic tracking of symptoms seems imperative. Leveraging longitudinal EHR-derived data can provide a more comprehensive understanding of the emerging symptoms and monitor longer-term trends and effects.

Nearly all studies acknowledged data-related limitations. The use of diagnosis codes is a limitation because they may not capture all signs, symptoms, or laboratory results found in clinical notes. This reliance on diagnostic codes could result in missing information and biases. Additionally, the intensity of the pandemic and misinformation might introduce confirmatory bias between healthcare providers and patients. Some articles [34, 45] acknowledged the need for replication studies in other cohorts. Several articles excluded hospitalized COVID-19 patients, which might not reflect the complete spectrum of long COVID. This limitation highlights the importance of including more hospitalized patients in future studies.

Temporal bias is an issue because the choice of time windows for analysis varies among studies. Some studies expressed concerns about not accounting for within-person time-varying confounders, such as changes in health-seeking behavior during the pandemic. The studies’ requirement of pre- and post-COVID-19 visits is acknowledged as potentially biased toward patients with more complex health histories. The cohort case window ratio is a subjective parameter, indicating variability in the study design. Additionally, some studies mentioned potential timing biases related to developing and implementing specific diagnosis codes. Some touched on the issue of data representation concerning various population groups and the inability to generalize the data. The absence of medication information used for COVID-19 therapy, particularly in severe cases, is noted as a potential limitation, as is the lack of data on viral variants for individual patients.

Our study has several limitations. First, the exclusion of non-English articles may introduce language bias. Due to the evolving nature of the topic, the omission of pre-prints and conference articles could result in the loss of crucial information. Additionally, it is possible that some articles were missed during the screening process, although we mitigated this by conducting reference checks and backward searches. Another recurring limitation was the lack of comprehensive insight into the underlying mechanisms that lead to long COVID. Furthermore, they did not explore the indirect effects of long COVID, such as the social, economic, and behavioral changes that may arise due to the condition. This knowledge gap poses a significant challenge in fully understanding and addressing long COVID.

In conclusion, this scoping review has provided a comprehensive overview of state-of-the-art research on long COVID, utilizing EHR data as the primary source for defining and characterizing this condition. The findings suggest that while a consensus on the definition of long COVID remains elusive, ICD-10 codes are commonly used for identification, and poor general well-being, respiratory conditions, and cardiovascular conditions are consistently associated with long COVID. It was observed that the use of EHR data for characterizing long COVID is primarily concentrated in the US, highlighting the need for more international studies. Moreover, while data science techniques are widely employed, the lack of validation and causality assessments is evident, highlighting the need for more robust methodologies. The complex nature of long COVID, encompassing various symptoms and clinical conditions, underscores the need for more in-depth studies, including those leveraging longitudinal EHR data. Despite the gaps, this review serves as a foundation for future research efforts aimed at harnessing the potential of EHR data to better understand the epidemiology of long COVID.

Competing interests:

None.

Funding:

None.

Author contributions:

Study concept and design: AM; Title and abstract screening: GSC and AM; Full-text review: RAL and GSJ; Data extraction: All authors; Writing and editing and final approval: All authors.

Crook H, Raza S, Nowell J, Young M, Edison P. Long covid—mechanisms, risk factors, and management. bmj. 2021;374.
Devi KP, Pourkarim MR, Thijssen M, Sureda A, Khayatkashani M, Cismaru CA, et al. A perspective on the applications of furin inhibitors for the treatment of SARS-CoV-2. Pharmacological Reports. 2022:1–6.
Garg M, Maralakunte M, Garg S, Dhooria S, Sehgal I, Bhalla AS, et al. The conundrum of ‘long-COVID-19: a narrative review. International journal of general medicine. 2021:2491–506.
Makhoul E, Aklinski JL, Miller J, Leonard C, Backer S, Kahar P, et al. A review of COVID-19 in relation to metabolic syndrome: obesity, hypertension, diabetes, and dyslipidemia. Cureus. 2022;14.
Cutler DM. The economic cost of long COVID: An update. Published online July. 2022.
Sagy YW, Feldhamer I, Brammli-Greenberg S, Lavie G. Estimating the economic burden of long-Covid: the additive cost of healthcare utilisation among COVID-19 recoverees in Israel. BMJ Global Health. 2023;8:e012588.
Gandjour A. Long COVID: Costs for the German economy and health care and pension system. BMC Health Services Research. 2023;23:1–7.
Sanchez-Ramirez DC, Normand K, Zhaoyun Y, Torres-Castro R. Long-term impact of COVID-19: a systematic review of the literature and meta-analysis. Biomedicines. 2021;9:900.
Ham DI. Long-haulers and labor market outcomes. Federal Reserve Bank of Minneapolis; 2022.
Amin-Chowdhury Z, Ladhani SN. Causation or confounding: why controls are critical for characterizing long COVID. Nature Medicine. 2021;27:1129–30.
Barizien N, Le Guen M, Russel S, Touche P, Huang F, Vallée A. Clinical characterization of dysautonomia in long COVID-19 patients. Scientific reports. 2021;11:14042.
Davis HE, Assaf GS, McCorkell L, Wei H, Low RJ, Re'em Y, et al. Characterizing long COVID in an international cohort: 7 months of symptoms and their impact. EClinicalMedicine. 2021;38.
Deer RR, Rock MA, Vasilevsky N, Carmody L, Rando H, Anzalone AJ, et al. Characterizing long COVID: deep phenotype of a complex condition. EBioMedicine. 2021;74.
Taquet M, Dercon Q, Luciano S, Geddes JR, Husain M, Harrison PJ. Incidence, co-occurrence, and evolution of long-COVID features: A 6-month retrospective cohort study of 273,618 survivors of COVID-19. PLoS medicine. 2021;18:e1003773.
Thaweethai T, Jolley SE, Karlson EW, Levitan EB, Levy B, McComsey GA, et al. Development of a definition of postacute sequelae of SARS-CoV-2 infection. Jama. 2023;329:1934–46.
Bonilla H, Peluso MJ, Rodgers K, Aberg JA, Patterson TF, Tamburro R, et al. Therapeutic trials for long COVID-19: A call to action from the interventions taskforce of the RECOVER initiative. Front Immunol. 2023;14:1129459.
Jones R, Davis A, Stanley B, Julious S, Ryan D, Jackson DJ, et al. Risk predictors and symptom features of long COVID within a broad primary care patient population including both tested and untested patients. Pragmatic and observational research. 2021:93–104.
Knight DR, Munipalli B, Logvinov II, Halkar MG, Mitri G, Hines SL. Perception, prevalence, and prediction of severe infection and post-acute sequelae of COVID-19. The American journal of the medical sciences. 2022;363:295–304.
Sudre CH, Murray B, Varsavsky T, Graham MS, Penfold RS, Bowyer RC, et al. Attributes and predictors of long COVID. Nature medicine. 2021;27:626–31.
Mollalo A, Hamidi B, Lenert L, Alekseyenko AV. Characterizing Patient Phenotypes and Emerging Trends in Application of Spatial Analysis in Individual-Level Health Data. 2023.
Kurbasic I, Pandza H, Masic I, Huseinagic S, Tandir S, Alicajic F, et al. The advantages and limitations of international classification of diseases, injuries and causes of death from aspect of existing health care system of Bosnia and Herzegovina. Acta Informatica Medica. 2008;16:159.
Pfaff ER, Girvin AT, Bennett TD, Bhatia A, Brooks IM, Deer RR, et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. The Lancet Digital Health. 2022;4:e532-e41.
Michelen M, Manoharan L, Elkheir N, Cheng V, Dagens A, Hastie C, et al. Characterising long COVID: a living systematic review. BMJ global health. 2021;6:e005427.
Akbarialiabad H, Taghrir MH, Abdollahi A, Ghahramani N, Kumar M, Paydar S, et al. Long COVID, a comprehensive systematic scoping review. Infection. 2021:1–24.
Aiyegbusi OL, Hughes SE, Turner G, Rivera SC, McMullan C, Chandan JS, et al. Symptoms, complications and management of long COVID: a review. Journal of the Royal Society of Medicine. 2021;114:428–42.
Michelen M, Manoharan L, Elkheir N, Cheng V, Dagens D, Hastie C, et al. Characterising long-term covid-19: a rapid living systematic review. medrxiv. 2020;10:08.20246025.
Kelly JD, Curteis T, Rawal A, Murton M, Clark LJ, Jafry Z, et al. SARS-CoV-2 post-acute sequelae in previously hospitalised patients: systematic literature review and meta-analysis. European Respiratory Review. 2023;32.
Iqbal FM, Lam K, Sounderajah V, Clarke JM, Ashrafian H, Darzi A. Characteristics and predictors of acute and chronic post-COVID syndrome: A systematic review and meta-analysis. EClinicalMedicine. 2021;36.
Dagliati A, Strasser ZH, Abad ZSH, Klann JG, Wagholikar KB, Mesa R, et al. Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study. Eclinicalmedicine. 2023;64.
Estiri H, Strasser ZH, Brat GA, Semenov YR, Patel CJ, Murphy SN. Evolving phenotypes of non-hospitalized patients that indicate long COVID. BMC medicine. 2021;19:1–10.
Al-Aly Z, Xie Y, Bowe B. High-dimensional characterization of post-acute sequelae of COVID-19. Nature. 2021;594:259–64.
Baskett WI, Qureshi AI, Shyu D, Armer JM, Shyu C-R. COVID-specific long-term sequelae in comparison to common viral respiratory infections: an analysis of 17 487 infected adult patients. Open Forum Infectious Diseases: Oxford University Press US; 2023. p. ofac683.
Fritsche LG, Jin W, Admon AJ, Mukherjee B. Characterizing and predicting post-acute sequelae of SARS CoV-2 infection (PASC) in a large academic medical center in the US. Journal of Clinical Medicine. 2023;12:1328.
Haupert SR, Shi X, Chen C, Fritsche LG, Mukherjee B. A Case-Crossover Phenome-wide association study (PheWAS) for understanding Post-COVID-19 diagnosis patterns. Journal of Biomedical Informatics. 2022;136:104237.
Jiang S, Loomba J, Sharma S, Brown D. Vital Measurements of Hospitalized COVID-19 Patients as a Predictor of Long COVID: An EHR-based Cohort Study from the RECOVER Program in N3C. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): IEEE; 2022. p. 3023-30.
Kessler R, Philipp J, Wilfer J, Kostev K. Predictive Attributes for Developing Long COVID—A Study Using Machine Learning and Real-World Data from Primary Care Physicians in Germany. Journal of Clinical Medicine. 2023;12:3511.
Khullar D, Zhang Y, Zang C, Xu Z, Wang F, Weiner MG, et al. Racial/ethnic disparities in post-acute sequelae of SARS-CoV-2 infection in New York: an EHR-based cohort study from the RECOVER program. Journal of General Internal Medicine. 2023;38:1127–36.
Lorman V, Rao S, Jhaveri R, Case A, Mejias A, Pajor NM, et al. Understanding pediatric long COVID using a tree-based scan statistic approach: an EHR-based cohort study from the RECOVER Program. JAMIA open. 2023;6:ooad016.
Nasir M, Cook N, Parras D, Mukherjee S, Miller G, Ferres JL, et al. Using Data Science and a Health Equity Lens to Identify Long-COVID Sequelae Among Medically Underserved Populations. Journal of Health Care for the Poor and Underserved. 2023;34:521–34.
Pfaff ER, Madlock-Brown C, Baratta JM, Bhatia A, Davis H, Girvin A, et al. Coding long COVID: characterizing a new disease through an ICD-10 lens. BMC medicine. 2023;21:58.
Rao S, Lee GM, Razzaghi H, Lorman V, Mejias A, Pajor NM, et al. Clinical features and burden of postacute sequelae of SARS-CoV-2 infection in children and adolescents. JAMA pediatrics. 2022;176:1000–9.
Reese JT, Blau H, Casiraghi E, Bergquist T, Loomba JJ, Callahan TJ, et al. Generalisable long COVID subtypes: findings from the NIH N3C and RECOVER programmes. EBioMedicine. 2023;87.
Sengupta S, Loomba J, Sharma S, Brown DE, Thorpe L, Haendel MA, et al. Analyzing historical diagnosis code data from NIH N3C and RECOVER Programs using deep learning to determine risk factors for Long Covid. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): IEEE; 2022. p. 2797 – 802.
Wang L, Foer D, MacPhaul E, Lo Y-C, Bates DW, Zhou L. PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes. Journal of Biomedical Informatics. 2022;125:103951.
Zang C, Zhang Y, Xu J, Bian J, Morozyuk D, Schenck EJ, et al. Data-driven analysis to understand long COVID using electronic health records from the RECOVER initiative. Nature Communications. 2023;14:1948.
Zhang HG, Dagliati A, Shakeri Hossein Abad Z, Xiong X, Bonzel C-L, Xia Z, et al. International electronic health record-derived post-acute sequelae profiles of COVID-19 patients. NPJ Digital Medicine. 2022;5:81.
Zhang H, Zang C, Xu Z, Zhang Y, Xu J, Bian J, et al. Data-driven identification of post-acute SARS-CoV-2 infection subphenotypes. Nature Medicine. 2023;29:226–35.

Download PDF

Version 1

posted

You are reading this latest preprint version

Identifying Long COVID Definitions, Predictors, and Risk Factors using Electronic Health Records: A Scoping Review

Status:

Version 1

Abstract

Objective

Methods

Results

Conclusion

Figures

1. Introduction

2. Methods

2.1. Data Source

2.2. Search Strategy

2.3. Study Selection

2.4. Data Extraction

2.5. Narrative Synthesis

3. Results

4. Discusssion

Declarations

Competing interests:

Funding:

Author contributions:

References

Status:

Version 1