This is the first scoping review for identifying long COVID phenotypes, predictors, and risk factors based on EHR data. We found 20 articles that met the eligibility criteria. The articles were classified based on how long COVID was defined, methodologies, and the identification of significant risk factors or phenotypes. ICD-10 codes (U09.9 or B 94.8) were the most common markers. A large majority of the studies reported poor general well-being and respiratory and cardiovascular conditions features as significant.
The studies collectively illuminate the heterogeneous nature of long COVID, revealing that it does not manifest as a uniform condition. Instead, it presents diverse phenotypic clusters, which include respiratory, neuropsychiatric, cardiovascular, and pain and fatigue subtypes [42]. These subtypes are characterized by distinct clinical features, patient demographics, and associations with various organ systems. Ten of the twelve symptom patterns identified by the NIH RECOVER initiative were considered as features by the studies and fell under broad, significant categories (Fig. 5). Both thirst and changes in sexual desire were absent as individual features but may fall under broad symptom characterizations [42]. Phenotypic clustering utilizing machine learning methods may provide a unique approach to examining phenotype commonalities and drastic differences, and may enhance our understanding of long COVID's heterogeneity. While several of the phenotypes found by the studies are common in understanding long COVID, latent phenotypes like substance use disorders (e.g., opioid use) and genital organs present unique ways to define the disease.
Beyond the symptoms of long COVID, some studies reported predictors and risk factors for developing the condition. We noted that SDOH and demographic information cannot plausibly be used to form a clinical definition of long COVID but may indicate the likelihood of developing the condition. Such studies included patients who recovered quickly from their acute infection in their test population, therefore diluting the conclusions that can be drawn from observed symptoms, diagnoses, and medicines.
Surprisingly, our research found limited global use of EHR data in defining or characterizing long COVID, with 85% of the included studies utilizing datasets composed of US patients. These geographical disparities in research distribution could mainly be related to privacy regulations governing access to EHR data across different countries. The lack of diverse EHR-based studies may hinder our understanding of long COVID's complex nature and the generalizability of findings beyond the studied regions. Thus, it is recommended that future international studies harness the full potential of EHR data for long COVID research, enabling the development of more effective interventions.
The studies identified a broad range of symptoms, complications, and clinical conditions significantly associated with long COVID. However, only one study [39] examined causality, using Bayesian structural time series models. This complexity underscores the need for a more comprehensive understanding of clinical conditions to elucidate the causal relationships. Moreover, while the studies primarily relied on ICD-10 codes to define long COVID, only one study leveraged the potential of NLP for extracting information from unstructured clinical notes in EHR data. Combined with longitudinal EHR data, this approach can enhance our understanding of patterns, progression, treatment, and management of long COVID.
While EHR, EMR, and electronic patient records (EPR) systems are often used interchangeably, they differ in scope and usage. EHRs were the most frequently encountered systems in this review. They serve as digital repositories of a patient's complete medical history from all healthcare providers and are intended for sharing with other healthcare entities, practices, and hospitals. EMRs focus on a single practice or hospital. Some EMR systems offer integrated care, possibly within larger hospital corporations. Lastly, EPR usage is primarily limited to Europe, which was not used in the articles. The localized usage of the term EPR may indicate the use of other terminologies globally that we have not accounted for in our literature search strategy, as our search was confined to specific health information systems. For future studies, it would be beneficial to include broader research terms, such as data hubs, data lakes, registries, and repositories, to encompass a more comprehensive scope of relevant literature.
The included studies shared some common limitations. There is no widely accepted definition for long COVID to date, so the disease was defined based on various arbitrary time intervals. Additionally, the symptoms and conditions of long COVID were examined mostly in the initial months. Most studies failed to differentiate between incident and prevalent symptoms, contributing to ambiguity in characterizing long COVID. To address these challenges, a standardized research protocol that facilitates capturing systematic tracking of symptoms seems imperative. Leveraging longitudinal EHR-derived data can provide a more comprehensive understanding of the emerging symptoms and monitor longer-term trends and effects.
Nearly all studies acknowledged data-related limitations. The use of diagnosis codes is a limitation because they may not capture all signs, symptoms, or laboratory results found in clinical notes. This reliance on diagnostic codes could result in missing information and biases. Additionally, the intensity of the pandemic and misinformation might introduce confirmatory bias between healthcare providers and patients. Some articles [34, 45] acknowledged the need for replication studies in other cohorts. Several articles excluded hospitalized COVID-19 patients, which might not reflect the complete spectrum of long COVID. This limitation highlights the importance of including more hospitalized patients in future studies.
Temporal bias is an issue because the choice of time windows for analysis varies among studies. Some studies expressed concerns about not accounting for within-person time-varying confounders, such as changes in health-seeking behavior during the pandemic. The studies’ requirement of pre- and post-COVID-19 visits is acknowledged as potentially biased toward patients with more complex health histories. The cohort case window ratio is a subjective parameter, indicating variability in the study design. Additionally, some studies mentioned potential timing biases related to developing and implementing specific diagnosis codes. Some touched on the issue of data representation concerning various population groups and the inability to generalize the data. The absence of medication information used for COVID-19 therapy, particularly in severe cases, is noted as a potential limitation, as is the lack of data on viral variants for individual patients.
Our study has several limitations. First, the exclusion of non-English articles may introduce language bias. Due to the evolving nature of the topic, the omission of pre-prints and conference articles could result in the loss of crucial information. Additionally, it is possible that some articles were missed during the screening process, although we mitigated this by conducting reference checks and backward searches. Another recurring limitation was the lack of comprehensive insight into the underlying mechanisms that lead to long COVID. Furthermore, they did not explore the indirect effects of long COVID, such as the social, economic, and behavioral changes that may arise due to the condition. This knowledge gap poses a significant challenge in fully understanding and addressing long COVID.
In conclusion, this scoping review has provided a comprehensive overview of state-of-the-art research on long COVID, utilizing EHR data as the primary source for defining and characterizing this condition. The findings suggest that while a consensus on the definition of long COVID remains elusive, ICD-10 codes are commonly used for identification, and poor general well-being, respiratory conditions, and cardiovascular conditions are consistently associated with long COVID. It was observed that the use of EHR data for characterizing long COVID is primarily concentrated in the US, highlighting the need for more international studies. Moreover, while data science techniques are widely employed, the lack of validation and causality assessments is evident, highlighting the need for more robust methodologies. The complex nature of long COVID, encompassing various symptoms and clinical conditions, underscores the need for more in-depth studies, including those leveraging longitudinal EHR data. Despite the gaps, this review serves as a foundation for future research efforts aimed at harnessing the potential of EHR data to better understand the epidemiology of long COVID.