In the United States (US), the transition from paper to electronic health records (EHR) was notably accelerated by initiatives such as the Meaningful Use program under the HITECH act1 from the Centers for Medicare & Medicaid Services. Beginning in 2009, this program was intended to enhance data capture and improve patient outcomes.2 Before this, the Veterans Health Administration (VHA), the largest integrated healthcare system in the US, had already independently developed and deployed the Veterans Health Information Systems and Technology Architecture (VistA), which captures medical records for millions of Veterans daily.3,4 Despite the widespread adoption of EHR systems in the US, translating EHR data into meaningful insights has been limited.5,6
Traditionally, EHR research and assessment has focused on structured data such as diagnostic codes, demographics, and lab results, leading to numerous insights in epidemiology, as well as advancements in clinical disease prediction and identifying drug repurposing targets.7–11 Despite the notable achievements in leveraging structured data, it is often limited in providing a comprehensive understanding of a patient’s clinical presentation.6,12 However, additional information within the medical record, often found in unstructured clinical text, can provide crucial information that might otherwise be unavailable. Prior estimates have even suggested that approximately 80% of medical data remains unstructured and unutilized following documentation.13
Previously, traditional Natural Language Processing (tNLP) techniques (e.g., rule-based approaches) have extracted insights from unstructured data.14 However, tNLP, while effective, necessitates extensive time-intensive preprocessing and lacks contextual understanding.15,16 Since the 1960s,17 and more recently with the development of Google’s transformer model in 201718 and OpenAI’s release of ChatGPT in 2022,19 the emergence of foundational Large Language Models (LLMs) has enabled more scalable and flexible analysis of unstructured text. This scalability is particularly crucial within the VHA, which operates across more than 1,255 healthcare facilities, including 170 medical centers and over 1,074 outpatient sites.20 The diversity in language and documentation practices across these numerous sites necessitates an information extraction approach that can adapt to varying terminologies and nuances, making LLMs invaluable for extracting meaningful insights from the vast and varies unstructured clinical text within the VHA.
LLMs, which are trained on vast amounts of textual data and designed to understand context, are able to understand and generate text based on specific requirements. As such, these models are well suited to efficiently process and interpret natural language, adapt to medical terminologies, and analyze unstructured data with greater flexibility and efficiency than tNLP methods.21–23 The robustness of LLMs, based on their design and training data, enables more nuanced information extraction from medical records.24 By enhancing the ability to analyze unstructured text, these models hold the potential to significantly improve clinical decision-making, patient outcomes, and overall healthcare delivery.
Objective:
We introduce a comprehensive and scalable framework for leveraging LLMs to extract population-level medical insights from EHRs with a focus on unstructured data. Using female infertility as an illustrative example, we outline a methodology that includes (1) data collection and preparation, (2) theme discovery, (3) data integration, and (4) next steps for data analyses. We discuss the integration of LLM outputs with structured data to build comprehensive multi-modal datasets that can be used to enhance more traditional predictive models. Theme discovery is particularly crucial for our framework, as the risk factors for female infertility are still not fully understood. Our approach leverages LLMs to uncover hidden patterns and themes within unstructured notes, provide new insights into potential risk factors that might otherwise remain undetected. This secure and scalable framework also facilitates the analysis of vast amounts of medical records within a reasonable timeframe. Given the significant push to prioritize women’s health research within VHA25 this work addresses a critical gap where lack of attention and dedicated research efforts have left many questions unanswered. A visual outline of our proposed methods can be found in Fig. 1.
An underlining goal is to demonstrate how LLMs can be effectively integrated into the VHA’s existing infrastructure to unlock deeper insights from unstructured EHR data. By refining our understanding of conditions such as female infertility and improving data integration methods, this approach aims to enhance the precision of healthcare delivery and inform more targeted strategies within the VHA, ultimately improving patient outcomes across the Veteran population.
[Insert Fig. 1 Approximately here]
LLM Framework:
Data Collection and Preparation:
Defining Outcome
For our outcome of interest, we explore the medical condition of female infertility, defined by the inability to conceive after 12 months of unprotected sexual intercourse.26 Research on predicting female infertility has drawn the interest of several multidisciplinary groups in recent years.27 Causes for female infertility have been shown to be influenced by various factors such as deficient ovulation, physical disorders, ovarian diseases, endometriosis, cervical trauma, and defective implantation, among others.28 Risk prediction in female infertility has traditionally relied on structured EHR data (i.e., laboratory results, diagnostic codes, etc.). However, relying solely on structured data may overlook critical nuances and contextual information embedded within clinical narratives, which often capture subtleties in patient history, symptom progression, and physician observations. By incorporating unstructured data from clinical notes, our approach aims to reveal these hidden details, offering a more comprehensive understanding of the factors contributing to female infertility. This integration is essential for identifying less obvious risk factors and ensuring that the models developed are both thorough and reflective of the complexities inherent in clinical care. This paper details steps and strategies for using LLMs to create a multimodal dataset aimed at improving the detection of early risk factors for female infertility, which could, in turn, inform strategies for risk mitigation and improving patient outcomes.
Population Selection
For this example, we utilized a retrospective observational analysis framework using data from the Department of Veterans Affairs (VHA) EHR database, accessed through Azure Databricks v13.3 data lake.29 EHR data between January 18th, 2006, and January 18th, 2024, were assessed for this project.
Female patients aged 18 to 35 with at least one primary care encounter within the last three years of the observational period were included. We excluded those over the age of 35, as this is considered “advanced maternal age,” a known risk for fertility complications.26 Using the diagnostic code for female infertility (ICD-9 code: 628.9 or ICD-10 code: N97), we identified patients with a documented infertility diagnosis within their medical records. Female patients of the same age range with no diagnosis of infertility were included in this project as our controls, thus establishing our case-control framework. As a proof of concept, a random sample of 100 infertility cases and 100 controls were identified and examined.
Data Collection
Among our sample population, we extracted structured data from the VHA’s EHR, including patient demographics, diagnostic codes, medications, and laboratory results. In total, over 450 structured variables were extracted (list of structured variables can be found in Supplement Table 1). Unstructured data collection focused on clinical notes from primary care and women’s health clinic encounters. For infertility cases, we randomly extracted one progress note per patient within 5 years prior to their first infertility diagnosis; for controls, we randomly extracted one progress note within 5 years before their latest primary care or women’s health clinic encounter.
Scalable Analytical Tools
For this proof-of-concept methodology, we focused on analyzing unstructured data with Meta’s Llama-3-8b, 30 an open-source LLM, using the foundation model with no additional finetuning from HuggingFace’s Transformers.31 Models were served and inferenced on the ND A100 v4-series virtual machine.32 To scale analyses, the Ray open-source unified compute framework was used to distribute workloads among Databricks GPU clusters.33 This configuration enabled the optimization of analysis and improved processing time. All analyses operated within the Databricks Azure environment on the VA Enterprise Cloud.
Ethics Statement
This quality assessment project received determination of non-research from Stanford Institutional Review Board, (Stanford University, Stanford, CA, USA) Protocol #74380.
Theme Discovery:
Theme discovery is a pivotal step in achieving the primary objective of this work – creating a comprehensive multimodal dataset that integrates both structured and unstructured data to enhance predictive models. By identifying and understanding the underlying themes within unstructured clinical notes, we can extract valuable insights that may not be captured through structured data alone. This section details the initial process for identifying themes within the unstructured clinical notes. By comparing the prevalence of themes among our infertility and control sample populations, we identified key differentiating themes between groups.
Prompt Engineering
We employed iterative prompt engineering to develop and refine our prompts. Prompt engineering is the process of designing and refining the questions or instructions provided to a language model to guide its output.34 This process not only uncovers new themes, but also refines previously identified ones, ensuring a robust and comprehensive thematic exploration.
To extract high-level themes from the text, we developed initial prompts which instructed the model to identify and extract themes present within the clinical note. We used zero-, one-, and few-shot prompting. Zero-shot prompting obtained initial results from the model without additional influence, serving as a baseline for theme identification. One- and few-shot prompting enabled in-context learning by providing one or a few sample notes with identified themes, which helped the model better understand the context and nuances of the task. Figure 2a details an example of a zero-shot prompt utilized to extract these themes from each clinical note. We further examined one- and few-shot prompting. Figure 2b illustrates an example of a one-shot prompt we tested. Based on sampled results, we iteratively refined our prompts, enabling the nuanced extraction of themes.
This iterative refinement of prompts was guided by subject matter experts (SMEs), including medical doctors (MDs) and data scientists, ensuring that the extracted themes were both clinically relevant and aligned with the goals of our assessment. The MDs were all board-certified, with extensive knowledge in reproductive medicine. They played a crucial role in reviewing the model’s outputs by providing expert feedback on the clinical validity of the identified themes, thereby helping to ensure that the themes were reflective of real-world clinical scenarios. The data scientists provided technical explanations of the model’s behavior, ensuring that the refinements were both clinically sound and technically feasible. During this process, SMEs engaged in discussions to resolve any differing opinions. In cases where discrepancies arose, a consensus was reached through collaborative discussion, with each expert providing insights based on their knowledge and clinical practice.
As we refined the prompts, the few-shot prompt approach had more precise and meaningful theme extraction. In this context, “meaningful” themes refer to those can offer valuable clinical insights, contributing to a deeper understanding of female infertility by highlighting potential risk factors or patterns that might otherwise be overlooked.
[Figure 2 Approximately here]
Class-Specific Themes Analysis
Once our few-shot prompt approach identified and extracted relevant themes, we conducted a comparison analysis to determine which themes were prevalent among each sample group. By calculating the absolute difference in the prevalence of themes among the case and control cohorts, we identified ‘infertility concerns’, ‘pregnancy planning and counseling,’ and ‘ovulation disorders’ to be most divergent between case and control groups; Table 1. These findings were further validated alongside SMEs, ensuring they accurately align with established clinical understandings of female infertility. If more granularity was needed, for instance if one of the identified themes was related to ‘ovulation disorders,’ we would delve into the specific keywords associated with this theme, such as ‘polycystic ovary syndrome (PCOS),’ ‘irregular menstrual cycles,’ or ‘anovulation’ for further adjudication.
Table 1
Top 20 Themes Based upon Prevalence Difference Among Groups
High Level Theme | Infertility Group, n | Control Group, n | Difference |
Infertility concerns | 17 | 3 | 14 |
Pregnancy planning and counseling | 12 | 2 | 10 |
Ovulation disorders | 15 | 5 | 10 |
Hormonal concerns | 9 | 0 | 9 |
HPV* results and screening | 5 | 14 | 9 |
Screening recommendations | 20 | 25 | 5 |
Medication review | 10 | 15 | 5 |
Allergy and sinus conditions | 0 | 5 | 5 |
Appointment management | 12 | 8 | 4 |
Mental health concerns | 10 | 14 | 4 |
Abnormal pap smear test results** | 8 | 4 | 4 |
Patient education and counseling | 1 | 5 | 4 |
Pain in upper extremities | 0 | 4 | 4 |
Headaches and migraines | 4 | 1 | 3 |
Bacterial vaginosis | 4 | 7 | 3 |
Painful intercourse | 1 | 4 | 3 |
Toxic exposure concerns | 0 | 3 | 3 |
Treatment plan | 0 | 3 | 3 |
Iron deficiency | 3 | 1 | 2 |
Hematological abnormalities | 2 | 0 | 2 |
*HPV = human papillomavirus **Pap smear = Papanicolaou test is a method of cervical screening used to detect potentially precancerous and cancerous processes in the cervix |
[Table 1 Approximately here]
We identified five themes that most effectively distinguished between the infertility cases and controls: (1) Infertility concerns, (2) Pregnancy planning and counselling, (3) Ovulation disorders, (4) Hormone imbalances, and (5) Abnormal pap smear. We selected a cutoff of five themes for further analysis to strike a balance between comprehensiveness and manageability. This decision was influenced by the upcoming formal thematic analysis, where it was essential to focus on a set number of themes that could be effectively processed by the model. Limiting the number of themes to five allowed us to maintain analytical depth without overwhelming the model, ensuring that each theme could be thoroughly explored and integrated into future analyses.
Formal Thematic Analysis:
Once we identified the key themes, we developed additional prompts to determine if these themes were present within the clinical notes. Employing a similar approach to prompt engineering as described earlier, we utilized zero-, one-, and few-shot prompting techniques. For instance, Fig. 3 illustrates the instructions for a zero-shot prompt. Subsequent iterations and consultations with SMEs refined our few-shot prompting method, enhancing the model’s ability to identify and confidently assert the presence of specific themes within the clinical text. Figure 4 showcases a sample output from the model, using fictious patient data for illustrative purposes, detailing both the prevalence and the confidence levels of the identified themes.
[Figure 3 Approximately here]
[Figure 4 Approximately here]
By configuring the model to estimate the likelihood of a theme’s presence within the text, rather than simply categorizing presence as a binary variable, we generated numerical probabilities. This methodology facilitates a more nuanced analysis and can contribute significantly to the development of predictive models that are both more sophisticated and informative.
Data Integration:
The final step in our analysis preparation involved transforming the data into a tabular format, where each row corresponded to a patient record and the columns represented the numerical probabilities of whether or not the extracted themes were present within the unstructured text. This data configuration allowed us to merge the newly created probability-encoded variables with our curated structured dataset, which included diagnostic codes, laboratory results, and demographic information for each patient. The resultant multimodal dataset allows for the simultaneous analysis of variables derived from both structured and unstructured data sources. Table 2 details an example of the multi-modal dataset we were able to develop.
Table 2
Example of multi-modal dataset with extracted theme variables.
| Structured Data | Unstructured Data |
Patient ID | Race | Ethnicity | BMI | Infertility dx | Pap Smear* Date | Pap Result | …. | Theme: infertility concerns | Theme: ovulation disorders | Theme: abnormal pap | …. |
1 | African American | Hispanic | 39.8 | 1 | 06/17/2023 | Normal | …. | 0.95 | 0.20 | 0.0 | …. |
2 | White | Non-Hispanic | 26.2 | 0 | 08/26/2021 | Normal | …. | 0.0 | 0.10 | 0.0 | …. |
3 | White | Hispanic | 40.1 | 1 | 02/02/2019 | Abnormal | …. | 0.81 | 0.62 | 0.75 | …. |
4 | Asian | Non-Hispanic | 35.0 | 1 | 04/07/2024 | Normal | …. | 0.91 | 0.76 | 0.0 | …. |
5 | Pacific Islander | Non-Hispanic | 29.8 | 0 | 03/15/2018 | Abnormal | …. | 0.05 | 0.11 | 0.80 | …. |
6 | White | Hispanic | 31.5 | 0 | 06/18/2022 | Abnormal | …. | 0.06 | 0.08 | 0.70 | …. |
7 | African American | Non-Hispanic | 27.8 | 0 | 10/29/2023 | Normal | …. | 0.01 | 0.05 | 0.05 | …. |
8 | White | Non-Hispanic | 24.6 | 1 | 11/05/2021 | Normal | …. | 0.96 | 0.80 | 0.10 | …. |
9 | Native American | Hispanic | 41.8 | 1 | 05/12/2023 | Normal | …. | 0.80 | 0.67 | 0.05 | …. |
10 | Asian | Non-Hispanic | 29.1 | 0 | 07/10/2021 | Abnormal | …. | 0.03 | 0.06 | 0.85 | …. |
…. | …. | …. | …. | …. | …. | …. | …. | …. | …. | …. | …. |
*Pap smear = Papanicolaou test is a method of cervical screening used to detect potentially precancerous and cancerous processes in the cervix |
[Table 2 Approximately here]
This approach, from class-specific theme identification through data structuring for pattern analysis, emphasizes the potential of unsupervised learning techniques in extracting meaningful insights from complex, unstructured medical datasets. Through careful application of these methods, we can advance our understanding of specific medical conditions, like female infertility, and contribute valuable knowledge that can be used to improve patient outcomes.
Next Steps:
Once we have established the multimodal dataset, advanced, as well as traditional statistical analyses can be employed. For instance, we may choose to conduct exploratory data analysis (EDA), further identifying key associations between our extracted features and the diagnosis of infertility. Applying statistical tests, such as Pearson’s R or Chi-Square tests, we can compare the prevalence of specific features between our case-control cohorts.
We can additionally leverage machine learning techniques to enhance our understanding for female infertility, identify those are highest risk, or potentially even identify previously unknown modifiable risk-factors. These techniques could include traditional approaches such as multi-variate logistic regression, or more complex algorithms such as random forests and gradient boosting machines, known for their predictive accuracy and robustness against overfitting.35 For instance, Lasso logistic regression could identify the most predictive features of female infertility by utilizing the algorithm's inherent feature selection capabilities. Additionally, models capable of processing sequential data, such as time-series models36 or Long Short-Term Memory (LSTM) networks,37,38 could analyze temporal patterns in clinical notes, potentially uncovering longitudinal risk factors for infertility.