This section provides a comprehensive overview of the findings from our scoping review, organized around subsections that emerged during our analysis. We begin with the studies characteristics and the type of data that have been used in these studies. Next, we analyze the studies from the technical aspects including the data preprocessing techniques, SSRL model types, models for downstream tasks, the evaluation metrics used and the interpretability techniques. Table 1 summarizes the key features of technical aspect, and Table 2 provides essential information of the studies from the medical perspective.
Studies characteristics
As illustrated in Fig. 2, most of the research (n = 33, 72%) was conducted by interdisciplinary teams of medical experts and data scientists. The United States led in the number of published studies (n = 21, 46%), followed by China (n = 9, 20%) and the United Kingdom (n = 4, 9%). Despite this geographic diversity, only a few studies (n = 11, 24%) involved international collaborations. For details on the authors and research teams, refer to Table S2.
Type of model and trend
Five main model types have been identified for representing EHR categorical data: Transformer-based models (n = 20, 43%), Autoencoder (AE) based models (n = 13, 28%), Graph Neural Network (GNN) based models (n = 8, 17%), Word-embedding models (n = 3, 7%), and Recurrent Neural Network (RNN) based models (n = 3, 7%). Studies that combine two or more model types are counted once for each corresponding model type. To assess their impact on research, we analyzed the number of citations for each model type.
Figure 3 shows the papers published from January 2019 to December 2023, their citation counts by July 2024, and their corresponding model types. Based on the number of citations, Transformers, RNN, and GNN models are the most impactful, with Transformer models showing particularly high citation counts for papers published from 2020 to 2023.
Type of data
Studies utilize various data types to represent patients and medical knowledge. Typically, patient representation is derived from EHRs, incorporating both categorical and non-categorical data. Additionally, external medical knowledge can be integrated into models through data collected beyond EHRs. For detailed information on the modalities used across studies, see Table S3.
EHR
Among the categorical data types in EHRs, diagnosis codes are the most frequently used (n = 45, 98%), including ICD-9, ICD-10-CM, and SNOMED-CT. Medication codes (n = 32, 70%), such as ATC and SNOMED-CT, along with procedure codes (n = 20, 43%) like CPT and ICD-10-PCS. To enhance patient representation, non-categorical data may also be included. The most common non-categorical data types are patient age (n = 19, 41%), clinical measurement values (n = 15, 33%) such as BMI, heart rate, and systolic blood pressure, and clinical narratives from physicians and practitioners (n = 7, 15%).
The integration of external data sources can further enrich patient profiles. Medical knowledge graphs and ontologies provide rich hierarchical information, while medical text corpora contain expert medical knowledge. These external sources offer a comprehensive understanding of clinical concept interactions. Among external data sources, ontologies are the most used (n = 7, 15%), they are employed to obtain the medical concept embeddings22–28and for SSRL training task23. Other significant external data sources include medical knowledge graph25,29 and medical text corpora30.
Table 1
A summary of the published year, data type, patient number, model type, and task type of the studies.
| Dataset | model type | task type |
numerical | categorical | patient number | dataset type |
model name | year | procedure | diagnosis | medication | lab. tests | age | measurement | unlabeled | labeled | private | public | AE | Transformer | GNN | RNN | Word embedding | Others | classification | clustering | regression |
Liang et al.31 | 2019 | | x | x | x | | | < 1k | < 1k | x | | | | | | | a | x | | |
de Lusignan et al.32 | 2019 | | x | | | x | | 11k | | x | | x | | | | | | | x | |
G-BERT22 | 2019 | | x | x | | | | 83k | 6k | | 1 | | x | x | | | | x | | |
Ruan et al.33 | 2019 | | x | x | x | x | x | 5k | 5k | x | | x | | | | | | x | x | |
BEHRT34 | 2020 | | x | | | x | | 1.6M | 700k | x | | | x | | | | | x | x | |
ConvAE35 | 2020 | x | x | x | x | | | 1.6M | | x | | x | | | | | | | x | |
Enhanced Reg36 | 2020 | | x | | x | x | x | 104.4k | 73k | x | | x | | | | | | x | | |
CLMBR14 | 2021 | x | x | x | x | x | | 3.4M | 131k | x | | | | | x | | | x | | |
EDisease30 | 2021 | | x | | x | x | | 1M | 816k | x | | | x | | | | | x | x | |
ME2Vec37 | 2021 | x | x | x | | | | 111k | 11k | x | 2 | | | x | | | | x | x | |
PLGMNN38 | 2021 | | | x | x | | x | < 1k | < 1k | x | 1 | | | | | | b | x | | |
Med-BERT39 | 2021 | | x | | | | | 28.4M | 43k | x | | | x | | | | | x | | |
Huang et al.40 | 2021 | x | x | x | x | x | x | 105k | | x | | | | | | x | | | x | |
BRLTM 41 | 2021 | x | x | x | | x | | 44k | 10k | x | | | x | | | | | x | | |
Phe2vec42 | 2021 | x | x | x | x | | | 300k | | x | | | | | | x | | | x | |
CEHR-BERT43 | 2021 | x | x | x | | | | 2.4M | 591k | x | | | x | | | | | x | x | |
DICE44 | 2021 | | x | x | x | x | x | 1k | 1k | x | | x | | | | | | x | x | |
Chushig-Muzo et al.45 | 2021 | | x | x | | | | 6,5k | | x | | x | | | | | | | x | |
Poulain et al.46 | 2021 | | x | x | | | x | 7k | | | 3 | | x | | | | | | | x |
Kumar et al.29 | 2022 | x | x | x | x | | | 29k | 29k | | 1 | x | | | | | | x | | |
Shao et al.47 | 2022 | | x | x | x | | | 30k | | x | | x | | | | | | | x | |
Claim-PT23 | 2022 | x | x | x | | x | | 1.9M | 1k | x | | | x | | | | | x | | |
Navaz et al.48 | 2022 | | x | | | x | x | 5k | | | 4,5 | x | | | | | | x | x | |
CEHR-GAN-BERT49 | 2022 | x | x | x | | x | | 55k | < 1k | | 3,6 | | x | | | | | x | | |
CEF-CL50 | 2022 | x | x | | | | | 48k | 48k | x | 3 | | | | | | c | x | | |
ADADIAG51 | 2022 | | x | | x | | | 28k | 6k | x | 6 | | x | | | | | x | x | |
Manzini et al.52 | 2022 | | x | x | | x | x | 11k | | x | | x | | | | | | | x | |
Herp et al.53 | 2023 | x | x | x | | | | 19k | 19k | x | | x | | | | | | x | x | |
MMMGCL28 | 2023 | x | x | | x | | | 14k | 4k | | 1,2 | | | x | | | | x | | |
MedM-PLM26 | 2023 | | x | x | | | | 40k | 5k | | 1 | | x | x | | | | x | | |
Ta et al.54 | 2023 | x | x | x | x | | x | 11k | | x | | | | | | x | | | x | |
Hi-BEHRT55 | 2023 | x | x | x | x | x | x | 2.8M | 406k | x | | | x | | | | | x | | |
CLMBR-256 | 2023 | x | x | x | x | | x | 1.8M | 157k | x | | | x | | x | | | x | | |
Sherbet24 | 2023 | | x | | | | | 46k | 7k | | 1,2 | | | x | | | | x | | |
Ru et al.57 | 2023 | x | x | x | x | | | 299k | 31k | x | | | x | | | | | x | | |
SeqCare25 | 2023 | x | x | x | x | | | 14k | 2k | x | 1 | | | x | | | | x | | |
Liu et al.27 | 2023 | | x | | | | | 2k | | | 1 | x | | | | | | | x | |
IPDM58 | 2023 | | x | | | | | 119k | 24k | | 1,7 | | x | | | | | x | x | |
ExMed-BERT13 | 2023 | | x | x | x | x | x | 3.5M | 80k | x | | | x | | | | | x | | |
Pellegrini et al.59 | 2023 | | x | x | x | x | x | 22k | 22k | | 1,8,9 | | x | x | | | | x | x | |
Jones et al.60 | 2023 | | x | x | | | | 27k | 11k | x | | x | | | | | | x | | |
TransformEHR61 | 2023 | | x | | | x | | 6.5M | 10k | x | | | x | | | | | x | | |
CLMBR-362 | 2023 | x | x | x | | | x | 242k | 18k | x | | | | | x | | | x | | |
Profile model63 | 2024 | | x | | | x | | 1M | 53k | x | | | x | | | | | x | x | |
Seki et al.64 | 2024 | | x | x | x | | x | 32k | 15k | x | | | | x | | | | x | x | |
Foresight65 | 2024 | x | x | x | | x | | 710k | 37k | x | | | x | | | | | x | | |
Other model type: a: Deep belief network, b: local-global memory neural network, c: contrastive learning |
Public dataset: 1: MIMIC-III, 2: eICU, 3: All of Us Program, 4: epidemiological COVID-19 data, 5: Framingham offspring heart study, 6: MIMIC-IV, 7: Alzheimer’s Disease Neuroimaging Initiative (ADNI), 8: TADPOLE, 9: Sepsis Prediction Dataset |
Technical aspects
Most models treat each data element as a distinct unit or token (n = 44, 95%). The identified data preprocessing techniques address various aspects such as numerical data, categorical data, data cleaning, and data shuffling. Some studies (n = 7, 15%) performed categorization by converting exact ages into intervals and clinical measurements into categories like high, normal, and low, based on clinical evaluation standards33,36,40,46,55,56,62. When maintaining the numerical nature of data, missing value imputation30,52,59 and value normalization33,38,44,59 have also been employed.
Some studies standardize data elements by mapping them to known ontologies23,35,51,56. A common approach to reducing dimensionality and data sparsity is using only the first digits of codes, effectively replacing them with parent node in the hierarchical ontology (n = 15, 33%).
In terms of data cleaning, typical practices include the removal of rare medical terms14,35,36,62,63,65 and the elimination of duplicated terms within a specific time range22,35,42,54. Additionally, shuffling the order of medical concepts within a time window40,54 was shown to help the model to generalize better, by mitigating the impact of arbitrary sequencing and emphasizing the importance of co-occurrence over specific order. This method can also be considered a form of data augmentation. Detailed information on data preprocessing across studies can be found in Table S4.
Self-supervised learning models
There are two primary self-supervised learning training strategies: generative and contrastive. Generative tasks involve models predicting parts of the data from other parts, which may be incomplete, transformed, masked, or corrupted. These tasks, such as autoregressive prediction and masked modeling, help the model learn to recover whole or partial features of its original input17,66. Contrastive tasks, on the other hand, focus on distinguishing between similar and dissimilar data points, helping the model capture discriminative features that are essential for understanding different types of data66. Both task types are crucial for training models to generate rich, generalized representations from unlabeled data66,67, and they are applied across various model architectures. The objective of these models is to capture essential patterns and features in the data and output the learned representation which is typically a fixed-length, high-dimensional vector that condenses large amounts of information. Five major architecture types have been identified in the studies, each trained with unlabeled data with different training tasks. Details of the SSRL models used and the temporality monitored in each study are provided in Table S5.
Transformer-based models are among the most impactful model types in the studies. In the medical domain, most transformer-based models treat patients as documents, visits as sentences, and medical concepts as tokens, capturing detailed patient histories. BERT68 is a transformer encoder-only model that effectively learns data representations by processing and contextualizing complex sequences of information. BERT models can be trained using various techniques, such as training with only Masked Language Model (MLM) by predicting randomly masked medical concepts in each EHR sequence34,41,46,51,57,63, enhancing its contextual understanding. Training both with MLM and auxiliary tasks13,22,39,43,49,59, further refine the model's representations by guiding it with specific medical insights. Additionally, self-contrastive learning techniques help improve BERT's robustness and accuracy in capturing meaningful patterns in medical data30,55. Other transformer-based training tasks include next visit code prediction23,56,61,65, medical code category prediction23, medication-diagnosis cross prediction26, and token replacement detection ELECTRA58.
AE-based models are encoder-decoder models that aim to reconstruct the input, enabling the learning of data representations in a compressed, lower-dimensional space. AEs are designed to learn the most salient features of the data, which can be particularly useful for capturing the underlying structure of categorical EHR data. Various deviations of AE were applied in the studies: Stacked Autoencoder32,36, Denoising Autoencoder45, Autoencoder with RNN units, such as GRU33 and LSTM44,48,52,53,60. Additionally, AE can be combined with other models such as collective matrix factorization29, CNN, and clustering algorithms27,47.
GNN-based models use graph learning to represent medical ontologies, hospital visits, and disease co-occurrence. Nodes represent the medical concepts and personal entities, linked by edges indicating their relationships. Graph attention models were used to learn the medical concept embeddings within medical ontologies22,26, with these embeddings frequently serving as initializations for further model training. Random walk technique is used to embed doctors according to their specialty37. Graph contrastive learning25,28 generates multiple views of augmented hospital visit graphs by modifying the original graph with node or edge perturbations, allowing the model to learn robust representations by contrasting positive pairs against negative pairs. These approaches ensure that the learned embeddings accurately reflect the complex relationships inherent in medical data67.
Word-embedding-based models convert words into numerical vectors, allowing computers to understand their meanings and relationships from their context in a sequence of words. The model learns to map each word or concept to a dense vector representation, capturing semantic similarities based on co-occurrence patterns. Patient EHR data, composed of a sequence of medical concepts ordered by time, are used to train the representation model to predict medical concepts based on their surrounding context, helping the model understand relationships between concepts. Various algorithms were identified, such as Glove42, Word2vec42,54,40 and FastText42.
RNN-based models are designed to capture temporal dependencies in sequential data, making them well-suited for tasks involving time-series EHR data. These models are trained with the objective of predicting future medical events based on a patient's historical data. Studies14,56,62 use a specific type of RNN, GRU. The models were trained to predict the set of medical code of day t based on the medical codes of previous days. To enhance the temporality, these studies have also included the time gap information in the input.
Predictive models for classification are used with the trained SSRL model as their backbone, to which a specific classification head is added. These predictive models require labeled data for training on specific tasks. Among the articles that have mentioned the predictive models used for classification tasks, different model types have been identified. These models are predominantly characterized by simple architectures which are easy to train. Some studies employ shallow neural networks such as linear layer23,59,61,63, logistic regression (LR) (n = 8, 17%), and support vector machines (SVM)31,33. Models that can capture more complex data patterns such as feedforward neural networks (n = 12, 26%) and RNN13,37–39,43,53 (n = 6, 13%), are also applied.
Clustering and visualization models are used with the data representation vector as input. We identified several techniques employed across the literature. T-distributed Stochastic Neighbor Embedding (t-SNE) emerged as the most frequently used model for data representation visualization and cluster interpretation (n = 12, 26%). In terms of clustering techniques, K-means40,52–54 was found to be the most common method. These clustering models take the embedding vectors generated by trained representation learning models as input.
Classification evaluation Since most of the classification tasks were binary, the most frequently used classification metric were AUROC (n = 21, 46%), followed by AUPRC (n = 14, 30%), accuracy (n = 10, 22%) and F1 (n = 9, 20%), while other metrics were also used but less frequently, such as precision (n = 6, 13%) and sensitivity (n = 5, 11%). A few studies have evaluated multi-class classification tasks. Metrics such as average precision31,34, precision at k63,65, macro-F129,37and weighted F124,29 were each reported in the studies10,24,28,31.
Clustering evaluation Despite the abundance of clustering studies, only a few employed specific clustering analysis metrics. Silhouette analysis (n = 4, 9%) was the most frequently used metric, followed by Davies-Bouldin index40,44 (n = 2, 4%) and purity score35,47 (n = 2, 4%)
Interpretability in machine learning is defined as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model69. Attention weight analysis was used in several studies (n = 6, 13%). Statistical analysis of the clusters was employed in some papers (n = 3, 6%). For post-hoc interpretability, methods such as Integrated gradient13 and Gradient-based saliency65 were utilized. Most of the papers interpreted their results using visualization computed by t-SNE (n = 12, 26%) and Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) (n = 3, 6%). Ten papers involved medical expert interpretation. Overall, only two papers attempted post-hoc interpretability methods on trained models. Refer to Table S7 for detailed information on the interpretability methods used in the studies.
Clinical aspects
Table 2
A summary of medical domain, interpretability, and evaluation tasks of the selected studies.
| medical domain | interpretability | evaluation task |
model name | Cardiology | General & multiple diseases | Neurology & Psychiatry | Primary Care | Oncology | other domains | attention weight analysis | statistical analysis on cluster | post-hoc interpretability | embedding visualization | medical expert interpretation | disease prediction | mortality prediction | readmission prediction | length of stay prediction | patient similarity | hospitalization | other tasks |
Liang et al.31 | x | | | | | a | | | | | | m | | | | | | |
de Lusignan et al.32 | | | | x | | | | | | | x | | | | | | | 8 |
G-BERT22 | | | | x | | | | | | | | | | | | | | 1 |
Ruan et al.33 | x | | | | | | | | | x | | m | x | | | | | |
BEHRT34 | | x | | | | | x | | | x | x | m | | | | | | 8 |
ConvAE35 | | | x | | x | d | | x | | x | x | | | | | | | 9 |
Enhanced Reg36 | x | | | | | c | | | | | | x | | x | | | | |
CLMBR14 | | x | | | | | | | | | | | x | x | x | | | 2 |
EDisease30 | | | | x | | | | | | x | x | | x | | | x | x | |
ME2Vec37 | | | | x | x | | | | | x | | m | | x | | | | 4 |
PLGMNN38 | x | | | | x | b,d | | | | | | | | | | | | 1, 10 |
Med-BERT39 | x | | | | x | | x | | | | x | x | | | | | | |
Huang et al.40 | | | x | | | | | | | x | | | | | | x | | |
BRLTM41 | | | x | | | | x | | | | | x | | | | | | |
Phe2vec42 | | x | | | | | | | | x | x | | | | | x | | |
CEHR-BERT43 | x | x | | | | | | | | | | x | x | x | | | x | |
DICE44 | x | | | | | | | x | | x | x | m | | | | x | | |
Chushig-Muzo et al.45 | x | | | | | a | | | | | | | | | | | | 5 |
Poulain et al.46 | x | | | | | | | | | | | | | | | | | 11 |
Kumar et al.29 | | | | x | | | | | | | | m | x | | | | | |
Shao et al.47 | | | | | | c | | x | | x | | | | | | x | | |
Claim-PT23 | | | x | | | c | | | | | | x | | | | | | |
Navaz et al.48 | x | | | | | b | | | | | | | x | | | x | | |
CEHR-GAN-BERT49 | x | | | x | | | | | | | | x | x | | | | | |
CEF-CL50 | | x | | | | | | | | | | x | | | | | | |
ADADIAG51 | x | | | | | | x | | | x | | x | | | | | | |
Manzini et al.52 | | | | | | a | | | | | x | | | | | | | 5 |
Herp et al.53 | | | | | x | | | | | | | m | | | | | | 10 |
MMMGCL28 | | | | x | | | | | | | | | x | x | x | | | |
MedM-PLM26 | | | | x | | | x | | | | | | | x | | | | 1, 3 |
Ta et al.54 | | | | | | b | | | | | x | | | | | | | 5 |
Hi-BEHRT55 | x | | x | | | a,e | | | | | | x | | | | | | |
CLMBR-256 | | x | | | | | | | | | | | x | x | x | | x | |
Sherbet24 | x | | | x | | | | | | | | m | | | | | x | |
Ru et al.57 | x | | | | | | | | | | | | | x | | | | |
SeqCare25 | | x | | | | | | | | | | m | | | | | | |
Liu et al.27 | | | | | | e | | | | x | | | | | | | | 8 |
IPDM58 | | | x | | | | | | | x | | m | x | | | | | 10 |
ExMed-BERT13 | | | | | | c | | | x | | | x | | | | | | |
Pellegrini et al.59 | | | x | | | b | x | | | x | | m | x | | x | | | |
Jones et al.60 | | | x | | | | | | | | | x | | | | | x | 6, 7 |
TransformEHR61 | | | x | | x | | | | | | | x | | | | | | |
CLMBR-362 | | x | | | | | | | | | | x | | | | | | |
Profile model63 | | x | | | | | | | | x | | m | | | | | | |
Seki et al.64 | | x | | | | | | | | x | | | | x | | | | |
Foresight65 | | x | | | | | | | x | | x | m | | | | | | |
Other medical domain: a: Endocrinology, b: Infectious Diseases, c: Respiratory, d: Gastroenterology, e: Nephrology.
Other evaluation tasks: 1: medication recommendation, 2: ICU transfers, 3: ICD coding, 4: doctor recommendation, 5: patient subtyping, 6: emergency department visit, 7: high medical resource utilization, 8: characterization of clusters, 9: patient stratification, 10: prognosis analysis, 11: multiregression, m: multilabel
Our scoping review identified various tasks across the articles. These tasks were distributed across various clinical domains, with Cardiology24,31,33,36,38,39,43–46,48,49,51,55,57 (n = 15, 33%), both General & multiple diseases (n = 11, 24%), Neurology & Psychiatry and Primary Care (n = 9, 20%) being the most frequently studied areas. Oncology (n = 6, 13%), followed, while Infectious Diseases38,48,54,59, Endocrinology35,45,52,55 and Respiratory13,23,36,47 each had 4 downstream tasks (n = 4, 9%). Gastroenterology35,38 and Nephrology27,55 had the lowest number of downstream tasks (n = 2, 4%). A detailed overview of the clinical events and their corresponding clinical domain mapping can be found in Table S1.
Upon training, the deep learning models have developed an intrinsic representation of the data, which can be general (multiple tasks) or task-specific (single or few similar tasks). Representation quality is evaluated in various clinical tasks, including predictive tasks, or patient phenotyping. For detailed information on the evaluation tasks in the studies, see Table S7.
Predictive tasks Among the 73 predictive tasks, the primary focus was on disease prediction (n = 27, 59%), followed by mortality prediction (n = 11, 24%), readmission prediction14,26,28,36,37,43,56,57,64 (n = 9, 20%), hospitalization (n = 5, 11%), and length of stay prediction (n = 4, 9%).
Additional tasks Included medication recommendations22,26,38 (n = 3, 7%), ICD coding49, doctor recommendations37, ICU transfers14, emergency department visits60, and high medical resource utilization60.
Patient Phenotyping Of the 33 patient phenotyping tasks, clustering was primarily used for visualization (n = 15, 33%), patient similarity assessment (n = 8, 24%), characterization of clusters (n = 3, 9%), patient subtyping (n = 2, 6%), and patient stratification (n = 1, 3%).
Medical expert involvement
Medical experts were involved across different stages of the studies, with varying degrees of participation. Among the reviewed publications, expert involvement was most prominent in study design (n = 14, 30%) and result interpretation (n = 13, 28%). Feature selection also saw substantial expert input (n = 10, 22%), while dataset extraction had more limited expert participation (n = 4, 9%).