In this study we performed various types of QL, a well-known off-policy algorithm for RL. While RL algorithms beyond the healthcare field are typically employed in online settings where agents can interact with an environment to learn the optimal policy, implementation in healthcare must rely on offline data. This is due to safety concerns and regulatory limitations surrounding the active modification of clinical decisions for algorithm learning purposes. Our offline data consisted of a large volume of clinical decisions previously made by physicians across a broad spectrum of clinical patients and circumstances. We employed traditional QL, DQN, and Conservative Q-learning (CQL) to optimize treatments for CAD based on reward functions derived from downstream clinical outcomes. We searched different hyperparameter spaces for each method and compared extracted optimal RL policies with the policy established by the behavior of practicing physicians. In this section, we provide an overview of the data resources and techniques utilized in the study.
Data Sources
This study used a cohort of 41,328 unique adult (≥ 18 years old) patients with obstructive CAD undergoing 49,521 unique cardiac catheterizations. Data was sourced from the Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH) registry 43,44 and included index encounters from January 2009 to 2019. APPROACH is among the world's most extensive databases for CAD management, linking unique disease phenotypes with detailed clinical data and relevant patient outcomes for individuals in Alberta, Canada who have undergone diagnostic cardiac catheterization and/or revascularization procedures. The data schema of APPROACH included core demographics, point of service clinical health variables (inclusive of comorbidities, laboratory tests, medications, referral indications for coronary catheterization, vital signs, all current and prior procedural parameters collected during catheterization and/or revascularization procedures. Standardized outputs of the CARAT™ (Cohesic Inc., Calgary, AB) reporting tool were also provided, a specialized CAD reporting software tool that provides clinicians an interactive tool for generation of patient-specific coronary anatomic models to characterize the location, extent, and specific features of coronary lesions and the corresponding revascularization procedures. Obstructive CAD was defined as any stenosis ≥ 50% in the left main coronary artery or ≥ 70% in any other coronary artery, or both. Index presentations for STEMI were a-priori excluded given established guideline recommendations to deliver immediate primary PCI of the culprit lesion.
Alberta provincial health records of the studied cohort were collected through Alberta Health Services (AHS) administrative databases. The dataset consisted of various tables, as outlined in Supplemental Table 1. The utilization of data in this study was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB20-1879).
Data Preprocessing
Decision models were developed using a Markov Decision Process (MDP), provided access to an expert-reviewed selection of candidate patient health features across all data resources to determine a variety of patient states. The majority of selected patient features were from the APPROACH data schema, including core demographic information, comorbidities, and standardized angiography findings provided by the CARAT™ system. Select additional features from administrative databases, inclusive of specific ICD-10-CA codes, medication ATC codes, and routinely performed laboratory tests were included. ICD-10-CA and ATC codes were selected based on our previous study 45, while the lab tests were chosen for their relevance to CAD and availability. These features were aggregated for a one-year period prior to each catheterization encounter. In total, 402 features were presented to the decision models, with details provided in Supplementary Table 2.
The cohort dataset was divided into three parts at a patient level: 70% for training, 20% for testing, and 10% for validation purposes (for neural networks). We normalized the continuous features using a Min-Max scaler and one-hot encoded all categorical variables. Missing data were imputed using the median value of each feature.
MDP Modeling
Each coronary catheterization encounter represented a step in our MDP model with patient features available at time of catheterization used to determine each patient's state. The revascularization treatment decision made at each step (i.e., MT, PCI or CABG) was considered the action for the model. A composite MACE outcome at three years following each action was used to calculate the reward. The transitions in this model included moving from one catheterization to another, the patient's death, or passing three years from the treatment without further catheterization or death.
We defined patient episodes based on these transitions, ending them either with the patient's death or at the end of the three-year post-treatment follow-up period. Notably, if a patient undergoes catheterization more than three years after the previous episode, a new episode was initiated.
State Determination
Our MDP models relied on defining unique patient states and learning the outcomes from how each patient transitioned between them. This made it crucial to first define representative states based on available data sources. Since one of our focuses was on the traditional Q-learning (QL) method, we needed to determine a finite number of states that were of an appropriate size to be statistically meaningful (e.g., below 1,000 states) yet adequately representative of differences observed across patient presentations. Conversely, for deep reinforcement learning methods, such as the Deep Q-network (DQN), the use of neural network-based encoders to estimate the Q-function allowed us to use the input features directly. The terminal states, in all methods, were defined by either death of the patient or by spending three years after a treatment action without death. Figure 1 illustrates the possible scenarios of transitions in our MDP model.
For the traditional QL approach, we employed the K-Means clustering algorithm to categorize patients into N subgroups, defining each cluster as a state. Given the uncertainty regarding the optimal N, we searched the space of \(\:N\in\:\left[\text{2,1000}\right)\), using each state space in a different model. Since DQN does not require predefined states and can utilize the features directly, we used the processed features as our state space.
Reward Function
The reward function of our MDP model was dominantly weighted by clinical outcomes. Patients received a reward of + 1 at the end of each episode if they completed a 3-year period without experiencing an adverse event. If a patient died within the period, the reward value was adjusted based on the time they lived. We applied a negative reward of -1 for each adverse event occurring following treatment, adjusted for the number of days following the decision (up to three years), using the formula presented in Fig. 2. Under these conditions, the earlier the event occurred, the larger the penalty.
The adverse events that were considered by the model were based on a 5-point MACE 46, defined as:
-
Acute Myocardial Infarction (AMI)
-
Stroke
-
Repeat Revascularization (PCI or CABG)
-
CV mortality within 3 years after treatment
-
All-cause mortality within 90 days after treatment
When two or more outcomes were experienced, the earlier event was used for reward calculations.
It is important to note that in all models trained or evaluated in this study, we did not discount future rewards or penalties and used \(\:{\gamma\:}=0.99\) for all MDP models, as we wanted to consider future clinical outcomes as important as the early ones.
Action Space
The action space in this study consisted of three treatment options for obstructive CAD: MT only, PCI, or CABG. We assumed MT only was a basic action taken by default when PCI or CABG was not taken. Actions were always taken following coronary catheterization.
Evaluation of the policies
Evaluating policies derived from RL algorithms in de-novo patients is challenging due to safety and ethical concerns. Therefore, we used a well-known 40,41,47 off-policy evaluation technique called Weighted Importance Sampling (WIS) to compare the expected values gained by our recommended policies to the physicians’ policy using trajectories collected under the physicians’ policy (i.e., the decisions of physicians).
WIS is an improved version of importance sampling. While importance sampling provides an unbiased estimation of the expected reward, it suffers from high variance. WIS helps alleviate this problem through a normalization step that adjusts the weights of each sampled trajectory so that they sum to one, effectively stabilizing the estimator. This normalization not only mitigates the high variance typically associated with traditional importance sampling but also ensures that the weighted samples better represent the distribution of the target policy 48,49.
We should also note that for all WIS evaluations in the current study, we used bootstrapping with 1,000 samples and 1,000 steps to calculate confidence intervals. Additionally, we presented WIS-driven expected rewards for both stochastic and greedy policies. The stochastic policy refers to a set of probabilities for taking each possible action (summing to 1), while the greedy policy involves only choosing the action with the highest probability (probability of 1 for that action and zero for the rest).
Behavior Policy
The behavior policy for this study is essentially a learned function that converts patient states into actions (i.e., revascularization decisions). The design of a behavior policy directly influences the results of the WIS, and an uncalibrated function may lead to incorrect estimations of the expected rewards. Raghu, et al. 50 showed that simple models, such as k-nearest neighbor (KNN), may perform better as behavior policies for off-policy evaluations, as the probabilities of the actions in these models better represent the stochastic nature of the problem.
In this study, we used k-means clustering to determine potential states of patients. These states (i.e., clusters) had a very similar nature to models like KNN, with the probabilities of actions in each cluster more likely to be stochastically calibrated. We chose the optimal number of clusters (K) in k-means by evaluating the predictive accuracy of each cluster configuration. Accuracy was determined by how well the most frequent treatment action for each cluster predicted ground truth physician decisions. We selected the smallest K that provided the highest predictive accuracy as the optimal number of states for the physician policy. This method balanced model simplicity with predictive performance and found a stochastically calibrated behavior policy. We refer to this behavior policy as \(\:{\pi\:}_{{B}_{best}}\) in the rest of the paper. This number of clusters chosen, 177, was the smallest number that achieved the maximum accuracy of 69.48% in predicting the physicians' true actions.
Learning Scheme
In this study, we used various types of QL to find optimal treatment decisions, including traditional QL based on states derived from k-means clustering, DQN, and Conservative Q-learning (CQL), which is a special type of DQN. In this part, we will explain each method briefly:
Traditional Q-Learning
We used different sets of states derived from k-means clustering and optimized 998 different Q-functions using the training episodes. We drew random samples (with replacement) from patient episodes and calculated the Q-function using QL algorithm with at least 1,000,000 steps (we increased this number as the number of states increased) 32,40.
Then, we determined the action with the highest Q-values for each state to extract the optimal policy. One of the advantages of QL based on a small number of states is its ease of use and interpretability. With this method, we can analyze the characteristics of each state (a cluster of patients) and identify the important contributing factors that lead to a specific action being chosen as optimal. Inspired by the RL interpretation methods of Zhang et al.51, we fitted an XGBoost model to transform the features to the associated clusters and calculated the Shapely values to represent the importance of the features.
One issue with QL methods is the potential overestimation of value for rare actions in certain states. To mitigate this and make the policies safer, aligning them closely with physicians’ policies is advisable. A practical approach is to use a weighted average of both the traditional and physicians’ policies to avoid an overly optimistic prediction for rare state-actions. In our study, we executed this by calculating a simple average of the traditional and physicians’ policies for our best model.
Deep Q-network
DQN can be considered an extension of traditional Q-learning that uses neural networks to estimate Q-values instead of a Q-matrix. In our study, we utilized this approach to handle the complexity of high-dimensional state spaces. The neural network receives input features representing states and predicts Q-values, which are compared to the target Q-values calculated from the Q-learning formula through a loss function. Gradient descent is then used to update the network's parameters (\(\:\theta\:\)). Additionally, DQN maintains a copy of the main network, updated less frequently, to calculate the target Q-values, enhancing the stability of the model 33,52. We used a feed-forward neural network structure and tuned the hyperparameters using grid search. The details of the hyperparameter space are provided in the supplementary information. Similar to QL, the optimal policy was calculated by choosing the action that maximized the Q-values for each state.
One of the key advantages of DQN over traditional QL is its ability to generalize across similar states due to the function approximation provided by neural networks. This generalization capability allows DQN to scale effectively to larger state spaces and more complex environments, facilitating the discovery of nuanced patterns and strategies within the data.
Conservative Q-Learning
CQL is a special type of DQN in which a term is added to the loss function to discourage the model from overestimating Q-values of state-action pairs that were not observed in past experiences. This modification makes CQL an ideal solution for offline RL, where you cannot directly explore rare state-action pairs, and the model may overestimate their expected reward during training from past experiences. The update formula of CQL is very similar to the DQN and the difference is a regularization term to that lower-bound the Q-function. This regularization term causes the conservatism (i.e., acting close to behavior policy) and is controlled by \(\:\alpha\:\) parameter. As \(\:\alpha\:\) increases, the new policy becomes more similar to the behavior policy 52,53.
Figure 3 summarizes the methodology employed in this study to train and evaluate different RL models for optimizing CAD treatments.