Study Design and Overview
This study is designed around the basic workflow considering several steps including hypothesis definition, study cohort and population, data dictionary creation, data prepossessing, data processing, data analytics, validation and results and finally future steps. Figure 1 illustrates the major components of the study from hypothesis definition to future plan.
We define the existence of discrepancies across different genders in:
-
diagnosis and the time of diagnosis.
-
procedures including invasive and noninvasive procedures.
-
time interval between diagnosis, medication order, and procedure.
Study Cohort and Population
Our data analytics built using EMR data on 960,129 patients admitted to UCSF between July 2011 and December of 2018. This study does not include any human subject and experimental protocol. All data-based De-Identified Clinical Data Warehouse (De-ID CDW) were authorized to access as “de-identified” by the University of California San Francisco and all IDs and metadata (e.g., location) have been removed. All methods were carried out in accordance with relevant guidelines and regulations at UCSF.
De-ID CDW is a de-identified database copy of high-value EHR data. Therefore, this data is not subject to HIPAA restrictions on research use and hence IRB approval or an honest broker intermediary and the need for informed consent was waived by the UCSF Research data team committee. The De-ID CDW system accelerates the research process by permitting UCSF investigators to locate research data and encourage an exploratory approach to hypothesis generation. The De-ID CDW is available to the UCSF research community.
After authorization to access “de-identified” EMR data for research, in consultation with cardiac, thoracic, and vascular surgeons, cardiologists and cardiovascular epidemiologists, the following cohort identification criteria were developed:
-
Coronary Artery Disease (CAD), commonly referred to as Ischemic Heart Disease (IHD) based on the ICD10 code (120–125).
-
Patients with missing value specifically for ICD10 code excluded.
-
Patients defined as unknown, and unspecified definition excluded.
To be included in this cohort, patients needed to meet the above criteria, leading to a cohort size of 32,904 CAD patients. Vital such as cholesterol (HDL), cholesterol (LDA), cholesterol (TOTAL), systolic blood pressure, diastolic blood pressure, BMI, age have been considered. Demographic characteristics such as ethnicity have been considered. Smoking conditions including patients never smoked, current every day smoker, former smoker, passive smoke exposure are very important characteristics to be considered. Co-morbidities (e.g. hypertension, liver disease, hyperlipidemia, diabetes, dialysis) for patients with CAD for both genders are calculated. All vitals, characteristics, and co-morbidities are shown in Fig. 2. This data set consisted of de-identified patient ID, demographic information (e.g. gender), and diagnosis based on ICD10 code as shown in Table 1. The details of ICD codes are described in supplementary material Table S1 (ICD10details). For procedure code, we used Current Procedural Terminology (CPT) and date of procedure services for both invasive and non-invasive procedures. For medication, we used the medication code, medication name, and date of the orders.
Table 1
ICD10 I20-I25 for CAD. Table S1 (ICD10codes) in supplementary materials show all details for ICD10 Codes.
ICD10 | Definition | Subgroups |
I20 | Angina pectoris | I20.0, I20.1, I20.8, I20.9 |
I21 | Acute myocardial infarction | I21.0(I21.01,I21.02,I21.09), I21.1(I21.11,I21.19), I21.2(I21.21,I21.29), I21.3, I21.4, I21.9, I21.A(I21.A1,I21.A9) |
I22 | Subsequent ST elevation and non-ST elevation myocardial infarction | I22.0, I22.1, I22.2, I22.8, I22.9 |
I23 | Certain current complications following ST elevation and non-ST elevation myocardial infarction | I23.0, I23.1, I23.2, I23.3, I23.4, I23.5, I23.6, I23.7, I23.8 |
I24 | Other acute ischemic heart diseases | I24.0, I24.1, I24.8, I24.9 |
I25 | Chronic ischemic heart disease | I25.1, I25.2, I25.3, I25.4, I25.5, I25.6, I25.7, I25.8 |
Data Dictionary
We manually created data dictionary for procedures (e.g. CPT codes include diagnostic cardiac catheterization, treatment cardiac catheterization, cardiac CT scan, echo, EKG, myocardial lab, and stress test). Refer to supplementary material Table S2 (Procedure Dictionary) for the full list including all codes and names for procedures. To create a dictionary for medication, different medications were classified into main classes as anticoagulants, antiplatelet, aspirin, beta-blocker, calcium antagonist, cardiac drug, cardiovascular drug, nitrate, ranolazine, and statin. Refer to supplementary material Table S3 (Medication Dictionary) for the full dictionary including all codes and names for medications.
Data Preprocessing
The patients whose medical history does not include at least one element from the set of CPT codes were eliminated from the initial cohort patient. By doing so, the patient number was reduced to ~ 23,000. Before proceeding, the CPT codes were mapped and translated to procedure names (e.g. EKG, CABG for Coronary Artery Bypass Grafting) based on our dictionary. The medication history data set contains patient ID, medication name, medication code, therapeutic class, pharmaceutical class, pharmaceutical subclass, date of medication ordered and gender. A similar translation was done for medication based on medication dictionary and medications were assigned into main classes.
Data Processing and Statistical Analytics
Our approach was based on time series patients’ data. Because of a big and diverse patient cohort at UCSF, we could follow each patient from initial interaction with the UCSF medical system following up any medication order and invasive/noninvasive CAD related procedures over months and years of treatment. For each patient the sequence of events was created from the time of initial presentation to the UCSF medical system to the last invasive procedure as the date of extracting data (e.g. CABG as one of the important targets). We have implemented methods to determine the first suspicion of CAD by providers (primary care and/or cardiologist). We measured the time between different events (e.g. time between prescribing of aspirin or and any other medications and ordering the EKG test, EKG test to CABG) and found the sequence of events for each patient and group of similar patients.
One of the novelties of this study is tracing a multidimensional aspect of patients’ treatment over time. It means we look at both medications and procedures over time of treatment. We merged the sequence of medication orders and procedures over time as a time series sequence from the time of admission to the end of treatment as recorded in EMR. Event time was defined as the date of the first event (e.g. prescribing aspirin, ordering stress test) until the date of the next event (e.g. ordering EKG test) and the next event. All medications and procedures from the dictionary can count as the first event in the patient records. We explored all possible existing events (e.g. aspirin = > EKG test = > diagnostic Cardiac Catheterization = > CABG) paths for individual patients. Then we calculated the time interval between every two pairs of events and the number of days. Table 2 shows a few examples of the events. Table S4 (Time Intervals) and Table S5 (All Paths) in supplementary materials show all paths and time intervals for all possible sequences of events. The data set is divided into separate data sets for men and women. For each set, we grouped each row with the same “Path” and compiled the days spanned into a list containing different days from different patients. Upon the completion of the list of days for each different path, the mean, standard deviation, number of patients and essentially the length of the day is calculated for both men and women data set. As the very last step, both the men and women sets are merged, or concatenated, on the same paths. Then, 2 sample t-Tests are performed for each row to evaluate whether the differences between the average delay days for men and women are statistically significant or not. Differences in delay time between groups were assessed with the p-value. Table 3 shows a few examples of the results of data analytics. Table S5 (All Paths) in supplementary materials show data analytic results for all paths and gender-based differences for all patients.
Table 2
Example of the time interval and days between events for each patient. Table S4 (Time Intervals) in supplementary materials shows all paths and time intervals for all possible sequences of events.
deidentified Patient ID | Path of events | Time interval | Days |
**1(deID PID) | Aspirin ⇒ EKG | [2011-09-05,2012-01-21] | 76 |
**2(deID PID) | EKG ⇒ diagnostic Cardiac Catheterization | [2012-01-21,2012-04-11] | 80 |
**3(deID PID) | diagnostic Cardiac Catheterization ⇒ CABG | [2012-04-11,2012-04-22] | 11 |
Table 3
Example of gender-based time interval calculation for individual paths. It includes, statistical analysis (average day between events in the path as average days; median as MD; standard deviation as SD, number of patients as #n, and p-value) for patients who went through the path of interest. The full table is in supplementary materials Table S5 (All Paths).
Path | Men | Women | p-value |
average days | MD | SD | #n | average days | MD | SD | #n |
Aspirin ⇒ EKG | 160.97 | 28.0 | 305.17 | 2457 | 178.03 | 33.0 | 317.29 | 1371 | 0.106062 |
EKG ⇒ diagnostic Cardiac Catheterization | 304.70 | 51.0 | 471.16 | 2010 | 368.29 | 105.0 | 496.86 | 1033 | 0.000682 |
diagnostic Cardiac Catheterization ⇒ CABG | 77.06 | 15.0 | 231.72 | 237 | 127.18 | 17.0 | 329.98 | 64 | 0.257025 |