The data source for this project is the healthcare claims patient level database with the study time period from January 31, 2019 to December 31, 2019. Patient cohorts: study target and control were established using endometriosis ICD 10 diagnosis codes. As endometriosis is a female only condition, female patients 18 and older were part of the study target cohort. A control cohort is often used to create a patient sample to compare with the study target cohort and is selected using cohort matching algorithms. 36 months of patient medical history prior to the first disease event in 2019 were extracted for both the study target and control cohorts. The healthcare claims patient level data includes diagnosis codes, medical and surgical codes, therapeutics and treatments prescribed at the transactional level.
A number of analytical methods was leveraged for the analysis from the rules-based patient qualification criteria to Machine Learning algorithms to derive probability of endometriosis onset. The following sub-sections of the article present a detailed explanation for each of the selected methods. The healthcare claims patient level dataset considered in the analysis is specific to the US healthcare market.
3.1. Healthcare claims patient level database
The healthcare claims patient level database is an anonymous longitudinal patient data set that can be used by organizations that are directly or indirectly associated to healthcare [9, 41]. There has been an increasing interest in patient-level data, as researchers, healthcare providers, and pharmaceutical companies are realizing the potential of creating better comparisons of effective treatment outcomes by analyzing longitudinal data that represent individual patient-based experiences and interactions with the US healthcare system [42].
The healthcare claims patient level database leveraged for this study consists of medical, hospital, and prescriptions claims across all payment types [10, 44]. The database covers more than 317 million patients in the US, spans over more than 17 years of medical health history, and includes more than 1.9 million healthcare providers [43]. Figure 1 presents the summary of information in the database.
3.2. Cohort selection
For this study, we identified 314,101 confirmed endometriosis patients in 2019 in the healthcare claims patient level database, using predefined ICD 10 diagnosis codes (Table 1). Female patients age 18 and above were selected to the study target cohort. For the control cohort, a random sample of 3 million female patients with the same age criterion was extracted from the database.
Table 1
ICD 10 diagnosis codes of endometriosis
Diagnosis Codes | Diagnosis Long Description |
N80.0 | Endometriosis of uterus |
N80.1 | Endometriosis of ovary |
N80.2 | Endometriosis of fallopian tube |
N80.3 | Endometriosis of pelvic peritoneum |
N80.4 | Endometriosis of rectovaginal septum and vagina |
N80.5 | Endometriosis of intestine |
N80.6 | Endometriosis in cutaneous scar |
N80.8 | Other endometriosis |
N80.9 | Endometriosis, unspecified |
To select a control cohort of an equal size to the study target groups out of 3 million patients, a noble technique known as ‘propensity score match’ was used [18]. Propensity matching algorithm [19], a statistical technique, selects the control cohort based on similar characteristics or covariates observed in the study target cohort. Covariates considered for selection were patient age and medical history [20]. Table 2 presents the distribution comparison between the study target and control cohorts by age and Census geographies. The patient age variable was created via grouping age ranges and US states were grouped into regions.
Table 2
Comparison between target and control cohort by age and region respectively
Age Group | Target | Control | | Region | Target | Control |
18–24 | 6.45% | 6.55% | | South | 39.90% | 39.90% |
25–34 | 25.01% | 25.24% | | Midwest | 22.78% | 22.76% |
35–44 | 37.57% | 37.08% | | Northeast | 18.82% | 18.84% |
45–54 | 23.13% | 23.18% | | West | 17.02% | 17.02% |
55–64 | 6.22% | 6.31% | | Other | 1.48% | 1.48% |
65+ | 1.62% | 1.64% | | | | |
3.3. Data extraction
The next step in the analysis process was to extract the entire medical history of the patients from the available information in the healthcare claims patient level database. In order to ensure extraction of healthcare history data prior to the first condition event, the event date for the target cohort was established for each patient. In the case of the control cohort, the first activity in 2019 was considered as the event date.
Using these event dates of respective patients, 36 months of medical history data was extracted. Historical data presented all the medical events in patient history, including diagnoses for comorbid conditions, medical and surgical procedures, therapeutics, and treatment prescribed to patients. Top 1000 diagnosis codes, top 800 medical and surgical procedures, and top 500 prescribed drugs were only considered for further analysis as these top codes constituted more than 80% of total data. A pivot table was created where data at the transaction level was aggregated by the anonymized patient ID. After historical medical claims data preprocessing for both cohorts independently, a dataset was integrated into a single data frame. The integrated data frame had more than 2,600 features. The dataset was further standardized and split into two groups, a training and test set, using 70:30 ratio respectively [21]. The training dataset is used to identify the key features of endometriosis onset, while the test group is used to validate if these features would predict the test group condition onset accurately [22]. Splitting the data into train and test sets helps to assess the model performance and its generalizing ability on unseen data [23].
3.4. Machine Learning algorithms’ overview
Machine Learning algorithms can be grouped into two categories: supervised and unsupervised learning.
3.4.a. Supervised learning algorithms
Supervised learning is the process of training or building the machine learning algorithms in which algorithms learn to map from input space (X) to output space (Y), i.e. Y = f(X) [25]. The major objective is to approximate the mapping function (f) in order to ensure that when a new data point (x) is added we can predict (y) outcome [26]. Supervised learning algorithms are mainly used for classification and prediction problems [32]. Following are the most popular supervised algorithms: logistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting, support vector machines (SVMs), Naïve Bayes, adaptive boosting (AdaBoost), artificial neural network (ANN) etc. [31].
3.4.b. Unsupervised learning algorithms
Unsupervised learning algorithms, on the other hand, try to learn the hidden pattern within the input dataset (X) [28]. These models are called unsupervised because there is no supervision to guide the models as compared to the supervised learning [29]. Algorithms are left at their own abilities to learn, discover and showcase the patterns in the input data (X). These algorithms are highly popular in the tasks to discover the natural clusters, dimension reduction, anomaly detection, etc. k-Means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule) are some popular examples of unsupervised learning algorithms [31].
Depending on the study objectives and the available data, algorithms are explored, tested for performance and data type fit, and selected accordingly. We framed the endometriosis onset prediction into a supervised classification problem and selected Logistic Regression and XGB models to develop a highly predictive algorithm of the disease onset. SVM, RF, AdaBoost, ANN, etc. are the other options that were explored in disease prediction; however, Logistic Regression and XGB were selected to predict the condition onset. Logistic Regression allows study of the odds of endometriosis occurrence for a given medical event [15], while XGB has more flexibility in fine tuning the hyper-parameters in comparison to other tree based algorithms [11].
Logistic Regression
Logistic Regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist [14, 15]. Mathematically, a binary logistic model has a dependent variable with two possible values, where the two values are labeled "0" and "1" [33]. Outputs with more than two values are modeled by multinomial logistic regression. Logistic Regression is used in various fields, including healthcare and social sciences [34].
xExtreme Gradient Boosting
Gradient boosting algorithm is a machine learning algorithm which is an ensemble of weak prediction models, mostly decision trees [11]. An individual tree is a simple, often unreliable, model but when multiple trees are grouped together, they can create a robust algorithm [12]. XGB starts by creating a first simple tree [35], which than progresses sequentially and builds upon the weaker learners, with each iteration revising the previous tree until an optimal point is reached, such as the number of trees (estimators) to build the solution [36].
Chi-Square Test
The Chi-square test is one of the most widely used non-parametric tests [37], often utilized to test the independence between observed and expected frequencies of one or more attributes in a contingency table, popularly known as ‘test goodness of fit’ [38]. In this work, the Chi-square test is used to identify top significant features given the dependent variable (Y) [40].
Logistic Regression, being the simplest of the machine learning algorithms, was selected as the base model for the analysis and used to compare other models’ performance. Both Logistic Regression and XGB models were trained, and top 1,000 features from each algorithm were selected out of more than 2,600 features used in the model runs. To decrease the number of data elements and to select only the most important variables to predicting the condition onset, we also used a Chi-Square test to identify the top 1,000 features. As a next step, the unique features from each model were utilized to train the final machine learning model to predict the endometriosis occurrence probability. Algorithms were trained on Python 3.5 using ‘scikit-learn’ and ‘xgboost’ libraries.