Development of the ML-algorithm
To develop ML-algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. coding of variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. validation of model with test data set and viii. selection of the model.
i. Selection of final data set
We selected a training data set from a population-based epidemiological cohort (i.e., CONSTANCES) to develop an algorithm to estimate the incidence of diabetes. The participants were recruited by CONSTANCES between January 1, 2012, and December 31, 2014. This cohort comprises after final completion a national representative randomly selected sample of 50,954 aged between 18 and 69 years (inclusive) and living in France [14, 15]. The participants are randomly selected from the beneficiaries of the National Health Insurance Fund (i.e. CNAM [Caisse Nationale d’Assurance Maladie]). In this cohort, data are collected using a self-administered questionnaire (SAQ) and a medical examination (MQ) and were used to define the known diabetes cases and pharmacologically-treated diabetes [16]. For known diabetes cases, in the SAQ, participants reported to have diabetes through the item: “Have you ever been told by a doctor or other health care professional that you had diabetes?” In the medical questionnaire, completed during the medical examination, the physician asked each participant if they had diabetes. For the pharmacologically-treated diabetes, two questions in the SAQ were related to diabetes treatment: “Are you currently being treated for diabetes with oral medication?” And “Are you currently being treated for diabetes with one or more insulin injections?”[16].
After fulfilling a SAQ on health status, life style factors, socioeconomic and demographic characteristics, the participants attend to their related health screening center for a medical examination which includes: medical questionnaire, physical examination and blood sampling. This information previously collected was linked with the French National Health Data System (i.e., SNDS). We excluded pregnant women, women who declared being already diagnosed of gestational diabetes mellitus and participants without SNDS data.
ii. Target definition
The diabetes status was defined according to CONSTANCES as described above. The diabetes cases treated for the first time over the 12 months before the date of SAQ were defined as incident cases (target 1). These diabetes cases included both type 1 and 2 diabetes. No diabetes cases treated over the 12 months before the date of SAQ, were defined as non-diabetes cases (target 0). The rest of diabetes cases were excluded (see Figure 1).
iii. Coding of variables for a given window of time
In CONSTANCES, we only coded those variables, which were also available in the SNDS to apply the potential ML-algorithm on SNDS to estimate the incidence of diabetes. A total of 3,483 continuous variables were coded and standardized (mean= 0, standard deviation=1) over the last 24 months before the date of SAQ. The rational to have a time window of 24 months before the SAQ was to provide a long duration to study changes in diagnostic procedures, hospitalizations and drug consumption that allows to estimate the incidence of diabetes with high accuracy. Following were the main categories of variables: number of medical consultations (50 variables), drug dispensed coded using the 5th level of the Anatomical Therapeutic code [ATC 05] (461 variables), biological test (747 variables), medical acts (i.e., X-ray, surgery, etc.) (2135 variables), all hospitalizations (5 variables), hospitalizations with a procedure (i.e., dialysis, radiotherapy, etc.) (5 variables), hospitalizations without a procedure (5 variables), hospitalizations related to following associated health conditions: diabetes, heart failure, stroke, heart attack, foot ulcer, lower limb amputation, ischemic heart disease, transient ischemic attack, end-stage renal failure, diabetic coma, diabetic ketoacidosis and cancer (75 variables).
iv. Split final data set into training and test data sets
The final data set was randomly split into 80% as a training data set and 20% as test data set. There was a significant imbalance of number of positive target (i.e., target 1 = diabetes treated cases) over the number of negative target (i.e., target 0 = non-diabetes cases) in the training dataset. To avoid the bias in ML-algorithm and skew in class distribution, we performed a random under sampling in the target 0 group to achieve the same number of individuals in both target groups. The selection of variables and the model was performed using the training data. The test data was used solely to test the final model performance.
v. Variables selection
First, we removed all variables with a variance equal to zero and then the ReliefF exp score was estimated, based on the relevance of each variable, to differentiate between target 1 and target 0. The ReliefF expRank method is noise tolerant and is not affected by features interactions [17-19]. All the variables were ranked according to the ReliefF exp score. For continuous variables, the score values range from 0 to 1 [18]. The cutoff score was 0.01 and was selected based on the visual inspection of the ordered plot of ReliefF values for all variables, called “elbow plot” approach. The variables that had a ReliefF exp score equal or more than 0.01 were included to train different models and the variables less than 0.01 were excluded.
Steps vi to viii Model selection and validation of the model with test data set
The four following models [i.e., 1. Linear discriminant analysis (LDA), 2. Logistic regression (LR), 3. Flexible discriminant analysis (FDA) and 4. Decision tree model (C5)] were applied to the training data set. For each model, we compared the performance in terms of Area under the Receiver Operating Characteristics (AROC) curve. The first validation of the models was performed using k-fold [three repeats of five-fold] cross-validation on training data set. After that, the models’ performances were assessed using the testing data set. Then, we automated the model selection process by giving the computer a specific metric including sensitivity, specificity, positive predictive value, negative predictive value, F1-score and kappa. Finally, a single model was retained based on its performance and its transferability to other databases.