Patients
The study subjects (n = 34) consisted of patients recruited to the prospective KISS trial investigating ePRO follow-up in cancer patients receiving ICIs 19. In brief, the trial included patients with advanced cancers treated with anti-PD-(L)1 s in outpatient settings with availability of internet access and email. At the initiation of the treatment phase (within 0–2 weeks from the 1st anti-PD-(L)1 infusion), patients received an email notification to complete the baseline electronic symptom questionnaire of 18 symptoms and weekly thereafter until treatment discontinuation or six months of follow-up. Data on the irAEs (nature of AE, date of onset and resolving, dates of change in AE severity, and the highest grade based on CTCAE classification) was prospectively collected in the trial.
The study was approved by Pohjois-Pohjanmaan sairaanhoitopiiri (PPSHP) ethics committee (number 9/2017), Valvira (number 361), and details of the study are publicly available at clinicaltrials.gov (NCT3928938). The study was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines.
Prediction models
The aim of this study was to create a model for predicting the presence (is the predicted irAE truly an irAE) and onset (is an irAE developing) of irAEs based on evolving patient-reported symptoms collected digitally in prospective manner from cancer patients receiving ICI therapies. For both modelling cases, the output of the prediction model is a continuous value [0–1] depicting the probability for the positive event, i.e., presence or onset of irAEs. With a classification threshold (0.5 was used with both models), the continuous probabilities were converted into binary outcomes, i.e., when the predicted probability for the positive event is larger than 0.5, prediction is labeled positive (irAE onsetting or present), and if less than 0.5, then negative (irAE not onsetting or present). Hence, the modelling methodology used in this study follows a general framework of binary classification in machine learning (ML). The first dataset included 16 540 reported symptoms from 34 ICI treated cancer patients in outpatient settings, comprising 18 monitored symptoms collected using the Kaiku Health digital platform. The second dataset included physician-confirmed prospectively collected irAE data from those 34 patients recorded in the eCRFs of the trial, containing initiation and end dates, CTCAE class, and severity of 26 irAEs. The timelines of ePROs and irAEs were synchronized according to dates. Of note, some patients might have experienced multiple irAEs, thus, the incidence of irAEs in this patient cohort was ~ 40% (n = 14). Multiple observations across same patients were used to create a timeline of irAEs, however, in every time point analyzed, the parameters differ comprising a new sample. Furthermore, the gradient boosting trees-algorithm used can handle intercorrelated observations or features.
Two ML models were built using an open source Python library XGBoost, which offers a widely used high-performance implementation of gradient boosting, an established algorithm suitable for classification problems. Gradient boosting is an ensemble-learning algorithm, i.e., it is an ensemble of many, usually tens or hundreds of decision trees. These decision trees, i.e., classification trees, are weak learners, but when combined using gradient boosting approach, they form a strong learner capable of capturing complex relationships in the training data. By combining the ePRO data and the clinical data, the first model was trained to detect the presence and the second model the onset (0–21 days prior to the diagnosis) of irAEs. The dataset was split into training (70 % of the data) and test sets (30 % of the data) by random allocation at patient-observation level. The test set was left out from the model training and tuning and was used only to evaluate the model performance. The hyperparameter tuning for both prediction models was done using grid search with repeated, stratified 5-fold cross-validation with five repeats. The model features included patient information including age, sex, and time from the treatment initiation, and ePRO data from 18 monitored symptoms. From the symptom data, three past values, linearly scaled based on the time difference from the latest report, and the latest change in symptom severity were included as features for each symptom. This yielded 75 features in total for both models.
The prediction performance of the models was evaluated using four commonly used metrics: accuracy, Area Under the Curve (AUC), F1-score and Mathew´s correlation coefficient (MCC), which are described shortly next. Accuracy describes how many predictions were correct as percentages, and 100% indicates a perfect classification. AUC is a performance metric for binary classification ranging from 0 to 1. F1-score is the weighted average of precision (how many of the cases predicted as positive are actually positive) and recall (how many of the positive cases are detected) which gets values between 0 and 1. Matthew’s correlation coefficient (MCC) summarizes all possible cases for binary predictions: true and false positives, and true and false negatives. MCC is also suitable for analyzing imbalanced datasets, where other class is much rarer than the other. MCC can be considered as a correlation coefficient between the observed and the predicted classifications and it gets values between − 1 and 1, where 1 is a perfect classification, 0 is random guessing and − 1 indicates a completely contradictory classification.