Data sources.
Data for this study were collected from the EMR scheduling system of CMUH located in Taichung, Taiwan. The data set covered a broad variety of details about patients, surgeries, specialties and surgical teams. A total of 170,748 cases performed between Jan 1, 2017, and Dec 31, 2019, were used for model development. Additionally, 8,672 cases performed between Mar 1 and April 30, 2020, were used as data for external model evaluation in this study. Over 400 different types of procedures across 25 surgical specialties were included in the data set. Institutional review board approval (CMUH109-REC1-091) was obtained from CMUH before carrying out this study.
Exclusion criteria, data processing and feature selection.
Emergent and urgent surgical cases were removed since these two types of surgeries cannot be scheduled until they happen. A surgeon’s age younger than 28 years and surgical case duration more than 10 hours or less than 10 minutes were also removed. Surgical records with missing values were excluded. Patients who were pregnant or underwent two or more surgical procedures at the same time or with age under 20 years were removed. The exclusion criteria are shown in Fig. S2. This approach resulted in a data set of 142,448 cases that were used for model training and testing. The same criteria were also applied to the data of Mar 1 to April 30, 2020, and 7,231 cases remained after exclusion.
Features were selected from available data sources based on literature review and discussion with surgeons and administrators of CMUH. Although the model performance could be enhanced by some postoperative information (e.g., total blood loss), these parameters cannot be used as features for model training because they were either missing or simply estimated by surgeons before surgery. Therefore, only variables that were available before surgery were selected for model development.
When visualizing all the categories of procedure types and the International Classification of Diseases (ICD) code, there were hundreds to thousands of categories in these two variables. To reduce the problem of having too many dimensions during one-hot encoding of categorical features, we combined categories that had less than 50 cases in the training set into a category and named it as ‘Others’. Similarly, we combined categories for primary surgeon’s ID, specialty, anesthesia type and room number that had less than 50 cases into the category of ‘Others’.
In addition, since surgical case duration can be related to the performance of surgeons and surgeons’ performance is affected by their working time, we analyzed primary surgeons’ previous surgical events. The number of previous surgeries and total surgical minutes performed by the same primary surgeons on the same day as well as within the last 7 days and the number of urgent and emergent operations prior to the case that was being performed by the same surgeon were included in the analysis. Together, 24 predictor variables were included for predictive model building in this study. These predictors can be categorized into 5 groups: patient, surgical team, operation, facility and primary surgeon’s prior events (see Table 1).
Model development and training.
We applied multiple ML methods for surgical case duration prediction. Surgical case duration (in minutes) is the total period starting from the time the patient enters the OR to the time of exiting the OR. Historic averages of case durations based on surgeon-specific or procedure-specific data from EMR systems were used as baseline models for comparison in case duration prediction. At the beginning, we performed multivariate linear regression (Reg) to predict surgical case duration. However, when we evaluated the distribution of surgical case duration, it was observed to be skewed to the right (Fig. S1 in the Supplementary info). We performed a logarithmic transformation on the surgical case duration to reduce the skewness. The model built from log-transformed multivariate linear regression (logReg) outperformed Reg in all evaluation indexes. Subsequent ML algorithms were also trained by using the log-transformed case duration as the target.
The first ML algorithm that we tested was random forest (RF), a tree-based supervised learning algorithm. RF uses bootstrap aggregation or a bagging technique for regression by constructing a multitude of decision trees based on training data and outputting the mean predicted value from the individual trees [21]. The bagging technique is unlikely to result in overfitting; in other words, it reduces the variation without increasing the bias. Tree-based techniques were suitable for our data since they include a large number of categorical variables, e.g., ICD code and procedure type, of which most were sparse. The number of trees set in this study was 50. The extreme gradient boosting (XGB) algorithm is the other supervised ML algorithm that was tested for comparison to RF. Recently, the XGB algorithm has gained popularity within the data science community due to its ability in overcoming the curse of dimensionality as well as capturing the interaction of variables [22].
XGB is also a decision tree-based algorithm but is more computationally efficient for real-time implementation than RF. The XGB and RF algorithms are different in the way that the trees are built. It has been shown that XGB performs better than RF if parameters are tuned carefully; otherwise, it would be more likely to overfit if the data are noisy [23, 24]. We adopted a 5-fold cross-validation strategy to tune the best number of iterations by using η = 0.5 (step size shrinkage to prevent overfitting), maximum of 3 depths of the tree, γ = 0.3 (minimum loss reduction, where a larger γ represents a more conservative algorithm) and α = 1 (L1 regularization weighting term, where a larger value indicates a more conservative model).
A data-splitting strategy was used in the training for all the models to prevent overfitting consequences. We randomly separated the data into training and testing subsets at a ratio of 4:1. The training data were used to build different predictive models as well as to extract important predictor variables. The testing data were used for internal evaluation of the models. In addition to interval evaluation, external evaluation on all the models was performed using data from Mar 1 to Apr 30, 2020. These data were not included in the original data set for ML model training. The results obtained from external evaluation are thus better in verifying the robustness of the trained model in making an accurate prediction. Historic averages of case duration for surgeon- or procedure-specific data calculated from EMR data were also evaluated on the same internal and external testing sets to ensure fair and uniform comparison across all models. Data processing and cleaning as well as model development in this study were performed using R software. The packages “xgboost and “randomforest were used to implement the XGB and RF algorithms in R, respectively [25, 17].
Model evaluation.
Multiple predictive models were built to predict surgical case duration. Different standards are usually applied to evaluate the predictive performance of the built models. The three key metrics used to evaluate model performance in this study included (1) R-square (R2), (2) mean absolute error (MAE), and (3) the percentage overage, underage and within.
R2 is the coefficient of determination; it represents the proportion of the variance for the actual case duration that is explained by predictor variables in our models.
MAE measures the average of errors between the actual case durations and the predictions.
Percentage overage indicates the percentage of cases with an actual case duration > prediction + 10 % tolerance threshold (i.e., 1.1 ∗ prediction) and prediction + 15 minutes. Meanwhile, percentage underage is the percentage of actual case duration < prediction - 10 % tolerance threshold (0.9 ∗ prediction) and prediction - 15 minutes. Therefore, the percentage within equals 100 %-(percentage overage + percentage underage).
Data availability
The minimum dataset (March to April 2020) used in external evaluation for this study is available from our web site: https://cmuhopai.azurewebsites.net/. The dataset required to replicate model training and internal evaluation contains personal data and is not publicly available, in keeping with the Data Protection Policy of CMUH.
Code availability
The code used in this study is currently unavailable but may become available in the future from the corresponding author on reasonable request.