2.1 Data
Tampere University Hospital is an academic hospital located in Tampere, Finland serving a population of 900,251 patients and providing level 1 trauma center equivalent capabilities. The ED has 70 beds and approximately 100,000 patients are treated annually. For this study, data on all registered ED visits were obtained from hospital database created during the sample period from December 1, 2014 to June 19, 2019. The requested dataset contained timestamps for admission and discharge, diagnosis, patient’s age, gender, identifier of the physician and patient’s municipality of residence. Overcrowding is relatively common and an internal protocol has been implemented to help the staff to react to these situations. Two levels of overcrowding are recognized: level 1 when occupancy exceeds 80 patients (70% occupancy) and level 2 when occupancy exceeds 100 patients (88% occupancy). Overcrowding annotations were acquired from a patient logistics system (Uoma, Unitary Healthcare Ltd., Tampere, Finland) and these were used to validate the occupancy reconstruction as described below. This was a retrospective study using anonymized data acquired from the hospital electronic health record. According to Medical Research Act of Finland, retrospective studies base on health records are exempted from review by Ethics committee and it was approved by Tampere University Hospital Research, Development and Innovation Centre.
2.2 Study design
This study consist of two phases, both of which aim to predict events one day ahead. In the first phase we perform predictions for total daily arrivals. In the second phase DPO is used as the target variable. Daily Peak Occupancy (DPO) is defined as the highest occupancy observed on any given day. Occupancy of the emergency department is not continuously recorded in the hospital database and thus it was not contained in the requested dataset. We reconstructed the hourly occupancy by utilizing an algorithm that used the timestamps of admission and discharge as inputs to output hourly occupancy. Once the hourly occupancy had been reconstructed we resampled the data into daily resolution by extracting the maximum occupancy of each day. The resulting vector was validated by comparing it to overcrowding alerts given by ED staff and recorded using a dedicated patient logistics system as described above. The busiest 25% of the days were labeled as crowded since this threshold has been previously documented to be associated with increased 10-day mortality (8).
Data was missing for a time period from December 1, 2014 to May 1, 2015. This time period was excluded and the time period from June 1, 2015 to June 19, 2019 was used in further analysis. We also observed significant discontinuity in the reconstructed hourly occupancy for a time period from December 1, 2014 to June 1, 2016 potentially due to changes in the internal coding of the different sections of the ED. Thus, the time period from June 1, 2016 to June 19, 2019 was used in phase 2.
2.3 Analysis
Statistical analyses were performed using Python 3.7.3. Jupyter Lab 1.0.4 powered by iPython 7.8.0 was used in the exploratory data analysis (9). Analytical tools and data structures were provided by Pandas 0.25.0 (10). Visualizations were created using Matplotlib 3.1.1. Predictive models were created using StatsModels 0.10.1 (11), Scikit-learn (12) and Prophet (13). SARIMA hyperparameters were optimized using pmdarima (14). The dependencies were managed using Anaconda 4.7.12. An exhaustive list of all dependencies is provided in the Appendix.
2.4 Predictive models
Both simple naïve and seasonal naïve models were included as benchmarks. In case of simple naïve, the value of the previous day is used as the prediction for tomorrow. Seasonal naïve on the other hand works by observing the corresponding value a season ago and using this value for predicting the future. For example, when forecasting the occupancy peak of the coming Tuesday, the peak of last Tuesday was used. Some earlier publications utilize the seasonal naïve as the baseline to compare different methods to (15). Others suggest linear regression as the benchmark method (16).
Seasonal Autoregressive Moving Average (SARIMA) is widely used in both econometrics and ED forecasting and has been established as a benchmark model in international forecasting competitions (17)(7). SARIMA consists of 7 independent components denoted by (p, d, q)x(P, D, Q)s. The parameters p, d, and q describe the autocorrelation, differencing, and size of the moving average window of the non-seasonal component of the models, whereas P, D, and Q describe the same values for the seasonal component. S parameter describes the length of the season and was assigned a value of 7 due to the well-known weekly seasonality of the target variables. We report results for both SARIMA and SARIMAX, the latter of which was trained using weekdays and months as additional independent variables.
General Linear Model (GLM) has been widely used in predicting ED visits and is regarded as the benchmark against which other models are compared (16). Weekdays and months were used as independent variables and a Poisson distribution was assumed for both target variables.
Prophet is a modular regression model created by Facebook. It has three main components: trend, seasonality, and holidays and also supports adding exogenous predictors into the linear component of the model. The potential of Prophet in ED forecasting has been mentioned but as far as we know no results have so far been published in the peer-reviewed literature although results in flu epidemic peak intensity prediction have been promising (18,19). Prophet is tested both with and without calendar variables.
2.5 Cross-validation
Both SARIMA and Prophet evaluate the complete historical vector to perform predictions into the future. We utilized a time series cross-validation to simulate moving forward in time while iteratively updating the models with new data. In both phases the data was divided into train and validation sets using December 31, 2017 as the split date. The resulting train sets were used to evaluate the initial hyperparameters of SARIMA and Prophet. We then iterated through the resulting validation set from January 1, 2018 to June 19, 2019 with steps of one day, performing predictions one day forward on each iteration and updating model parameters. GLM was trained using a simple holdout using the train and validation sets described above.
2.6 Accuracy measures
Mean Absolute Percentage Error (MAPE) is the most used accuracy metric in the field of ED forecasting (7) and is widely used in time series forecasting in general. MAPE has the advantage of being scale-independent, thus allowing model comparisons between different datasets. For these reasons, MAPE was selected as the preferred continuous error measure in phases 1 and 2. The formula for MAPE is as follows:
In phase 2 we evaluate the accuracy of the models as binary predictors with the goal of forecasting whether the following day will be crowded or not which calls for discrete error measures. We report Receiver operating characteristics (ROC) and calculate the Area under the curve (AUC) measure along with reporting overall accuracy, sensitivity and specificity of the models.