Key findings
A common problem in the application of machine learning is availability of data at the point of decision making. The present study aimed at using routine data readily available at admission to predict aspects relevant to the organisation of psychiatric hospital care. A further aim was to compare the results of a machine learning approach to those obtained using a traditional method and those obtained using a naive baseline classifier.
The models’ performance, as measured by the area under the ROC, varied strongly between the predicted outcomes, with relatively high performance in the prediction of coercive treatment and 1:1 observations and relatively poor performance in the prediction of short LOS and non-response to treatment. The GBM performed slightly better than logistic regression. Both approaches were substantially better than a naive prediction based solely on basic diagnostic grouping.
The present results confirm previous studies suggesting inadequacy of the area under the ROC as a measure for predictive performance in unbalanced data, in our case data with many more negatives than positives (36). The area under the ROC gave a misleadingly positive impression of the models for 1:1 observation and crisis intervention, while the precision and recall plots revealed a lack of sufficient precision for a clinically meaningful application.
Furthermore, we found relatively large differences in the areas under the ROCs between different hospital sites (Figure 3). As described in the methods section, we trained each model in 8 hospital and used it to predict outcomes of patients from the left-out ninth hospital. Therefore, if the remaining ninth hospital had very different patients or provided care in a different way, the model would perform worse in this hospital than in the other hospitals that were more common to each other. This was probably the reason for the very low performance of the naive baseline classifier in predicting 1:1-observations in two study sites with areas under the curves below 0.5 (0.4 and 0.1, respectively), where the incidences of this event were very low (0.3% and 0.07%).
It is still unclear, which predictive performance is sufficient for beneficial application in routine clinical practice and this was out of the scope of the present study (37,38). Furthermore, different clinical applications might require their own trade-off decisions between reducing false alerts and increasing coverage of actual positives. We have chosen different configurations for comparison of model performance, borrowing from Tomašev et al in their prediction of acute kidney injuries (13). For instance, our GBM model for the prediction of coercive treatment, operationalised with a precision of at least 0.2 (see Table 2), gave a warning for 26% of all episodes at admission. Thereof, four false alerts were caused for each true alert and warnings were given in advance for 73% of all actual positive cases. The same model could be operationalised at a precision of 0.25, which gave a warning for 13% of all episodes, resulting in three false alerts for each positive alert and a warning for 48% of all actual positive cases.
Just because we can predict future events does not mean we should (39). Very few clinical prediction models have undergone formal impact analysis, i.e. studying the impact of using the predictions on patient outcomes (40). Patients’ benefit depends on how predictions are translated into effective decision making (41). Predictions must be reasonably included in the clinical processes to create an actual benefit from better informed decisions. This requires a range of steps at the individual hospital level, such as integration into current IT, human resources and financial investment systems.
Even a perfectly integrated model must be used responsibly in clinical practice, and the exact framework for such application is currently under a broad discussion (42–44). For instance, caregivers have to be trained in using the provided results, patients‘ access to care has to remain equitable, real-world performance must be constantly scrutinised and responsibilities in case of errors have to be clear. Furthermore, predictions must not become self-fulfilling. Instead, a warning at admission for coercive treatment could be used to intensify non-invasive care with the aim to avoid coercive approaches, for instance.
Our study in comparison to previous research
Less distinct diagnostic concepts (6–8), less standardization of care (9) and a broader spectrum of acceptable therapeutic regimes (10) make the prediction of outcomes in psychiatry more complex than in other medical disciplines (3–5). An infamous example for these difficulties was the failure of the Medicare DRG system for psychiatry due to the inability to predict length of stay and associated hospital costs (45,46). Recent studies have often used a broad range of feature variables in studies restricted to specific settings and patients. Leigthon et al. (47) predicted remission after 12 months in 79 patients with first episode of psychosis with a wide range of demographic, socioeconomic and psychometric feature variables and reached an area under the ROC of 0.65. Koutsouleris et al (48) also investigated remission in first episode of psychosis and reached a sensitivity of 71%, a specificity of 72% and a precision of 93% in 108 unseen patients with their top ten demographic, socioeconomic and psychometric predictor variables. Lin et al (49) tried to distinguish treatment responders from non-responders prior to antidepressant therapy in 455 patients with major depression. They used single nucleotide polymorphisms from genetic analyses and other clinical data and reached an area under the AUC of 0.82. Common traits of these studies were the restriction to specific patient groups and the relatively small sample sizes. Furthermore, they mainly used data that might not be available during routine patient admission.
Strengths and weaknesses of our study
A strength of this study was the large sample size over two distinct years and at nine study sites. This allowed us to include a broad range of the present spectrum of psychiatric inpatients and to develop models that should be applicable in most hospitals. Furthermore, we were able to test our models in patients that were treated in another calendar year and a different hospital and thereby reduce information leakage. A further strength of the present study was the restrictive inclusion of only feature variables that should be available at admission in most hospitals. Therefore, it should be possible to implement the present models in many hospitals without additional documentation effort.
A potential weakness of our study was the retrospective use of administrative routine data which entails potential validity concerns. The validity of routine hospital data for health services research is a frequently discussed topic (50,51), and studies found both low (52) and high validity of such data (53). However, the development of models for application in routine clinical practice necessitated the use of routinely generated data including the inherent caveats. A further limitation was the lack of time stamps for the diagnostic groupings. Patients were grouped in one of five basic diagnostic groups at admission and these groupings remained stable during an episode. However, we were not able to entirely rule out that these groupings might have been changed during the stay by staff in rare cases. A further limitation was the restriction to hospitals from one large provider of inpatient psychiatric services in the region of Hesse, Germany, which raises the question whether the predictive performance of our models would remain stable if applied in psychiatric hospitals with different circumstances.