This prediction model study is reported in accordance with the TRIPOD checklist[22].
Aim, design, setting and population
The aim was to develop and test a clinically useful model for predicting hospital admissions of older persons based on routine healthcare data. This is a prospective cohort study that included all residents aged 75–109 years in the county of Östergötland (n = 40,728) located in the south-east of Sweden. This age group constitutes 9.6% of the population, close to the national proportion of 9.2%. In the county of Östergötland, healthcare for the elderly is provided mainly by 43 healthcare centres in primary care and four hospitals, one of which is the University Hospital of Linköping.
Data source and study variables
The 12-month data were obtained between November 2015 and October 2016 from the computerized information system of the County Council of Östergötland, where statistics for all healthcare in the county are stored. For example, for the whole population there are records of the number of visits to primary or hospital care, number of days in hospital, diagnostic codes for each visit etc. We used unplanned in-ward hospital stays between November 2016 and October 2017 as the dependent variable. This time period was chosen, since the predicted cases were included in a intervention study starting November 2017[21]. We included number of physician visits, number of non-physician visits (to nurses, occupational therapists or physiotherapists), number of previous in-ward hospital stays, number of emergency room (ER) visits, age, gender and International Classification of Diseases, and 10th Revision, (ICD10)-codes grouped by two digits. For each diagnosis, two variables were constructed, one based on open-clinic visits and one based on hospital visits. To get good precision in the estimation of the coefficients and to get a reliable model over time, variables with number of observations less than 40 were excluded. All diagnosis variables were dichotomized into yes or no. People who died during the following prediction period were included in the analysis.
Model developing
The data was randomly divided into two halves, a training data set and a validation data set. The training set was used to build a prediction model and the validation set was used to validate this model. The prediction model algorithm was developed using multivariable logistic regression with forward logistic regression (LR) (see statistics below). The aim was to identify participants aged 75 or older who are likely to be hospitalized within the next 12 months.
Statistical analysis and external validation
The first step was to calculate the univariable association for each variable with 12-months unplanned hospital admission. Because of large number of observations that could result in statistical significance for rather weak associations, only variables with p-values less than 0.001 was further included in the multivariable analysis.
Multivariable logistic regression was then used to identify significant factors predictive of unplanned hospital admission over a 12-month period. The model-building process consisted of three steps: selecting the variables, building the model, and validating the model. The best model was assessed by change in Akaike information criterion. A penalty factor of five was used to avoid overfitting and to reduce the number of variables in the final model. Collinearity was observed by calculating variance inflation factor for each variable in the final model and variables with a value above five were excluded. After the final model was made some further test was done in an attempt to further improve the model. First, we tested all 2-way interactions. Further, we tested to log-transform all numerical variables. Finally, we tested non-linearity for numerical variables by using restricted cubic splines. If an improvement in AUC was not achieved, the simplest model was chosen because we wanted a robust model that was easy to implement. Risk scores were calculated for all individuals.
Model performance measures: Overall discrimination was assessed using c-statistic, a measure of goodness of fit for binary outcomes in a logistic regression model. The area under the receiver operating characteristic (ROC) curve (AUC) is used to quantify the binary outcomes (hospital admission or not). The ROC curve is continually plotting every ideally possible sensitivity versus specificity across all threshold cut-off points. AUC reflects the accuracy of the predictive models and can be compared among the different models. AUC 0.5 means the model has no discrimination (the proportions of true cases and false positive cases are equal) whereas AUC 1.0 means the model has a perfect discrimination [23]. Five different sensitivity analyses were performed to assess how the prediction model changed in different settings. The first model included both unplanned and planned hospital admissions, the second model excluded people who died within the 12-month follow-up period and in the last two models, different follow-up periods 3-, and 6 months was tested. Lastly, we tested the least absolute shrinkage and selection operator (lasso) as an alternative selection method.
External validation was also performed in two additional data sets. One using the same time period as above but including ages 65-74 (n=51104). And another using the age group 75+ for year 2012 for prediction of unplanned hospital admission the following 12 months (n=38121).
All statistics were performed using R version 3.5.2 (R Core Team, Vienna, Austria). The MASS package was used for fitting the logistic model and the pROC package was used for estimating the AUC. The glmnet package was used for fitting the lasso model.
Ethical aspects
The study has been subject to ethical evaluation and was approved by the regional ethical review board in Linköping (Dnr 2016/347-31).