Our corpus of works on LOS–P forecasting contains mostly journals papers (89%) authored by 125 authors who contributed to the topic with only one publication. Studies appeared in the proceedings of 3 conferences and on 21 different journals, two of which with more than one publication: Journal of Perinatology (n=2) and Journal of the American Academy of Child & Adolescent Psychiatry (n=2). It is evident the lack of an outlet concentrating studies on the topic. Research took place in nine different countries, four of which with more than one publication: USA (n=17), Germany (n=2), Brazil (n=2), and Canada (n=2). Studies belong to several knowledge areas, with prevalence in Pediatrics (n=8) and Psychiatry (n=4), medical specialties (e.g., cardiology and neurology), and medical departments (e.g., emergency and intensive care). Only one study is in the area of Computer Science, indicating that research on LOS–P prediction emphasizes healthcare applications rather than forecasting methods. The evolution of publications per year shows an increase in studies over the decades, beginning in the 1980s (n=1) and with the most recent publications in 2020 (n=2). In the decades of 1990 (n=6), 2000 (n=9), and 2010 (n=10), the number of studies increased considerably, bending towards the use of Machine Learning modeling techniques.
In the thematic analysis, we divided studies according to three dimensions: the technical approach used to generate the forecasts, the medical department where the study took place, and the population analyzed. Table 2 summarizes our findings and serves as a guide to what has already been reported in the literature, addressing RQ1 and RQ2.
Table 2 – Summary of studies regarding forecasting approach, and investigated department and population
Department
|
Population
|
Approach
|
# of articles
|
Regression
|
Machine Learning
|
Others
|
Emergency
|
Babies and children with bronchiolitis
|
|
[42]
|
|
2
|
Pediatric trauma patients
|
|
[19]
|
|
Neonatal Intensive Care units
|
Premature newborns
|
[38], [41], [32]
|
|
|
11
|
Newborns
|
[26], [39], [29], [23]
|
|
[28]
|
Chronically underweight newborns
|
[36], [40], [18]
|
|
|
Pediatric Intensive Care units
|
Pediatric patients
|
[27]
|
|
|
1
|
Pediatric unit or hospital
|
Babies undergoing bidirectional Glenn procedure
|
[21]
|
|
|
3
|
Children with hematological diseases complicated by febrile neutropenia
|
[31]
|
|
|
Pediatric patients with respiratory diseases
|
|
[2]
|
|
Psychiatric unit or hospital
|
Children
|
[30], [25], [3]
|
|
[20]
|
6
|
Teenagers
|
[37], [3], [24]
|
|
[20]
|
Young Adults
|
[37]
|
|
|
Not specified
|
Premature newborns
|
[35]
|
[35]
|
|
5
|
Newborns and babies undergoing cardiac surgery
|
[33]
|
|
|
Babies admitted for gastroenteritis
|
[43]
|
|
|
Pediatric patients
|
|
[34]
|
|
Pediatric victims of ATV accidents
|
[22]
|
|
|
Nº of articles
|
21
|
5
|
2
|
|
References: [21] Anderson et al. (2009); [34] Balan et al. (2019); [36] Bannwart et al. (1999); [28] Bender et al. (2012); [37] Browning (1986); [30] Gold et al. (1993); [38] Hintz et al. (2009); [20] Höger et al. (2002); [41] Jeremic and Tan (2008); [25] Kavanaugh et al. (2019); [26] Khoshnood et al. (1996); [43] Lee et al. (2005); [39] Lee et al. (2016); [3] Leon et al. (2006); [27] Levin et al. (2012); [2] Ma et al. (2020); [40] Marshall et al. (2012); [22] Nagarsheth et al. (2011); [33] Parkman and Woods (2005); [31] Pastura et al. (2004); [32] Paul et al. (2020); [29] Pearlman et al. (1992); [23] Pepler et al. (2012); [18] Rendina (1998); [24] Stewart et al. (2013); [19] Walczak and Scorpio (2000); [42] Walsh et al. (2004); [35] Zernikow et al. (1999).
The technical approaches were divided into Regression Analysis (used in 75% of the studies), Machine Learning techniques, and Others. Departments where studies took place were divided into six categories, one of which dedicated to articles that did not convey that information. The largest number of studies took place in Neonatal Intensive Care units (39.29%) and Psychiatric units or hospitals (21.43%), which are highly controlled areas with abundant LOS data. In those environments, regression analysis was the predominant forecasting technique. In opposition, LOS data from Emergency departments were exclusively modeled using Machine Learning techniques. Regarding the population analyzed, most studies used data from newborn patients (42.86%) or patients in specific situations (28.57%), e.g., victims of ATV accidents and children with hematologic malignancies complicated by febrile neutropenia. Fewer studies (32%) involved adolescent patients or young adults.
Table 3 characterizes the datasets used in the studies, addressing RQ3. The information presented includes country of origin, sampling period, sample size, number of hospitals included in the sample, mean or median of the LOS–P.
Hospitalizations of pediatric patients were sampled in 13 countries; the USA (n=14) is the country with the highest representation in the studies (50%). Data were collected between 1987 and 2017, covering from 9 to 120 months; data were mostly collected during the 1990s (n=10) and 2000s (n=10). Four studies do not specify the sampling period; the majority performed the sampling in a period equal to or greater than one year (n=22). Sample sizes range from 41 to 23,551 observations. Most studies (n=15) took place at a single location, indicating a low concentration of multicenter studies. Studies' LOS–P values display averages or medians ranging from 3.39 days to 18.02 months, with more than 40% of the articles not reporting this information (n=12). The longest LOS occurred in hospitals or Psychiatric units (ranging from 2 to 18 months), indicating more extended stays and lower turnover in those types of services. In Neonatal Intensive Care units, LOS–P averages vary from 23 to 54.8 days, with the highest averages (54.8 and 52.8 days) concentrated in the population of very low weight neonates.
Table 3 - Characteristics of datasets used in the studies
Reference
|
Country
|
Sampling period
|
Sample size
|
Hospitals
|
LOS–P
|
[21]
|
USA
|
July 2001 to December 2007
|
100
|
1
|
Median: 20 days
|
[34]
|
USA
|
2016
|
5,236
|
4,200
|
Not informed
|
[36]
|
Not informed
|
January 1992 to December 1993
|
97
|
1
|
Mean: 52.8 days
|
[28]
|
USA
|
August 1999 to October 1999, and April 2002 to September 2002
|
908
|
1
|
Not informed
|
[37]
|
Not informed
|
Not informed
|
41
|
1
|
Mean: 18.02 months
|
[30]
|
USA
|
May 1988 to December 1989
|
96
|
1
|
Mean: 71.6 days
|
[38]
|
USA
|
July 2002 to December 2005
|
2,254
|
Not informed
|
Not informed
|
[20]
|
Germany
|
Not informed
|
1,001
|
13
|
Median: 104 days
|
[41]
|
Canada
|
Not informed
|
186
|
1
|
Not informed
|
[25]
|
Not informed
|
2010 to 2015
|
96
|
1
|
Mean: 18.56 days
|
[26]
|
USA
|
1990
|
558
|
1
|
Mean: 23 days
|
[43]
|
Australia
|
1995
|
514
|
58
|
Mean: 3.39 days
|
[39]
|
USA
|
2008 to 2011
|
23,551
|
125
|
Not informed
|
[3]
|
USA
|
1998 to 2001
|
1,930
|
44
|
Mean: 10.4 days
|
[27]
|
USA
|
Not informed
|
2,062
|
1
|
Mean: 3.5 days
|
[2]
|
China
|
January 2014 to April 2016
|
11,206
|
1
|
Not informed
|
[40]
|
Argentina, Chile, Paraguay, Peru and Uruguay
|
January 2001 to December 2008
|
7,599
|
20
|
Not informed
|
[22]
|
USA
|
January 2000 to December 2009
|
420
|
Not informed
|
Not informed
|
[33]
|
Not informed
|
September 1993 to December 1997
|
458
|
1
|
Not informed
|
[31]
|
Brazil
|
February 2001 to May 2002
|
62
|
1
|
Mean: 10 days
|
[32]
|
USA
|
November 2014 to March 2017
|
152
|
14
|
Not informed
|
[29]
|
USA
|
October 1987 to July 1988
|
393
|
1
|
Mean: 23.8 days
|
[23]
|
South Africa
|
January 2007 to December 2008
|
3,794
|
15
|
Mean: 17.9 days
|
[18]
|
USA
|
January 1994 to December 1996
|
314
|
2
|
Mean: 54.8 days
|
[24]
|
Canada
|
October 2005 to March 2010
|
2,445
|
69
|
Mean: 16.31 days
|
[19]
|
USA
|
April 1994 to December 1997
|
7,665
|
Not informed
|
Mean: 3.98 days
|
[42]
|
Ireland
|
1999
|
119
|
1
|
Not informed
|
[35]
|
Not informed
|
October 1989 to January 1996
|
2,144
|
1
|
Not informed
|
Methods used to build the forecasting models are divided into three groups, according to the stage of model development they propose to address: (i) pre-processing, (ii) variable selection, and (iii) cross-validation. Pre-processing consists of preparing the data prior to modeling, avoiding noise due to outliers, missing data, multicollinearity, and lack of variable standardization. Variable selection methods focus on optimizing the forecasting model by improving its precision and interpretability using only the most informative variables. Cross-validation is used to evaluate the performance of the model in different datasets. Table 4 presents the studies' main methods and the results obtained from their application, addressing RQ1. The pre-processing methods reported in our corpus may be divided into (i) data cleaning methods to avoid modeling noise and (ii) methods to prepare and transform data to remove scale effects. The main pre-processing method aimed at data cleaning is the collinearity test, which evaluates the correlation level between independent variables. The test was reported in eight studies [18], [19], [20], [21], [22], [23], [24], [25]. To avoid noise in the model due to the excessive number of observations with missing data in the independent variables, six studies excluded incomplete observations from the datasets [26], [18], [19], [27], [2], and two adopted data imputation strategies [28], [2]. Two studies mention the withdrawal of outliers from the datasets before modeling [23], [25].
The main pre-processing method for data transformation is the logarithmic transformation of LOS–P values to correct the positive asymmetric distribution of the dependent variable, adopted in ten studies [29], [30], [26], [20], [31], [22], [23], [18], [24], [32]. The logarithmic transformation also reduces the effect of outliers, ensuring the normality of the residuals and stabilizing the variance. Machine Learning-based approaches did not transform the dependent variable.
Table 4 – Forecasting modeling approaches
Reference
|
Pre-processing
|
Variable selection
|
Cross-validation
|
Performance
|
Logarithmic transformation
|
Coding of categoric variables
|
Data rescaling
|
Colinearity test
|
Variable categorization
|
Missing data treatment
|
Feature engineering
|
Outliers withdrawn
|
Backward stepwise selection
|
Forward stepwise
|
Stepwise multiple cox regression
|
Correlation analysis
|
ANOVA
|
Significance test
|
PCA
|
Traditional Holdout
|
Temporal Holdout
|
K-fold
|
[21]
|
|
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R² = 0.43
|
[34]
|
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
R² = 0.9415
|
[36]
|
|
|
|
|
X
|
|
|
|
|
X
|
|
|
X
|
|
|
|
|
|
R² = 0.63 – 0.82
|
[28]
|
|
|
|
|
|
X
|
|
|
|
|
|
|
|
X
|
|
|
X
|
X
|
R² = 0.66 – 0.79
|
[37]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
R² = 0.36 – 0.43
|
[30]
|
X
|
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
|
Variance = 30.7% - 57%
|
[38]
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
X
|
|
X
|
|
|
R² = 0.38
|
[20]
|
X
|
|
|
X
|
|
|
|
|
|
|
|
|
X
|
|
|
X
|
|
|
R² = 0.097 – 0.237
|
[41]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
*AMSE ≅ 0.08 – 0.38 days
|
[25]
|
|
|
|
X
|
|
|
|
X
|
|
|
|
X
|
|
|
|
|
|
|
R² = 0.242 – 0.278
|
[26]
|
X
|
X
|
|
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
R² = 0.66
|
[43]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-
|
[39]
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
|
X
|
|
RMSE = 6.2 – 18.8 MAE = 4.2 – 14.6
|
[3]
|
|
X
|
|
|
|
|
X
|
|
|
|
|
|
|
X
|
|
X
|
|
|
R² = 0.22 – 0.30
|
[27]
|
|
|
|
|
|
X
|
|
|
|
X
|
|
|
|
|
|
X
|
|
|
% Prediction up to 12 h = 27% - 46%
|
[2]
|
|
X
|
|
|
|
X
|
X
|
|
|
|
|
|
|
|
|
X
|
X
|
|
R² = 0.694 - 0.831 RMSE = 0,296 a 0,588
|
[40]
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
|
Correlation between forecasting models = 0.92
|
[22]
|
X
|
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R² = 0.329
|
[33]
|
|
X
|
|
|
X
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
R² = 0.04 – 0.225
|
[31]
|
X
|
|
|
|
X
|
|
|
|
|
|
|
|
X
|
|
|
|
|
|
R² = 0.47
|
[32]
|
X
|
|
|
|
|
|
|
|
X
|
|
|
X
|
|
|
|
|
|
|
R² = 0.4464
|
[29]
|
X
|
X
|
|
|
|
|
|
|
|
|
|
|
|
X
|
|
|
|
|
R² = 0.78
|
[23]
|
X
|
|
|
X
|
|
|
|
X
|
|
|
|
|
X
|
|
X
|
|
X
|
|
R² = 0.7027
|
[18]
|
X
|
X
|
|
X
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
R² = 0.51
|
[24]
|
X
|
X
|
|
X
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R² = 0.1287
|
[19]
|
|
X
|
|
X
|
|
X
|
X
|
|
|
|
|
|
|
|
|
|
X
|
|
MAE = 2.5 – 4.26 days % *PP = 12.9% - 21.2% % *P1 = 34% - 51.4%
|
[42]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X
|
Mean correct performance = 60 - 80%
|
[35]
|
|
|
X
|
|
|
|
|
|
|
X
|
|
|
|
|
|
X
|
|
|
*CPR = 0.85 – 0.92
|
Frequence
|
10
|
8
|
2
|
8
|
3
|
6
|
3
|
2
|
3
|
3
|
1
|
2
|
5
|
6
|
1
|
8
|
5
|
2
|
|
*AMSE = Average MSE, *PP = Perfect prediction, *P1 = Predictions up to one day, *CPR = Correlation between the predicted and actual LSPPH
|
|
|
|
|
|
|
|
|
|
|
A large number of studies use the coding of categorical variables (n=8) through the use of dummy variables [18], [3], [2], [24], and specific codings that vary according to needs [29], [26], [19], [33]. Two studies use data rescaling through normalization [34] and linear rescaling [35]. Other data preparation methods include the categorization of variables [36], [31], [33], and feature engineering [19], [3], [2], which creates new variables by combining the ones available in the dataset.
Regarding the variable selection methods, most studies (n=17) propose reducing the number of variables in the model to keep only the most informative ones. Variable selection aims to balance model simplicity and performance; however, it is noteworthy that most Machine Learning-based studies do not use any variable selection method (except for Zernikow et al. [35], who proposes the Forward Stepwise method). The most popular variable selection methods are the Analysis of Variance [37], [30], [36], [20], [31], [23], and the variable significance test [29], [33], [3], [38], [28]. Both methods are statistical-based and straightforward, aimed at verifying the relationship between the dependent and independent variables.
Stepwise variable selection methods (backward or forward) perform the selection in stages, with the addition or removal of variables and subsequent assessment of the model's performance at each iteration. Three studies used the stepwise backward method [38], [39], [32], which starts with all variables in the model and removes at each iteration the least significant one. Three studies used the stepwise forward method [36], [35], [27], starting with a model with no variables and adding the most significant one at each iteration. In addition to the methods mentioned above, others less predominant in the studies are correlation analysis [25], [32], stepwise multiple Cox regression [40], and Principal Component Analysis (PCA) [23].
Cross-validation allows measuring model stability. Half of the studies in our corpus used the procedure to validate model results, seeking its generalization. Cross-validation approaches are divided into three categories: traditional holdout [35], [20], [3], [41], [38], [27], [34], [2], temporal holdout [19], [23], [28], [39], [2], and k-fold [42], [28].
The holdout method divides the dataset into two partitions (training and testing), which are mutually exclusive and vary according to the analyst's preferences. Traditional holdout randomly splits the dataset assuming that the frequency distributions do not change over time; temporal holdout divides the dataset taking into account the temporal evolution of the data. Except for two studies that do not mention the percentage of the dataset used in each partition [40], [2], all other studies used traditional holdout, varying the proportion of the dataset in the training and testing partitions, as follows: 80% – 20% [27], [34], 75% – 25% [35], 70% – 30% [38], and 50% – 50% [20], [3]. The k-fold method randomly divides the dataset into k unique parts of the same size, trains the forecast model with parts, and uses the remaining part for validation. The process is repeated times, such that all parts are used in the validation step. Walsh et al. [42] use 5-fold, with the dataset divided into training, testing, and validation in five different ways, while Bender et al. [28] do not give details on the k-fold method used.
The performance of LOS–P forecasting models was measured using several metrics, with only one study not detailing the applied model's performance [43]. The most used performance metric (n=19) is the coefficient of determination ( ) that measures the proportion of the variability in the dependent variable captured by the model. In studies measuring model performance using , indicated in the last column of Table 4, values ranged from 0.04 to 0.9415, with most studies reporting values greater than 0.5 (n=8). Other performance metrics reported were the Root Mean Square Error (RSME) and the Mean Absolute Error (MAE). RMSE measures the standard deviation of model errors; studies that used this metric reported values between 0.296 and 18.8 days [39], [2]. MAE measures the absolute average value of the differences between forecasts and actual observations; studies that used this metric reported values between 2.5 and 14.6 days [19], [39].
Performance metrics reported in only one study were Average Mean Squared Error – ASME [40], the proportion of forecasts with up to 12 hours [27] and 24 hours [19] of error with respect to actual values, average correct performance [42], correlation between forecasts and actual values [35], proportion of perfect forecasts [19], correlation between predictions based on variables available at birth and 30 days after birth [40], and variance representing the amount of information captured by each independent variable in the model [30].
All studies reported benefits from using LOS–P forecasts for the hospital ecosystem, except for eight studies [37], [30], [18], [20], [21], [22], [24], [34], that focused only on model performance. We identified five dimensions positively affected by the use of forecasting models, partially addressing RQ4: (i) patient care, (ii) costs, (iii) hospital management, (iv) quality measurement, and (v) updating of medical practices.
In the patient care dimension, the two main benefits reported are providing families information about the expected discharge date [29], [36], [35], [33], [43], [40], [28], [23], and preventing complications associated with prolonged hospitalizations [43], [40], [25]. Hintz et al. [38] suggested that LOS–P prediction allows a better understanding of risk factors associated with prolonged stays. Identifying such patients may direct hospitals towards more aggressive treatments and the provision of specialized care to prevent complications [43], [25].
Benefits associated with the cost dimension are estimates of financial values spent on hospitalization [43] and cost reduction for the hospital [36], [28], [26], [40], [32]. LOS–P forecast contributes to the hospital's strategic planning and guides medical care, reducing the length of the patient's stay and, consequently, hospitalization costs [36], [40].
Hospital management can bring several benefits to the hospital, being directly related to the other dimensions. Studies report management areas positively affected by LOS–P forecasts, such as resource allocation and planning [28], [39], [2], [23], [19], [35], patient flow management [27], [19], hospital bed management [43], [27], [2], optimization of decision making [41], [27], [2], [42], [35], and shift staff scheduling [28], [35]. The availability of LOS forecasts at patient's admission allows an efficient allocation of resources [39]; identifying LOS predictors may also contribute to the effective management of scarce medical resources [2].
As for the benefits associated with measuring patient care quality, studies advocate the monitoring of hospital performance [43], [31], [23], [35], and the standardization of care across hospitals [39], [3], [40], [35]. Hospital performance may be assessed by measuring the effect of hospital-related variables in the LOS–P model [43] or by using predicted LOS values as reference [23], and the difference between the expected and actual LOS values as a service quality indicator [35]. Pepler et al. [23] suggest that performance measurement based on LOS–P forecasts should be part of a quality program monitored by hospital managers to improve patient care quality. Leon et al. [3] argue that efforts to maintain quality should be directed towards understanding variations in practice standards across hospitals, while Lee et al. [39] suggest standardizing the treatment of premature babies among institutions. By comparing LOS–P values across centers, benchmarking analyses may be performed, contributing to hospitals' strategic planning [40].
The last dimension of benefits is related to the detection of variations in historical patterns due to changes in medical practices resulting from updating LOS–P models [42]. Walczak and Scorpio [19] report that the use of neural network models makes solutions non-static; as medical practices evolve, the prediction model is quickly adapted through continuous learning based on new datasets.
To start addressing RQ5 we list the limitations and barriers reported in the LOS–P forecasting literature. They are: (i) lack of model generalizability due to differences across hospitals [29], [26], [36], [20], [31], [33], [3], [38], [28], [27], [40], (ii) lack of data on potentially useful LOS–P predictors [36], [31], [42], [33], [3], [40], [24], [39], [25], (iii) small sample sizes used to obtain forecasting models [36], [42], [31], [38], [22], [32], and (iv) studies based on samples that do not adequately represent the entire population [42], [3], [21], [25].
Other limitations cited by the investigated authors include the presence of missing data in the dependent variable [3], [33], imperfect data collection resulting in noisy samples [37], [28], [27], inconsistencies in parameters' estimations [2], lack of consensus regarding the minimum precision level that enables using the model to support decision making [27], presence of multicollinearity between independent variables [20], and poor model performance when predicting large LOS–P values [35].