Included studies
By applying the search strategy, 855 papers were identified and screened (Fig. 1). Of these, 63 were excluded by title and 254 by abstract, leaving 538 to be assessed using the full-text. Of these, 237 papers were excluded after a full-text review leaving 301 papers to be included in the final review (12–312).
The 301 studies were published across a 34-year span, from 1985 to 2019. Cardiology was found to be the most frequently reported clinical area in 62 (20.6%) studies, for example studies which use these methods to analyse recurrent heart failure related admissions. Oncology studies were the second most applied area with 45 (15.0%) studies modelling tumour recurrences, such as recurrences of breast cancer (80, 84, 85, 182, 183, 233), bladder cancer (162, 255, 299, 301), rectal cancer (44, 172, 225), and oesophageal cancer (141) amongst other cancer types (23, 151). The full list of clinical areas can be seen in Table 3 in Appendix 2.
The majority of studies, 173 (57.5%), used data from a cohort design. The remaining studies used data from RCTs (55 (18.3%)), case-control studies (12 (4.0%)) or cross-sectional studies (7 (2.3%)). Model development was the primary focus in 45 (15.0%) studies, rather than a primary objective of the paper to report analysis results of a clinical dataset.
A detailed summary of the included studies according to the aims of the review is detailed below.
Statistical approaches to modelling recurrent events
The most frequently reported method for analysing recurrent events was the Andersen-Gill (AG) model (313), which was used in 152 (50.5%) of the 301 included papers (Table 1). This is an extension of the Cox model using robust standard errors to account for within subject heterogeneity. Frailty models (187) were used by 116 (38.5%) studies. A variety of frailty models were applied depending on distribution and these are summarised in Table 1. The most frequent was the gamma frailty model in 63 (39.9%) studies.
Table 1
Summary of methods identified from the data extraction
Method | Frequency N (%) |
Recurrent Event Methods: | |
Andersen-Gill (AG) (313) | 152 (50.50%) |
Frailty Model (187): | 116 (38.54%) |
Gamma | 63 (39.87%) |
Unspecified | 35 (22.15%) |
Gaussian | 18 (11.39%) |
Log-Normal | 15 (9.49%) |
Weibull | 10 (6.33%) |
Exponential | 8 (5.06%) |
Log-Logistic | 3 (1.90%) |
Poisson | 3 (1.90%) |
Compound Poisson | 1 (0.63%) |
Gompertz | 1 (0.63%) |
Logistic | 1 (0.63%) |
Prentice, Williams and Peterson Models (314): 2 | 41 (13.62%) |
Prentice, Williams and Peterson-Total Time (PWP-TT) | 27 (8.97%) |
Prentice, Williams and Peterson-Gap Time (PWP-GT) | 22 (7.31%) |
Wei, Lei and Weissfeld (WLW) (315) | 33 (10.96%) |
Bayesian Methods | 11 (3.65%) |
Multi-State Model (MSM) | 9 (2.99%) |
Lin, Wei, Ying and Yang (LWYY) (106) | 2 (0.66%) |
Lee, Wei and Amato (LWA) (316) | 1 (0.33%) |
Lawless and Nadeau marginal model (LN) (317) | 1 (0.33%) |
Liang, Self and Chang (LSC) (318) | 1 (0.33%) |
Multilevel Survival Model (285) | 1 (0.33%) |
Papers which used multiple recurrent event methods | 48 (15.95%) |
1 Some papers applied more than one type of frailty model meaning the types of frailty do not total to the 116 papers which used a frailty model.
2 Some papers used both the PWP-TT and PWP-GT variation.
There were 48 (16.0%) papers identified which used more than one method to analyse recurrent events.
Quality of analysis assessment
Selected aspects of the PROBAST ‘analysis’ domain, as described in the methods section, are now considered.
Were there a reasonable number of participants with the outcome? (PROBAST 4.1)
The number of events per 100-person years was calculated by dividing the number of recurrent events overall in the study by the total number of person years of follow-up and multiplied by 100. Person years of follow-up was reported directly in 31 (10.3%) studies and approximated using the median (99 (33.0%) studies) or mean length of follow-up (42 (14.0%) studies).
The total number of recurrent events was reported in 227 (75.4%) studies and the number of patients who experienced recurrent events reported in 191 (63.5%) studies.
There were 114 (47.8%) studies which provided the event rate in the paper and 134 (44.5%) studies where the event rate could be calculated manually using either person years or the mean/median length of follow-up. The median (Interquartile-range (IQR)) event rate was 26.1 (5.9–59.3) per 100-person years based on these 248 studies.
PROBAST states that the Events per Variable (EPV) included in a model should be greater than or equal to 20 for studies to have less chance of overfitting and thus be graded as low risk of bias (11). EPV could be calculated for 216 (71.8%) included studies. In the remaining studies, it was not clear how many predictor levels had been included in the model, or the number of events within the dataset was not specified. Of these 216 studies, 27 (12.5%) had an inadequate EPV of less than 20. The median (IQR) for EPV was 128.6 (33.5-419.5).
Were continuous and categorical predictions handled appropriately? (PROBAST 4.2)
Studies which use categorisation when analysing continuous predictors are usually rated as high risk of bias in the PROBAST assessment unless a clear clinical rationale is provided for doing so. There were 62 (20.5%) studies which categorised continuous predictors without rationale.
Were participants with missing data handled appropriately? (PROBAST 4.4)
The majority of studies, 215 (71.4%), did not adequately report a specific approach for handling missing data for either the outcome or covariates. Of the 86 (28.6%) which did, complete case analysis was used in 45 (52.3%) studies and multiple imputation was used in 19 (22.4%) studies. Of these 19 studies, 11 (57.9%) reported the number of imputations used. Additionally, two (0.7%) studies created an extra category for each variable used in the analysis which had missing data to minimise the loss of observations through missing data, one (0.3%) study excluded variables if more than 10% of the data for that variable was missing and one (0.3%) study only used variables in the analysis which had fewer than 20% missing data.
Was selection of predictors based on univariable analysis avoided? (PROBAST 4.5)
Univariable screening was used as the sole method when choosing predictors for inclusion in the final model in 44 (14.6%) included studies, a characteristic associated with high risk of bias. An additional 23 (7.6%) studies used stepwise regression, backwards or forwards elimination to choose the final predictors in the model.
Were relevant model performance measures evaluated appropriately? (PROBAST 4.7)
There were 37 (12.3%) included studies that reported calibration statistics and measures of discrimination were reported in 30 (10.0%) studies. The PROBAST checklist requires internal validation and reporting of calibration and discrimination statistics for a study to be rated as a low risk of bias (11).
Some papers used multiple measures of internal validation, and reported several measures for calibration and discrimination; this was done in 75 (24.9%) papers. External validation was found to be used far less, in only three (1.0%) studies (51, 162, 293) although notably models may have been externally validated in separate publications that were not picked up by our review.
Were model overfitting, underfitting, and optimism in model performance accounted for? (PROBAST 4.8)
Model overfitting and optimism was accounted for following internal validation in 74 (24.6%) studies which were consequently graded as low risk of bias. Bootstrap resampling was used in 20 (6.6%) studies, and cross validation methods were used in 8 (2.7%) studies.
Do predictors and their assigned weights in the final model correspond to the results from the reported multivariable model. (PROBAST 4.9)
Of the 297 (98.67%) studies where the number of predictors and levels for each were reported, 202 (67.1%) reported the full results for all included predictors, as required by studies graded as low risk of bias.
Additional information
Few studies calculated additional statistics specific to recurrent event models (and currently outside the scope of PROBAST) − 7 (2.3%) reported the Root Mean Square Error (RMSE), one (0.3%) reported the Mean Absolute Percentage Error (MAPE), two (0.7%) reported the Mean Square Error (MSE) and no papers reported the Root Mean Square Percentage Error (RMSPE). The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) statistics were reported for 23 (7.6%) and 9 (3.0%) studies respectively. Additionally, the Deviance Information Criterion (DIC) statistic was used as a measure of model performance in 7 (2.3%) studies.