An adaptive sample size determination procedure is specific for the development of a clinical prediction model in the modeling and data context at hand. Our adaptive stopping rules led to much higher sample sizes than the 10 EPP rule, even more than 20 EPP was needed. These results are consistent with the finding that EPP requirements increase with the event fraction.8 The required sample size was also slightly larger than when using the fixed calculation method by Riley and colleagues.6 The choice of both modeling strategy and the specific stopping rule had impact on the required sample size, but the impact depended on the modeling context. We observed considerable variability in model performance, particularly at low sample sizes.
Perhaps surprisingly, the inclusion of variable selection reduced the sample size for the ovarian cancer data. This may be caused by strong preselection of predictors (Figure S6), and by the relationship between the maximum diameter of the lesion and of the largest solid component. These diameters are clearly correlated, with the latter diameter bounded by the former. The variable selection typically excluded the maximum lesion diameter.
The adaptive sample size procedure monitors model performance during data collection. The main strength of the adaptive procedure is that it is able to incorporate more complex modeling scenarios than the existing methods for sample size estimation. It can for example account for imputation of missing data, modeling of nonlinear relations, variable selection, and penalization algorithms. Thus, one can further tailor the resulting final estimate of the required sample size to the specific modeling context. Moreover, our method can nicely complement Riley’s method for fixed sample size calculation. We recommend to provide a reasonable estimate of sample size upfront (N0 in the adaptive procedure above), so that the feasibility of collecting this amount of data is ensured. Riley’s method is an important tool to do so. Then, the adaptive approach can be used to adjust the initial estimate if needed. Whereas this upfront calculation focuses on the minimal sample size at which desired performance can be expected, adaptive sample size monitoring can help to find the sample size at which there is empirical support for the desired performance.
Our adaptive sample size procedure for prediction models bears resemblance to the group-sequential design for randomized studies. Differences are that (1) randomized trials test a tool rather than develop one and (2) significance testing is not at stake for prediction model development. Early stopping for superiority in group-sequential trials may lead to inflated estimates of the effect of the intervention.30 Analogously, when stopping early in our procedure for prediction modeling may lead to less robust models with lower performance on new data from the same population. In the context of prediction modeling, adaptive monitoring is more flexible because significance testing is not an issue. Nevertheless, the modeling strategy and preferably also the stopping rule should be specified in advance, and the learning curves should be reported.
Values for Nstart and Nadd have to be set. These values can be chosen depending on the situation, using arguments such as N0, the anticipated or even the actual recruitment rate, and the effort needed to prepare data for analysis. Other stopping rules than the ones we have used can be derived. Although the calibration slope and ΔAUC are useful performance measures, other measures such as Brier or R-squared measures may be used as overall measures of performance.6,18 Our additional requirement to achieve the target calibration slope and ΔAUC on two consecutive assessments may for example depend on the chosen value of Nadd: the larger Nadd, the lower the need for such a requirement may be. The key issue is that these choices are transparent and justified where possible.
Apart from application in prospective studies, this procedure can also be applied to retrospective studies on existing patient cohorts (similar to our two illustrative datasets, in fact). Preparing data from existing cohorts for prediction modeling is not always straightforward, for example when biomarkers have to be quantified for available blood samples, or when extensive data cleaning is required. The adaptive sample size procedure can then be applied to know how many cases have to be prepared. For retrospective applications, cases should be added in reverse chronological order. This avoids that the most recent available data are not used in the end.
A limitation of the adaptive procedure is that the final sample size is not set in advance, which may lead to practical and logistical shortcomings. For example, more data cleaning and computational efforts are required, and studies may take longer to complete if the stopping rule is met at a higher sample size than anticipated. On the other hand, although using a fixed sample size does not have this drawback, it is uncertain how reasonable the fixed sample size turns out to be in the end. Another consequence of our procedure is that, for prospective studies, continuous data monitoring and data cleaning is required. This additional effort is probably more an advantage than a limitation, because continuous evaluation of incoming data tends to save time later on and can timely spot and remedy any data collection issues. Finally, the adaptive procedure is most attractive for settings where outcomes are immediately known (diagnostic research) or within a short period of follow-up (e.g. complications after surgery, or 30-day mortality).
A limitation of the resampling study may be that we sampled from the datasets without replacement rather than with replacement. We deliberately opted to sample without replacement to mimic real-life recruitment. However, this may have led to an underestimation of the variability between learning curves (Figures S11-12).
Future research should focus on learning curves to further study how the required sample size is impacted by contextual characteristics such as modeling choices (type of algorithm, amount of a priori and data-driven variable selection), case mix (distribution of predictors and outcomes), and predictive strength. Although this was not addressed systematically in this work, predictive strength of the included predictors, as expressed by the AUC, plays a role. The ovarian cancer (AUC around 0.9) and CAD case study (AUC around 0.7) are clearly different in this respect.