The study was based on risk factor and outcome data collected from January 2006 to December 2019 from trans-rectal systematic 10–12 core biopsies from 10 PBCG cohorts spanning North America and Europe used for training and one PBCG European cohort used for validation (Fig. 1, Fig. 2). The risk factors collected included the standard risk factors used in clinical practice for prostate cancer diagnosis along with other less commonly used risk factors, but with proven associations to prostate cancer. All PBCG data were collected following local institutional review board (IRB) approval from the University of Texas Health Science Center of San Antonio, Memorial Sloan Kettering Cancer Center (MSKCC), Mayo Clinic, University of California San Francisco, Hamburg-Eppendorf University Clinic, Cleveland Clinic, Sunnybrook Health Sciences Centre, Veterans affairs (VA) Caribbean Healthcare System, VA Durham, San Raffaele Hospital, and University Hospital Zurich. Analyses for this retrospective study were approved by the Technical University of Munich Rechts der Isar Hospital ethics committee, with all methods performed in accordance with the guidelines and regulations of the committee. As data collected were anonymized and obtained as part of standard clinical care, consent was waived by all IRB’s, except regarding second-degree prostate cancer and first-degree breast cancer family history for the VA Durham. Written consent for these variables was obtained and documented as part of a larger separate study at the VA Durham prior to the beginning of this study. All institutional PBCG IRB approvals are maintained by the MSKCC central data coordinating center and IRB.
The 10 cohorts used for training the model followed the PBCG prospective protocol in data collection, whereas the external validation cohort supplied retrospective data from a single institution that performs a high annual number of prostate biopsies to the PBCG [2, 3]. Included data came from patients who had received a prostate biopsy following a PSA test under local standard-of-care and may be seen as representative of patients in North America, including Puerto Rico, and Europe. MRI biopsies as well as prostate biopsies from patients with prostate cancer were excluded. Clinically significant prostate cancer was defined as Gleason grade group ≥ 2 on biopsy [13]. For users of the developed risk calculator, two risk factors were mandatory: PSA and age. Ten risk factors were optional: DRE, prostate volume, prior negative biopsy, 5-alpha-reductase-inhibitor use, prior PSA screen (yes/no), African ancestry, Hispanic ethnicity, first- and second-degree prostate cancer- and first-degree breast cancer-family history.
We performed a literature search to identify the six most commonly used approaches for handling missing data in multivariable logistic regression modeling, for either single or multiple cohorts as found in this study. All of the approaches could be implemented in the R statistical package. Our aim was to identify the most accurate approach for implementation in the online tool. To increase flexibility of the tool, we tailored each method to the specific list of risk factors available for an individual. That is, for a validation set, the algorithms were applied for each individual in the validation set separately. All algorithms return logistic-regression-based expressions for probability of clinically significant prostate cancer; the cohort ensemble approach averages these for the individual cohorts. The methods are summarized in Table 1 and Table S1.
Table 1
Methods for fitting individual predictor-specific risk models for members of a test set by combining data from multiple cohorts. All individuals in the training and test cohorts have 2 predictors, PSA and age, and then any subset, including none, of 10 additional predictors for a total of 12 predictors, denoted by \(\text{X}\). The set of predictors available for the new individual is denoted by \({\text{X}}^{*}\). All models use logistic regression for prediction of clinically significant prostate cancer. MICE = Multiple imputation by chained equations; BIC = Bayesian Information Criterion defined as the log likelihood – (number of covariates) \(\times\) log(sample size).
Method | Definition |
Available cases | Pool individual-level data that have \({\text{X}}^{*}\) measured across all cohorts and fit a model including \({\text{X}}^{*}\)as main effects. |
Iterative BIC selection | Same as available cases, but with an iterative stepwise BIC-based model selection to determine the optimal subset of \({\text{X}}^{\text{*}}\) and interactions. |
Cohort ensemble | Separate models are built to each cohort by using the coinciding variables of the cohort and the patient. |
Categorization | All individuals in all cohorts are used. Predictors are categorized with missing as one of the categories so that the complete list of predictors \(\text{X}\) is used. |
Missing indicator | Include an indicator for missing a continuous predictor value and the interaction with the predictor as additional variables in the analysis. Mostly similar to Categorization. |
Imputation | Impute missing covariates in the training set following the MICE method. Mean imputation for missing values in prediction. |
The available cases algorithm pooled individual level data from the training cohorts with information on the variables that the end-user had available, fit a main effects logistic regression model for clinically significant prostate cancer to the training data, and used the coefficients in a tailored prediction model for the target patient. The iterative Bayesian information criterion (BIC) selection method added stepwise BIC-based model selection to the available cases algorithm, allowing two-way interactions to be included. If a risk factor was not chosen in the optimal model by the selection process, the procedure was re-started excluding the risk factor, allowing for a greater number of individuals from the training set to be included in model development.
Rather than pooling data across cohorts, the cohort ensemble method constructed separate models for each cohort, restricting to risk factors available by the end-user and collected by the training cohort. A risk factor was considered available in a training cohort if it was measured in 40% or more participants, otherwise it was considered missing and not included so as not to prohibitively reduce the sample size for constructing a cohort-specific model. Because models were fit to single cohorts and some of the cohorts had small sample sizes, information from individual cohorts could be low or considered inadequate for robust multivariable model construction, as for example, cohort 10 with only 243 biopsies. Such cohorts were not excluded because while they may lack power for obtaining statistical significance of individual coefficients, the goal here was optimizing out-of-sample prediction. Cohort-specific risks were averaged over the cohorts for the result provided to the end-user.
The categorization algorithm returned to pooling data across all training cohorts, and additionally transformed all continuous risk factors to categorical so that missing could be added as an extra category. For inherently categorical risk factors, such as DRE, categories were coded as normal, abnormal, and missing. Prostate volume was stratified to < 30, 30–50 and > 50 cc, as previously suggested so that it could be obtained by pre-biopsy DRE or TRUS, before adding the additional category of missing [14]. The advantage of this approach was that only one model is fit and needed by the end-user. The missing indicator algorithm was similar to the categorization algorithm, but did not require categorization of continuous variables [15]. Instead, it introduced an indicator equal to 1 if the corresponding risk factor was missing versus 0 if not missing. The model included the indicator and the interaction with the risk factor. Since prostate volume was the only continuous risk factor that was sometimes missing, the missing indicator algorithm differed from the categorization algorithm in only one variable. Second-degree prostate cancer- and first-degree breast cancer family history were either both collected or not at all by the individual cohorts. Adding a missing category to them would therefore induce multi-collinearity. In order to avoid this, they were combined to a single new 5-category risk factor with second-degree prostate cancer family history only, first-degree breast cancer family history only, both present, none present or missing.
Multiple imputation has been recommended for fitting statistical models to training data to handle either outcomes or risk factors missing at random (MAR) [16]. In the case here, the outcome of clinically significant prostate cancer was not missing for any individuals so imputation was applied only for missing risk factors. Data were pooled across all ten cohorts to form the training set and imputation was applied using the pooled set and not by cohort. For a patient in the training set with multiple missing risk factors, multiple imputation by chained equations (MICE) sequentially imputes missing data according to full conditional models appropriate to the risk factor data type using all other risk factors available as covariates along with the outcomes that have been fit to complete cases in the training set [16, 17]. The R mice package uses 5 imputations as default and the literature has also recommended 10 iterations [16, 18]. We implemented 30 imputations, as the average percentage of missing values across all risk factors in the training set, and averaged models built on the 30 imputed data sets for the final training set risk model. For the end-user or member of the validation set who is missing a risk factor, the algorithm imputed its value using mean values from the training set only, and not from other members of a validation set, as the latter would not be available in practice [17].
External validation on the European cohort, which was not used for training, was measured by discrimination using the area under the receiver-operating-characteristic curve (AUC) along with its 95% confidence interval (CI), calibration in the large (CIL), which evaluates the average difference between the predicted risk and binary clinically significant prostate cancer outcome for each patient in the validation set, and calibration-in-the-small by calibration curves of observed versus predicted risk according to deciles of predicted risk. Internal leave-one-cohort-out cross validation using the same metrics was also performed, by alternatively holding out one of the 10 PBCG cohorts used for training the model as a test set and training the models on the remaining 9 cohorts. Distributions of AUCs and CILs from the 10 test validations were visualized by violin plots showing smoothed histograms and boxplots showing medians and inter-quartile ranges. All analyses were performed in the R statistical package [19].