The decision-making process between active surveillance and surgical intervention, regarding GG upgrading in PCa, especially towards clinically significant PCa, such as ISUP GG 2 to 3, is of significant importance(6, 7). Despite extensive research into potential factors, the exact features that contribute to upgrading remain controversial. Therefore, our study aims to integrate both potential characteristics and time factors to predict upgrade, considering a maximum waiting time for RP of approximately 3 to 4 months, as depicted in Table 1. Notably, the median surgical waiting time in Class 0 (Non-upgrading) is 2 weeks longer than Class 1 (Upgrading).
Contrary to expectations, our findings revealed no specific features that demonstrated a strong association with GG upgrade. Both the Pearson correlation and the PCA failed to identify the distinct factors that contributed significantly to the upgrading. The surgical waiting time and most features do not show correlations, suggesting that the threshold of 3–4 months did not significantly impact upgrading. This is consistent with previous studies(13, 14) and existing guidelines(29), which support the safe delay of RP at around 3 months. However, three features exhibit weak correlations, including negative correlations with biopsy GG and PI-RADS, and a positive correlation with a percent positive core. These weak correlations may be attributed to the rarity of the upgrading events compared to the more prevalent non-upgrading events, influencing the degree of associations in the data. This is consistent with previous studies on percent positive core (9, 10, 12, 30) and PI-RADS(12). As also emphasized in machine learning application literature(30, 31), there's a consistent recognition of imaging characteristics in predicting GG upgrades. Furthermore, despite the comprehensive nature of PCA, it did not offer additional insights beyond those provided by Pearson's correlation. The scree plot in Fig. 1A, where the 7 PCs required to obtain more than 80% explained variance, indicates that individual features alone are insufficient for accurate predictions. The limited discriminatory power illustrated in Fig. 1B poses the challenge of identifying distinctive features, making model training in our dataset a challenging task.
Therefore, we then employed 10-fold and 3-fold cross-validation within each fold, assisting the model in learning challenging cases. Addressing data imbalance during the preprocessing stage, we implemented both over-sampling and under-sampling (32). It is essential to note that the statistical significance for comparing model performance among resampling circumstances is not applicable, as the evaluation relies primarily on performance metrics such as accuracy, precision, recall, and F1 score. Interestingly, both over-sampling and under-sampling strategies contributed to the enhancement of model performance for most models, except for NN and LR. Regarding NN, the decrease in performance is attributed to the overfitting of models resulting from resampling techniques, as indicated by high testing accuracy but low training accuracy. For LR, as discussed by van den Goorbergh et al. (33), data imbalance can have an adverse effect on the parameter tuning process. Despite a slight decline in model performance, resulting from a slight imbalance in our study, LR remains the best model in accuracy in all three strategies, while NN performs the best in terms of the F1 score. Furthermore, among ML models, those achieving testing accuracy exceeding our baseline of 82.1% included XGBoost, RF, and LR when trained on data without applying any sampling methods.
Considered the best model in the present study, we perform an explanatory analysis on LR and NN using the Explained add-on library in Orange Data Mining to assess the features influencing the model output, shown in Fig. 2. Consistent with previous studies(8–10, 12), the main common contributors to the predictions of LR and NN were biopsy GG, age and percent positive core; however, these characteristics did not show a high impact on the model, supported by weak correlation scores in Fig. 1 and Table 2. Regarding the clinical application of our prediction model, we want the model to accurately identify all cases labeled 1 (upgrade). Despite the challenges, our empirical goal is to attain a minimum precision of 0.8 for label 1, corresponding to an F1 score threshold of 0.89. However, our model performance does not meet this F1 threshold, indicating that this model does not meet the required performance for clinical use. This highlights the need to explore additional features to enhance predictive capabilities.
Concerning the time waiting for surgery, both our study and existing research prove that analyzing at a specific time point does not provide clear predictions. We propose exploring alternative models, such as the Cox proportional hazards model, to enhance time-associated predictions, considering not only one time point but the entire duration. Given the weak to no correlations of our features with the target feature and the failure to reach the F1 threshold in model performance, there arises a need for further exploration of supplementary features that could serve as indicators for this problem. In clinical practice, even with a weak correlation, clinicians must be attentive if a patient with a low GG also has high PSA levels, is older, or has a high percentage of positive core, as these suggest an increased risk of upgrading. It is worth suggesting for future studies that predicting upgrades may not rely solely on specific individual features. The physician's experience of the physician and the characteristics of the patient are likely important factors that influence the decision about the next step in individualized treatment.
This study has limitations. First, being a single-center study conducted in Thailand, its generalizability may be limited, particularly in terms of racial or genetic factors. Second, with surgical waiting time of only around 3–4 months, its long-term relevance to other settings may be limited. Third, as a retrospective chart review, the data and time-dependent factors could influence decisions in practice, potentially affecting the analysis.