According to the inclusion and exclusion criteria, we collected 826 newly diagnosed hormone-sensitive prostate cancer patients from Yunnan Cancer Hospital from January 2017 to January 2022, among which 293 missing data were excluded, and their problem data mainly focused on Gleason score, pathological type, and absence of PSA at initial diagnosis. Our final cohort consisted by 533 patients.
We modeled the data characteristics of these patients and adjusted the parameters described above. The characteristic subsets and baseline data of different patients can be seen in Table 1 and Table (S1).Figure 2 is a subset attribute heat map. The distribution of patient indicators is shown in Figure (S1).We used python3.8.6 to carry out Shapiro-Wilktest test method to test the normality of measurement data. For measurement data conforming to normal distribution, two independent sample T-test was used, and "mean ± standard deviation" was used for statistical description.Wilcoxon rank sum test was used to describe the data that did not conform to normal distribution, and "median (25%-75%) [M (P25-P75)]" was used for statistical description. Counting data and grade data are expressed by frequency (%); Chi-square test was used to compare the differences between groups of other bivariate variables.
Test level: α=0.05, P<0.05 was considered statistically significant.
Figure 1 shows our technical path.
Model representation
KNN model
KNN model accuracy (ACC) was 75.4%, sensitivity(SEN) or regression rate(Recall) was 80%, specificity (SPE) was 70.1%, precision(PRE) was 75%, and f1-score was 77.42%, as shown in Figure 3. The area under the ROC curve (AUC) is 0.75, as shown in Figure 4.
Naive bayes
The accuracy rate (ACC) of the naive Bayes model was 71.1%, the sensitivity (SEN) or regression rate (Recall) was 83.1%, the specificity (SPE) was 62.65%, the precision(PRE)was 61.25%, and the f1-score was 70.5%, as shown in Figure 5. The area under the ROC curve (AUC) is 0.76, as shown in Figure6.
Ensemble learning
The ensemble learning in order of model accuracy from large to small is: XGboost, ADAboost, random forest.The accuracy (ACC) of XGboost model was 88.02%, sensitivity(SEN) or the regression rate(Recall) was 90.9%, the specificity (SPE) was 84.6%, the precision (PRE) was 87.5%, and the f1-score was 89%.ADAboost model accuracy (ACC) was 86.6%,sensitivity(SEN) or regression rate (Recall) was 89.6%, specificity (SPE) was 83.1%, the precision(PRE) was 86.25%, and f1-score was 87.9%.The random forest model accuracy (ACC)was 85.2%,sensitivity(SEN) or regression rate(Recall) was 87.3%, specificity (SPE) was 82.5%, the precision(PRE) was 86.25%, and f1-score was 86.79%, as shown in Figure 7.The order of area under ROC curve (AUC) from highest to lowest is: ADAboost0.93, Random forest 0.92, XGboost0.89, as shown in Figure 8.
Analysis of importance based on ensemble learning
The importance selection of various features in the random forest is shown in Figure 9. Among them,tumor burden, hormone sensitive stage treatment plan, lactate dehydrogenase (LDH), alkaline phosphatase, and whether bone metastasis occurs at first diagnosis are in the top 5.At the bottom of the list were the presence or absence of hematuria, Gleason score and pathological type. The random forest model visualization is shown in Figure (S4).
The importance of features in the XGboost model is shown in Figure 10. The ranking of XGboost is calculated by the sum of error reduction in variable segmentation.LDH and prostate volume ranked first in the selection of importance of each feature, while bone metastasis and Gleason score ranked last. The XGboost model visualization is shown in Figure (S6).