Demographics
2010-2015 dataset:
The dataset included 24,481 cases, with a mean patient age of 67.95 years (standard deviation 11.21 years). The average tumor size was 34.79 mm (standard deviation of 20.74 mm), ranging from 1 mm to 150 mm. The sex distribution indicated a male predominance: 18,299 males (74.8%) and 6,182 females (25.2%). The racial composition was primarily white (21,842 cases, 89.2%), followed by black (1,562 cases, 6.4%), Asian (877 cases, 3.6%), and other races (200 cases, 0.8%).
The most common tumor grade was grade IV, which was observed in 11,425 patients (46.7%). Tumor stage "0a" was most common (10,588 patients, 43.3%). The Ta classification was the most prevalent tumor characteristic and was present in 10,588 patients (43.3%). In the N classification, N0 was predominant (22,727 cases, 92.8%), and in the M classification, M0 was observed in 23,668 cases (96.7%), with the remainder classified as M1.
Radiation treatment was administered to 1,361 patients (5.6%), whereas 23,120 patients (94.4%) did not receive radiation. The chemotherapy data revealed that 15,800 patients (64.6%) did not undergo chemotherapy, whereas 8,681 patients (35.4%) did.
Patient outcomes revealed that 17,256 patients (70.5%) were alive, whereas 7,225 patients (29.5%) died due to TCC.
2018+ dataset:
The dataset included 32,348 cases, with a mean patient age of 70.61 years (standard deviation of 10.73 years). The average tumor size was 36.62 mm (standard deviation 48.50 mm), ranging from 0 mm to 990 mm. The sex distribution indicated a male predominance: 24,657 males (76.2%) and 7,691 females (23.8%). The racial composition was primarily white (28,777 cases, 89.0%), followed by Asian or Pacific Islander (1,743 cases, 5.4%), black (1,701 cases, 5.3%), and American Indian/Alaskan Native (127 cases, 0.4%).
The most common tumor grade was Grade IV (10,788 cases, 33.4%), followed by Grade III (10,553 cases, 32.6%), Grade I (5,559 cases, 17.2%), and Grade II (5,448 cases, 16.8%). Tumor stage "0a" was most frequently noted (15,930 cases, 49.2%), followed by Stage I (8,016 cases, 24.8%), Stage II (4,140 cases, 12.8%), Stage III (2,330 cases, 7.2%), Stage 0 (1,054 cases, 3.3%), and Stage IV (878 cases, 2.7%).
In the T classification, Ta was the most prevalent (15,930 cases, 49.2%), followed by T1 (8,208 cases, 25.4%), T2 (2,905 cases, 9.0%), and various other classifications in smaller numbers. For the N classification, N0 was predominant (30,802 cases, 95.2%), with smaller counts for N2, N1, and N3. For the M classification, M0 was observed in 31,559 cases (97.6%), with the remainder classified as M1.
Radiation treatment was administered to 1,926 patients (6.0%), whereas 30,422 patients (94.0%) did not receive radiation. The chemotherapy data revealed that 19,480 patients (60.2%) did not undergo chemotherapy, whereas 12,868 patients (39.8%) did.
Patient outcomes revealed that 29,341 patients (90.7%) were alive, whereas 3,007 patients (9.3%) died from urinary bladder cancer.
Model performance
Performance on the test dataset:
The deep neural network model had a sensitivity of 64.55% (95% CI 61.64% to 67.38%). Its specificity was 91.95% (95% CI 90.84% to 92.98%). The positive likelihood ratio was 8.02 (95% CI 6.99 to 9.21), and the negative likelihood ratio was 0.39 (95% CI 0.36 to 0.42). The positive predictive value was 77.43% (95% CI 74.93% to 79.74%), whereas the negative predictive value was 85.85% (95% CI 84.84% to 86.80%). The model's accuracy was 83.75% (95% CI 82.51% to 84.93%). The ROC-AUC was 0.8945 (95% CI 0.8826 to 0.9065). The Brier score was 0.1154.
The logistic regression model had a sensitivity of 64.45% (95% CI 61.54% to 67.29%). The specificity was 91.61% (95% CI 90.47% to 92.65%). The positive likelihood ratio was 7.68 (95% CI 6.71 to 8.79). The negative likelihood ratio was 0.39 (95% CI 0.36 to 0.42). The positive predictive value was 76.65% (95% CI 74.15% to 78.98%). The negative predictive value was 85.77% (95% CI 84.76% to 86.73%). The accuracy was 83.47% (95% CI 82.23% to 84.66%). The ROC-AUC was 0.8862 (95% CI 0.8741 to 0.8987). The Brier score was 0.1196.
The gradient boosting model had a sensitivity of 66.09% (95% CI 63.21% to 68.89%). The specificity was 91.37% (95% CI 90.22% to 92.43%). The positive likelihood ratio was 7.66 (95% CI 6.71 to 8.75). The negative likelihood ratio was 0.37 (95% CI 0.34 to 0.40). The positive predictive value was 76.61% (95% CI 74.15% to 78.90%). The negative predictive value was 86.31% (95% CI 85.29% to 87.26%). The accuracy was 83.80% (95% CI 82.57% to 84.98%). The ROC-AUC was 0.8938 (95% CI 0.8820 to 0.9057). The Brier score was 0.1157.
Performance of the internal validation dataset:
The deep neural network model had a sensitivity of 62.69% (95% CI 59.71% to 65.59%). The specificity reached 91.88% (95% CI 90.77% to 92.91%). The positive likelihood ratio was 7.72 (95% CI 6.73 to 8.86). The negative likelihood ratio was 0.41 (95% CI 0.38 to 0.44). The positive predictive value was 76.10% (95% CI 73.52% to 78.51%). The negative predictive value was 85.66% (95% CI 84.67% to 86.59%). The model's accuracy was 83.36% (95% CI 82.12% to 84.55%). The ROC-AUC was 0.8893 (95% CI 0.8774 to 0.9006). The Brier score was 0.1175.
The logistic regression model had a sensitivity of 61.66% (95% CI 58.67% to 64.58%). The specificity was 92.15% (95% CI 91.05% to 93.16%). The positive likelihood ratio was 7.86 (95% CI 6.83 to 9.04). The negative likelihood ratio was 0.42 (95% CI 0.39 to 0.45). The positive predictive value was 76.42% (95% CI 73.80% to 78.84%). The negative predictive value was 85.36% (95% CI 84.37% to 86.29%). The accuracy was 83.25% (95% CI 82.00% to 84.45%). The ROC-AUC was 0.8827 (95% CI 0.8703 to 0.8944). The Brier score was 0.1205.
The gradient boosting model had a sensitivity of 64.55% (95% CI 61.61% to 67.42%). The specificity was 91.00% (95% CI 89.83% to 92.07%). The positive likelihood ratio was 7.17 (95% CI 6.30 to 8.17). The negative likelihood ratio was 0.39 (95% CI 0.36 to 0.42). The positive predictive value was 74.73% (95% CI 72.20% to 77.11%). The negative predictive value was 86.16% (95% CI 85.16% to 87.11%). The accuracy was 83.28% (95% CI 82.03% to 84.47%). The ROC-AUC was 0.8883 (95% CI 0.8761 to 0.8995). The Brier score was 0.1176.
Deep neural network performance on the external validation dataset:
The deep neural network model had a sensitivity of 69.40% (95% CI 67.72% to 71.05%). Its specificity was 85.32% (95% CI 84.91% to 85.72%). The positive likelihood ratio was 4.73 (95% CI 4.56 to 4.90), and the negative likelihood ratio was 0.36 (95% CI 0.34 to 0.38). The positive predictive value was 32.63% (95% CI 31.84% to 33.44%), whereas the negative predictive value was 96.46% (95% CI 96.27% to 96.64%). The model's accuracy was 83.84% (95% CI 83.43% to 84.24%). The ROC-AUC was 0.8758 (95% CI 0.8698 to 0.8816). The Brier score was 0.1182.
For summary of model performances see Table 1 and Figure 1.
Table 1. Model performance comparison on the test, internal validation, and external validation datasets
Metric
|
Sensitivity
|
Specificity
|
Positive Likelihood Ratio
|
Negative Likelihood Ratio
|
Positive Predictive Value
|
Negative Predictive Value
|
Accuracy
|
ROC-AUC
|
Brier Score
|
DNN Test
|
64.55% (95% CI 61.64% to 67.38%)
|
91.95% (95% CI 90.84% to 92.98%)
|
8.02 (95% CI 6.99 to 9.21)
|
0.39 (95% CI 0.36 to 0.42)
|
77.43% (95% CI 74.93% to 79.74%)
|
85.85% (95% CI 84.84% to 86.80%)
|
83.75% (95% CI 82.51% to 84.93%)
|
0.8945 (95% CI 0.8826 to 0.9065)
|
0.1154
|
Logistic Regression Test
|
64.45% (95% CI 61.54% to 67.29%)
|
91.61% (95% CI 90.47% to 92.65%)
|
7.68 (95% CI 6.71 to 8.79)
|
0.39 (95% CI 0.36 to 0.42)
|
76.65% (95% CI 74.15% to 78.98%)
|
85.77% (95% CI 84.76% to 86.73%)
|
83.47% (95% CI 82.23% to 84.66%)
|
0.8862 (95% CI 0.8741 to 0.8987)
|
0.1196
|
Gradient Boosting Test
|
66.09% (95% CI 63.21% to 68.89%)
|
91.37% (95% CI 90.22% to 92.43%)
|
7.66 (95% CI 6.71 to 8.75)
|
0.37 (95% CI 0.34 to 0.40)
|
76.61% (95% CI 74.15% to 78.90%)
|
86.31% (95% CI 85.29% to 87.26%)
|
83.80% (95% CI 82.57% to 84.98%)
|
0.8938 (95% CI 0.8820 to 0.9057)
|
0.1157
|
DNN Internal Validation
|
62.69% (95% CI 59.71% to 65.59%)
|
91.88% (95% CI 90.77% to 92.91%)
|
7.72 (95% CI 6.73 to 8.86)
|
0.41 (95% CI 0.38 to 0.44)
|
76.10% (95% CI 73.52% to 78.51%)
|
85.66% (95% CI 84.67% to 86.59%)
|
83.36% (95% CI 82.12% to 84.55%)
|
0.8893 (95% CI 0.8774 to 0.9006)
|
0.1175
|
Logistic Regression Internal Validation
|
61.66% (95% CI 58.67% to 64.58%)
|
92.15% (95% CI 91.05% to 93.16%)
|
7.86 (95% CI 6.83 to 9.04)
|
0.42 (95% CI 0.39 to 0.45)
|
76.42% (95% CI 73.80% to 78.84%)
|
85.36% (95% CI 84.37% to 86.29%)
|
83.25% (95% CI 82.00% to 84.45%)
|
0.8827 (95% CI 0.8703 to 0.8944)
|
0.1205
|
Gradient Boosting Internal Validation
|
64.55% (95% CI 61.61% to 67.42%)
|
91.00% (95% CI 89.83% to 92.07%)
|
7.17 (95% CI 6.30 to 8.17)
|
0.39 (95% CI 0.36 to 0.42)
|
74.73% (95% CI 72.20% to 77.11%)
|
86.16% (95% CI 85.16% to 87.11%)
|
83.28% (95% CI 82.03% to 84.47%)
|
0.8883 (95% CI 0.8761 to 0.8995)
|
0.1176
|
DNN External Validation
|
69.40% (95% CI 67.72% to 71.05%)
|
85.32% (95% CI 84.91% to 85.72%)
|
4.73 (95% CI 4.56 to 4.90)
|
0.36 (95% CI 0.34 to 0.38)
|
32.63% (95% CI 31.84% to 33.44%)
|
96.46% (95% CI 96.27% to 96.64%)
|
83.84% (95% CI 83.43% to 84.24%)
|
0.8758 (95% CI 0.8698 to 0.8816)
|
0.1182
|
Feature importance
The SHAP summary plots highlight the most influential features in predicting cancer patient mortality. The plots show the average impact of each feature on the model's output, with Ta, Age, Stage I, Size, and Grade IV being the most significant.
For tumor stage, Ta and Stage I had the greatest impact on mortality. Their presence is associated with a lower risk of mortality, whereas their absence increases the risk. Age is also a significant predictor, with higher age values increasing the risk of mortality. Tumor size contributes significantly as well, with larger tumors indicating a greater risk of mortality. Grade IV tumors have a smaller, yet still important, effect on mortality, with patients with anaplastic tumors having a greater chance of death (see Figure 2).
The partial dependence plot revealed that the relationship between age and mortality is not linear but rather quadratic, with greater age significantly increasing the risk of mortality. In contrast, tumor size demonstrated a more linear relationship with mortality. Additionally, the age and size values were positively related to each other (see Figure 3).
Deep neural network architecture
The deep neural network architecture starts with an input layer that expects data with 31 features. This input was processed through three dense layers, each with 100 units, using the leaky ReLU activation function, L2 regularization, and normal initialization, followed by batch normalization. The output layer was a dense layer with a single unit and a sigmoid activation function, which is suitable for binary classification.
The model was compiled via the stochastic gradient descent (SGD) optimizer with a learning rate of 0.001, momentum of 0.9, and Nesterov momentum. The loss function was binary cross-entropy, and the metrics for evaluation were accuracy and AUC.
Early stopping and reducing the learning rate on the plateau were used to optimize the training process. Early training stopped if the validation loss did not improve for 8 epochs, restoring the best weights. The learning rate was reduced when the validation loss plateaued, helping to fine-tune the model further.