Study Design
This retrospective study was approved by the institutional review board at Johns Hopkins Hospital with a waiver of written informed consent (IRB00349673). Data were collected from three groups of PCa patients who underwent [18F]DCFPyL PET/CT imaging: 275 patients in Cohort 1 (January 2015 to December 2018) enrolled in a research setting from a previously described study [17]; the first consecutive 64 patients in Cohort 2 (October 2021 to November 2022) following clinical approval of the radiotracer at our institution; and 19 patients from an external institution in Cohort 3 (January 2017 to December 2023). All patients had [18F]DCFPyL PET/CT imaging and pathological/clinical confirmation of PCa diagnosis. Cases with poor image quality, artifacts, or absence of uptake lesions were excluded, resulting in a final dataset for PSMA-RADS scoring of 238 patients from Cohort 1, 36 patients from Cohort 2, and 19 patients from Cohort 3. In Cohort 1, patients were randomly assigned to training (n = 172) and internal test sets (n = 66), while Cohort 2 and Cohort 3 served as prospective and external test sets, respectively.
Each patient’s chart was reviewed for pathology reports and/or follow-up imaging to categorize lesions on the initial PSMA PET/CT as benign or malignant and to assess treatment response. Of 191 patients with available follow-up data for malignancy classification, 9 were excluded due to lesion removal, preventing assessment of lesion progression. This left 182 patients eligible for treatment assessment evaluation and survival analysis. A detailed flowchart for patient inclusion and exclusion is shown in FIGURE 1. Additionally, clinical variables such as age, race, height, weight, body mass index, PSA levels, Gleason scores, imaging indications, relapse, survival status, therapeutic lines and interval between baseline and follow-up scans were collected. These clinical variables were compared across datasets (SUPPLEMENTARY TABLES 1–4).
Lesion segmentation
All lesions were segmented in the axial plane using Mirada DBx software on a per-slice basis, as previously published [17]. Two radiologists (LZ, YM) performed manual lesion segmentation, which was subsequently reviewed and revised as needed by a third radiologist (HB) as needed.
PSMA-RADS scoring and malignancy evaluation
Each lesion was assigned a PSMA-RADS score (PSMA-RADs version 1.0) [9] by two radiologists (HW, LZ). Disagreements were resolved by a third radiologist (YM). Lesions were grouped based on their PSMA-RADS score, with PSMA-RADS-1 and − 2 lesions classified into one group and PSMA-RADS-3,-4, and − 5 lesions into another for binary PSMA-RADS classification. The training, internal, prospective and external test sets comprised 2125, 915, 300, and 223 lesions, respectively.
For lmalignancy evaluation, two radiologists (LZ and HB) labeled lesions as malignant based on pathology confirmation or follow-up imaging (MRI/CT/PET) showing size changes greater than 2 mm, newly enlarged lymph nodes (over 10 mm), or bone destruction/formation [18]. Lesions not meeting these criteria were labeled as benign. For the benign versus malignant classification task, the training, internal, prospective and external test sets comprised 1217, 370, 168 and 210 lesions, respectively.The distribution of PSMA-RADS scores and malignancy categories across the datasets is detailed in SUPPLEMENTARY TABLE 5.
Lesion treatment response and rurvival evaluation
For treatment response assessment, two radiologists (LZ and HB) labeled lesions as progressive if follow-up imaging (CT, MRI, or PET/CT) showed enlargement of over 2 mm, newly enlarged lymph nodes (over 10 mm) or bone destruction/formation [18]. Lesions that remained stable or shrunk by more than 2 mm were labeled as non-progressive. Surival status at the endpoint (the date of follow-up) for each patiens was also collected.
Deep learning model training and visualization
The models were implemented using Pytorch [19] and MONAI [20], and trained on a NVIDIA GeForce RTX 3090 GPU. For PSMA-RADS score classification and benign-malignant categorization, seven models were developed using PET and CT data (SUPPLEMENTARY FIGURE 1). Two models used single modality (PET or CT) input, while five models applied fusion strategies to combine the modalities. For the treatment response and survival prediction tasks, three single models were developed using PET, CT data, and clinical data individually. Two additional models combined either PET or CT with clinical data, and one model integrated both modalities with clinical data.
For single modality models, a 3D DenseNet architecture [21] was initialized with one input channel for image features. The model began with a 3D convolutional layer, followed by batch normalization, ReLU activation and max-pooling layers. After feature extraction in this initial layer, six densely connected layers formed the first dense block. Each dense layer consisted of two convolutional layers with batch normalization and ReLU activation in between. This densely connected structure facilitated the propagation and reuse of features across the network, enhancing representational power. Transition blocks, incorporating average pooling, separated the dense blocks. Subsequent dense blocks had 12, 24, and 16 layers, respectively, with transition blocks in between. Following the fourth dense block, the model applied batch normalization, ReLU activation, global average pooling, flattening, and finally, the features entered a linear layer for classification with two output features (Supplementary Fig. 1).
Using this DenseNet framework, three late fusion models were developed. PET and CT data were processed separately using the described architecture, and their features are fused prior to the classification layer using one of three strategies: (1) Multi-Layer Perceptron and Self-Attention [22], (2) Squeeze and Excitation (SE) with Sigmoid Activation [23], or (3) convolution blocks. These strategies were termed Output Transformer, Output SE, and Output Convolution. Two early-fusion models combined the two modalities prior to the first dense block using either (1) 3D convolution or (2) concatenation, termed Input Convolution and Input Concatenation.
The models were trained with AdamW optimizer [24] using a learning rate of 1e-5, a batch size of 10, and 1000 epochs. PET/CT images from both datasets were preprocessed uniformly, including conversion of PET intensities to Standardized Uptake Value corrected for body weight (SUVbw) and normalization to the [0,1] range. Volumes were resampled to a 2 mm slice thickness and cropped to a size of 96×96×96 voxels, focusing on normal tissues, PSMA lesions, and surrounding areas. Final models were selected based on classification accuracy on a training subset.
For PSMA-RADS score classification and benign-malignant categorization, prediction probability scores and Uniform Manifold Approximation and Projection (UMAP)[25] feature reduction analysis were applied. The final layer of the DL model provided prediction probabilities for PSMA-RADS group or malignancy status, and the argmax function determined the predicted class. Probability scores for PSMA-RADS-3,-4,-5 and malignancy class were visualized in UMAP space.
The treatment response model incorporated both image-based and clinical data. For image-based inputs, PSMA segmentation masks were used to create masked CT or PET images, which were processed through a DenseNet architecture pre-trained on ImageNet, with four additional predictive layers. Clinical data were processed through a neural network with dense layers of 16, 32, and 2 nodes, utilizing 13 clinical variables to differentiate between non-progressive and progressive treatment outcomes. The final model combined predictions from image-based models (CT, PET or both) and the clinical data-based models through a weighted sum.
For survival prediction, a time-to-event model was employed to estimate the probability of reaching critical outcomes (i.e., death). The image-based model extracted 256-dimensional features from the dense layer of the treatment prediction model, while the clinical data model used the same 13 clinical variables as before. These features were fed in a survival forest model to calculate survival probability scores. The final survival probability for each patient was derived from the weighted sum of the image-based and clinical-data-based risk scores, and the model’s performance was evaluated using time-to-event analysis across different configurations of image and clinical data models.
Statistical analysis
Statistical analysis and data preprocessing were performed using Python v. 3.10.12. For PSMA-RADS score classification, several performance metrics were calculated including accuracy, area under the receiver operating characteristic curve (AUROC), weighted F1 score, precision, recall, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The best-performing model in the internal and prospective test sets, based on accuracy and AUROC, was selected for evaluation on the external test set. Bootstrap resampling (1000 samples) was employed to calculate 95% confidence and tolerance intervals for ROC curves and accuracy.
The impact of demographic and clinical variables on classification accuracy was assessed using Chi-square tests for categorical variables and t-tests or ANOVA for continuous variables. Consistency between PSMA-RADS scores, model outputs, and ground truth malignancy was evaluated using Intra-class Correlation Coefficient (ICC).
For survival prediction models, accuracy was evaluated with the concordance index (C-index) to account for right-censored data, correlating treatment response with predicted survival probabilities. Patient stratification for survival analysis was performed using the Kaplan-Meier method, with statistical significance assessed via the log-rank test, based on predicted survival probabilities. Time-dependent ROC-AUC was calculated to evaluate survival prediction models over time, and precision-recall curves were generated to quantify precision, recall, and F1 scores. Statistical significance was defined as P < 0.05.