After selecting non-metastatic, primary oropharyngeal squamous cell carcinomas (OPSCC), from four datasets on The Cancer Imaging Archive (TCIA) [5], the training set consisted of 1223 volumetric CTs with associated primary GTV reference segmentations made by radiation oncologists – please see Data Availability. A holdout validation subset of 20% out of each collection was created by random sampling without replacement (holdout sample sizes : HN1 = 18, Montreal = 48, OPC = 101 and MDA = 80). A single private institutional dataset named “HN3” (n = 154) was used entirely as an independent test set; HN3 was demographically the same population as HN1 – please refer to the Data Availability statement.
A deep learning architecture was built on top of a squeeze-and-excitation normalization model that has been experimented elsewhere for HNC GTV segmentation [2,6]. Summarizing, the model was a 3D U-Net architecture with ResNet elements, with squeeze and excitation normalization blocks following each convolutional block that allowed the network to find the ideal normalization for Hounsfield units during training. We added attention gates to each spatial resolution level after each skip connection, aiming to accelerate the localization of the GTV and enhance neuron activations in important areas [7]. Extra up-sampling paths were added to allow low-resolution features to have an impact further along in the network. At the end, network attention was visualized using Grad-CAM++ [8].
To simulate model development and updating of weights via multi-institutional federated deep learning, we intentionally isolated each of the aforementioned training datasets. Identical copies of the deep learning architecture were trained entirely separately on each dataset, then only the model weights of the four partial models were combined into one global model using the synchronous federated averaging (FedAvg) [9] algorithm. Averaged weights were then used as starting state for the next subsequent epoch of training. During each epoch of training, each partial model iteratively ran through all of its training samples. The experiments thus consisted of performing the federated averaging (i) every 1 epoch - “FedAvg1”, (ii) every 5 epochs – “FedAvg5” and (iii) every 10 epochs – “FedAvg10”, up to a grand total of 100 epochs in each experiment.
The median and interquartile range of the Dice Similarity Coefficient (DSC) scores are summarized in Table 1. The holdout-validation performance was marginally best for FedAvg1 in HN1 and Montreal, and for FedAvg10 in OPC and MDA. In the external test dataset HN3, FedAvg1 performed best. Overall, the performance of FedAvg1, it was functionally equal to the situation where all the training data were combined into one single set and trained on it in entirety (i.e. “centralized”). The best 95th percentile Hausdorrf Distance (HD95) in holdout validation was observed for FedAvg10–7.2, 9.0, 9.9, 9.7 and 11.9 mm, respectively - for HN1, OPC, MDA, Montreal and HN3 (test). In comparison, the metrics from centralized training were 7.9, 9.2, 10.4, 13.8 and 9.2 mm, respectively.
Table 1
Median Dice similarity coefficients (DSC) of predicted versus reference GTV segmentations in holdout validation subjects, as a function of synchronous federated averaging every 1, 5 or 10 epochs. Interquartile range in DSC given in parentheses under median values. The asterisk (*) indicated the best median DSC in each dataset. The centralized training DSC results are given for comparison. HN1, OPC, MDA and Montreal are publicly available on TCIA. HN3 is a completely separate independent test set, which was a private dataset that may be requested from the institution.
| HN1 | OPC | MDA | Montreal | HN3 (test) |
FedAvg1 | 0.73* (0.16–0.77) | 0.65 (0.55–0.73) | 0.67 (0.45–0.78) | 0.63* (0.46–0.75) | 0.62* (0.33–0.76) |
FedAvg5 | 0.72 (0.60–0.74) | 0.66 (0.55–0.73) | 0.65 (0.49–0.79) | 0.52 (0.41–0.72) | 0.55 (0.30–0.73) |
FedAvg10 | 0.72 (0.59–0.79) | 0.68* (0.54–0.76) | 0.69* (0.50–0.77) | 0.62 (0.46–0.73) | 0.59 (0.30–0.74) |
Centralized | 0.68 (0.20–0.77) | 0.69 (0.52–0.76) | 0.69 (0.42–0.72) | 0.61 (0.42–0.72) | 0.65 (0.41–0.74) |
The DSC of each subject in the holdout and test sets are presented in Fig. 1. Statistically significant differences (p < 0.05) in DSC were not detected, neither within datasets across all methods of training nor when comparing centralized versus any of these federated averaging methods. The p-value was calculated through two-sided Mann-Whitney-Wilcoxon tests with Bonferroni correction for multiple testing.
However, the differences in performance as a function of total size of GTV was statistically significant. Figure 2 shows the DSC for holdout validation subjects (and test subjects in the case of HN3) in each dataset, grouped according to total GTV size. Only results of the FedAvg1 model are presented in the figure. The GTV size groupings were GTV ≥ 10cm3 versus GTV < 10cm3. Model performance in the smaller GTVs were poorer in terms of median DSC and interquartile spread, and has the overall effect of dragging the median DSC results downwards. The median scores for the smaller versus the larger tumors were summarized in Table 2. It is clear the effect is unrelated to federated learning, since the same size-dependence is observed with centralized training. In terms of DSC, the FedAvg1 model is functionally the same as the centrally-trained model, across both GTV ≥ 10cm3 and GTV < 10cm3 sub-groups.
Table 2
Median Dice similarity coefficients (DSC) of the FedAvg1, FedAvg1, FedAvg1 and centrally-trained models in the holdout validation subjects, grouped according to GTV size ≥ 10cm3 versus GTV < 10cm3. The differences in DSC between larger versus smaller tumors are statistically significant, but differences due to averaging every 1, 5 or 10 epochs for a given tumor size subgroup were not statistically significant.
| GTV size | HN1 | OPC | MDA | Montreal | HN3 (test) |
FedAvg1 | ≥ 10cm3 | 0.77 (0.74–0.80) | 0.70 (0.61–0.76) | 0.72 (0.56–0.80) | 0.67 (0.53–0.76) | 0.72 (0.55–0.78) |
< 10cm3 | 0.01 (0.01–0.20) | 0.51 (0.40–0.60) | 0.28 (0.20–0.42) | 0.45 (0.30–0.66) | 0.36 (0.01–0.59) |
FedAvg5 | ≥ 10cm3 | 0.74 (0.72–0.77) | 0.69 (0.64–0.77) | 0.69 (0.59–0.79) | 0.61 (0.42–0.72) | 0.70 (0.55–0.76) |
< 10cm3 | 0.21 (0.10–0.66) | 0.52 (0.40–0.61) | 0.45 (0.15–0.50) | 0.45 (0.36–0.53) | 0.27 (0.12–0.44) |
FedAvg10 | ≥ 10cm3 | 0.77 (0.75–0.79) | 0.69 (0.64–0.78) | 0.73 (0.60–0.80) | 0.65 (0.54–0.74) | 0.69 (0.53–0.76) |
< 10cm3 | 0.48 (0.14–0.66) | 0.52 (0.42–0.62) | 0.36 (0.29–0.58) | 0.45 (0.40–0.62) | 0.29 (0-0.54) |
Centralized | ≥ 10cm3 | 0.76 (0.69–0.79) | 0.71 (0.62–0.77) | 0.71 (0.61–0.78) | 0.61 (0.48–0.75) | 0.72 (0.60–0.78) |
< 10cm3 | 0.06 (0.05–0.35) | 0.52 (0.40–0.66) | 0.30 (0.21–0.44) | 0.45 (0.27–0.68) | 0.41 (0.24–0.63) |
Examples of automated segmentations produced by model FedAvg1 in validation subjects are provided in Fig. 3. Overall, the model accurately locates the middle of the GTV, but the extreme outer boundaries of the tumor were not often in exact agreement with the radiation oncologists’ delineation. Grad-CAM activation maps were only partially helpful, such that the vast majority of subjects showed activations near the GTV, but the GTV was not always consistently contained within the strongly activating area.
Lastly, in Fig. 4, we show the training and validation loss curves obtained during the centrally-trained, FedAvg1, FedAvg5 and FedAvg10 epoch experiments. There will always be a very small periodically repeating spike located every 1, 5 or 10 epochs due to the federated averaging step itself, where all the partial models trained on the disparate datasets are combined into a single global model, which is then used as the starting state for the next training iteration. In addition to this, large extra spikes occurs periodically, roughly every 10 epochs as can be seen in the training and validation loss curves of FedAvg1. This large spike is independent of the frequency of the federated averaging, and appears to be a feature of the training mechanics that is not settable with the typical hyperparameters. This pattern overlaps with the intentional averaging every 10 epochs in the FedAvg10 training, but manifests itself quite differently in the FedAvg5 training with a significant jump near epoch 60. Such transients seemed to be highly reproducible and was seen over three repetitions of all these experiments, and was furthermore independent of weight initialization at the beginning. Nonetheless, the training transient did not appear to affect the actual performance of the final selected model.