Neural networks evaluated herein include the UNet-based StarDist [19], Mask R-CNN [11], and the region based UNet technique, ANCIS [13]. The segmentation results from each technique were compared. Additionally, we tested various image pre and post processing techniques, and train a UNet for noise detection and removal from our test images.
A. Initial Testing
We first trained our networks using the broad, publicly available Kaggle dataset. The Kaggle dataset includes a large set of 2D microscopic images of cell nuclei from cell culture and tissue across various imaging modalities including conventional fluorescence, histology and bright-field microscopy. Although super-resolution fluorescence microscopy is a type of fluorescence microscopy, super-resolved images may or may not prove compatible with CNNs trained using traditional fluorescence microscopy images. Our goal here is to determine whether training directly on a super-resolution STORM image dataset is necessary. Each network was trained using the optimal number of epochs, steps and other factors determined by the authors of each method. Trained networks were then applied to our resized STORM images from 512x512 tissue and cell line test sets. As shown in Figure 1A and Table 1, the optimal results were achieved using Mask R-CNN, but the overall performance is poor. The F1 score on the tissue and cell line dataset is only 0.181 and 0.073, respectively.
Table 1. Average test accuracy scores for Mask-RCNNtrained on Kaggle dataset and tested on super-resolution imagery
Test Set
|
Pre-Processing
|
F1-Score
|
FN
|
Hausdorff
|
Colon Tissue
|
512x512
|
0.181
|
0.819
|
14.07
|
256x256
|
0.268
|
0.619
|
8.68
|
256x256 Blur
|
0.262
|
0.635
|
8.03
|
256x256 HEq
|
0.352
|
0.501
|
8.22
|
Cell Line
|
512x512
|
0.073
|
0.924
|
12.83
|
256x256
|
0.475
|
0.473
|
6.9
|
256x256 Blur
|
0.555
|
0.268
|
5.92
|
256x256 HEq
|
0.628
|
0.201
|
6.07
|
Average F1-Score, false negative percent (FN) and Hausdorff distance for a Mask R-CNN segmentation network model trained on the Kaggle dataset, and applied to both our super-resolution colon tissue and DNA labelled cell line datasets. The network was applied to our Colon Tissue and Cell Line image test sets (512x512 resolution), as well as to the downsized versions of each test set (256x256 resolution), and to Gaussian blurred (Blur) and histogram equalized (HEq) versions.
Next, we applied a set of pre-processing methods to improve the performance. The STORM images have two unique aspects compared to conventional fluorescence images: the inherently discontinuous structural features at the nearly molecular-scale resolution and the nearly zero “intensity” value in most background regions. We deliberately lowered the image “resolution” by downsizing to 256x256 and either blurring or altering the image contrast by histogram equalization (Figure 1B). These pre-processing methods overall did improve our test accuracy, while decreasing the False Negative percentile (Table 1). However, none of the Kaggle trained models succeeded in attaining an average F1-Score exceeding 0.5 on the tissue dataset and a top mark of 0.628 was achieved on the cell line dataset. Notable exceptions to the reported averaged results can be found when analyzing individual images. Kaggle trained Mask R-CNN performed markedly better on STORM images containing dense or uniform nuclear texture with clear borders in both cell line and tissue images (Supplementary Figure 1). The F1-scores of these individual segmentations were found to be less than those found using the STORM-trained Mask R-CNN network, but they demonstrate the potential of CNNs towards the segmentation of super-resolution STORM images.
B. Optimization
Due to the generally poor results when applying the Kaggle trained networks, we next trained each network directly on our STORM image datasets. We determined parameters for instance segmentation via a process of training and testing, and utilized test accuracy as the determinant factor for best performance. Test accuracy, for optimization purposes, was assessed using the F1-Score calculation at an IoU threshold of 0.7. Parameters varied included number of epochs and number of training images for all networks, as well as number of steps for Mask R-CNN and StarDist. Additional parameters, included learning rate and other network specific variables, were optimized as well.
i. Tissue dataset
A typical trend observed was a quick rise in accuracy with increasing number of epochs, followed by a fluctuation and then a settling. After a larger number of epochs, network accuracy was observed to level or drop, likely due to overfitting (Figure 2A). Plotting the test accuracy against number of steps provided a similar trend, with networks reaching an accuracy saturation point when using more than a few hundred steps. Note that ANCIS did not provide the ability to vary steps, but rather utilized a two-part training conducted first on the region-based localization network, then on the instance segmentation network.
It was observed that for Mask R-CNN and StarDist, the optimal number of epochs occurred at about 400, whereas 200 Epochs produced the peak F1-Score for ANCIS. Both UNet methods, ANCIS and StarDist, showed some decrease in test accuracy with increasing epochs, beyond the optimal range. Mask R-CNN, on the other hand, appeared to fluctuate. Increasing steps improved the accuracy of Mask R-CNN more linearly, beyond 100 steps, without significant fluctuation until beyond 500 steps, suggesting nonlinear effects when using too many or too few epochs (Supplementary Figure 2). StarDist reached a peak accuracy at 300 steps, then dropped off, as with increasing epochs, perhaps due to overfitting.
Increasing the size of the training set was expected to improve test accuracy, and this was generally found to be the case (Supplementary Figure 3). All networks improved significantly when increasing from 10 to 20 training images, with each image containing an average of 22 instances. ANCIS and StarDist continued to improve at a nearly linear trend beyond 20 images, however Mask R-CNN once again demonstrated a fluctuating trend. All networks performed best when using the entire available training set of 77 images, however satisfactory results could be obtained using less. False negative or false positive counts tended to be higher when using a smaller dataset, and overlapping detections occurred with greater frequency.
Varying the learning rate within a limited, though often used, range of 1e-3 to 1e-5 did not produce a great deviation in test accuracy, however a lower learning rate tended to require more epochs to achieve the same accuracy. Since all learning rates within this range provided similar results, we selected a rate in the middle of the range, resulting in a common rate of 1e-4 for all networks.
ii. Cell Line
The trend for test accuracy versus number of epochs for the cell line data proved to be similar to the trend to the results from the tissue dataset, however the F1-Score values were higher overall with less fluctuation (Figure 2B). More uniform shapes, less noise and greater spacing in-between instances (i.e. less clustering) may help account for the increased accuracy for the nuclei segmentation on the cell line dataset versus the tissue dataset. The smoother accuracy-versus-epoch curves may also be accounted for by the reduced variability between the target instances. The optimal number of epochs, steps and learning rate used were found to also be similar to those from the tissue dataset, but not the same. Optimal number of epochs were determined at 400 for both StarDist and ANCIS, but 600 for Mask R-CNN. Steps versus test accuracy, however, progressed similarly to the results found for the tissue training set.
Effect of training set size was also determined for the cell line dataset consisting of 65 training images. The performances were overall improved for all networks with increasing dataset size (Supplementary Figure 3), although Mask R-CNN dropped in accuracy when using the entire dataset, from 0.869 to 0.831. Both ANCIS and StarDist fluctuated between 40 and 60 images, but maintained an overall upward trend in images versus accuracy. ANCIS demonstrated both the highest scores and least variation. Indeed, the F1-Score for ANCIS when trained on only 10 images was nearly 0.9, with each image containing an average of 4.2 nuclei per image. However, the false negative percent of this model was much higher than the model trained on the full dataset, 12.5% for the former and 3% the latter, as was the Hausdorff distance, 8.53 and 6.39 respectively. Mask R-CNN demonstrated a similar trend, scoring fairly high even when only trained on 10 images, though scores were not as high as for the ANCIS model.
C. Network Testing
Following network training and optimization, nuclei segmentation was conducted on all test image sets (Figure 3). The tissue dataset included the STORM images of nuclei labeled with a heterochromatin marker H3K9me3 from both colon and prostate tissue at different pathological states (normal, low-grade and high-grade pre-cancerous lesions and invasive cancer). When evaluating the cell line dataset, the test set included images with various labeled molecular targets (H3K27me3, H3K4me3, DNA, RNA polymerase II) from different cell lines under normal and treated conditions. Test accuracy was assessed again using the F1-Score of instances that achieved an IoU of 0.7. Additionally, we calculated the percentage of False Negatives and the average Hausdorff Distance, to provide an estimate of boarder and instance positioning accuracy.
i. Tissue dataset
STORM images of colon tissue dataset include various pathological states including normal tissue, precancerous (adenoma, high-grade dysplasia) and invasive cancer. Nuclear texture for different pathological phenotype varies significantly, where the nuclear texture from precancerous (adenoma and high-grade dysplasia) and cancerous tissue exhibit dramatically fragmented chromatin texture with highly disrupted borders. Comparing network performance when trained and tested on the colon tissue dataset, we found that the Mask R-CNN model provided the highest test accuracy (Table 2). All networks demonstrated an optimal F1 score of at least 0.72 on the tissue dataset, but none achieved a score greater than 0.80. Additionally, Mask R-CNN was found to result in the lowest average Hausdorff distance, implying the greatest average instance border accuracy. Mask R-CNN also demonstrated the lowest false negative rate, but also had the highest number of false positives, suggesting some degree of over-segmentation. ANCIS, on the other hand, had the highest false negative rate and lowest number of false positives, suggesting under-segmentation (Figure 3). StarDist performed similarly to ANCIS, although with slightly lower accuracy, less false negatives and higher Hausdorff distance.
Table2. Average test accuracy scores for CNNs trained and tested on super-resolution imagery
DataSet
|
Mask R-CNN
|
ANCIS
|
StarDist
|
Train
|
Test
|
F1
|
FN/FP
|
H
|
F1
|
FN/FP
|
H
|
F1
|
FN/FP
|
H
|
Colon Tissue
|
Colon
|
0.793
|
0.177/0.225
|
9.76
|
0.739
|
0.315/0.122
|
10.68
|
0.725
|
0.253/0.126
|
10.62
|
Prostate
|
0.646
|
0.221/0.121
|
8.82
|
0.505
|
0.496/0.118
|
9.64
|
0.673
|
0.357/0.071
|
9.17
|
Cell Downsize
|
0.872
|
0.053/0.17
|
5.65
|
0.601
|
0.201/0.134
|
13.21
|
0.847
|
0.137/0.105
|
6.39
|
Cell A
|
Cell
|
0.832
|
0.11/0.3
|
8.21
|
0.902
|
0.101/0.078
|
7.05
|
0.799
|
0.107/0.263
|
8.87
|
Cell B
|
Cell
|
0.831
|
0.125/0.116
|
7.91
|
0.952
|
0.03/0.128
|
6.39
|
0.859
|
0.076/0.31
|
8.11
|
Colon Upsize
|
0.423
|
0.607/0.424
|
17.43
|
0.489
|
0.551/0.076
|
14.51
|
0.39
|
0.467/0.645
|
17.06
|
Combine
|
Colon
|
0.676
|
0.169/0.564
|
10.59
|
0.753
|
0.305/0.11
|
11.13
|
0.612
|
0.396/0.397
|
11.39
|
Cell
|
0.885
|
0.021/0.236
|
5.54
|
0.943
|
0.041/0.107
|
6.76
|
0.92
|
0.037/0.174
|
5.45
|
Colon Blur
|
Colon Blur
|
0.752
|
0.246/0.25
|
9.87
|
0.729
|
0.349/0.136
|
11.06
|
0.658
|
0.387/0.141
|
11.02
|
Colon HEQ
|
Colon HEQ
|
0.769
|
0.183/0.287
|
10.01
|
0.733
|
0.328/0.127
|
11.17
|
0.696
|
0.348/0.123
|
10.94
|
Cell A Blur
|
Cell Blur
|
0.791
|
0.12/0.333
|
6.87
|
0.858
|
0.112/0.118
|
6.83
|
0.761
|
0.153/0.288
|
9.22
|
Cell A HEQ
|
Cell HEQ
|
0.81
|
0.105/0.363
|
6.64
|
0.867
|
0.098/0.126
|
6.57
|
0.785
|
0.14/0.285
|
8.99
|
F1-Score (F1), false negative percent and false positive percent (FN/FP), and Hausdorff distance (H) for Mask R-CNN, ANCIS and StarDist network models trained on the STORM colon tissue, and cell line datasets A & B. An additional combined dataset was created for training and included downsized cell line dataset A, colon tissue and Kaggle datasets. Training also was conducted on histogram equalized (HEQ) and blurred (Blur) versions of the datasets. Testing was conducted on the 512x512 colon and prostate tissue test sets as well as on the 512x512 cell line test set, downsized (256x256) cell line set and upsized (1024x1024) colon dataset. In addition, we evaluated the effect of additional image pre-processing including Gaussian blur and histogram equalization on segmentation accuracy. The networks were trained on pre-processed versions of the original tissue and cell line datasets for both the training and test images. The results indicate that the original data provided the best test accuracy over the pre-processed images for all cases, suggesting no advantage to be gained by these processes (Table 2).
Further, we evaluated whether the networks trained on the dataset from one type of biological sample (e.g., cell line dataset) can be directly used on another type (e.g., tissue dataset) to determine cross compatibility between trained models. We tested the networks trained on cell line dataset on the colon tissue images. Accuracy scores were found to be low, but improved somewhat when the tissue images were resized to 1024x1024 (upsized to make it similar to that of cell line). False negatives and Hausdorff distances were also much higher than when segmenting with a tissue-trained model. Generally, the models trained with cell line dataset did not perform well when applied to tissue images.
We also briefly compared results between normal nuclei and those at different pathological states within our colon tissue test set. Segmentation test accuracy was found to be significantly better on the normal nuclear phenotypes (F1=0.919) than on the pathological phenotypes (low-grade F1=0.825, high-grade F1=0.779 and invasive adenocarcinoma F1=0.676), when training on the STORM colon tissue dataset using Mask R-CNN (Supplementary Figure 4). Scores for ANCIS (normal F1=0.871, low-grade F1=0.783, high-grade F1=0.676 and invasive F1=0.653) and StarDist (normal F1=0.895, low-grade F1=0.702, high-grade F1=0.678 and invasive F1=0.532) proceeded similarly. Enhanced performance on the normal tissue is likely due to the dense nuclear texture and more well-spaced nuclei observed in those images, compared to the more clustered and irregular nuclei found and disrupted nuclear texture in pathological tissue sample images.
Lastly, we evaluated the cross compatibility between different tissue types. We applied the models trained on the original colon tissue dataset trained networks to the prostate tissue test set, labeled with the same nuclear marker (H3K9me3), expressing multiple pathological phenotypes (normal, low-grade and high-grade prostatic intraepithelial neoplasia and invasive cancer). The network segmentation was found to be acceptable across phenotypes (F1=0.646, Mask R-CNN), but the accuracy was significantly lower than for the segmentation on the colon tissue test set (F1=0.793) (Supplementary Figure 5). Potential causes for the reduced accuracy were the variations in nuclear shape (more circular) and texture (consisting of more discrete fragments) when compared to the colon tissue nuclei. The prostate tissue images also contained a denser noise in between nuclei.
ii. Cell Line Dataset
Training was initially conducted on cell line dataset A, those images with discrete nuclear texture (e.g., those labeled with RNA polymerase II). Testing of the trained models was conducted on a subset of the STORM images of nuclei also with discrete texture, as well as on a set of STORM images of nuclei with dense or diffuse texture (e.g., those labeled with DNA, H3K4me3) (Figure 4). Nuclei segmentation for cell line images performed higher in test accuracy than the tissue dataset, likely due to reduced cell clustering and more regular cell shapes. Unlike with the tissue dataset, in which Mask R-CNN performed the best overall, top marks for cell line dataset were achieved using ANCIS. When trained using the cell line dataset with dense nuclear texture, all networks achieved F1-Scores above 0.8 (Table 2). Additionally, false negative percent was found to be less than 10 percent for all trained network models, and Hausdorff distances were also less than 10. Results improved further for ANCIS and StarDist when training was conducted on cell line dataset B, containing cell nuclei with both discrete and dense/diffuse texture. The top results using ANCIS achieved an F1-Score of 0.954, with a false negative percent of 3 and Hausdorff distance of 6.39. Further, networks were also trained and tested on blurred and histogram equalized versions of the images in cell line dataset A. Like with the colon tissue set, the original dataset provided the best results, as neither blurring nor histogram equalization improved the accuracy.
Lastly, we evaluated cross compatibility between cell line and tissue datasets. Tissue dataset trained network models were applied to segment the cell line imagery. In general, the results were sub-optimal, due largely to under segmentation of the larger cell line nuclei. However, when the cell images were resized down to 256X256, the test accuracy improved, particularly for Mask R-CNN (0.872) and StarDist (0.842). The percent false negatives still remained higher for StarDist, on the downsized dataset, whereas the Mask R-CNN tissue model performed better than its cell line trained models (Table 2). These results led us to contemplate whether a combined training network, incorporating STORM images from both tissue and cell line datasets, could achieve even better performance.
iii. Combined dataset
We combined the Kaggle, colon tissue and downsized (256x256) cell line dataset B to create a potentially more robust dataset. Downsizing was conducted on the cell line images to roughly match the nuclear sizes to those found in the colon tissue STORM images. Testing on the tissue dataset resulted in improved accuracy over the network trained on the Kaggle dataset alone (Table 1), but worse than the networks trained directly on the STORM images from tissue dataset for both StarDist and Mask R-CNN (Table 2). ANCIS, on the other hand, experienced a boost in the nuclei segmentation accuracy for tissue dataset (F1=0.753) compared to the ANCIS model trained on tissue data alone (F1=0.739). Since ANCIS had demonstrated a higher false negative percent previously, this boost in accuracy was likely due to an increase in the ability of newly trained model to accept a greater degree of variability in instance identification, learned from the broader dataset with diverse image features. Mask R-CNN, on the other hand, suffered from this same increase in variability, since that network model demonstrated a trend towards over-selection of instances. Test results on the cell line data, however, were surprisingly robust across all networks. Optimal or near optimal F1-Scores, false negatives and Hausdorff distances were found for all three broadly trained network models (Table 2). The Hausdorff distances for StarDist in particular showed significant improvement when trained on the combined dataset, versus when trained on either cell line dataset, (Supplementary Figure 6).
D. Image Processing of Test Images
i. Noise Removal
STORM images often contain “noisy regions” due to non-specific binding or unbound fluorophores, out-of-focus fluorescence and autofluorescence signals. Such “noisy” regions are more prominent in the tissue dataset. To further improve accuracy in nuclei segmentation, we conducted noise removal on the test images by training a UNet to semantically recognize and segment noisy regions, as shown in Figure 5. Test accuracy was slightly improved for all models due to a reduction in false positives. The F1-Scores for Mask R-CNN improved from 0.788 to 0.793, for ANCIS from 0.735 to 0.756 and for StarDist from 0.71 to 0.725, when applied to the tissue test images segmented using tissue trained models. Importantly, false negative percent was markedly reduced, by 5 percent for both Mask R-CNN and StarDist, and by 12 percent for ANCIS. The greater improvement for ANCIS, and lesser for Mask R-CNN, is likely due, in part, to the number of total detections. Indeed, noise removal eliminated just as many false positives with Mask R-CNN as with ANCIS. Hausdorff distance was little affected by noise removal. However, it is worth noting that networks trained using more optimal parameters tend to detect less noise, without any additional processing. Using an additional network for noise detection and removal can supplement optimization, but does not replace it.
Application of the UNet trained using tissue dataset noise on the cell line dataset A, resulted in the removal of some nuclei along with the noise, when the nuclei with discrete and sparse texture were erroneously recognized as noise. Therefore, independently trained UNets were required for tissue and cell line images, due to the variation in the noise density between the two different image sets. F1-Scores and false negative percentages were less affected when noise removal was applied to the cell line dataset. On average, for cell line data, the F1-Scores improved by less than 1 percent, and false negative percent was reduced by less than 2 percent when segmentation was also conducted by cell line trained network models.
ii. Small Instance Detection and Overlap Removal
The post-processing steps for the test images, which occurred following instance segmentation, included overlap and small instance removal. These processes made a greater difference in test accuracy with ANCIS and Mask R-CNN (Figure 6), however StarDist already utilized a built-in module to eliminate overlaps and only benefitted from the small instance removal. The elimination of overlaps and small instance detections improved the F1-Score from the best scoring Mask R-CNN for the colon tissue dataset from 0.774 to 0.793, and the best ANCIS score for tissue data improved from 0.73 to 0.756. The removal of small instances from StarDist tissue segmentation results improved the F1-Score only from 0.716 to 0.725. When applied to cell line dataset A with discrete texture, post processing improved the F1-Score from 0.813 to 0.832 for Mask R-CNN, from 0.883 to 0.902 for ANCIS and from 0.778 to 0.799 for StarDist. Additionally, the false negative percent was reduced by 5 and 17 percent for Mask R-CNN and ANCIS, respectively, and by 4 percent for StarDist, when applied to the tissue dataset. False negative percent for the cell line test results only improved by less than 2 percent for all network models. Interestingly, the Hausdorff distance was not improved by more than 0.25 for any test set.