Application of Convolutional Neural Networks Towards Nuclei Segmentation in Localization-based Super-Resolution Fluorescence Microscopy Images

doi:10.21203/rs.3.rs-144688/v1

Download PDF

Research Article

Application of Convolutional Neural Networks Towards Nuclei Segmentation in Localization-based Super-Resolution Fluorescence Microscopy Images

https://doi.org/10.21203/rs.3.rs-144688/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Automated segmentation of nuclei in microscopic images has been conducted to enhance throughput in pathological diagnostics and biological research. Segmentation accuracy and speed has been significantly enhanced with the advent of convolutional neural networks. A barrier in the broad application of neural networks to nuclei segmentation is the necessity to train the network using a set of application specific images and image labels. Previous works have attempted to create broadly trained networks for universal nuclei segmentation; however, such networks do not work on all imaging modalities, and best results are still commonly found when the network is retrained on user specific data. Stochastic optical reconstruction microscopy (STORM) based super-resolution fluorescence microscopy has opened a new avenue to image nuclear architecture at nanoscale resolutions. Due to the large size and discontinuous features typical of super-resolution images, automatic nuclei segmentation can be difficult. In this study, we apply commonly used networks (Mask R-CNN and UNet architectures) towards the task of segmenting super-resolution images of nuclei. First, we assess whether networks broadly trained on conventional fluorescence microscopy datasets can accurately segment super-resolution images. Then, we compare the resultant segmentations with results obtained using networks trained directly on our super-resolution data. We next attempt to optimize and compare segmentation accuracy using three different neural network architectures.

Results

Results indicate that super-resolution images are not broadly compatible with neural networks trained on conventional bright-field or fluorescence microscopy images. When the networks were trained on super-resolution data, however, we attained nuclei segmentation accuracies (F1-Score) in excess of 0.8, comparable to past results found when conducting nuclei segmentation on conventional fluorescence microscopy images. Overall, we achieved the best results utilizing the Mask R-CNN architecture.

Conclusions

We found that convolutional neural networks are powerful tools capable of accurately and quickly segmenting localization-based super-resolution microscopy images of nuclei. While broadly trained and widely applicable segmentation algorithms are desirable for quick use with minimal input, optimal results are still found when the network is both trained and tested on visually similar images. We provide a set of Colab notebooks to disseminate the software into the broad scientific community (https://github.com/YangLiuLab/Super-Resolution-Nuclei-Segmentation).

Bioinformatics

Super-resolution

STORM

convolutional neural networks

Mask R-CNN

UNet

StarDist

ANCIS

nuclei segmentation

Quantitative analysis of nuclei morphology and architecture plays an important role in understanding nuclear function and assisting pathological diagnosis. Segmentation of cell nuclei is generally the essential first step. Manual segmentation can be highly time-consuming and is impractical for large data sets. Additionally, it suffers from individual user bias and lack of reproducibility. Convolutional neural networks (CNNs) have been adapted for nuclei segmentation in wide-field fluorescence images and bright-field histology images with great success [1-3].

With recent advances in super-resolution fluorescence microscopy, nuclear architecture can now be visualized at the molecular scale down to the resolution of 20-30 nm [4-6] in its original spatial context of cells and tissue architecture [7]. In particular, stochastic optical reconstruction microscopy [STORM] has been increasingly used in cell biology to visualize chromatin compaction and spatial relationships between DNA and DNA-binding proteins using various cell models and tissue samples at different pathological stages [5, 6, 8, 9]. More recently, such approach was applied for examining chromatin structure in clinical tissue samples [7], which may potentially improve cancer diagnosis.

Automated analysis of super-resolution images of cell nuclei has been challenging due to the unique image characteristics distinct from conventional wide-field fluorescence images. The super-resolved nuclear architecture is often discontinuous in nature with frequently disrupted nuclear boundaries. Their image size exceeds that of conventional microscopic images by about 100-fold. Super-resolution images are also susceptible to various noises during the image reconstruction process, which may result in partial or over segmentation. Additionally, dense noise may be falsely identified as a cell due to the similarity between noise and nuclei regions. In general it has been noted that CNNs tend to produce segmentation results that do not ensure contiguous predictions [10]. Therefore, publicly available CNNs for nuclear segmentation, pre-trained on wide-field fluorescent images, may not be well suited for analyzing super-resolution images.

This study is motivated by the need to identify and segment individual nuclei from STORM-based super-resolution images for high-throughput analysis. Identifying effective neural network architectures for nuclear segmentation and developing optimal image processing algorithms for segmenting STORM-based super-resolution images will provide a quick and efficient workflow for quantitative analysis and imaging informatics. In this study we aim to optimize and compare the performance of multiple commonly used network architectures on nuclear segmentation, evaluate the performance of the selected networks when applied to diverse STORM image datasets from cell lines and tissue, and elucidate best practices for nuclear instance segmentation in localization-based super-resolution microscopy images. We will also determine whether networks broadly trained to detect nuclei in a variety of microscopic imagery, such as those used in the 2018 Kaggle Data Science Bowl [1], could be effective in segmenting super-resolution microscopy images, or whether additional training on super-resolution imagery is even necessary.

A. Overview of Related Works

The most cited CNNs for nuclear instance segmentation include UNet [11] and Mask R-CNN [12], which themselves are the basis for many other networks [10, 13-21]. UNet is based on the Fully Convolutional Network (FCN) for semantic segmentation [22, 23], but with the addition of skip connections between the encoder and decoder layers to preserve spatial information and scale [11]. Additionally, UNet does not have any fully connected layers, utilizing a tiling scheme to analyze arbitrarily sized images one tile at a time. Mask R-CNN applies a small CNN for instance segmentation only on the proposed regions generated using Faster R-CNN [12]. Region based approaches limit instance detections to within a bounding box or area of interest, and have become popular techniques for improving segmentation accuracy.

i. Region Proposal Based Segmentation Networks

Object detection networks, such as Faster R-CNN [24], Single Shot Detectors (SSD) [25] and You Only Look Once (YOLO) [26], have been implemented in a number of instance segmentation neural networks in addition to Mask R-CNN. Attentive neural cell instance segmentation (ANCIS) applies UNet based segmentation on regions proposed by an SSD [14]. In addition, ANCIS applies two small CNN modules to locate cell features and suppress unwanted noise regions, potentially enhancing accurate delineation of intricate structures and boarders. Similarly, the Single Stage Salient-Instance Segmentation network (S4Net) combines a SSD with the segmentation branch of Mask R-CNN [18]. S4Net also utilizes an additional segmentation branch which helps the network to distinguish instances using the contextual information both inside and outside of each proposed region. Another CNN method utilizes Fast YOLO [27] for instance detection and UNet for segmentation [19]. Instance Relation Network (IRNet) builds on Mask R-CNN by adding a small CNN module to each of the region proposal network and segmentation heads [16]. This region proposal module helps to remove duplicate proposals while the segmentation module learns contextual information from the predictions to improve instance differentiation.

ii. Non-Region Proposal Based Methods in Instance Segmentation

Alternate to bounding box region proposals, some networks utilize semantic or panoptic segmentation to improve instance segmentation accuracy. Semantic segmentation can be used to identify regions of interest for later instance segmentation [28]. Panoptic segmentation looks to improve instance segmentation by not only identifying target objects of interest, but also by identifying and segmenting out the background and other non-target objects in the image [29].

Various methods are implemented to improve the performance of CNN nuclei segmentation. Among these techniques are edge determination modalities used to better delineate closely grouped or touching nuclei. The original UNet study accomplished this using a loss function that weights edge loss more heavily than interior, forcing the network to better learn instance boundaries [11]. Other techniques include teaching a secondary CNN to learn the borders between clustered cells or the cell edges [21, 30, 31], thresholding, morphological processing, active contours [3] and watershed [32, 33]. These methods are not always trained as part of the network, and are commonly applied during post-processing on the instance predictions made following network training.

Several methods implement a distance transform to aid clustered cell segmentation. One recent technique uses either a UNet or an FCN to predict nuclei locations and estimate boundaries by implementing a regression based learning of the distance maps [34]. Another method uses a trained neural network to predict the Euclidean distance map from the test image, and then applies a Faster R-CNN on the map to localize the individual nuclei [35]. The Hover-Net method learns each prediction’s center of mass, and then applies a sobel operator to determine the horizontal and vertical gradients within each prediction, the results of which were used as inputs into a marker assisted watershed [36]. Lastly, the StarDist method trained an added CNN module to fit star-convex polygons to UNet instance predictions, and then predicted the polygon boarders [20]. The accuracy of the border predictions was highly dependent on the image scale.

B. Proposed Methods

In this study, we performed extensive evaluation of the performance of CNN architectures towards instance segmentation of nuclei from STORM-based super-resolution images. First, we evaluated whether networks broadly trained on publicly available nuclear images adopted from the 2018 Kaggle Data Bowl competition [1] could be directly applied towards the nuclear segmentation of our super-resolution image dataset. We then trained, optimized and tested each of three selected CNNs exclusively on our super-resolution image datasets. Our STORM images of cell nuclei include a diverse spatial “texture”, labeled with fluorophores attached to various molecular targets, from different cell and tissue types, biological states and pathological states. For each network, we evaluated the effect of various pre- and post-processing of STORM images on test accuracy. Lastly, we evaluated a noise removal network for improving instance segmentation on super-resolution images and the effect of overlapping regions on test accuracy.

Three convolutional neural networks were selected for training and testing. We focused initially on broadly used networks for image segmentation, particularly UNet and Mask R-CNN [11, 12]. UNet, however, inherently performs semantic segmentation. Therefore, we instead selected a UNet-based instance segmentation algorithm, StarDist [20]. To evaluate the effect of region proposal networks using the same base segmentation network (UNet), we selected ANCIS [14]. In addition, this selection allowed us to compare two region-based localization methods with different segmentation algorithms in ANCIS and Mask R-CNN.

Neural networks evaluated herein include the UNet-based StarDist [19], Mask R-CNN [11], and the region based UNet technique, ANCIS [13]. The segmentation results from each technique were compared. Additionally, we tested various image pre and post processing techniques, and train a UNet for noise detection and removal from our test images.

A. Initial Testing

We first trained our networks using the broad, publicly available Kaggle dataset. The Kaggle dataset includes a large set of 2D microscopic images of cell nuclei from cell culture and tissue across various imaging modalities including conventional fluorescence, histology and bright-field microscopy. Although super-resolution fluorescence microscopy is a type of fluorescence microscopy, super-resolved images may or may not prove compatible with CNNs trained using traditional fluorescence microscopy images. Our goal here is to determine whether training directly on a super-resolution STORM image dataset is necessary. Each network was trained using the optimal number of epochs, steps and other factors determined by the authors of each method. Trained networks were then applied to our resized STORM images from 512x512 tissue and cell line test sets. As shown in Figure 1A and Table 1, the optimal results were achieved using Mask R-CNN, but the overall performance is poor. The F1 score on the tissue and cell line dataset is only 0.181 and 0.073, respectively.

Table 1. Average test accuracy scores for Mask-RCNNtrained on Kaggle dataset and tested on super-resolution imagery

Test Set	Pre-Processing	F1-Score	FN	Hausdorff
Colon Tissue	512x512	0.181	0.819	14.07
	256x256	0.268	0.619	8.68
	256x256 Blur	0.262	0.635	8.03
	256x256 HEq	0.352	0.501	8.22
Cell Line	512x512	0.073	0.924	12.83
	256x256	0.475	0.473	6.9
	256x256 Blur	0.555	0.268	5.92
	256x256 HEq	0.628	0.201	6.07

Average F1-Score, false negative percent (FN) and Hausdorff distance for a Mask R-CNN segmentation network model trained on the Kaggle dataset, and applied to both our super-resolution colon tissue and DNA labelled cell line datasets. The network was applied to our Colon Tissue and Cell Line image test sets (512x512 resolution), as well as to the downsized versions of each test set (256x256 resolution), and to Gaussian blurred (Blur) and histogram equalized (HEq) versions.

Next, we applied a set of pre-processing methods to improve the performance. The STORM images have two unique aspects compared to conventional fluorescence images: the inherently discontinuous structural features at the nearly molecular-scale resolution and the nearly zero “intensity” value in most background regions. We deliberately lowered the image “resolution” by downsizing to 256x256 and either blurring or altering the image contrast by histogram equalization (Figure 1B). These pre-processing methods overall did improve our test accuracy, while decreasing the False Negative percentile (Table 1). However, none of the Kaggle trained models succeeded in attaining an average F1-Score exceeding 0.5 on the tissue dataset and a top mark of 0.628 was achieved on the cell line dataset. Notable exceptions to the reported averaged results can be found when analyzing individual images. Kaggle trained Mask R-CNN performed markedly better on STORM images containing dense or uniform nuclear texture with clear borders in both cell line and tissue images (Supplementary Figure 1). The F1-scores of these individual segmentations were found to be less than those found using the STORM-trained Mask R-CNN network, but they demonstrate the potential of CNNs towards the segmentation of super-resolution STORM images.

B. Optimization

Due to the generally poor results when applying the Kaggle trained networks, we next trained each network directly on our STORM image datasets. We determined parameters for instance segmentation via a process of training and testing, and utilized test accuracy as the determinant factor for best performance. Test accuracy, for optimization purposes, was assessed using the F1-Score calculation at an IoU threshold of 0.7. Parameters varied included number of epochs and number of training images for all networks, as well as number of steps for Mask R-CNN and StarDist. Additional parameters, included learning rate and other network specific variables, were optimized as well.

i. Tissue dataset

A typical trend observed was a quick rise in accuracy with increasing number of epochs, followed by a fluctuation and then a settling. After a larger number of epochs, network accuracy was observed to level or drop, likely due to overfitting (Figure 2A). Plotting the test accuracy against number of steps provided a similar trend, with networks reaching an accuracy saturation point when using more than a few hundred steps. Note that ANCIS did not provide the ability to vary steps, but rather utilized a two-part training conducted first on the region-based localization network, then on the instance segmentation network.

It was observed that for Mask R-CNN and StarDist, the optimal number of epochs occurred at about 400, whereas 200 Epochs produced the peak F1-Score for ANCIS. Both UNet methods, ANCIS and StarDist, showed some decrease in test accuracy with increasing epochs, beyond the optimal range. Mask R-CNN, on the other hand, appeared to fluctuate. Increasing steps improved the accuracy of Mask R-CNN more linearly, beyond 100 steps, without significant fluctuation until beyond 500 steps, suggesting nonlinear effects when using too many or too few epochs (Supplementary Figure 2). StarDist reached a peak accuracy at 300 steps, then dropped off, as with increasing epochs, perhaps due to overfitting.

Increasing the size of the training set was expected to improve test accuracy, and this was generally found to be the case (Supplementary Figure 3). All networks improved significantly when increasing from 10 to 20 training images, with each image containing an average of 22 instances. ANCIS and StarDist continued to improve at a nearly linear trend beyond 20 images, however Mask R-CNN once again demonstrated a fluctuating trend. All networks performed best when using the entire available training set of 77 images, however satisfactory results could be obtained using less. False negative or false positive counts tended to be higher when using a smaller dataset, and overlapping detections occurred with greater frequency.

Varying the learning rate within a limited, though often used, range of 1e-3 to 1e-5 did not produce a great deviation in test accuracy, however a lower learning rate tended to require more epochs to achieve the same accuracy. Since all learning rates within this range provided similar results, we selected a rate in the middle of the range, resulting in a common rate of 1e-4 for all networks.

ii. Cell Line

The trend for test accuracy versus number of epochs for the cell line data proved to be similar to the trend to the results from the tissue dataset, however the F1-Score values were higher overall with less fluctuation (Figure 2B). More uniform shapes, less noise and greater spacing in-between instances (i.e. less clustering) may help account for the increased accuracy for the nuclei segmentation on the cell line dataset versus the tissue dataset. The smoother accuracy-versus-epoch curves may also be accounted for by the reduced variability between the target instances. The optimal number of epochs, steps and learning rate used were found to also be similar to those from the tissue dataset, but not the same. Optimal number of epochs were determined at 400 for both StarDist and ANCIS, but 600 for Mask R-CNN. Steps versus test accuracy, however, progressed similarly to the results found for the tissue training set.

Effect of training set size was also determined for the cell line dataset consisting of 65 training images. The performances were overall improved for all networks with increasing dataset size (Supplementary Figure 3), although Mask R-CNN dropped in accuracy when using the entire dataset, from 0.869 to 0.831. Both ANCIS and StarDist fluctuated between 40 and 60 images, but maintained an overall upward trend in images versus accuracy. ANCIS demonstrated both the highest scores and least variation. Indeed, the F1-Score for ANCIS when trained on only 10 images was nearly 0.9, with each image containing an average of 4.2 nuclei per image. However, the false negative percent of this model was much higher than the model trained on the full dataset, 12.5% for the former and 3% the latter, as was the Hausdorff distance, 8.53 and 6.39 respectively. Mask R-CNN demonstrated a similar trend, scoring fairly high even when only trained on 10 images, though scores were not as high as for the ANCIS model.

C. Network Testing

Following network training and optimization, nuclei segmentation was conducted on all test image sets (Figure 3). The tissue dataset included the STORM images of nuclei labeled with a heterochromatin marker H3K9me3 from both colon and prostate tissue at different pathological states (normal, low-grade and high-grade pre-cancerous lesions and invasive cancer). When evaluating the cell line dataset, the test set included images with various labeled molecular targets (H3K27me3, H3K4me3, DNA, RNA polymerase II) from different cell lines under normal and treated conditions. Test accuracy was assessed again using the F1-Score of instances that achieved an IoU of 0.7. Additionally, we calculated the percentage of False Negatives and the average Hausdorff Distance, to provide an estimate of boarder and instance positioning accuracy.

i. Tissue dataset

STORM images of colon tissue dataset include various pathological states including normal tissue, precancerous (adenoma, high-grade dysplasia) and invasive cancer. Nuclear texture for different pathological phenotype varies significantly, where the nuclear texture from precancerous (adenoma and high-grade dysplasia) and cancerous tissue exhibit dramatically fragmented chromatin texture with highly disrupted borders. Comparing network performance when trained and tested on the colon tissue dataset, we found that the Mask R-CNN model provided the highest test accuracy (Table 2). All networks demonstrated an optimal F1 score of at least 0.72 on the tissue dataset, but none achieved a score greater than 0.80. Additionally, Mask R-CNN was found to result in the lowest average Hausdorff distance, implying the greatest average instance border accuracy. Mask R-CNN also demonstrated the lowest false negative rate, but also had the highest number of false positives, suggesting some degree of over-segmentation. ANCIS, on the other hand, had the highest false negative rate and lowest number of false positives, suggesting under-segmentation (Figure 3). StarDist performed similarly to ANCIS, although with slightly lower accuracy, less false negatives and higher Hausdorff distance.

Table2. Average test accuracy scores for CNNs trained and tested on super-resolution imagery

DataSet		Mask R-CNN			ANCIS			StarDist
Train	Test	F1	FN/FP	H	F1	FN/FP	H	F1	FN/FP	H
Colon Tissue	Colon	0.793	0.177/0.225	9.76	0.739	0.315/0.122	10.68	0.725	0.253/0.126	10.62
	Prostate	0.646	0.221/0.121	8.82	0.505	0.496/0.118	9.64	0.673	0.357/0.071	9.17
	Cell Downsize	0.872	0.053/0.17	5.65	0.601	0.201/0.134	13.21	0.847	0.137/0.105	6.39
Cell A	Cell	0.832	0.11/0.3	8.21	0.902	0.101/0.078	7.05	0.799	0.107/0.263	8.87
Cell B	Cell	0.831	0.125/0.116	7.91	0.952	0.03/0.128	6.39	0.859	0.076/0.31	8.11
Cell B	Colon Upsize	0.423	0.607/0.424	17.43	0.489	0.551/0.076	14.51	0.39	0.467/0.645	17.06
Combine	Colon	0.676	0.169/0.564	10.59	0.753	0.305/0.11	11.13	0.612	0.396/0.397	11.39
Combine	Cell	0.885	0.021/0.236	5.54	0.943	0.041/0.107	6.76	0.92	0.037/0.174	5.45
Colon Blur	Colon Blur	0.752	0.246/0.25	9.87	0.729	0.349/0.136	11.06	0.658	0.387/0.141	11.02
Colon HEQ	Colon HEQ	0.769	0.183/0.287	10.01	0.733	0.328/0.127	11.17	0.696	0.348/0.123	10.94
Cell A Blur	Cell Blur	0.791	0.12/0.333	6.87	0.858	0.112/0.118	6.83	0.761	0.153/0.288	9.22
Cell A HEQ	Cell HEQ	0.81	0.105/0.363	6.64	0.867	0.098/0.126	6.57	0.785	0.14/0.285	8.99

F1-Score (F1), false negative percent and false positive percent (FN/FP), and Hausdorff distance (H) for Mask R-CNN, ANCIS and StarDist network models trained on the STORM colon tissue, and cell line datasets A & B. An additional combined dataset was created for training and included downsized cell line dataset A, colon tissue and Kaggle datasets. Training also was conducted on histogram equalized (HEQ) and blurred (Blur) versions of the datasets. Testing was conducted on the 512x512 colon and prostate tissue test sets as well as on the 512x512 cell line test set, downsized (256x256) cell line set and upsized (1024x1024) colon dataset. In addition, we evaluated the effect of additional image pre-processing including Gaussian blur and histogram equalization on segmentation accuracy. The networks were trained on pre-processed versions of the original tissue and cell line datasets for both the training and test images. The results indicate that the original data provided the best test accuracy over the pre-processed images for all cases, suggesting no advantage to be gained by these processes (Table 2).

Further, we evaluated whether the networks trained on the dataset from one type of biological sample (e.g., cell line dataset) can be directly used on another type (e.g., tissue dataset) to determine cross compatibility between trained models. We tested the networks trained on cell line dataset on the colon tissue images. Accuracy scores were found to be low, but improved somewhat when the tissue images were resized to 1024x1024 (upsized to make it similar to that of cell line). False negatives and Hausdorff distances were also much higher than when segmenting with a tissue-trained model. Generally, the models trained with cell line dataset did not perform well when applied to tissue images.

We also briefly compared results between normal nuclei and those at different pathological states within our colon tissue test set. Segmentation test accuracy was found to be significantly better on the normal nuclear phenotypes (F1=0.919) than on the pathological phenotypes (low-grade F1=0.825, high-grade F1=0.779 and invasive adenocarcinoma F1=0.676), when training on the STORM colon tissue dataset using Mask R-CNN (Supplementary Figure 4). Scores for ANCIS (normal F1=0.871, low-grade F1=0.783, high-grade F1=0.676 and invasive F1=0.653) and StarDist (normal F1=0.895, low-grade F1=0.702, high-grade F1=0.678 and invasive F1=0.532) proceeded similarly. Enhanced performance on the normal tissue is likely due to the dense nuclear texture and more well-spaced nuclei observed in those images, compared to the more clustered and irregular nuclei found and disrupted nuclear texture in pathological tissue sample images.

Lastly, we evaluated the cross compatibility between different tissue types. We applied the models trained on the original colon tissue dataset trained networks to the prostate tissue test set, labeled with the same nuclear marker (H3K9me3), expressing multiple pathological phenotypes (normal, low-grade and high-grade prostatic intraepithelial neoplasia and invasive cancer). The network segmentation was found to be acceptable across phenotypes (F1=0.646, Mask R-CNN), but the accuracy was significantly lower than for the segmentation on the colon tissue test set (F1=0.793) (Supplementary Figure 5). Potential causes for the reduced accuracy were the variations in nuclear shape (more circular) and texture (consisting of more discrete fragments) when compared to the colon tissue nuclei. The prostate tissue images also contained a denser noise in between nuclei.

ii. Cell Line Dataset

Training was initially conducted on cell line dataset A, those images with discrete nuclear texture (e.g., those labeled with RNA polymerase II). Testing of the trained models was conducted on a subset of the STORM images of nuclei also with discrete texture, as well as on a set of STORM images of nuclei with dense or diffuse texture (e.g., those labeled with DNA, H3K4me3) (Figure 4). Nuclei segmentation for cell line images performed higher in test accuracy than the tissue dataset, likely due to reduced cell clustering and more regular cell shapes. Unlike with the tissue dataset, in which Mask R-CNN performed the best overall, top marks for cell line dataset were achieved using ANCIS. When trained using the cell line dataset with dense nuclear texture, all networks achieved F1-Scores above 0.8 (Table 2). Additionally, false negative percent was found to be less than 10 percent for all trained network models, and Hausdorff distances were also less than 10. Results improved further for ANCIS and StarDist when training was conducted on cell line dataset B, containing cell nuclei with both discrete and dense/diffuse texture. The top results using ANCIS achieved an F1-Score of 0.954, with a false negative percent of 3 and Hausdorff distance of 6.39. Further, networks were also trained and tested on blurred and histogram equalized versions of the images in cell line dataset A. Like with the colon tissue set, the original dataset provided the best results, as neither blurring nor histogram equalization improved the accuracy.

Lastly, we evaluated cross compatibility between cell line and tissue datasets. Tissue dataset trained network models were applied to segment the cell line imagery. In general, the results were sub-optimal, due largely to under segmentation of the larger cell line nuclei. However, when the cell images were resized down to 256X256, the test accuracy improved, particularly for Mask R-CNN (0.872) and StarDist (0.842). The percent false negatives still remained higher for StarDist, on the downsized dataset, whereas the Mask R-CNN tissue model performed better than its cell line trained models (Table 2). These results led us to contemplate whether a combined training network, incorporating STORM images from both tissue and cell line datasets, could achieve even better performance.

iii. Combined dataset

We combined the Kaggle, colon tissue and downsized (256x256) cell line dataset B to create a potentially more robust dataset. Downsizing was conducted on the cell line images to roughly match the nuclear sizes to those found in the colon tissue STORM images. Testing on the tissue dataset resulted in improved accuracy over the network trained on the Kaggle dataset alone (Table 1), but worse than the networks trained directly on the STORM images from tissue dataset for both StarDist and Mask R-CNN (Table 2). ANCIS, on the other hand, experienced a boost in the nuclei segmentation accuracy for tissue dataset (F1=0.753) compared to the ANCIS model trained on tissue data alone (F1=0.739). Since ANCIS had demonstrated a higher false negative percent previously, this boost in accuracy was likely due to an increase in the ability of newly trained model to accept a greater degree of variability in instance identification, learned from the broader dataset with diverse image features. Mask R-CNN, on the other hand, suffered from this same increase in variability, since that network model demonstrated a trend towards over-selection of instances. Test results on the cell line data, however, were surprisingly robust across all networks. Optimal or near optimal F1-Scores, false negatives and Hausdorff distances were found for all three broadly trained network models (Table 2). The Hausdorff distances for StarDist in particular showed significant improvement when trained on the combined dataset, versus when trained on either cell line dataset, (Supplementary Figure 6).

D. Image Processing of Test Images

i. Noise Removal

STORM images often contain “noisy regions” due to non-specific binding or unbound fluorophores, out-of-focus fluorescence and autofluorescence signals. Such “noisy” regions are more prominent in the tissue dataset. To further improve accuracy in nuclei segmentation, we conducted noise removal on the test images by training a UNet to semantically recognize and segment noisy regions, as shown in Figure 5. Test accuracy was slightly improved for all models due to a reduction in false positives. The F1-Scores for Mask R-CNN improved from 0.788 to 0.793, for ANCIS from 0.735 to 0.756 and for StarDist from 0.71 to 0.725, when applied to the tissue test images segmented using tissue trained models. Importantly, false negative percent was markedly reduced, by 5 percent for both Mask R-CNN and StarDist, and by 12 percent for ANCIS. The greater improvement for ANCIS, and lesser for Mask R-CNN, is likely due, in part, to the number of total detections. Indeed, noise removal eliminated just as many false positives with Mask R-CNN as with ANCIS. Hausdorff distance was little affected by noise removal. However, it is worth noting that networks trained using more optimal parameters tend to detect less noise, without any additional processing. Using an additional network for noise detection and removal can supplement optimization, but does not replace it.

Application of the UNet trained using tissue dataset noise on the cell line dataset A, resulted in the removal of some nuclei along with the noise, when the nuclei with discrete and sparse texture were erroneously recognized as noise. Therefore, independently trained UNets were required for tissue and cell line images, due to the variation in the noise density between the two different image sets. F1-Scores and false negative percentages were less affected when noise removal was applied to the cell line dataset. On average, for cell line data, the F1-Scores improved by less than 1 percent, and false negative percent was reduced by less than 2 percent when segmentation was also conducted by cell line trained network models.

ii. Small Instance Detection and Overlap Removal

The post-processing steps for the test images, which occurred following instance segmentation, included overlap and small instance removal. These processes made a greater difference in test accuracy with ANCIS and Mask R-CNN (Figure 6), however StarDist already utilized a built-in module to eliminate overlaps and only benefitted from the small instance removal. The elimination of overlaps and small instance detections improved the F1-Score from the best scoring Mask R-CNN for the colon tissue dataset from 0.774 to 0.793, and the best ANCIS score for tissue data improved from 0.73 to 0.756. The removal of small instances from StarDist tissue segmentation results improved the F1-Score only from 0.716 to 0.725. When applied to cell line dataset A with discrete texture, post processing improved the F1-Score from 0.813 to 0.832 for Mask R-CNN, from 0.883 to 0.902 for ANCIS and from 0.778 to 0.799 for StarDist. Additionally, the false negative percent was reduced by 5 and 17 percent for Mask R-CNN and ANCIS, respectively, and by 4 percent for StarDist, when applied to the tissue dataset. False negative percent for the cell line test results only improved by less than 2 percent for all network models. Interestingly, the Hausdorff distance was not improved by more than 0.25 for any test set.

Super-resolution STORM imaging of nuclear organization has been increasingly used in basic biological research. Recent success in imaging nanoscale chromatin structure on tissue and pathological samples paves the way for its future use in assisting clinical pathology. Nuclei segmentation is the first and essential step towards quantitative analysis of chromatin texture and molecular-scale chromatin folding. However, unlike conventional fluorescence or bright-field microscopic images of cell nuclei, STORM images present several unique characteristics and challenges for nuclei segmentation due to the localization-based single molecule detection. The nucleus often lacks clear borders and uniform texture. As the resolution reaches nearly molecular scale, many biological structures often appear discrete. In various biological processes or pathological states, the nuclear architecture can also be disrupted. These characteristics make nuclei segmentation very challenging using traditional image processing methods. The success of applying CNNs in nuclei segmentation on various conventional microscopic images motivated us to explore whether it can achieve similar success on the challenging STORM-based super-resolution images.

A. Initial Testing

The Kaggle dataset included a broad range of stained nuclei images from conventional fluorescence and bright-field microscopy on cells and tissue from different imaging conditions. We found that the Kaggle trained networks severely over-segmented the STORM-based super-resolution images. Notably, our Kaggle trained ANCIS model produced F1-Scores of about 0.01 or less for all test sets. Mask R-CNN, which is more prone to over-selection and under-segmentation, produced the highest test scores, but still failed to segment the images to an acceptable standard. Upon inspection of the segmentation results, we believe the reason for such poor results is the detection of the discontinuous fragments inside the nuclei of super-resolved images as very small instances, largely coming from the inherently discrete or fragmented nuclear texture in the STORM images. Indeed, when we downsized our test set or performed blurring, smaller segments merged into larger detectable pieces, and the test scores improved (although, not by very much). As a result, many of the instances produced were of size less than our minimum test IoU threshold, and so were noted as false negatives. Therefore, training directly on super-resolution imagery was required.

B. Network Testing

i. Overfitting

We have noted that as our networks were trained with increasing number of epochs, the mean test accuracy, judged by F1-Score, increased up to a point before either decreasing again or leveling off. While increased training typically resulted in decreased noise detection, false positives and overlap, it also created an overfitting condition where the network model failed to recognize valid instances, resulting in increased false negatives. In other words, the network lost some of its ability to infer, and began to discard nominally positive detections. In some cases, we also noticed a drop in test accuracy with increasing dataset size. This condition only occurred with StarDist when trained on the tissue dataset and Mask R-CNN when trained using the cell line dataset. The reason may well again be due to overfitting, although in the case of Mask R-CNN it may have been a fluctuation in test accuracy, as we have seen demonstrated by this network on our dataset. Furthermore, we have found that test accuracy increases rapidly with training set size (Supplementary Figure 3). Taken together, these findings indicate that good results can be gained from training sets consisting of tens of images, rather than hundreds or thousands. The exact number will vary depending on image type and target instances. In our own datasets we have observed that fewer cell line nuclei instances than tissue nuclei instances were required to accurately train a CNN for segmentation, due to less variation between instances. In either case, satisfactory results could be attained when training on as few as 20 images.

ii. Effects of Scale

Nuclei size often varies significantly (from a few microns to tens of microns) among different cell types, cultured cells and tissues. Scale makes a significant difference in segmentation accuracy [2, 15]. Networks trained on a limited dataset without scale variation or image augmentation, may get locked into recognizing only a limited set of nuclear sizes, resulting in over or under segmentation of instances as well as increased false negatives and splits and merges when applied to a test set containing different nuclear sizes. Many networks are trained to be scale invariant so as to offset these issues by augmenting the training set with scaled versions of the data [3, 11, 17, 21]. Inversely, one study rescaled all of the images so that nuclei were about the same size [15]. Feature Pyramid Networks (FPN) have also been useful in creating prediction feature maps at various image scales [37], and have been used for instance detection with FCNs [38] and with Mask R-CNN [12, 23]. In dealing with the discontinuous image features, typical of STORM-based super-resolution microscopy images, recognizing multiple nuclear sizes may prove particularly important in helping prevent over- or under-segmentation issues.

iii. Performance of nuclei segmentation using different networks

Given similar settings, particularly in regard to epochs, threshold values and maximum instance limits, ANCIS models demonstrated increased false negatives whereas Mask R-CNN models had more false positives when trained and tested on the same dataset. Two distinct differences may have given rise to this condition, the first being the different region proposal networks used (SSD versus Faster R-CNN) and the second being the different segmentation network (UNet versus Mask R-CNN). One item of note in favor of identifying the segmentation network as the culprit is the greater number of false negatives incurred when applying the UNet based StarDist network as well, suggesting UNet may be a more selective algorithm. An additional difference between the networks is the added modules. StarDist incorporates a module for eliminating overlaps, while ANCIS incorporates a module for distinguishing intricate features from background. These modules, however, perform more as supplements to improve segmentation boarder accuracy, and should not cause or prevent over or under selection of instances.

It is worth noting that the quality of nuclei segmentation may not be fully reflected by the quantitative metrics on the test results (e.g., F1-score) for each network. The overall qualitative perception on the segmented nuclei can also be an important factor to consider. Comparing network performance, when trained and tested on the same image sets, certain general features became apparent. Mask R-CNN appeared to achieve the greatest number of detections, in both true positive nuclei and false positives consisting of mainly noise and small, off-target nuclei (Figures 3B & 4B, Supplementary Figure 4). Both Mask R-CNN and ANCIS utilized region-based localization, which resulted in some overlapping detections. ANCIS, however, appeared to detect less noise than Mask R-CNN, a quality reflected in the increased false negative percent found in the data. StarDist generally detected less noise as well, but also made fewer true positive detections, again indicating that UNet is a stricter segmentation algorithm than Mask R-CNN. The Hausdorff distances were lower for Mask R-CNN and ANCIS than for StarDist, possibly due more to the star-distributed polygon fitting module creating a more rounded boarder, rather than the segmentation network.

In addition to training each network on single cell line image sets with discrete nuclear texture, we also trained a broader cell line dataset containing various labeled molecular targets with both discrete and dense or diffuse nuclear texture. This introduced greater variability to the training set, potentially making it more robust. The broadly trained cell line network models demonstrated increased test accuracy for both ANCIS and StarDist networks. The Mask R-CNN model, however, did not perform as well using our broader training set, the increased variability appearing to cause increased splits and merges of instances. A reason may be that the increased variability confused a segmentation network that is more prone towards over selection. We have seen that ANCIS and StarDist, both of which are built on UNet, and both of which demonstrated greater false negatives than Mask R-CNN, can enjoy reduced false negatives when trained on a broader dataset. Mask R-CNN on the other hand appeared to perform better when trained more specifically.

Training segmentation networks on one image type, such as from the cell line dataset, and testing on another, such as our colon tissue test set, typically produced poor results. Even when both training and testing on either tissue or cell line images, the results may not be as good if the test set utilized a different type of tissue or otherwise presented a dramatically different texture or image features than the training set. Such a result was observed here, when training on our colon tissue data and testing on the prostate tissue images, where the segmentation accuracy dropped between 5 and 15 percentage points on the F1-Score when compared to the results found when segmenting the colon tissue test set. We did, however, achieve very good results when training on our cell line dataset B and testing on the cell line test set. In this case, and due mostly to the lower variability in shapes and reduced clustering, a well-trained network can perform well on a variety of similar images, even with some variation in texture. In most cases, while a pre-trained network can be a good starting place and worth trying out, it is better to train the network on the same type of image that is being segmented.

C. Post Processing

i. Noise Detection

Well-trained networks are more robust against noise, and therefore do not require as much post-processing, however noise removal can still be beneficial as long as it is not too aggressive. When training a noise detection network, care must be taken to train on noise similar to that found in the test set to avoid misclassification issues. For example, if only very dense noise is used to train a noise removal network for a dataset containing nuclei with various textures and label densities, including low density, discrete or diffuse labelled nuclei, then some of those may be misidentified as noise and removed (Supplementary Figure 7). Noise removal is especially effective on STORM images in the tissue dataset, where large “noisy” regions can be prominent due to the out-of-focus fluorophores and autofluorescence. Therefore, it is typically helpful to train a noise recognition network on the same, or similar, training set as used for instance detection.

ii. Overlap Removal

The method used for merging or, more particularly, splitting overlapping instances may not be ideal. However, we found that for nuclei segmentation on STORM images, a significant improvement in test accuracy was gained even by roughly splitting, or merging, the overlapping cells. An imprecise solution, as demonstrated in the methods section (Figure 8), can still improve test accuracy by as much as 5-10 percentage points on the F1-Score. Methods that more accurately divide the overlap will not add more than 1-2 additional percentage points, in most of the cases we have observed. While this gain is nonetheless desirable, no method will always achieve perfect results.

Overall, despite the small gain on test accuracy metrics due to noise removal and post processing, we believe it is worth attempting. Even with post processing, there will still be segmentation errors. Those errors can, however, and at the user’s discretion, be manually corrected. While this is a time-consuming endeavor, it is certainly less so than segmenting all of the instances manually.

In conclusion, we evaluated the performance of a set of three CNNs for nuclei segmentation on STORM-based super-resolution fluorescence images from a diverse dataset from cell lines and tissues with different labeled molecular targets at different biological and pathological states. We found that Mask R-CNN has the best overall performance for nuclei segmentation of STORM-based super-resolution images, when trained on the STORM images from tissue alone, with high test accuracy, low false negatives and good border identification (short Hausdorff distance). The pre-trained model on the tissue dataset can achieve a good performance on the downsized images from cell lines as well. Optimal performance will still be achieved when training and testing are conducted on the same data type. Image processing such as noise removal and overlap removal helped improve the overall accuracy, especially on the tissue dataset. We have built our software using the Python platform, and used GitHub and Google Colaboratory to disseminate to the biomedical research community.

A. Data Collection

Images used in this study were collected using our custom-built STORM system on both formalin-fixed, paraffin-embedded (FFPE) tissue section and cultured cells. The imaging systems, sample preparation, image acquisition and reconstruction have been previously described [5, 7]. The image characteristics of cell nuclei can vary dramatically depending on the labeled targets. We included a diverse set of STORM images of nuclei with different molecular targets that exhibit various distinct nuclear textures (e.g., discrete clusters, diffuse pattern, or dense clumps) from the various labeled molecular targets from different biological or pathological states. Supplementary Table 9 shows the list of cell/tissue types, labeled target in the nuclei, biological or pathological states of the cells or tissue and fluorophore used.

All STORM-based super-resolution images were originally sized at 5120x5120 pixels, and were resized for speed during training. Downsizing was conducted using bilinear interpolation with averaging. Datasets were constructed by creating 1024x1024, 512x512 and 256x256 versions of the STORM images of cell nuclei from human colon tissue as well as 512x512 and 256x256 versions of the STORM images of cell nuclei from cell lines.

Additional datasets were created for training and testing by blurring or contrast enhancing the STORM images of nuclei from human colon tissue and cell lines with resized 512x512 and 256x256 datasets. Blurring was applied using a Gaussian filter with sigma set at 2. Contrast enhancement was conducted by means of histogram equalization with normalization, to limit intra image variations. All image resizing and pre-processing steps described in this section were conducted using Fiji [39].

i. STORM images of cell nuclei from tissue samples

A total of 134 STORM images of nuclei labeled with heterochromatin marker (H3K9me3) were included in the human colon tissue dataset, of which 60 percent were used for training, 30 percent for testing and 10 percent for validation. The same testing, training and validation images were used for each network. Additionally, a set of 69 STORM images of nuclei stained with the same marker from mouse prostate tissue (from Myc-driven prostate tumorigenesis mouse model and wild-type mice) were used as an alternate test set for the network models trained on colon tissue data. These datasets include normal tissue and those at different pathological states (low-grade and high-grade precancerous lesions and invasive cancer). As we have previously shown, chromatin compaction becomes progressively more disrupted in carcinogenesis and therefore chromatin texture varies significantly in normal tissue, precancerous lesions, and invasive cancer [7].

ii. STORM images of cell nuclei from cell lines

Two training sets were collected for the cell line dataset. Cell lines used in this dataset include mouse and human fibroblast cells, human breast cancer cells, kidney cells and prostate cancer cells. A wide variety of molecular targets in the cell nuclei (H3K27me3, H3K4me3, DNA, RNA polymerase II) were labeled, which are characteristic of a diverse “texture” of nuclear organization. These datasets were generally divided into two categories based on its image characteristics: discrete and dense nuclear texture, which depend on either molecular targets or biological states. A total of 44 STORM images with discrete nuclear texture and an additional 65 images with dense nuclear texture were used in this study. Each dataset was divided: 60 percent for training, 30 percent for testing and 10 percent for validation. The STORM images of cell nuclei with discrete nuclear texture comprised our first, single-source (single labeled target of RNA polymerase II) cell line training and validation dataset (Supplementary Figure 8), herein referred to as cell line set A. Next, the STORM image training and validation datasets including images with both discrete and dense nuclear texture comprised our second, multiple-source (i.e., various labeled targets) training and validation set, which we shall call cell line set B. Then, the two test sets from cell line sets A & B were combined to create a single test dataset for both the discrete and dense nuclear texture cell line training sets. In this way, we directly compared the network performances of both models: trained on a single data source presenting a discrete texture, versus trained on multiple sources containing both discrete and dense texture. A more complete breakdown of cell type, biological state, labeled molecular target and fluorophore used in each image set, as well as a breakdown of which dataset each image type appears in (Supplementary Table 9).

iii. Kaggle Dataset

In addition to our own STORM super-resolution image dataset, we also trained and tested our selection of neural networks using the 2018 Kaggle data science bowl dataset [1]. Included in this publicly available nuclear image dataset were wide-field fluorescence, H&E stained and brightfield microscopic images of nuclei from cultured cells and tissue. From the entirety of the dataset, we utilized only the stage 1 training set, containing 670 labeled images. Kaggle datasets were not resized, but rather tested as is.

B. Neural Network Implementation

All networks were implemented using Python 3.6.9 with Tensorflow 1.14 and Keras 2.3.1. Training and testing was conducted on an Alienware X51 r3 PC running Ubuntu 18.04 LTE, implementing an Intel® Core™ i7-6700 (8MB Cache, 4.0GHz) CPU with 32GB (2133 MHz) DDR4 RAM and a NVIDIA GeForce GTX 1060 GPU with 6GB memory.

All networks utilized the Adam optimization algorithm to iteratively update network weights. Additionally, all networks used image augmentation to increase the robustness of the training datasets. Augmentation methods included random horizontal and vertical flipping, rotations of 90 or 270 degrees and scaling by values between 0.5 and 1.5. Non-linear transforms such as stretching, skewing or sheering were not used.

C. Software Distribution

To make our methods more readily available and duplicable, we have coded a set of Colab notebooks on which users can both test and train the neural networks used in this study (https://github.com/YangLiuLab/Super-Resolution-Nuclei-Segmentation). Colab notebooks are designed to run using a hosted runtime on the Google cloud service. This allows users access to all of the necessary coding libraries, as well as GPU compute power, without having to install and setup the environment locally. Notebooks were created for Mask R-CNN and ANCIS, as well as the UNet used herein for noise detection. A separate notebook for StarDist was not created, since a good Colab notebook focusing on nucleus segmentation already exists for this network, and can be used with our pretrained weights [40]. Our modifications for post processing predictions and resizing images were also included in the code, as optional parameters. We have also included links to our pre-trained weights for each network using our super-resolution image sets.

i. StarDist

The StarDist distribution applied here was version 0.6.0, downloaded from the author’s GitHub page [20]. Parameters used included a batch size of 2 and a training patch size of 128x128. The number of rays for the star distributed polygons was set at 32, the author recommended value. During prediction, the non-maximum suppression (nms) threshold was set to 0.3 and the probability threshold to 0.5. Additional settings were left to their default values, including binary cross-entropy training loss function and mean average error polygon distance loss function parameters.

ii. ANCIS

Our tested distribution of the ANCIS code was also downloaded from the author’s GitHub page [14]. Network parameters used for all training and testing included a nms threshold of 0.7, confidence threshold of 0.5 and segmentation threshold of 0.5. Thresholds were selected based on previous experiences with image segmentation networks, as well as findings reported in the literature. The batch size was limited to 2 by the GPU capacity. The maximum detectable instance limit was set greater than the number of instances in any of the training or test images, we used 400, since we did not wish to limit segmentation in this way. Anchors were left at defaults values, along with default cross entropy loss function.

iii. Mask R-CNN

Matterport’s broadly disseminated Mask R-CNN distribution, version 2.1, was implemented in these experiments [41]. Threshold parameters and maximum instance limits were matched to those reported above for ANCIS. In addition, we applied some of the parameters set by Waleed Abdulla in his nucleus segmentation example code [41]. Anchors per image were set to 64, and anchor scales were set at (8, 16, 32, 64, 128). We did not implement mini masks, however, and our batch size was limited to 1. The network backbone used was resnet50, and binary cross entropy loss function as used. Training was initialized on the pre-trained coco dataset weights, downloaded from the code author’s website [41].

D. Network Training and Testing

i. Initial Testing

Each of our tested neural networks came packaged with a pre-trained model using the Kaggle stage 1 dataset. We evaluated the performance of these broadly trained nuclei detection models on our super-resolution images by calculating the F1-Score of the segmentation results. Images used for segmentation included our STORM images of nuclei from human colon tissue and cell line test sets (original 5120x5120 image size downsized to 512x512). The F1-Scores were calculated at an IoU threshold of 0.7, using Caicedo’s method [21]. Additionally, we tested the effect of resizing by testing the Kaggle models on the same datasets downsized to 256x256. To evaluate whether converting the super-resolution images into their lower-resolution versions improve the performance of nuclei segmentation, we also tested the models on blurred and histogram equalized versions of our 256x256 test sets.

ii. Optimization

To identify the optimal parameters, we performed training and testing over a range of variable parameters including epochs, steps per epoch and learning rate. Performance comparisons were conducted by tabulating the F1-Scores on the resultant test data, evaluated at an IoU of 0.7, using the method set down by Caicedo et al [21]. Optimization was conducted for the colon tissue dataset and also for the cell line A dataset (discrete nuclear texture). First, a model for each network was trained using 5, 10, 20, 50, 100, 200, 400, 600, 800 and 1000 epochs, then each epoch’s model was tested for comparison. StarDist and Mask R-CNN networks were then trained using the optimal number of epochs with 20, 50, 100, 200, 300, 400 and 500 steps, and again tested. ANCIS was not evaluated for number of steps as the code did not provide the option to change steps, however ANCIS did provide separate training functions for both the region proposal and segmentation networks, and both were optimized for epochs. Lastly, each network was trained using learning rates of 1e-3, 1e-4 and 1e-5.

In order to determine the minimum number of required images in the dataset to achieve acceptable results, we varied the size of each dataset used for training. Using the tissue dataset, the training set size was varied from 10, 20, 40, 60 and 77 images; for the cell line dataset, we used 10, 20, 40 and 65 images. Models were retrained for each training set size, and accuracy testing was conducted for each network.

iii. Network Evaluation

Using the optimal settings determined in the previous section, we conducted a more thorough testing and evaluation of each network’s performance on our STORM images. Test accuracy was analyzed using the F1-Score, Hausdorff distance and the false negative percentage, again using an IoU threshold of 0.7. The Hausdorff distance was calculated using the scikit-image version 0.17.2 python library. Both the Hausdorff distance and the F1-Score were averaged across all instances that achieved the requisite IoU threshold in each test dataset. The false negative percentage was determined as the total number of ground truth instances that did not achieve the requisite IoU score in the predictions of each test set divided by the total number of ground truth instances in that set. Similarly, the false positive percentage was calculated as the total number of predicted instances without a corresponding ground truth instance, divided by the total number of ground truth instances in the image.

The optimally trained colon tissue network models were used to evaluate both the colon tissue and prostate tissue test sets, as well as the 512x512 and 256x256 cell line test data. Additionally, the trained network models for cell line training sets A & B were both used to evaluate the cell line test set images. Only the network models for cell line training set B (containing both discrete and dense nuclear texture) was applied to the 512x512 and 1024x1024 colon tissue test sets. Both the colon tissue and cell line network models were used to create two new datasets (each) by either blurring or performing a histogram equalization on the images. Networks were trained on the blurred or equalized images using the same optimal parameters as the original image datasets were trained on, and then tested on their corresponding blurred or equalized test sets.

E. Segmentation Processing

i. Noise Removal

Test images were pre-processed before segmentation. A semantic UNet [11] was trained, separately, on the human colon tissue and cell line B training sets, using labels indicating not nuclei, but noise regions, consisting of out-of-focus nuclei or unbound fluorophores. The UNet was trained for 50 epochs at 300 steps per epoch, with a learning rate of 1e-4. Image augmentation, like that used for training our instance detection networks, was applied, along with the Adam optimizer and binary cross entropy loss function. The trained models were then tested on their respective test sets, creating noise probability maps as the outputs (Figure 7). The probability maps were binarized using Otsu’s method and subtracted from their respective test images, creating our denoised test sets.

ii. Post Processing

Following test image segmentation, post processing was applied. First, small instances with pixel area of less than 25 percent of the image average, were removed. Next, overlapping instances were merged or separated. Where overlapping instances were located, the pixel area of each instance was calculated as was the area of the overlap. Instances sharing more than 50 percent of their total pixel area, or where 50 percent of either instance’s pixels overlap, those instances were merged. When the overlapping area comprised less than 10 percent of one instance, but more than 10 percent of the other, the overlap was assigned to the instance with the greatest overlap. However, when both instances contributed 10 percent or less of their pixels to the overlapping region, the entire region was randomly assigned. Lastly, when the overlap comprised less than 50 percent of each instance, but more than 10 percent, the overlapping region was split, as shown in Figure 8.

The first step in dividing an overlap was to find the center of mass of each instance. Next, the two center points were connected by a line segment. The center of the line was then found, and a new line was created, perpendicular to the first and passing through its center point, as well as through the borders of the overlapping regions. The overlapping instances were then split along the second line and each section assigned a different label. Any new small instances created during the split were again removed.

Ethics approval and consent to participate

Not Applicable

Consent for publication

Not Applicable

Availability of data and materials

Raw images used for this study can be made available upon request, due to the large size of the raw data. Representative images as well as the code used in this study have been made available on the authors GitHub (https://github.com/YangLiuLab/Super-Resolution-Nuclei-Segmentation).

Competing interests

I declare that the authors have no competing interests as defined by BMC, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Funding

This work is supported by National Institute of Health Grant Number R33CA225494, R01CA185363 and R01CA254112 (to Y.L.) and T15LM007059 (to C.A.M.).

Authors' contributions

C.A.M. implemented code and prepared figures. C.A.M. and Y.L. wrote the main manuscript. All authors reviewed the manuscript.

Acknowledgements

We acknowledge Dr. Jianquan Xu for providing the super-resolution image data sets.

Caicedo JC, Goodman A, Haghighi M, Heng C, Singh S, Karhohs KW, Becker T, Carpenter AE: Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nat Methods 2019, 16:1247–1253.
Kromp F, Fischer L, Bozsaky E, Ambros I, Doerr W, Taschner-Mandl S, Ambros P, Hanbury A: Deep Learning architectures for generalized immunofluorescence based nuclear image segmentation. 2019.
Van Valen DA, Kudo T, Lane KM, Macklin DN, Quach NT, DeFelice MM, Maayan I, Tanouchi Y, Ashley EA, Covert MW: Deep Learning Automates the Quantitative Analysis of Individual Cells in Live-Cell Imaging Experiments. PLoS Comput Biol 2016, 12(11).
Boettiger AN, Bintu B, Moffitt JR, Wang S, Beliveau BJ, Fudenberg G, Imakaev M, Mirny LA, Wu C-t, Zhuang X: Super-resolution imaging reveals distinct chromatin folding for different epigenetic states. Nature 2016, 529:418-422.
Xu J, Ma H, Jin J, Uttam S, Fu R, Huang Y, Liu Y: Super-Resolution Imaging of Higher-Order Chromatin Structures at Different Epigenomic States in Single Mammalian Cells. Cell Rep 2018, 24(4):873-882.
Ricci MA, Manzo C, García-Parajo MF, Lakadamyali M, Cosma MP: Chromatin fibers are formed by heterogeneous groups of nucleosomes in vivo. Cell 2015, 160(6):1145-1158.
Xu J, Ma H, Ma H, Jiang W, Mela CA, Duan M, Zhao S, Gao C, Hahm E-R, Lardo SM et al: Super-resolution imaging reveals the evolution of higher-order chromatin folding in early carcinogenesis. Nat Commun 2020, 11(1).
Kieffer-Kwon KR, Nimura K, Rao SSP, Xu J, Jung S, Pekowska A, Dose M, Stevens E, Mathe E, Dong P et al: Myc Regulates Chromatin Decompaction and Nuclear Architecture during B Cell Activation. Molecular cell 2017, 67(4):566-578.
Lakadamyali M, Cosma MP: Advanced microscopy methods for visualizing chromatin structure. FEBS letters 2015, 589(20 Pt A):3023-3030.
Rutter EM, Lagergren JH, Flores KB: Automated Object Tracing for Biomedical Image Segmentation Using a Deep Convolutional Neural Network. In: 2018; Cham. Springer International Publishing: 686-694.
Ronneberger O, Fischer P, Brox T: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: 2015; Cham. Springer International Publishing: 234-241.
He K, Gkioxari G, Dollar P, Girshick R: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV): 2017; Venice, ITA. IEEE: 2980-2988.
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J: UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In: 2018; Cham. Springer International Publishing: 3-11.
Yi J, Wu P, Jiang M, Huang Q, Hoeppner DJ, Metaxas DN: Attentive neural cell instance segmentation. Med Image Anal 2019, 55:228-240.
Hollandi R, Szkalisity A, Toth T, Tasnadi E, Molnar C, Mathe B, Grexa I, Molnar J, Balind A, Gorbe M et al: A deep learning framework for nucleus segmentation using image style transfer. bioRxiv 2019.
Zhou Y, Chen H, Xu J, Dou Q, Heng P-A: IRNet: Instance Relation Network for Overlapping Cervical Cell Segmentation. In: 2019; Cham. Springer International Publishing: 640-648.
Vuola AO, Akram SU, Kannala J: Mask-RCNN and U-Net Ensembled for Nuclei Segmentation. In: International Symposium on Biomedical Imaging: 2019; Venice, ITA. IEEE: 208-212.
Fan R, Cheng M-M, Hou Q, Mu T-J, Wang J, Hu S-M: S4Net: Single Stage Salient-Instance Segmentation. In: Computer Vision and Pattern Recognition (CVPR): 2019; Long Beach, CA, USA. IEEE: 6096-6105.
Narotamo H, Sanches JM, Silveira M: Segmentation of Cell Nuclei in Fluorescence Microscopy Images Using Deep Learning. In: 2019; Cham. Springer International Publishing: 53-64.
Schmidt U, Weigert M, Broaddus C, Myers G: Cell Detection with Star-Convex Polygons. In: 2018; Cham. Springer International Publishing: 265-273.
Caicedo JC, Roth J, Goodman A, Becker T, Karhohs KW, Broisin M, Molnar C, McQuin C, Singh S, Theis FJ et al: Evaluation of Deep Learning Strategies for Nucleus Segmentation in Fluorescence Images. Cytometry Part A 2019, 95A:952–965.
Long J, Shelhamer E, Darrell T: Fully Convolutional Networks for Semantic Segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR): 2015; Boston, MA, USA. IEEE: 3431-3440.
Johnson JW: Adapting Mask-Rcnn for Automatic Nucleus Segmentation. In: Computer Vision Conference: 25-26 April 2019 2019; Las Vegas, NV, USA. Springer: 399-407.
Ren S, He K, Girshick R, Sun J: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: arXiv e-prints. 2015: arXiv:1506.01497.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC: SSD: Single Shot MultiBox Detector. In: 2016; Cham. Springer International Publishing: 21-37.
Redmon J, Divvala S, Girshick R, Farhadi A: You Only Look Once: Unified, Real-Time Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. IEEE 2016.
Javad Shafiee M, Chywl B, Li F, Wong A: Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video. In: arXiv e-prints. 2017: arXiv:1709.05943.
Alemi Koohbanani N, Jahanifar M, Gooya A, Rajpoot N: Nuclear Instance Segmentation Using a Proposal-Free Spatially Aware Deep Learning Framework. In: 2019; Cham. Springer International Publishing: 622-630.
Zhang D, Song Y, Liu D, Jia H, Liu S, Xia Y, Huang H, Cai W: Panoptic Segmentation with an End-to-End Cell R-CNN for Pathology Image Analysis. In: 2018; Cham. Springer International Publishing: 237-244.
Guerrero-Pena FA, Fernandez PDM, Ren TI, Yui M, Rothenberg E, Cunha A: Multiclass Weighted Loss for Instance Segmentation of Cluttered Cells. In: International Conference on Image Processing: 2018; Athens, GRC. IEEE: 2451-2455.
Kumar N, Verma R, Sharma S, Bhargava S, Vahadane A, Sethi A: A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology. IEEE Trans Biomed Eng 2017, 36(7):1550-1560.
Gudla PR, Zaki G, Shachar S, Misteli T, Pegoraro G: Deep Learning Based Segmentation of Nuclei from Fluorescence Microscopy Images. Microsc Microanal 2019, 25(Suppl 2).
Xie X, Li Y, Zhang M, Shen L: Robust Segmentation of Nucleus in Histopathology Images via Mask R-CNN. In: 2019; Cham. Springer International Publishing: 428-436.
Naylor P, Laé M, Reyal F, Walter T: Segmentation of Nuclei in Histopathology Images by Deep Regression of the Distance Map. IEEE Transactions on Medical Imaging 2018, 38(2).
Wanga W, Tafta DA, Chena Y-J, Zhanga J, Wallaceb CT, Xuc M, Watkinsb SC, Xing J: Learn to segment single cells with deep distance estimator and deep cell detector. Comput Biol Med 2019, 108:133–141.
Graham S, Vu QD, Raza SEA, Azam A, Tsang YW, Kwak JT, Rajpoot N: Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med Image Anal 2019, 58.
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S: Feature Pyramid Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. IEEE 2017.
Zhao T, Yin Z: Pyramid-Based Fully Convolutional Networks for Cell Segmentation. In: 2018; Cham. Springer International Publishing: 677-685.
Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B et al: Fiji: an open-source platform for biological-image analysis. Nat Methods 2012, 9(7):676–682.
Chamier Lv, Laine RF, Jukkala J, Spahn C, Krentzel D, Nehme E, Lerche M, Hernández-Pérez S, Mattila PK, Karinou E et al: ZeroCostDL4Mic: an open platform to use Deep-Learning in Microscopy. bioRxiv 2020:2020.2003.2020.000133.
Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow [https://github.com/matterport/Mask_RCNN]

SupplementaryFigure1.pptx
Supplementary Figure 1 (image; Supplementary_Figure_1.pptx) Mask R-CNN trained on the Kaggle dataset and applied to STORM images. Using a Kaggle data trained Mask R-CNN, examples of the best results we can achieve when conducting nuclei segmentation on super-resolution images from cell lines and tissue with distinct nuclear texture. Downsizing the test images (A-D) and applying the neural network to densely labeled cell line nuclei (A) can result in very good segmentation. When applied to less dense nuclei with discrete (B) or diffuse (C) nuclei, over or under segmentation can occur. Additionally, the trained network can perform well on tissue images where the nuclei exhibit bright intensity and are well separated (D). However, the results here, with the exception of (A), are not as good as those found when applying Mask R-CNN trained directly on STORM data.
SupplementaryFigure2.pptx
Supplementary Figure 2 (image; Supplementary_Figure_2.pptx) Training steps per epoch and RPN versus test accuracy. (A) Training steps per epoch versus test accuracy for Mask R-CNN and StarDist trained on the STORM images of human colon tissue dataset. Results for the cell line datasets proceeded similarly, only at higher F1-Scores. After early fluctuation, the accuracy continued to grow with steps for Mask R-CNN up to 500 steps. StarDist test accuracy, however, slowly rose to a peak at 300 epochs before falling off. ANCIS did not provide a step-per-epoch variable, but provided separate training programs for its region proposal network (RPN) and segmentation algorithm. RPN epochs versus test accuracy plot (B) for ANCIS rose quickly between 0 and 50 epochs, then more slowly until reaching a peak at 600 epochs.
SupplementaryFigure3.pptx
Supplementary Figure 3 (image; Supplementary_Figure_3.pptx) Test accuracy versus number of images in the training set. Training set size had a distinct effect on test accuracy when conducting training for each of the three networks on the colon tissue dataset (A) and the cell line A dataset (discrete nuclear texture) (B). All networks demonstrate a general improvement in accuracy with number of images, with exceptions. In the tissue set, the Mask R-CNN test accuracy fluctuated, although the largest training set still provided the best accuracy. When training the cell line dataset, it was StarDist that fluctuated, while Mask R-CNN performed best when trained on 60 images rather than the maximum number.
SupplementaryFigure4.pptx
Supplementary Figure 4 (image; Supplementary_Figure_4.pptx) Example Mask R-CNN segmented STORM images of colon tissue at normal and various pathological states. States include (A) normal healthy tissue, (B) a low-grade dysplasia, (C) a high-grade dysplasia, and (D) an invasive adenocarcinoma. Results for the normal state were the most accurate, followed by low-grade, high-grade and invasive cancer. This demonstrates a decrease in test accuracy with decreasing nuclear cohesion. Segmentation was conducted using Mask R-CNN trained on the STORM colon tissue dataset.
SupplementaryFigure5.pptx
Supplementary Figure 5 (image; Supplementary_Figure_5.pptx) Nuclei segmentation on STORM images from prostate tissue dataset. Segmentation of prostate tissue images using models which were pre-trained on the STORM colon tissue dataset. The original STORM image of prostate tissue (A) was segmented using (B) Mask R-CNN, (C) ANCIS and (D) StarDist networks. All three networks had multiple false negatives, however boarder placement appeared fairly accurate on the true positives. Mask R-CNN performed the best on this tissue dataset.
SupplementaryFigure6.pptx
Supplementary Figure 6 (image; Supplementary_Figure_6.pptx) Improved nuclei segmentation of StarDist method when trained on a mixed dataset. Representative nuclei segmentation of the StarDist algorithm on our cell line test images when training was conducted on our cell line A dataset (A), versus when trained on a combined dataset (B) (dataset included Kaggle, STORM colon tissue, and STORM cell line images). Similarly, we trained StarDist on our colon tissue dataset and applied the model to our colon tissue test set (C), and then applied the StarDist trained on the combined dataset to the same test set (D). Training on the combined dataset clearly improved instance boarder accuracy, particularly for the cell line images (B), and reduced false negatives in the tissue images (D). ANCIS demonstrated a similar improvement when trained on the combined dataset and tested on the tissue images. Mask R-CNN alone did not improve when trained on the combined dataset.
SupplementaryFigure7.pptx
Supplementary Figure 7 (image; Supplementary_Figure_7.pptx) Improper noise model can result in nuclei deletion when denoising. When conducting denoising using an improperly trained UNet noise model (trained only on dense noise), less noisy super-resolution images from our colon tissue dataset (A), which also contain less densely labeled nuclei, sometimes suffered from nuclei deletion (C) due to the networks inability to accurately distinguish noise from cell data (B). The noise probability map (B) from the UNet model shows a high probability of suspected noise (white regions) over several nuclei sites. The resulting denoised image, when this model is applied to the original image (A), shows the deletion of real nuclei (C).
SupplementaryFigure8.pptx
Supplementary Figure 8 (image; Supplementary_Figure_8.pptx) Example cell line images, each expressing different textures. Textures include a dim, diffuse pattern (active form of RNA polymerase II in 3T3 cells) (A), more densely placed labels (DNA in 3T3 cells) (B) and (H3K4me3 in HK2 cells) (C), and discrete labels (H3K27me3 in CA1h cells) (D).
SupplementaryTable1.pptx
Supplementary Table 1 (table; Supplementary_Table_1.pptx) Cell types included in our STORM datasets, along with their respective labeled targets, biological states, fluorophores and datasets.

Download PDF

Reviewers agreed at journal
17 Jan, 2021
Reviewers invited by journal
17 Jan, 2021
Editor assigned by journal
14 Jan, 2021
Editor invited by journal
14 Jan, 2021
Submission checks completed at journal
14 Jan, 2021
First submitted to journal
10 Jan, 2021

You are reading this latest preprint version

Application of Convolutional Neural Networks Towards Nuclei Segmentation in Localization-based Super-Resolution Fluorescence Microscopy Images

Status:

Version 1

Abstract

Figures

Background

Results

Discussion

Conclusion

Methods

Declarations

References

Supplementary Files

Status:

Version 1