Distinct nuclear morphology of ovarian cancer tissues:
Formalin fixed paraffin embedded normal and diseased (ovarian cancer) tissues were obtained from Tata Medical Center following the ethical guidelines. Tissues were stained with lamin A, and lamin B following proper antigen retrieval technique and imaged under confocal microscope. Each field containing approximately 110-150 nuclei were captured from each of the subsets (lamin A stained Normal and Ovarian Cancer tissues, lamin B stained Normal and Ovarian Cancer tissues) under similar acquisition parameters. One representative field from each of the tissue sets has been shown in Figure 2. A visibly prominent enlargement of the cancer nuclei was observed with respect to the normal nuclei in both lamin A and lamin B stained tissues. Two tissue microarrays each containing 40 samples were obtained from Tata Medical Center for which we were blinded. The arrays were stained for lamin A and lamin B following the same procedure and consequently the images acquired from the TMA slides were used for validation of the best working model for this problem. 40 samples in a tissue microarray slide stained with lamin A antibody has been shown in Figure 3.
Data Augmentation by SMOTE and Analysis of data points in the sample space:
We started our experiment with 262 fields each containing about 150 nuclei of ovarian cancer tissues (majority class) and 52 fields each containing about 110 nuclei of normal ovarian epithelial tissue (minority class). Considering the dataset used, the distribution between the majority and the minority class was not equal and the majority class comprised of almost 84% of the dataset, and hence the dataset used was an imbalanced one. Using an imbalanced dataset for building a Deep Learning classifier would add bias to the majority class and unless the dataset is synthetically matched, the classifier would have a strong tendency to predict the majority class for unknown samples. Therefore, we have applied Synthetic Minority Oversampling Technique (SMOTE)30, which uses vector interpolation with high dimensional data to generate synthetic samples of the minority class. Different properties of SMOTE have various implications over high dimensional data 33.
Image Pre-processing:
The pre-processing algorithm consists of two parts – applying a segmentation mask based on the key visual properties like area, perimeter, circularity, eccentricity, foci distance, loop length, maximum curvature and normalised curvature of the nuclei followed by image sharpening techniques. In the first part, based on the Image Hue Saturation Value (HSV) and using a sensitivity factor, the segmentation mask was created, which was subsequently made prominent by the application of morphological closing operation with an elliptical kernel. Elliptical kernel was used to adapt to the shape of the cells and capture the maximum possible relevant information. Rectangular kernel was previously tested and it got no more than 93% accuracy. In the second phase, using Gaussian Blur and adding weights to the blurred image, we ensured uniform sharpening of the pre-processed images with the segmentation masks and converted the pre-processed images into grey-scaled form so that the key visual features are made more prominent and easier for the Deep Learning algorithm to unravel features. Information from the background was completely removed to emphasize over the morphological properties of the nuclei. (Figure 4)
Morphometric comparisons between images before and after pre-processing:
The minor and major axes of each nucleus were measured manually using ImageJ (ImageJ bundled with 64-bit Java 1.8.0_112). Careful investigation revealed that the hallmark of the diseased tissues was characterized by prominent nuclear enlargement as reported earlier34. We quantified these changes by considering every nucleus an ellipse; eight parameters (Area, Perimeter, Eccentricity, Circularity, Foci Distance, Loop Length, Maximum Curvature and Normalized Maximum Curvature) were measured for each of the nuclei using the formulae mentioned earlier29. With these sets of images, a gross morphometric analysis was performed based on the distribution of lamin A and lamin B proteins in the nucleus. Later on, following pre-processing, all the eight parameters were reanalysed from the lamin A and lamin B stained nuclei on randomly selected cancer and normal pre-processed tissue images. Histograms were generated for each of the parameters using ROOT data analysis framework (Version 6, Release 6.08/06-2017-03-02), where the X axis denotes the normalized number of nuclei with respect to the total number of nuclei calculated corresponding to the defined parameter and Y axis denotes the measure of the parameter. It wasevident from the plots (Figure 5), that the perimeter of most of the cancer nuclei from the total population were showing an increase of 55-62% compared to most of the normal nuclei for both lamin A and lamin B stained tissues in the images before pre-processing and the observation was similar in the pre-processed counterparts as well (Figure 5 A1, A2,A3, A4 ). Similar phenomena was observed while measuring the area, where the area of most of the cancer nuclei was more than twice the area of most of the normal nuclei in the population in both raw and pre-processed images of cancer and normal tissue (Figure 5 B1, B2,B3, B4). Both the observations indicated an increase in size of the cancerous nuclei and this feature was unaltered post pre-processing. However, in the cancer nuclei, around 3% and 12% shift from the normal were observed in the circularity and eccentricity values respectively which was not that significant denoting no prominent change in the shape (Supplementary figure 1 A1,A2,A3,A4, B1, B2,B3,B4). Eccentricity is a focal length (Distance from the centre to one focus) and semi major axis dependent variable. Still, to further validate, foci distance (2*Focal length) was also measured where the shift associated with eccentricity was supposed to get doubled according to the formulae. We could find a small increase in the Foci distance values of the cancer nuclei in comparison to the normal nuclei, which denotes an increase in the distance between the foci thereby approaching an elliptical nature (Supplementary figure 1 C1, C2,C3,C4). Another common parameter in ellipse geometry is loop length, which is a focal length dependent variable, hence a rise was evident in the loop length of cancer nuclei denoting an increase in size once again (Supplementary figure 1 D1,D2,D3,D4)`. Next, to study the change in the surface architecture, maximum curvature and normalised curvature were measured; but no significant shift was observed to deduce a conclusion (Supplementary figure 1 E1, E2,E3,E4 F1, F2,F3,F4). Observations were consistent in the pre-processed images as well. As we all know, that tumor microenvironment harbours a heterogeneous cell population including cells at different stages of malignancy and some normal cells too, so the analysis spanned a large range of parametric measures to accommodate all the nuclei in the population. Some shifts were visibly clear and prominent but some were not. So, the change may not be specified with distinct values. Overall, these measurements confirmed prominent alteration in morphology in the cancer nuclei or in the nuclei approaching malignancy with respect to the normal nuclei and gave a gross idea regarding the direction of change. This experiment concluded that morphometric alteration in form of altered distribution of lamins in nuclei can potentially be used as signatures to classify cancer and normal nuclei or to study the progress towards malignancy and this feature being unaltered post pre-processing, can probably be a potential classifier for the deep learning model to distinguish between cancer and normal nuclei.
A. 1.Comparative distribution of the number of normal (Mean±Std error of mean:17.82± 0.3032) (Std Dev:5.574±0.2144) and ovarian cancer (Mean±Std error of mean:27.59± 0.333) (Std Dev:6.626±0.2354) nuclei based on Perimeter values acquired from lamin A stained tissue images before pre-processing.
A. 2. Comparative distribution of the number of normal (Mean±Std error of mean:16.11± 0.1259) (Std Dev:3.352±0.08) and ovarian cancer (Mean±Std error of mean:26.21± 0.3628) (Std Dev:7.814±0.2565) nuclei based on Perimeter values acquired from lamin B stained tissue images before pre-processing.
A. 3.Comparative distribution of the number of normal (Mean±Std error of mean:13.75± 0.1804) (Std Dev:3.125±0.1276) and ovarian cancer (Mean±Std error of mean:23.1± 0.289) (Std Dev:5.445±0.2043) nuclei based on Perimeter values acquired from lamin A stained tissue images after pre-processing.
A. 4. Comparative distribution of the number of normal (Mean±Std error of mean:12.85± 0.1732) (Std Dev:3.015±0.1225) and ovarian cancer (Mean±Std error of mean:23.66± 0.3121) (Std Dev:5.635±0.2207) nuclei based on Perimeter values acquired from lamin B stained tissue images after pre-processing.
B. 1.Comparative distribution of the number of normal (Mean±Std error of mean:23.47± 0.7129) (Std Dev:13.11±0.0541) and ovarian cancer (Mean±Std error of mean:51.62±1.153) (Std Dev:22.86±0.8153) nuclei based on Area values acquired from lamin A stained tissue images before pre-processing.
B. 2. Comparative distribution of the number of normal (Mean±Std error of mean:19.94± 0.316) (Std Dev:8.414±0.2234)and diseased (Mean±Std error of mean:49.31± 1.234) (Std Dev:26.35±0.8725)nuclei based on Area values acquired from lamin B stained tissue images before pre-processing.
B. 3.Comparative distribution of the number of normal (Mean±Std error of mean:14.66± 0.3791) (Std Dev:6.577±0.268) and ovarian cancer (Mean±Std error of mean:41.38±1.021) (Std Dev:19.24±0.722) nuclei based on Area values acquired from lamin A stained tissue images after pre-processing.
B. 4.Comparative distribution of the number of normal (Mean±Std error of mean:13.02± 0.3677) (Std Dev:6.401±0.26) and ovarian cancer (Mean±Std error of mean:42.44±1.006) (Std Dev:18.13±0.7111) nuclei based on Area values acquired from lamin B stained tissue images after pre-processing.
Training a Deep Hybrid Learner
After the pre-processed images were acquired, we had to split the data into a training set and validation set with a split ratio of 75:25. The training set was used to train the supervised binary classification model and the validation set was used for hyper-parameter tuning to make sure that the model was not over-fitting on the training set and remained generalized. For training a Deep Hybrid Learner, we first trained the 21 Layered CNN which is used to extract features. We trained it for 250 epochs and the model learning and loss curves obtained are shown in Figure 6. The training and validation learning and loss curve clearly showed that the model was neither over-fitting, nor under-fitting and thus it showed a very good fit. In the model learning curve, we used AUC Score as the metric for determining the fitness of the algorithm. We observed from the learning curve that the training and validation AUC Scores gradually increased with training iterations or epochs and the maximum training score after 250 epochs was obtained as 0.998 and the validation score was obtained as 0.997. The model loss curves for the training and validation set showed that the training and validation loss was gradually decreasing with increase in epoch, which was an indication that the model was learning gradually with more training iteration. Absence of any statistically significant variance between training and validation loss indicated absence of any over-fitting issues. These scores and extremely minimal model loss values along with graphical interpretations from the learning and loss curve supported the fact that the model was quite well generalized and expected to perform well on the test dataset.
Selection of Model Evaluation Matrices
Since, we had a highly imbalanced dataset; accuracy alone would never be a good metric and could be misleading. If the model was always predicting the output as the majority class label, then the accuracy values would be very high, but yet the model would be highly biased on the majority class. Hence, we needed a metric that could show us the impact of true positives and false positives on an imbalanced dataset. Hence, we used Area under the Receiver Operating Characteristics Curve (AUC-ROC) scores (Figure 6) and Confusion matrix to clearly highlight the true positives, true negatives, false positives and false negatives, with the positive class being detection of cancer cells and the negative class being detection of normal cells. In our case, since AI driven approaches were used as automated pre-checks for cancer detection, after which detailed tests and inspection would be performed, false positives were comparatively less expensive than false negatives, as false negatives would lead to delayed detection of cancer. Typically then, a model with a specific set of hyper-parameters giving maximum AUC-ROC score, minimum false positives, and false negatives needs to be selected, in which less proportion of false negatives as compared to false positives will be preferred. .
Model Evaluation on Test Data:
The clinical details of the TMA samples were revealed and it contained a mixed cohort of tissues from Non cancer ovary as well as from omentum and adjacent areas from patients diagnosed with ovarian cancer undergoing debulking surgery. The normal and cancer tissue samples were grouped in 7 sub-classes (PDSadjacent Normal, PDS Tumor, IDS good response adjacent Normal, IDS good response Tumor, IDS poor response Normal, IDS poor response Tumor, Non cancer ovary)(Figure 7). One representative image from each sub-class was chosen as test image to evaluate the model performances. We compared performances of deep hybrid learning models (Deep Hybrid Learning with Random Forest35 and Deep Hybrid Learning with XGBoost36) with other models like conventional deep neural network (DNN) model, Densenet 20131 with transfer learning model37, ResNet50 with transfer learning model37, InceptionNetv3 with transfer learning model37, VGG 1632 with transfer learning model37. The main difference between DHL and conventional DNN is that conventional DNN uses a fully connected neural network, whereas DHL uses a classical Machine Learning algorithm for final classification. In the transfer learning models, we have used pretrained weights from ImageNet. The training, validation and test dataset was same for all the approaches and the number of epochs, batch size was also consistent for all the approaches.Confusion matrices were obtained for the 7 representative images from 7 sub-classes. The Deep Hybrid Learners exhibited 100% validation accuracy by recognising 4 images from 4 sub-classes (PDS adjacent Normal, IDS good response adjacent Normal, IDS poor response Normal and Non-Cancer Ovary, respectively) as true negatives or Normal and 3 images from 3 sub-classes (PDS Tumor, IDS good response Tumor, IDS poor response Tumor, respectively) as true positive or Cancer, whereas the other models could not recognise all 7 images accurately resulting in lower validation accuracies. ‘Normal’ and ‘Cancer’ has been annotated as 0 and 1 respectively in the confusion matrices. These images were absolutely unknown to the model and the clinical details were not revealed to the person performing the tests to ensure an unbiased validation and impartial selection of the accurate model based on performance. The matrices were used as the score cards to evaluate the model performances. So, deep hybrid learners were found to be the best working models for this specific problem (Figure 8). For this research work, the choice of the ideal model architecture depends on two main criteria: Generalization and Efficiency. A model is said to be generalized when it is not over-fitting on training data and the model evaluation scores are consistent for training set, validation set and testing set. From the above results we can see that the Deep Hybrid Learners (both Random Forest and XGBoost variant) showed almost consistent results for training, validation and testing phases. Also, we found that the model was extremely efficient with low variance, as we observed that the AUC scores on training validation and test dataset were 0.99, 0.99 and 1.0 respectively (Table 1). The conventional deep learning model trained from scratch without transfer learning, seemed to have high training scores, but it showed high variance on validation and test dataset as the scores were much lower on validation and test set. Therefore, it indicated that the model was over fitting on the training data, and it was not generalized, hence performing poorly on testing and validation dataset. This behaviour of the model could be explained by our previous hypothesis that the dataset used for this research work was not favourable for a conventional deep learning approach, as it would require more training samples for the conventional model to learn and improve generalization. Hence, more sophisticated and novel approaches like Deep Hybrid Learning which uses CNN for feature extraction and classical machine learning algorithms for the final classification, was more efficient and robust for this type of microscopic image datasets. We have even applied Transfer Learning37 with more sophisticated Deep Learning architectures like DenseNet20131, ResNet50, InceptionNetv3, VGG1632, but the results obtained showed presence of over-fitting, lack of generalization and much lower model efficiency than the Deep Hybrid Learners. One plausible reason could be that these transfer learning models were trained using pre-trained weights from ImageNet images, which were significantly different and might have a significant statistical difference from microscopic images, making transfer learning approach ineffective in this case.
Thus we can conclude that our Deep Hybrid learning approach was successful and much better performing than other deep learning algorithms with these types of microscopic image datasets for automated detection of ovarian cancer.
Deep Hybrid Learning
The trained and hyper-parameter tuned model performance on the test set proved how well the model was generalized and did not have any unwanted bias. Now, as imbalanced dataset is not suitable for classical deep learning models for building supervised classifiers with high accuracy and generalization, we came up with the Deep Hybrid Learning (DHL) algorithm, which utilized Deep Convolutional Neural Network to extract features from the pre-processed samples and uses the extracted feature vector with classical Machine Learning algorithms like Random Forest35 and XGBoost36 to build the final classifier. Of late Ensemble learning techniques like Boosting algorithms are known to work well with high dimensional data, as boosting techniques are known to combine weak learners to identify the “hard” data points and combine the weak learners to form a very strong and efficient classifier38. Similarly, Ensemble methods like Random Forests work very well on smaller but high dimensional datasets for solving binary classification problems and have been known to produce generalized results35. The results obtained using the Deep Hybrid Learning approaches turned out to be extremely promising (Figure 8, Table 1) and so far have performed much better than any other conventional approaches and is comparable to or even better than human level performance for this classification problem.
Table 1: Comparison of model evaluation matrices