In this study, we present the promising results of a new convolutional neural network SCC-Net specifically designed for the automatic segmentation of head and neck SCC. Our proposed method successfully managed the segmentation of the entire UADT, including the oral cavity, oropharynx, hypopharynx, and larynx, using a single model with high mIOU and DSC.
Endoscopy stands a pivotal role in the diagnosis and management of head and neck cancer. However, leveraging deep neural networks for the analysis of endoscopy images poses significant challenges due to various factors included variable angles, presence of bubbles/fluid, optical artifacts like light reflexes and shadows and image quality variability including inadequate resolution, lack of sharpness and variations in RGB resolution. While a robust training dataset that encompass these obstacles may fulfill the training purpose. However, a diverse and representative dataset to capture the variability in oral cancer cases had not been published. A systemic review examined 332 published articles/datasets discussing oral cancer image analysis, only one publicly available oral cancer image dataset was identified.[25] Contrast the scarcity of oral cancer datasets with the abundance of datasets available for colonoscopy and panendoscopy, which have been widely used in artificial intelligence research due to their availability. (Table 3) The lack of large datasets in oral cancer poses a significant challenge for deep neural network training in this domain. Efforts to bridge this dataset gap are crucial to unlock the full potential of artificial intelligence in oral cancer management, enabling more accurate and efficient diagnosis, treatment planning, and monitoring of patients.
Table 3. Clinical Photo Open Available Datasets
|
Panendoscopy
|
Esophageal Endoscopy Images
|
|
|
|
|
Capsule endoscopy dataset
|
|
|
|
|
KVASIR. N=8,000
|
|
|
|
|
ERS. N=6,000
|
|
|
|
Glottis
|
Benchmark for Automatic Glottis Segmentation (BAGLS) N=59,250
|
Colonscopy
|
MICCAI 2017
|
|
|
|
|
|
CVC colon DB. N=612
|
|
|
|
|
|
SUN-SEG. N=158,690
|
|
|
|
|
|
GLRC dataset.
|
|
|
|
|
|
KUMC dataset.
|
|
|
|
|
Oral cancer
|
SHIVAM BAROT Oral Cancer images(tongue&lip) N=87
|
Segmentation serves as a crucial prerequisite for autonomous diagnosis and various computer- and robot-aided interventions. However, head and neck cancer segmentation from endoscopic images is considered as a challenging task due to the variations in size and shape of tumor among different patients. While oropharynx and glottis tumor were rare, the missing rate of smaller tumor during endoscopy exam especially in oropharynx and hypopharynx area is also another issue that needs to be addressed. Therefore, an automatic algorithm to segment the malignancy lesion during endoscopy exam could aid in diagnosis especially in pharynx exam.
In recent years, convolutional neural network (CNN) based classification of endoscopy images has garnered significant attention. Song et al. developed a smartphone-based system for automatic image classification of oral cavity (OC) lesions using CNNs, achieving an accuracy of 87%, sensitivity of 85%, and specificity of 89% by evaluating dual-modality images (white light and autofluorescence).[26] Mascharak et al. utilized naïve Bayesian classifiers trained with low-level image features to automatically detect and quantitatively analyze oropharyngeal SCC using narrowed band image (NBI) multispectral imaging, demonstrating increased diagnostic accuracy compared to conventional white light video-endoscopy.[27] Ren et al. collected a large dataset of 24,667 laryngoscopic images and trained a CNN-based classifier that outperformed clinical visual assessment by 12 otolaryngologists, achieving an overall accuracy of 96.24%.[28] Inaba et al. employed RetinaNet for the detection of superficial laryngopharyngeal cancer from normal pharyngeal mucosa, achieving an accuracy of 97.3%.[29] Kono et al. used a combination of 1243 white light images and 3316 NBI images to train a MASK RCNN model for cancer detection, yielding sensitivity, specificity, PPV, NPV, and accuracy values of 92%, 47%, 55%, 89%, and 66%, respectively.[30] Heo et al. trained 12 convolutional neural network classification algorithms using 5,576 tongue endoscopic images, including 1,941 pathologically proven cancer lesions. The deep learning model achieved an accuracy of 84.7%, while general physicians and oncology specialists achieved accuracies of 75.9% and 91.2%, respectively.[10] Recently, Flügge et al, had used a deep-learning approach based on Swin-Transformer to automatically detect Oral SCC on clinical photographs. With a classification accuracy of 0.986 and an AUC of 0.99, the method shows promise in assisting clinicians for early detection of oral cancer.[31]
Despite successful detection and classification of oral SCC, publications regarding segmentation oral SCC clinical photos scarce. This study represents a significant advancement as the first attempt to validate a deep learning segmentation model capable of achieving accurate results across the oral cavity, oropharynx, hypopharynx, and glottis in SCC. For segmentation tasks, only two publications had described using deep learning model to perform segmentation over UADT. Paderno et al had analyzd 34 and 45 narrowed band image endoscopic videos of oral cavity and oropharynx lesions demonstrated DSC of 0.7603.[13] Muhammad et al had published their result of a novel deep learning segmentation model (SegMENT) reporting segmentation of laryngeal SCC with the following median values: 0.68 intersection over union (IoU), 0.81 dice similarity coefficient (DSC), 0.95 recall, 0.78 precision, 0.97 accuracy with other result of oral cavity and oropharynx SCC in table. (Table 4) Our proposed SCC-Net demonstrated superior mIOU and recall compared to all previous deep learning results. However, the precision did not perform as well, and we can provide two possible explanations for this. Firstly, considering that the model was utilized as a screening tool where high sensitivity (recall) is crucial, the precision (positive predictive value) may have been compromised in order to avoid missing potential lesions. Secondly, our dataset primarily consisted of oral cavity photos, encompassing complex anatomy subsites such as the lip, tongue, bucca, gingiva, retromolar trigone, and hard palate. The precision could have been affected by the variability across these different subsites. This observation aligns with previous reported results, where the oral cavity dataset exhibited a poor precision of 0.602. Nevertheless, it is important to note that surgical management of SCC necessitates an adequate safe margin for excision surgeries. Therefore, utilizing a larger mask than the ground truth should not pose a risk of inadequate excision if employed for surgical planning.
Table 4. Comparing head and neck squamous cell carcinoma segmentation performance
|
Publication
|
Task
|
mIOU
|
DSC
|
Recall
|
Precision
|
Paderno [13]
|
Laryngeal SCC NBI lesion
|
0.7603
|
|
|
Muhammad [32]
|
Laryngeal SCC
|
0.686
|
0.814
|
0.951
|
0.785
|
Muhammad [32]
|
OCSCC
|
0.749
|
0.598
|
0.905
|
0.602
|
Muhammad [32]
|
OPSCC
|
0.784
|
0.879
|
0.907
|
0.933
|
SCC-Net
|
OC/OP/HP/Larynx SCC
|
0.872
|
0.868
|
0.9715
|
0.69
|
SCC: Squamous cell carcinoma, OC: Oral cavity, OP: Orophayrnx, HP: Hypophayrnx
In the era of multidisciplinary diagnosis and treatment for cancer patients, those with head and neck cancer routinely undergo check-ups across various subspecialties. While CT and MRI scans have been the gold standard for staging cancer, the importance of clear gross images of mucosal lesions cannot be overstated in making clinical decisions regarding surgical and adjuvant treatments. Our study proposes a new method for oral cancer image segmentation by combining NAS and Unet for the automatic segmentation of tumors in the oral cavity, oropharynx, hypopharynx, and glottis. This represents the largest cohort of pathologically confirmed cancer segmentation from medical endoscopic photos.
There are several limitations in our current study. Firstly, the datasets were relatively small and partially patient-unbalanced, indicating a high level of variability. Additionally, the ground truth was defined through secondary review opinions. The presence of visual illusions such as shadows, reflections, blurriness, and varying illumination conditions can adversely affect the performance of convolutional neural networks (CNNs) and the quality of segmented oral tumors. It was observed that all CNNs tended to detect malignant areas more prominently under certain illumination conditions. Independent evaluations by multiple experts may lead to a more accurate definition of endoscopic tumor margins.