4.1 Overview
In this section we discuss the HSI image dataset, the machine learning task and training process, and our performance evaluation setup. The models are trained to recognize the tissue type at any spatial location within HSI images, among a set of pre-defined tissue types. We solve this problem by training state-of-the-art models using supervised learning. This involves two stages: First is the training stage, where the model parameters are automatically optimized to maximize recognition performance on a training dataset with known tissue classes provided by an expert surgeon. Second is the inference stage, where the predictive performance is evaluated with HSIs of subjects that are not present in the training dataset with LOPOCV. This evaluation measures the performance and generalization capability of the model.
4.2 Animals characteristics
In the present study 8 adult pigs (Large White) were included. This experiment was part of the ELIOS project (Endoscopic Luminescent Imaging for Oncology Surgery), approved by the local Ethical Committee on Animal Experimentation (ICOMETH No. 38.2016.01.085), and by the French Ministry of Superior Education and Research (MESR) (APAFIS#8721-2017013010316298-v2). All animals were managed according to French laws for animal use and care, and according to the directives of the European Community Council (2010/63/EU) and ARRIVE guidelines[33].
A 24 hours preoperative fasting with free access to water was observed. Premedication was administered 10 minutes preoperatively, with an intramuscular injection of ketamine (20mg/kg) and azaperone (2mg/kg) (Stresnil, Janssen-Cilag, Belgium). Intravenous propofol (3mg/kg) combined with rocuronium (0.8mg/kg) was used for induction. Anesthesia was maintained with 2% isoflurane. At the end of the procedures, pigs were sacrificed with an intravenous injection of Pentobarbital Sodium (40mg/kg) (Exagon®, AXIENCE, France), under 5% isoflurane anesthesia.
4.3 Surgical Procedure and Hyperspectral Data Acquisition
A mid-line neck incision was performed, and the neurovascular bundle of the neck was carefully dissected. Successively, the vagal nerve, common carotid artery and the common jugular vein were isolated bilaterally, and the nerves were marked using a 3-0 polypropylene suture thread. The surgical scene was stabilized using self-retaining retractors. The hyperspectral imager employed during the experiment was a push-broom scanning complementary metal–oxide–semiconductor (CMOS) compact system (TIVITA®, Diaspective Vision GmbH, Germany), with a 640x476 pixel spatial resolution, acting within a spectral range from 500 to 1000nm (5 nm spectral resolution). The hyperspectral camera was placed at approximately 45 cm from the surgical field and an acquisition was taken. In order to avoid any potential bias, environmental lights were switched off and the ventilation was paused for the few seconds required for each acquisition (<10 seconds).
4.4 Imaging Postprocessing and Data Annotation
Immediately after each acquisition, in the OR and with the surgical scene still available, the operating surgeon (MB) by means of an image manipulator software (GIMP, GNU Image Manipulation Program, open source) annotated manually the resulting RGB image. The annotated classes were vagal nerve, jugular vein, carotid artery, subcutaneous fat, muscle, skin and metal of the retractor. The RGB images for each subject are shown in Figure 2 (a), which are provided by the camera device and synthesized from the HSI. As visible in the pictures the nerves and the vessels are rather small and difficult to distinguish from the RGB images alone. For this reason, the annotation was carried on by the operating surgeon, still looking at the surgical field, in which the structures were well distinguishable, given the precise anatomical position of the dissected elements of the neck neurovascular bundle (carotid artery, jugular vein and vagal nerve). These images are visualized in greyscale with annotations overlaid as colored regions in Figure 2 (b). These figures are cropped and oriented similarly for better visualization. It is not feasible to correctly annotate every pixel in each image. Consequently, the annotation of each class is a subset of the pixels of the class for which the human annotator is certain of the class type. The machine learning models are trained and evaluated using only the annotated regions. One of the primary challenges is that the HSI camera has a limited resolution of 640 x 476 pixels, and consequently the thin structures (in particular nerves) are challenging to recognize.
4.5 Image Pre-processing and Spectral Curve Distributions
The HSI images are calibrated to account for the illumination source, dark current and the current of the Charged Coupled Device (CCD) camera. Therefore, intensity calibration was not required as a pre-processing step. The images have 100 wavelength bands in the range 500 nm (visible green light) to 1000 nm (near infra-red). Normalization techniques are commonly applied as a pre-processing stage to reduce spectral variability caused by tissue morphology effects (changes in illumination angle, and non-flat structures) [34, 35]. The most popular normalization technique has been tested: Standard Normal Variate (SNV), where each individual curve is transformed to have a zero mean and standard deviation of one. SNV makes a spectral curve invariant to linear transforms caused by e.g. surface orientation variability. However, SNV may adversely affect recognition performance because of the potential loss of discriminative information. Models have been trained by us both with and without SNV normalization to assess its benefits and limits.
Before applying machine learning, a qualitative analysis of the spectral curves of each tissue class was performed to understand the intrinsic difficulty of the recognition problem. These curves are plotted in Figure 5 and Figure 6, where we show spectral curve distributions for each class with and without the use of SNV normalization. Classes that are easier to recognize by the machine have two general characteristics: i) low intra-class variation (spectral distribution is similar for the same class) and ii) high inter-class variation (spectral distribution is different for different classes). Intra-class variation is primarily caused by differences in the same tissue class in different subjects. In Figure 5 and Figure 6, intra-class variation is represented by the spectral curve spread (one standard deviation from the mean, illustrated by the grey zone). Inter-class variation is visualized by differences in the mean spectral curves (solid black line). Inter-class variability is tissue specific, where vein and skin are most dissimilar tissues. Muscle, nerve and fat are relatively similar, indicating that recognizing them with machine learning is not trivial. The metal class has a strongly different profile compared to the tissue classes. The large intra-class variation of metal can be explained by strong specular reflection.
It is clear that intra-class variability has been reduced by SNV normalization, but inter-class variability has also been reduced. The mean spectral curves of all classes are plotted together in Figure 6, right-most graphs, showing without and with SNV normalization (top and bottom right graphs of Figure 6, respectively).
4.6 Machine Learning Recognition Problem
Given a hyperspectral image, for each spatial coordinate (pixel), a sub-volume centered on the pixel is extracted, which is then input to a predictive machine learning model. This model produces a score for each tissue class, where a higher score indicates more likelihood of the tissue class being present at the spatial coordinate. Finally, the pixel is assigned to the class with the highest predictive score. This process repeats for each pixel of interest. This machine learning recognition problem is illustrated in Figure 7.
4.7 Machine Learning Models
Various machine learning models have been applied to other HSI image classification problems including medical imaging and remote sensing, and there no single model that performs best on all datasets[36, 37]. Consequently, the two most successful models for HSI image segmentation are studied in this work: Support Vector Machines (SVMs)[37-39] and Convolutional Neural Networks (CNNs)[36, 40].
4.8 Support Vector Machine (SVM)
SVMs have been shown to work well for HSI classification problems, in particular remote sensing[41-43]. They are simple to train and they offer good performance when the training set size is limited[37] and when the input data is high-dimensional, as is the case for our dataset. A SVM classifier fits a decision boundary (hyperplane) in multidimensional feature space. The feature space is constructed by transforming the data points using a kernel, such as radial basis functions, to feature space. Training is then performed to determine a hyperplane that separates the classes in feature space. Once trained, a new spectral curve at a given pixel is first transformed to feature space and then classified using the hyperplane. SVMs generalize naturally to multi-class classification problems such as ours, and they have been shown to outperform other classical machine learning models in HSI classification problems such as k-nearest neighbor (KNN) and decision trees[44]. We construct the feature space using the relative absorption of each wavelength in the HSI sub-volume centered at a given pixel. Consequently, the feature space has 100 dimensions (one dimension per wavelength).
4.9 Convolutional Neural Network (CNN)
The second model is a CNN, shown to work well with many kinds of image data including HS[36, 40, 45, 46]. A CNN is a special kind of feed-forward neural network where neurons are arranged in a series of hidden layers. The image is fed into the first layer, and the purpose of the layer is to extract meaningful features to aid classification. These features are fed into the second layer, where higher-level features are computed from the inputted features, and the process repeats until the last layer. Finally, a classification decision is made using the features at the last layer. CNNs are specially designed so that the features are extracted using convolutional filters. The filter weights are trainable parameters of the mode, and during learning they automatically adapt to extract features that are most pertinent to the specific classification task.
4.10 Implementation and Training
For each HS image in the dataset, we sample sub-volume centered on every annotated pixels. Spatial patches of 5x5 pixels are extracted to form 5x5x100 sub-volumes. The third dimension corresponds to the wavelength dimension with the 100 bands. The dataset is then composed of 213,203 samples.
Those sub-volumes are successively split into training and testing sets respecting the Leave-One-Patient-Out Cross-Validation (LOPOCV) process to train and evaluate model performance. LOPOCV is standard practice in medical image classification with small datasets, and it ensures that a model is not trained and tested on data from the same subject. Performance statistics with LOPOCV measure the generalization performance of a classifier on new subjects. LOPOCV was performed by partitioning the subjects into 8 subsets, S1, S2 … S8, where each subset Si is formed of every sub-volumes extracted from all subject HS images excluding the ith subject. The models are trained on subset Si, and then performance was tested on the ith subject HS image. The process is repeated 8 times for each subject.
The SVM classified was implemented with Python’s Scikit-learn library[47]. Auto-scaling was applied to each spectral band (unit variance and mean centering)[48] to eliminate the influence of variable dimension Default SVM parameters were used from Python’s sklearn.svm.SVC class (Radial Basis Kernels (RBFs of degree 3 and regularization parameter C=1).. The CNN was implemented in Pytorch v1.2 using an established neural network architecture[45] that has been implemented in a public code. The CNN processes a sub-volume centered around a given pixel using a 5x5 spatial window) and it has 32,628 trainable parameters with 7 hidden layers. The first 6 layers are convolutional and the last hidden layer is fully connected. The final output layer has 7 neurons, where 7 is the number of classes (6 tissue classes and the metal class). Each output layer computes the predictive score of each class. Finally, the class with the highest score is selected. The architecture of the CNN is provided in Figure 8. for full reproducibility. The CNN is trained using Pytorch’s backpropagation implementation with Adam optimization (0.001)[49]. Because the dataset is highly imbalanced with some classes significantly more represented than others, training was implemented with a weighted cross entropy loss with weight inversely proportional to the number of training samples per class. This is standard practice to prevent the classifier attempting to maximize performance for the most represented class (in our case, the skin class).
Training was performed on an DGX V1 (Nvidia corp.) equivalent deep learning server (Cysco inc.) taking approximately 15 hours for the CNN and 25 hours for the SVM. Training was performed 8 times in order to implement LOPOCV. Each training session has seven training images and one test image, and training was repeated eight times so that every image is used once as a test image. .
4.11 Performance Metrics and Statistical Methods
Performance was evaluated with standard metrics used in machine learning, implemented by Python scikit-learn (version 0.23, https://scikit-learn.org). Multi-class Type 1 and Type 2 errors are evaluated and presented as confusion matrices. Area Under Receiver Operator Curve (AU-ROC), sensitivity, specificity and Sørensen–Dice coefficient score (DCS) were used to evaluate class-specific performance using ‘one versus all’ and macro averaging. Statistically significant differences were computed with a two-tailed paired t-test (α=0.05) similarly to previous works in HSI classification analysis[30] using Excel (Microsoft Office 365, Microsoft, U.S.A.).