Fig 1 shows the architecture of our modified co-learning technique based on ten Convolutional Neural Network (CNN) models. The Purpose of every ten CNN model is to derive the image features that are most relevant to each specific images. The inputs given to each CNN model are PET image and CT image. The modified co-learning technique uses the modality specific features produced by the ten CNN models to derive a spatially varying fusion map to weight the modality-specific features at different locations. Finally, the reconstruction component integrates the modality-specific fused features across multiple scales to produce the final prediction map.
A. Creation of modified co-learning technique
Let G = X * Y + c be the output feature map of a CNN model where * is the convolution operation, Y is a input to CNN model, X is the learned weight and c is the learned bias. A batch normalization layer has been utilized to normalize every output feature dimension G to a distribution with zero mean and unit variance. The Leaky rectified linear unit (Leaky ReLU) activation function was utilized after feature map normalization:
Where g is a normalized feature and i is a parameter controlling the ‘leakiness’ of the activation function with the constraint that 0 < i < 1. The Leaky ReLU activation avoids the dead neuron problem that can occur with the standard ReLU function where some weights in X can be updated to a value where their training gradients are forever stuck at 0, thus preventing the weights from being updated in the future. The parameter i enables the introduction of a small non zero gradient when g < 0, thereby preventing the weights from being stuck at an unrecoverable value. For simplicity of notion, the output of a convolutional layer has referred by G = Αi(X * Y + c) which is a feature map generated from Y after convolution, batch normalization and activation.
B. Modified co-learning technique based on ten CNN model
The modified co-learning technique contains two parts:
(i) a modified co-learning technique based on CNN model which learns to derive spatially varying fusion maps (ii) fusion operation utilizes the fusion maps to prioritize different features. Fig 2 shows a conceptual example of the ten CNN model based modified feature co-learning and fusion unit. The inputs to the modified feature co- learning unit are two feature maps GCT and GPET (each CNN model) of size w x h x c with w width, h height and c channels. These feature maps are stacked to form Ymulti (w x h x m x c) with m = 2 number of modalities. The channels of Ymulti are then convolved with the channels of a learnable 3D kernel Xmulti of size k x k x m, where k is the width and height of the kernel and m = 2 is the number of modalities.
By performing ten CNN models without padding the modality dimension, we obtain for a given channel c a feature map with a singleton third dimension where the value at location (a,b) is determined from the neighborhood of both GCT(a,b) and GPET (a,b):
We then squeeze the singleton third dimension to obtain an output feature map Xmulti*Ymulti of size w x h x 2c, the same width and height as the two modality-specific input feature maps FCT(a, b) and FPET (a,b) and double the number of channels, which is important for the weighting of modality-specific feature maps by the modifying co-learned fusion maps.
The modified co-learned fusion map controls the level of importance given to information from each modality at each location, in contrast to the global fusion ratio in PET-CT pixel intermixing [22]–[24]. Thus the modifying co-learned fusion maps directly affect the input distribution of the learnable layers that immediately follow the modified co-learning unit. Hence, we do not normalize the output of the CNN model within the modified co-learning unit. As with the CNN model, we utilized a Leaky ReLU activation function to obtain the multi modality modifying co-learned fusion map:
Where cmulti are the learned biases. The multi modality fusion map GCNN is obtained by the modified co-learning unit based on ten CNN models. The fusion operation integrates the modality-specific feature maps according to the values (coefficients) in the multi-modality fusion map is as follows:
where GCNN is the modifying co-learned feature map, is the stacking operation and is an element wise multiplication. This process merges the two modality-specific feature maps. GCT and GPET weights them by the modifying co-learned fusion map similar to pixel intermixing. Our modified co-learning based ten CNN models generates fused feature maps, one for each PET and CT image.
C. Reconstruction
The reconstruction part of our CNN creates a prediction map of the ROIs within the PET-CT image. It does this by integrating the modified co-learned feature maps from each of ten CNN model. The concept behind reconstruction block is to generate higher dimensional feature maps that better correspond to the features for different ROIs by merging lower dimensional information with features that were fused from multiple image modalities. As with the modality-specific encoders, we use batch normalization [20] and Leaky ReLU [21] activations. After the last reconstruction block, the output feature map has the same width and height as the input PET-CT image,with 128 channels in the third dimension. This is analogous to a final 128-dimensional feature vector for each pixel in the original image. Ten CNN models utilize a 1x1 convolution to map these feature vectors into R + 1 feature maps, where R is the number of ROIs. Finally, these observations have transformed into a probability or prediction map that corresponds to the likelihood of the pixel belonging to a particular class using the softmax function [25]:
where Qj(p) is the probability that the pixel with observation vector p belongs to the region j, pj the j-th element of vector p and is the activation corresponding to region j.