Enhancement and Evaluation for deep learning-based classification of volumetric neuroimaging with 3D-to-2D Knowledge Distillation

doi:10.21203/rs.3.rs-4361670/v1

Download PDF

Article

Enhancement and Evaluation for deep learning-based classification of volumetric neuroimaging with 3D-to-2D Knowledge Distillation

https://doi.org/10.21203/rs.3.rs-4361670/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The application of deep learning techniques for the analysis of neuroimaging has been increasing recently. The 3D Convolutional Neural Network (CNN) technology, which is commonly adopted to encode volumetric information, requires a large number of datasets. However, due to the nature of the medical domain, there are limitations in the number of data available. This is because the cost of acquiring imaging is expensive and the use of personnel to annotate diagnostic labels is resource-intensive. For these reasons, several prior studies have opted to use comparatively lighter 2D CNNs instead of the complex 3D CNN technology. They analyze using projected 2D datasets created from representative slices extracted from 3D volumetric imaging. However, this approach, by selecting only projected 2D slices from the entire volume, reflects only partial volumetric information. This poses a risk of developing lesion diagnosis systems without a deep understanding of the interrelations among volumetric data. We propose a novel 3D-to-2D knowledge distillation framework that utilizes not only the projected 2D dataset but also the original 3D volumetric imaging dataset. This framework is designed to employ volumetric prior knowledge in training 2D CNNs. Our proposed method includes three modules: i) a 3D teacher network that encodes volumetric prior knowledge from the 3D dataset, ii) a 2D student network that encodes partial volumetric information from the 2D dataset, and aims to develop an understanding of the original volumetric imaging, and iii) a distillation loss introduced to reduce the gap in the graph representation expressing the relationship between data in the feature embedding spaces of i) and ii), thereby enhancing the final performance. The effectiveness of our proposed method is demonstrated by improving the classification performance orthogonally across various 2D projection methods on the well-known Parkinson's Progression Markers Initiative (PPMI) dataset. Notably, when our approach is applied to the FuseMe approach, it achieves an F1 score of 98.30%, which is higher than that of the 3D teacher network (97.66%).

Physical sciences/Engineering/Biomedical engineering

Health sciences/Health care/Medical imaging

Parkinson’s Disease

Dopamine transporter

SPECT

Functional neuroimaging

Deep learning

Knowledge distillation

Neuroimaging data is a vital tool for assessing and monitoring pathological brain changes associated with progressive neurodegenerative conditions¹. The field of medical imaging analysis, which has enabled precise diagnostics, procedural planning, symptom observation, and corresponding prescriptions, is continuously evolving and developing due to its utility. This advancement in imaging technology has been underpinned by the introduction of various non-invasive imaging modalities such as computed tomography (CT), single photon emission computed tomography (SPECT), positron emission tomography (PET), and magnetic resonance imaging (MRI), along with key technologies like phantoms using 3D printing application² and AI-based computer-aided diagnosis³. AI technologies' contribution to medical imaging analysis advancement includes reducing radiation exposure risks^4–7, acquiring high-resolution images⁸, utilizing multimodalities, and ensuring better reliability and safety in diagnostics in limited dataset environments^9,10. With the development of precise medical equipment and AI technology, the scope of analysis has become more abstract (from hand-crafted features to the data itself) and expanded (from single modality to multi-modality). As the quantity and complexity of data measured for managing complex patient abnormalities increase, there has been a growing need for more accurate and efficient image analysis techniques.

Recent studies in neuroimaging analysis have reported notable performance improvements with the adoption of deep learning technologies. Since the demonstration of the efficacy of 2D CNNs by AlexNet in the ImageNet¹¹ competition, a multitude of CNN architectures have been adopted¹². Subsequently, the application of both 2D and 3D CNNs has been seamlessly integrated into the field of medical image processing. In particular, the ability to analyze volumetric imaging through 3D CNNs has led to a variety of studies in neuroimaging analysis fields such as Parkinson's disease(PD) or Alzheimer's Disease(AD) detection¹³. The superior performance of 3D CNNs for volumetric images is attributed to the abundance of parameters allowing for the observation of rich features. However, as a common problem with most deep learning models, over-parameterized models may tend to overfit the distribution of the train set and, consequently, be vulnerable to unfamiliar lesion types encountered in real-time scenarios. Therefore, training 3D CNNs requires a large dataset, but creating annotations for these datasets is time-consuming and costly.

In the deep learning community, to overcome the limitations of small datasets, the idea of transfer learning was proposed, which involves applying knowledge systems learned from large datasets to different tasks. In the field of medical vision signal processing, the application of transfer learning typically involves freezing the parameters of models trained on ImageNet and adding a classification head for fine-tuning^14–16. Meanwhile, when applying these 2D CNNs to neuroimaging, an operation to project the original 3D volumetric imaging into a 2D format (a.k.a 2D projection) must be conducted first. Accordingly, previous studies have primarily focused on directly or automatically extracting representative slices from volumetric images for use in diagnostic prediction models^16–19. Other research has attempted to reflect partial volumetric features by extracting representative or adjacent slices^14,19 from different planes (i.e., trans-axial, sagittal, coronal) and concatenating them along the image channel axis, acknowledging that a single slice lacking depth information may not sufficiently represent volumetric information^14,20. However, there are several limitations to this approach. First, selecting a representative slice from volumetric imaging may not adequately reflect the volume information, leading to reduced performance of 2D CNNs compared to 3D CNNs. This is particularly true in cases like PD, where it is necessary to observe and diagnose dopaminergic deficits in localized striatal areas through dopamine transporter imaging. The concern is that judgments made solely based on arbitrary slices may not be based on sufficient information. Additionally, while 2D projection methods can express partial volumetric features using limited depth information, they fail to explain to the model how these features are structurally interconnected in the original three-dimensional space. In the field of medical visual signal processing, where datasets are typically small, we expect that models with many 2D CNN structures can effectively learn volumetric features from a flood of information that may contain some noise or artifacts. This approach in resolving neuroimaging classification tasks based on 2D projection is anticipated to prevent overfitting and enhance generalization performance. This expectation is due to the nature of the tasks and the limited data environment typically encountered in this field.

Meanwhile, Hinton et al.²¹ proposed an effective knowledge transfer technique for neural networks, known as Knowledge Distillation (KD). This is a type of ensemble model where the knowledge from one or several expert models (referred to as the teacher network) is transferred to a single model with fewer parameters (referred to as the student network). The purpose of this technique is to transfer the knowledge or problem-solving capabilities of powerful yet cumbersome models to models operating under constrained conditions (such as limited system resources or modalities). Gou et al.²² provide a detailed introduction to the types of knowledge handled by this group of KD techniques, their methods of transfer, and various application cases. Recent research in the field of medical visual processing also reports cases where KD techniques are used in tasks such as disease classification^23–25 and segmentation^26,27. In particular, for cross-modal KD, which enhances multi-modal representation, the gap in knowledge between the teacher network, which learns the auxiliary modality, and the student model, which learns the target modality, is reduced to improve the performance of the student model^28–30. We expect that in the classification of neuroimaging based on 2D projection if we can utilize the homogeneous knowledge shared among samples between the partial volumetric information contained in the projected 2D input and the original 3D volumetric information during training, we can achieve the best performance for each given 2D projection method.

In this paper, we propose 3D-to-2D knowledge distillation (3D-to-2D KD) to address the issue of 2D CNN networks trained on small datasets not sufficiently learning the original 3D volumetric information. To the best of our knowledge, there is a lack of scientific papers related to reducing the modality gap between 3D volumetric imaging and projected 2D representation. Our objective is to maximize the use of partial volumetric feature relationships contained in individual samples by 2D CNN networks. To achieve this, we minimize the difference in graph representations between the image features used by the 2D CNN network and the features encoded by a pre-trained 3D CNN network, as illustrated in Fig. 1. To observe the effectiveness of 3D-to-2D KD, we adopt a PD classification task based on different 2D projections, using multicenter SPECT imaging obtained from the Parkinson's Progression Markers Initiative (PPMI) dataset. Additionally, we observe that there are different degrees of reduction in the modality gap between the original volumetric representation and what can be achieved through KD, depending on the projection method. Medical imaging data and the associated analysis tasks represent one of the major challenges that AI must confront, characterized by complex data and significant responsibility. These are mostly provided as 3-dimensional volumetric images and include various imaging modalities (MRI, CT, X-ray, PET, etc). From this perspective, we aimed to develop a method that is not only orthogonally applicable to AI-based 3D volumetric imaging analysis but also simple yet effective.

In this paper, our proposed method's contributions can be summarized as follows:

Beyond simply projecting 3D imaging to train a 2D CNN network, we propose a simple yet effective 2D projection-agnostic framework, named 3D-to-2D KD, that additionally utilizes 3D information.

By observing the differences in the minimum modality gap for each 2D projection method through the similarity matrices of features and logits produced by the 3D teacher network and the 2D student network, we analyze the impact of the choice of 2D projection method and optimal slice selection on 3D-to-2D KD.

We experimentally demonstrate that the use of KD, which leverages the relational information of the data, improves PD classification performance across all 2D projection methods, despite their varying capacities to handle partial volumetric information.

2D projection approach for deep learning-based neuroimaging analysis

In deep learning-based neuroimaging analysis, the 2D projection method is a necessary computation to handle 3D volumetric imaging using the relatively lighter 2D CNN network. In previous studies, the data split process into train, validation, and test sets is followed by heuristically determining representative slices that reflect valid diagnostic information to create the dataset. The model is then trained to process on a slice-by-slice basis (i.e., using a single slice as input)¹⁶. In addressing the slice selection problem, another line of research, instead of heuristically determining the representative slice, utilizes a technique that automatically detects slices likely to contain diagnostically significant lesions. This is achieved by measuring the uncertainty of individual slices using Shannon entropy^16,17. In a dataset constructed from multiple slices extracted from a sample, a model indirectly experiences changes in information depth through the method of delivering only a single slice at a time. However, it does not directly observe volumetric features. To enhance the conveyance of rich volumetric information, Aderghal et al. ¹⁹ proposed the 2D + e approach. This method combines adjacent slices by assigning them to RGB bands, allowing the model to utilize more volumetric features during the inference phase by conveying them together. In subsequent research, they proposed a method to better utilize rich volumetric features. This involves combining the feature information of adjacent slices obtained from each plane at the intermediate layer level, or ensembling them at the prediction level²⁰.

The aforementioned methods were proposed for AD detection, where the primary features used are cortical atrophy in structural MRI or amyloid plaque deposition in beta-amyloid PET, hence there is a relative abundance of slices available for observing the lesions. On the other hand, in the case of PD detection, it is necessary to use the localized striatal region as diagnostic information, which necessitates the proposal of automated slice extraction techniques. Additionally, for PD diagnosis, the primary finding is the reduction of dopaminergic deficits in dopamine transporter imaging. The methods proposed above have a limitation in that during the 2D projection process, the fragmented volumetric information does not inform the model about how it is structurally interconnected in the original 3D data.

Knowledge Distillation

The initial concept of KD was proposed by Hinton et al²¹. It is a type of neural information processing technology that transfers knowledge from one or several expert models to a single model, similar to an ensemble model. Subsequent studies on KD have widely applied this technique across various fields. They focus on model compression for visual recognition³¹, natural language processing^32,33, and expanding to other tasks. This includes knowledge transfer for data augmentation³⁴ or dataset distillation³⁵, distilling knowledge from cumbersome models with issues like latency or limited resources to smaller, more manageable single models.

Gou et al. ²² provided a detailed introduction to the types of knowledge representations, transfer methods, and various application cases handled by this group of KD techniques. Firstly, response-based knowledge assumes the model's logits as the knowledge to be distilled, transforming them into what are known as soft targets via temperature softmax operation, thereby creating a loss to reduce the knowledge gap. For feature-based knowledge, the researchers note a similarity to response-based knowledge in focusing on the model's features as knowledge, but with the difference that additional loss experiments were conducted. On the other hand, studies utilizing relation-based knowledge adopt the contextual information that the model uses to understand input data as the target for distillation. They use the correlations between features generated by the model or the mutual similarities understood by the data as the knowledge to be distilled.

Meanwhile, separate from the expansion of knowledge representation, another line of research introduces the cross-modal KD approach, which trains the model to have a comprehensive understanding of cross-modal representation by distilling homogenous meanings from multiple modalities. Chen et al.³⁰ proposed a compositional contrastive learning model that reduces the cross-modal semantic gap by contrasting different modalities in a shared latent space, based on the homogeneous information content of various modalities. Hu et al.²⁹ proposes a framework that improves the comprehensive understanding of a student network, which learns only monomodal data (T1 contrast-enhanced imaging), through a teacher network trained on multimodal data including T1, T2, T1 contrast-enhanced, and Flair imaging. To our knowledge, there is no existing research that utilizes the nature of relation-based KD and cross-modal KD in the classification of neuroimaging based on 2D projection. Specifically, no studies effectively teach a model the partial volumetric information contained in projected 2D inputs by using the homogenous knowledge identifiable in the interrelations of data samples between them and the original 3D volumetric information as a prior for the training of a 2D CNN network (student network).

Our proposal focuses on enhancing the partial volumetric representation, which is inherently limited compared to the original volumetric information, by utilizing the relational information among data. In contrast to existing works, which simply expose partial 3D spatial information to the model through basic slice extraction^16,17 and feature^19,20combination without addressing how these are utilized in lesion analysis, our method provides relationships in the data created in terms of 3D spatial information as a prior. This approach of determining the input for the 2D student model is termed 'partial input restriction'. Figure 2 provides a comprehensive schematic illustration of the experiment and the model's structure. In the following section, we will discuss in detail the 2D projection we adopted and the structure for 3D-to-2D KD.

2D projection alteration for 3D-to-2D KD

To demonstrate the effectiveness of 3D-to-2D KD, we prepared the model's input information through the following process. From the given 3D volumetric imaging, we re-sample the striatal region at its median level as pre-defined by slice indices and along three axes (axial, coronal, sagittal). When including adjacent slices, we acquired one adjacent slice¹⁹. We refer to this process as 'slice extraction'. The volumetric features contained in the resulting images from the previous process are handled in three major ways:

Single slice as input: The extracted slices are used directly as non-i.i.d. input data for training the diagnostic model. In this case, to prevent data leakage caused by multiple slices extracted from one sample being distributed across the training, validation, and test sets, a subject-level data split is performed before conducting slice extraction³⁶.

Aggregated slices with early fusion (EF): The extracted slices are assigned to each RGB-band (i.e., channel-level concatenation), thereby conveying relatively thicker volume information to the model compared to a single slice. This method allows for a more comprehensive representation of the volumetric features within the model¹⁹.

Aggregated feature with joint fusion (JF): Among the extracted slices, adjacent slices are assigned to the RGB band. They are then individually fed into a 2D CNN network for each plane to encode plane-wise features. These plane-wise features are immediately concatenated along the channel dimension and then passed through a Feed Forward Network (FFN) to encode a comprehensive volumetric feature. This process facilitates the integration of detailed spatial information from different planes, resulting in a more robust representation of the volumetric characteristics in the model²⁰.

In previous studies, different network parameters are used for each plane to learn plane-wise features. However, in our approach, we designed the model to share parameters across planes, recognizing that the striatal instances appearing in each plane are homogeneous for the prediction model to capture. The choice of the design was also driven by the need to minimize model parameters in our experimental setup. A detailed explanation of the approach is provided in the '2D Student Network for Diagnosis System' section. The shared parameter strategy not only reduces the model's complexity but also ensures a more consistent learning process across different planes.

Automatic projection via rank-pooling-based projection

In previous studies applying the 2D projection method, particularly for AD diagnosis, researchers heuristically determined the representative slice based on observations needed, such as cortical atrophy in structural MRI or amyloid plaque load in beta-amyloid PET. These findings are expected to appear across multiple regions, making it easier for researchers to arbitrarily select them. However, in cases like detecting cerebral malignancies or PD, where the lesion areas to be observed are localized, manually extracting diagnostic slices involves significant time and effort, and the selection of representative slices can be subjectively influenced by the researcher. To overcome it, we propose a method that considers a 3D volumetric image as a dynamic image that changes scenes according to an arbitrary axis, thereby summarizing 3D spatial information without the need for practitioner intervention. The approach aims to achieve automatic 2D projection, effectively reducing the subjective bias and manual workload in selecting representative slices.

Bilen et al.³⁷ initially proposed the concept of a 'dynamic image' as a compact representation for video analysis. A dynamic image is obtained by applying rank pooling to each frame of a video. The process effectively turns standard 2D CNNs into dynamic-aware models through fine-tuning video data. Prior research compares and discusses the construction operations of the rank-pooler, contrasting approximated rank pooling with modified rank-pooling that directly ranks feature frames. The studies report that while modified rank-pooling is about 45 times slower than approximated rank-pooling, it yields approximately 3% higher accuracy. Therefore, in our work, we adopted modified rank-pooling as an automatic projection technique to better encapsulate the rich volumetric information in the condensation process. The approach aims to provide a more accurate and dynamic representation of the volumetric data for effective analysis. First, when provided with $N$ volumetric imaging data ${x}_{T}\in {\mathbb{R}}^{D\times H\times W}, {X}_{T}=\{{x}_{T1},\dots ,{x}_{TN}\}$, each containing $D$ slices, we enumerate the ${j}^{th}$ slice ${I}_{j}=\left\{ {I}_{1},{I}_{2}, \cdots , {I}_{D} \right\}$ from each data point ${X}_{Tij}(i=1, \dots , N)$.

$$\widehat{\rho }\left( {I}_{i1},{I}_{i2}, \cdots , {I}_{ij} \right)= \frac{1}{n} \sum _{j=1}^{N}{\alpha }_{n}{I}_{n}$$

$${\alpha }_{d}=2d-D-1$$

Rank-pooling operations, as described in Eq. (1), produce a dynamic image by multiplying coefficients by the slices and then computing their cumulative sum. In modified rank-pooling, $\rho$ is a function that generates a score by reflecting the rank of a sequence of $D$ slices and maps it to a single value. In other words, the optimized rank in modified rank-pooling is derived from a weighted sum determined by a linear weighting function ${\alpha }_{d}$, corresponding to depth $d$ along an arbitrary axis. This approach allows for a more nuanced and dynamic representation of the volumetric data by incorporating the sequential and spatial information contained within the slices. We obtain plane-wise dynamic images by applying rank-pooling directly to each axis of the 3D volumetric imaging. Subsequently, we compare the results of applying EF or JF to these plane-wise dynamic images, along with channel-level concatenation of representative slices for each axis as per the 2D + e approach. The method allows us to comprehensively evaluate the effectiveness of different fusion techniques in capturing the complex spatial information inherent in 3D volumetric imaging.

Building Volumetric Prior Knowledge Through Training 3D Teacher Network

To enable the 2D student network, which will be used in the diagnostic system, to understand the original 3D space from the partial volumetric features it observes, we employ volumetric prior knowledge. To form this volumetric prior knowledge, we begin by using a 3D teacher network to encode the original 3D space. The architecture of the 3D teacher network fundamentally follows that of ResNet18³⁸ but with all 2D convolutional layers replaced by 3D convolutional layers. The ResNet's residual block structure involves repeating a sequence of a convolutional layer, a normalization layer, and a ReLU activation twice but adds the input features back to the features through identity mapping before the final ReLU activation. Our structural adaptation differs from the conventional ResNet18, which stacks two residual blocks in a group, repeated four times. Instead, we simply stack four individual residual blocks. This modification is driven by the need for a 3D CNN model that incorporates the validated structure of 2D CNNs, while also being lightweight enough for training with the typically smaller datasets characteristic of medical data. This approach aims to balance the complexity and depth of the model to suit the specific requirements of medical imaging analysis. Our internal experiments revealed that reducing the number of blocks in a ResNet18's residual block led to higher performance on both validation and test sets compared to the standard ResNet18. After the sequence of residual blocks, we apply Adaptive Average Pooling and encode the output into a 1024-dimensional vector. Finally, we use a single-layer neural network to map this vector to a 2-dimensional node representing Healthy Control (HC) and PD. This network is used to calculate the probability distribution $P\left(k|{x}^{T}\right)={\widehat{y}}^{T}$for the diagnostic label. In this context, $k$ represents the random variable associated with the diagnostic label.

We train the 3D teacher network from scratch using the given dataset ${D}_{T}$. In our experiment, the dataset ${D}_{T}=\{{X}_{T}, {Y}_{T}\}$ consists of ${X}_{T}$, which are 3D volumetric images ${x}_{T}\in {\mathbb{R}}^{D\times H\times W}, {X}_{T}=\{{x}_{T1},\dots ,{x}_{TN}\}$, and ${Y}_{T}$ are the corresponding diagnostic labels evaluated according to the criteria of PPMI for each volumetric image. Here, $H, W$, and $D$ don't necessarily have to align with the standard neuroimaging coordinate system, let's define $H$ as the axis running from left to right of the brain, $W$ as the axis from the front to the back of the brain, and $D$ as the brain's longitudinal axis. The 3D teacher network is trained through the cross-entropy loss between the diagnostic label and the predicted probability distribution ${\widehat{y}}_{T}.$

$${L}_{CL{S}^{T}}\left({\widehat{y}}_{T},{y}_{T} \right)={L}_{CE}\left({\widehat{y}}_{T},{y}_{T}\right)=-\frac{1}{B}\sum _{i}^{B}{y}_{Ti}log\left({\widehat{y}}_{T}\right)$$

Once the 3D teacher network’s model parameters are trained, they are frozen during the training of the student model.

2D Student Network for diagnostic system

To distill the 3D volumetric prior knowledge embedded in the 3D teacher network, a 2D student network, using standard ResNet18 as its backbone, learns from partially restricted input from scratch while simultaneously mimicking the knowledge representation of the 3D teacher network. This process is akin to the 2D student network piecing together fragmented volumetric information to reconstruct a complete 3D volumetric knowledge. Consequently, the effectiveness of the 3D-to-2D KD can vary depending on how rich the input information is in volumetric knowledge. Therefore, the structure of the 2D student network is designed to adaptively respond to alterations in partial input restriction. For example, in cases where the 2D + e or rank-pooling-based projection method is combined through EF, the channel size is 3. In the JF setup, as illustrated in Fig. 2, we encode the projected inputs for each plane using a shared backbone model parameter. These encoded features are then immediately concatenated, and a single-layer neural network is used to map them into a 2-dimensional vector. This process is employed to predict the probability distribution $P\left(k|{x}_{S}\right)={\widehat{y}}_{S}$ for HC and PD.

We train the 2D student network from scratch using a dataset ${D}_{S}=\{{X}_{S}, {Y}_{S}\}$ created by applying the 2D projection method to the dataset ${D}_{S}$. Here, ${X}_{S}$ represents the projected 2D images ${x}_{S}\in {\mathbb{R}}^{C\times H\times W},$ with ${X}_{S}=\{{x}_{S1},\dots ,{x}_{SN}\}$, generated by a predefined partial input restriction function $R$ (i.e., ${x}_{S}=R\left({x}_{T}\right)$). In our experimental setup, ${Y}_{S}$ is used as the Ground Truth (GT) label, which is identical to ${Y}_{T}$, and we ensure that ${D}_{T}$ and ${D}_{S}$ have the same subject IDs. Here, $C$ represents the channel size of the model’s input images, which varies depending on the 2D projection method used. The objective function for the 2D student network is formulated as follows:

$${L}_{CL{S}^{S}}\left({\widehat{y}}_{S},{y}_{S} \right)={L}_{CE}\left({\widehat{y}}_{S},{y}_{S}\right)=-\frac{1}{B}\sum _{i}^{B}{y}_{Si}log\left({\widehat{y}}_{S}\right)$$

Aligning volumetric feature representation with 3D-to-2D Knowledge Distillation

As mentioned earlier, although 2D projection methods provide partial volumetric information to the 2D CNN model in their own distinct ways, they do not necessarily convey the context in which these information fragments are used in the original 3D space. Furthermore, if the dataset is limited in a data-driven feature learning supervision manner as directed by the given objective function, it is not easy for the visual representation created by the 2D student network to coincidentally align with the original 3D volumetric representation produced by the 3D teacher network. To minimize the modality gap between the original 3D data and the projected 2D data, we rectify the graph-level representations created by the teacher and student networks for minibatch data. By having the 2D student network closely mimic the similarity matrix between data samples, which is calculated based on the 3D volumetric information discovered by the 3D teacher network, the ability of the 2D network to handle fragmented volumetric features is significantly enhanced.

In the initial KD approach, the knowledge representation used is the soft prediction, which is obtained by applying a temperature softmax function to the output logits generated by the model. This soft prediction is also referred to as a soft target. The KD loss using soft targets, denoted as ${L}_{s.t.}$, is defined as follows:

$${L}_{s.t.}={L}_{CE}(\sigma \left(\frac{{z}_{S}}{T}\right), \sigma \left(\frac{{z}_{T}}{T}\right))$$

In this context, ${z}_{S}$ and ${z}_{T}$ represent the logits produced by the student network and the teacher network, respectively, while $\sigma$ denotes the softmax function. The parameter T (temperature) in the softmax function plays a role in smoothing the computed class probabilities. It adjusts the sharpness of the probability distribution, amplifying either the strong or weak probabilities to prevent the softmax function's output from becoming too extreme. In the setup, the soft targets generated by the fixed (non-updating) teacher network act as additional pseudo labels for training the student model.

We adopt the values of the flattened features in the penultimate layer, corresponding to the volumetric representation encoded by the neural networks involved in 3D-to-2D KD across the 3D and 2D modalities, as the distilled representation for 3D-to-2D KD. ${f}_{T}\in {\mathbb{R}}^{B\times {C}_{T}}$ and ${f}_{S}\in {\mathbb{R}}^{B\times {C}_{S}}$ represent the feature vector from the penultimate layer of each network. Here, ${C}_{T}$ and ${C}_{S}$ are the dimensional sizes of the features created by the teacher network and the student network, respectively. For a given minibatch of data, the interrelationships of data points in the embedding space formed by the feature vector of the 3D student network are expressed through a similarity matrix. We calculate the similarity matrix as follows:

$$\stackrel{\sim}{{f}_{T}}=\frac{{f}_{T}}{\left|\right|{f}_{T}|{\left.\right|}_{2}}; {S}_{T}=\stackrel{\sim}{{f}_{T}} \cdot {\stackrel{\sim}{{f}_{T}}}^{T}$$

For the distilled representations, we first apply the l2-norm and the similarity matrices ${S}_{T}, {S}_{S}\in {\mathbb{R}}^{B\times B}$ are computed through linear affinity. In our internal experiments, we considered various similarity measures such as linear affinity (i.e., simple matrix multiplication), Radial Basis Function (RBF), and k-nearest neighbors (kNN)-based affinity. However, since no significant differences were observed, we opted for linear affinity for clarity. The similarity matrix for the feature representation from the penultimate layer of the 2D student network is calculated similarly:

$$\stackrel{\sim}{{f}_{S}}=\frac{{f}_{S}}{\left|\right|{f}_{S}|{\left.\right|}_{2}}; {S}_{S}=\stackrel{\sim}{{f}_{S}} \cdot {\stackrel{\sim}{{f}_{S}}}^{T}$$

We distill the volumetric prior knowledge of the teacher network by directly reducing the difference between the similarity matrices created from the encoded features produced by the teacher and student networks. We define the 3D-to-2D KD loss based on volumetric features as follows:

$${L}_{fg}\left({S}_{T},{ S}_{S}\right)=\frac{1}{{B}^{2}}\sum _{\left(i, j\right)\in I}|\left|{S}_{T}-{S}_{S}\right|{\left.\right|}_{2}^{2}$$

$I$ represents the set containing all pairs of data points included in the minibatch input.

Finally, the total loss ${L}_{total}$ used for training the 2D student network is defined as follows:

$${L}_{total}={L}_{CL{S}^{S}}\left({\widehat{y}}_{S}, {y}_{S}\right)+{L}_{fg}({S}_{T}, {S}_{S})$$

Figure 2. Illustration of 3D to 2D Knowledge Distillation. ${L}_{fg}$is 3D-to-2D KD loss based on volumetric features, ${L}_{{CLS}^{S}}$ is the cross-entropy loss between the diagnostic label ${y}_{S}$ and probability distribution ${\widehat{y}}_{S}$, ${\Phi }$ is the similarity measure, FFN is the Feed-Forward Network, ${f}_{T}$ is a feature vector in the penultimate layer of the 3D teacher network, ${f}_{S}$ is a feature vector in the penultimate layer of the 2D student network

Dataset and preprocessing

To evaluate the effectiveness of the proposed method, experimental data was acquired from the PPMI database (https://www.ppmi-info.org/access-data-specimens/download-data) on March 24, 2023. The cohort study includes a total of 1028 subjects, comprising 212 HC and 816 individuals diagnosed with PD, all of whom underwent dopamine transporter imaging. Only subjects with definitive diagnostic labels of HC or PD were considered, and subjects with Scans Without Evidence of Dopaminergic Deficit (SWEDD) were excluded. The SPECT raw projection data from PPMI underwent an iterative reconstruction process using the HERMES system (Hermes Medical Solutions, Skeppsbron 44, 111 30 Stockholm, Sweden) for visual assessment and calculation of the striatal binding ratio. It was followed by a pre-processing step using PMOD (PMOD Technologies, Zurich, Switzerland). Each scan was subjected to attenuation correction, and the pre-processing step involved normalizing all scans to the standard Montreal Neurological Institute (MNI) space to ensure consistent anatomical alignment. Readers seeking more detailed information on the imaging protocol and pre-processing steps can refer to the PPMI documentation³⁹. We downloaded data that had already undergone a series of pre-processing steps by the PPMI imaging protocol adopted by the PPMI imaging center. We downloaded data that had already undergone a series of pre-processing steps by the PPMI imaging protocol adopted by the PPMI imaging center.

The initial dimensions of the images we downloaded were 91×109×91. We used the striatal region defined in the AAL3v1 template (https://www.gin.cnrs.fr/en/tools/aal/), which is positioned in the same-sized MNI space, to determine the representative slice. This was achieved by adopting the central slice index of each plane of the estimated striatal volume, specifically focusing on regions 75 and 76 (Caudate nucleus) and 77 and 78 (Lenticular nucleus, Putamen⁴⁰).

The entire set of images, after the application of 2D projection operations, is first used for the data splitting process. The entire dataset is divided into train, validation, and test datasets in a 6:2:2 ratio, resulting in 616 training, 206 validation, and 206 test samples. The seed for the data split remains fixed throughout the entire experimental process. In other words, the subject IDs used in the datasets for training and evaluating the 3D teacher network and the 2D student network are the same. Standard normalization is applied to the training, validation, and test sets using the mean and standard deviation of the training set. In the case of an experimental setup using a single slice as input, adjacent slices assigned to the RGB band are decomposed following the aforementioned process to create a dataset consisting of non-i.i.d. samples. In this case, the experimental dataset is composed of 1848 training, 618 validation, and 618 test samples. In this study, the final input of the 3D teacher network was adjusted to a dimension of 64×64×64×1, while the final input of the 2D student network was set to 64×64×C. Through this configuration, each experiment was conducted 20 times to ensure robustness in the results. Ultimately, the outcomes were derived by calculating the average weighted F1 score based on these repeated predictions.

Experimental Details

For the 3D-to-2D KD setup, we utilized the PyTorch library for the necessary 3D and 2D CNN networks and optimization operations. The data splitting and standard normalization processes were facilitated using the sklearn library. To train the 3D teacher network and the 2D student network, we employed the Radam optimizer⁴¹ and the cosine annealing scheduler with warm restarts⁴². The initial learning rate for the warmup was set at 0, with ${T}_{0}=20, {T}_{mult}=1$, and eta_max=0.01. The entire network was trained using three NVIDIA RTX 3090 GPUs.

Performance comparison according to the different 2D projection methods

To evaluate the universal applicability of the proposal, we replicated the representative 2D projection methods used in previous studies and compared their performance when applied with 3D-to-2D KD. The following briefly describes the 2D projection methods we adopted for performance comparison. The single slice as input^16,18,36 method involves extracting one slice before and after a representative slice index from the original 3D volumetric imaging and using each slice as a non-i.i.d input for model training and inference. The Adjacent slices with the EF¹⁹ method involve extracting one slice before and after the representative slice index from the original 3D volumetric imaging, resulting in a total of three slices. These slices are then assigned to the RGB band and concatenated along the channel dimension to form the input used for model training and inference. The approach is referred to as the 2D + e approach in previous studies. Adjacent slices with the JF²⁰ method involve applying the 2D + e approach to each of the trans-axial, coronal, and sagittal axes in the original 3D volumetric imaging. These are then encoded using a 2D CNN, and the resulting plane-wise features are concatenated at the intermediate layer of the neural network before being forwarded. The technique is referenced in previous studies as FuseMe. The Rank-pooling method traverses each slice along an arbitrary axis of the 3D volume, assigning coefficients and performing a weighted sum to embed the volume information into a single slice. Rank-pooling with EF is a setup that assigns plane-wise images obtained through rank-pooling to the RGB-band¹⁴. Rank-pooling with JF follows the same approach as extending 2D + e to FuseMe.

In Table 1, our 3D-to-2D KD consistently achieves state-of-the-art performance by enhancing volumetric prior knowledge in almost all 2D projection methods, except for the Single slice as input method in the sagittal plane. Particularly in the FuseMe setup (C), the application of 3D-to-2D KD showed comparable performance to that of the 3D teacher model. We found that by selecting an appropriate plane that can represent changes in the striatal region in the (A) setup, it is possible to create a robust model with non-i.i.d input through sufficient data augmentation, and we also observed a slight improvement in performance in this regard.

2D-to-3D KD for volume-aware 2D CNN

In terms of the highest performance improvement observed through 3D-to-2D KD, for non-automatic projection methods (Table 1. (A), (B), (C)), the greatest enhancement was seen in datasets combined with adjacent slices through EF (2D + e approach), with improvements of up to 6.45%. On the other hand, when guidance using volumetric prior knowledge was absent, a significant performance degradation was noted. This is attributed to the student network incorporating sub-optimal cues for PD diagnosis instead of findings that could explain the diagnosis in inputs enriched with volumetric information from adjacent slices. However, even in such cases, we observed that training with our proposed 3D-to-2D KD in the same dataset and model size could enhance volumetric understanding.

Application of automatic projection method and 3D-to-2D KD

In the case of the automatic projection method, when volumetric features were combined through JF, it showed a 1.79% higher performance than EF and also experienced a greater improvement in performance due to 3D-to-2D KD. When considering the 2D student network from a system perspective, both EF and JF transmit the same amount of volumetric features to the model. However, the reason for the differing effects of 3D-to-2D KD is that the way the student network encodes volumetric features influences the distillation process. In the case of EF, where different plane-wise projection images are assigned to the RGB band, overlapping volumetric features from different planes may weaken important information or induce unintended artifacts. On the other hand, JF encodes each plane-wise projection image individually before combining them. Therefore, compared to EF, JF can be more useful for exploring data graph representations made from 3D volumetric priors.

TABLE 1

Quantitative comparison of our method in PPMI. All results are conducted with the ResNet-18 backbone.

Configuration	Weighted F1 (%)
3D Teacher Network		97.66
2D Student Network with 2D Projection method	without 3D-to-2D KD	with 3D-to-2D KD (Ours)
(A) Single slice as input Axial Coronal Sagittal	97.07 94.83 80.72	97.24 (+0.17) 97.07 (+2.24) 80.30 (-0.42)
(B) Adjacent slices with EF (2D+e)
Axial	90.47	96.92 (+6.45)
Coronal Sagittal	91.68 90.47	97.14 (+5.46) 96.92 (+6.45)
(C) Adjacent slices with JF (FuseMe)	96.98	98.30 (+1.32)
(D) Rank-pooling
Axial	91.10	92.99 (+1.89)
Coronal Sagittal	92.15 94.13	94.96 (+2.81) 94.81 (+0.68)
(E) Rank-pooling with EF	93.25	94.69 (+1.44)
(D) Rank-pooling with JF	94.59	96.45 (+1.86)

Ablation study on loss function

To gain a comprehensive understanding of our proposed 3D-to-2D KD, we compare the use of feature vectors (${L}_{fg}$)⁴³ to represent data similarity relationships in a graph, with the use of logits (${L}_{lg}$)⁴⁴. Table 2 presents an ablation study conducted using both non-automatic and automatic projection methods for this comparison. Contrary to the expectation that applying both losses would enhance performance, the experimental results reveal that the highest performance is recorded when only ${L}_{fg}$ is applied, regardless of the chosen projection method. Intriguingly, we observed that applying only ${L}_{lg}$ resulted in lower performance than not applying 3D-to-2D KD at all. We attribute these results to the size of the feature dimension of the distillation representation chosen to create the similarity matrix. Intriguingly, previous research⁴⁵ has reported that not only the intermediate features of CNN layers but also class probabilities can reconstruct an input image at an approximate level. These findings imply that class probabilities also contain information about the input through relationships of associated class information, aligning with the initial rationale for adopting temperature softmax in early KD methods²¹. However, the model used for reconstruction in previous studies was AlexNet, which predicts 1,000 classes, whereas our study predicts only 2 diagnostic labels, significantly reducing expressiveness. This indicates that distillation based on features extracted from the penultimate layer is more informative and much more effective in reducing the modality gap. Additionally, our experimental results reveal that graph-based distillation using features with reduced expressiveness can have a negative impact.

TABLE 2

Ablation study for the proposed functions.

Configuration		Weighted F1 of student network (%)
L_fg	L_lg	2D+e with JF (FuseMe)	Rank-pool with JF
X	X	96.98	94.59
O	X	98.30	96.45
X	O	96.34	93.04
O	O	97.77	95.94

Similarity matrix according to different 2D projection methods

To analyze the impact of 3D-to-2D KD on the 2D student network, we computed the similarity matrices ${S}_{T}$ and ${S}_{S}$ used in distillation, based on the features of the 3D teacher network and the 2D student network, for the test set. These matrices are visualized in Fig. 3. We observed that both 2D student networks adopting FuseMe or rank-pooling approaches became partially similar or smoothed in their similarity matrices to the 3D teacher's test set similarity matrix after applying 3D-to-2D KD. In both FuseMe and rank-pooling, without distillation, there were cases where sample data were incorrectly closely embedded. They were corrected and aligned according to the representation viewed by the 3D teacher. Figure 2 shows that the distribution of data pairs that were similar in the similarity score histogram before applying KD in the FuseMe approach decreased after the application of KD. It indicates that the representations that were incorrectly closely embedded were corrected through KD.

On the other hand, when comparing before and after the application of 3D-to-2D KD, it was observed that the rank-pooling approach produced results more similar to the 3D teacher network's similarity matrix and similarity score histogram than the FuseMe approach. From this, we observe that the effectiveness of 3D-to-2D KD in aligning the volumetric representation of the 2D student network with the 3D volumetric prior knowledge formed by the pre-trained 3D teacher network depends on the nature of the information contained in the input modality. That is, the information conveyed to the 2D student network by FuseMe involves heuristically determining volumes such that adjacent slices represent the striatal region, and transmitting this information according to three planes. Rank-pooling, by assigning linear coefficients according to depth for all slices and summing them up, inherently includes depth information in its way. The results in a greater inclusion of 3D volumetric information, which is beneficial for feature alignment. However, findings critical for PD diagnosis are typically observed in the localized striatal region. Therefore, having more volumetric information does not necessarily mean it is sufficiently informative for PD assessment. The nature of PD diagnostics, which often focuses on specific regions rather than the entire volume, suggests that while rank-pooling provides a more comprehensive volumetric perspective, it does not automatically translate into more effective diagnostic information for PD.

Comparison of the Proposed 3D-to2D KD with prior research in Deep Learning-Based Neuroimaging Classification

The deep learning community approaches the analysis of volumetric neuroimaging in two primary ways:

(i) Direct application of 3D CNNs that can extract features from the volume, and (ii) use of 2D CNN-based diagnostic systems that transform local regions of the volumetric data, which reference findings related to lesions, into 2D format data for diagnosis. In the (ii) approach, handling the original full-volume 3D datasets and the derived 2D datasets involves processes like slice selection and data fusion. The former refers to the original 3D neuroimaging data, while the latter contains limited volumetric information compared to the original data.

Accordingly, prior research has focused on either developing 3D CNN-based diagnostic systems that overcome the constraints of data limitation or investigating methods to define local areas for using 2D CNNs.

The key difference between our proposed 3D-to-2D KD and previous studies is that instead of solely relying on 2D datasets to design 2D CNN-based diagnostic systems, our approach actively involves and utilizes the original 3D data. In our 3D-to-2D KD setup, the 3D dataset primarily allows the 3D teacher network to learn volumetric prior knowledge. Additionally, we propose a 3D-to-2D KD applicable even when the input modality and model architecture differ.

From the technical standpoint of KD application, in response-based KD techniques, temperature softmax applied logits are used to distill feature information enhanced in representational power regarding information between related classes. In medical imaging analysis, due to the critical nature of lesions and context determined by imaging modality, the number of classes in classification tasks is lower than in general image classification tasks (e.g., the ImageNet classification task has 1000 classes). Our internal experiments showed that such application of response-based KD was not significantly practical in medical imaging analysis, where the number of classes tends to be limited. Observations from the results in Table 2 indicate that the application of relation-based KD techniques also needs to be revised, similar to why distillation through soft targets is often ineffective. Consequently, through Table 1, we observed that our approach enables the training of volume-aware 2D CNNs by learning from 3D datasets and 2D modalities.

The proposed method offers several additional advantages:

Firstly, 3D-to-2D KD provides the benefit of being orthogonally applicable to the analysis of volumetric imaging with 2D CNNs, regardless of the neural network model's architecture. Furthermore, we can extend 3D-to-2D KD to problem domains where they apply neural network-based models after selecting features.

Additionally, in terms of alleviating data limitations, we experimentally observed that our 3D-to-2D KD framework allows for performance enhancement without the need to expand additional datasets.

The impact of Modality Gap on 3D-to2D KD

Researchers in cross-modal KD have directed their efforts towards aligning the student network, which is involved in KD, or the entire network, with homogeneous information from multiple modalities. This approach reduces the modality gap by measuring and expanding the amount of information. Consequently, to maximize the effectiveness of the proposed method of 3D-to-2D KD, the 2D projection method must meet two essential criteria. Firstly, the proposed 3D-to-2D KD approach is influenced by how the 2D projection method transforms the original 3D dataset into a 2D modality. In other words, the 2D projection method must encapsulate the volumetric prior knowledge generated by the 3D teacher model within the scope of information it can mediate, presenting the volumetric features in a form understandable to the model. Additionally, the 2D projection method must convey significant information necessary to solve the problem. To summarize, the 2D projection method should align with the 3D volumetric prior and possess informative features essential for the problem.

Our experimental results empirically validate the analysis above. As depicted in Fig. 3e, when applying 3D-to-2D KD to the rank-pooling (wJF) method, the 2D modality includes three plane-wise weighted-average slices. These weighted-average slices represent not just a single scene's information but an average of regions emphasized along a specific axis, which the distillation loss can significantly influence compared to the FuseMe method. Consequently, it leads to a similarity matrix and a histogram of similarity scores closest to the teacher model. On the other hand, the rank-pooling technique has an inherent limitation due to the weighted averaging process; changes parallel to the axis get accumulated and consequently omitted. The rank-pooling approach for 2D projection results in a method that can effectively reflect volumetric features, yet it risks the loss of informative features.

In contrast, the FuseMe method retrieves adjacent information at fixed slice levels, resulting in the inability to recover volume information if it is not the targeted index for extraction. Consequently, as shown in Fig. 3c, while the application of 3D-to-2D KD in the FuseMe approach does improve the similarity matrix and the histogram of similarity scores, it does not achieve the proximity evident in the rank-pooling method. However, since FuseMe is not an automatic projection method but relies on heuristic decisions, it specifically utilizes slices anticipated to be most beneficial for PD diagnosis, accurately encompassing the striatal region. In the scenario, the ratio of the striatal to non-striatal regions in the 2D modality generated by FuseMe demonstrates an enhancement over the original 3D dataset.

In Table 1, the 3D-to-2D KD application to the FuseMe method demonstrates state-of-the-art performance, surpassing that of the 3D teacher network by leveraging the informative striatal region across three planes to build volume-aware understanding. The principles described above explain its improvement. Even if an overfitted 3D teacher network, due to limited datasets, generates less informative 3D volumetric prior knowledge, the 3D-to-2D KD loss can still align the representation of both modalities based on homogeneous information, provided that the 2D modality is sufficiently informative. Therefore, although we train the 2D student network on the same amount of limited dataset, it can escape overfitting through the influence of distillation loss.

Limitations and Future Research

The popularity of Vision Transformers (ViT) in the deep learning community, initially sparked by their success in natural language processing, persists in the vision domain. Following the trend, we conducted internal experiments using ViT as a backbone network; however, we could not identify a suitable model within our chosen dataset and experimental setup, leading to its exclusion.

The difficulty in model selection with ViT backbones primarily stems from the fact that ViT, based on attention mechanisms and feed-forward networks, becomes significantly heavier when adapted to handle 3D volume patches. In contrast, our modified 3D Resnet18 has a relatively light parameter size of about 14.3M. The original ViT paper reports performance dependency on the scale of the dataset. For a stable application of ViT in the medical domain, we must address data limitation and imbalance issues. The deep learning community reports various techniques to adapt large models effectively.

In our future research, we plan to explore the development of volume-aware, pre-trained ViTs through volumetric KD techniques and to investigate adapting large models to smaller datasets.

Acknowledgements

Data used in the preparation of this article were obtained from the Parkinson's Progression Markers Initiative database (https://www.ppmi-info.org/access-data-specimens/download-data). For up-to-date information on the study, visit www.ppmiinfo.org. PPMI – a public-private partnership (http://www.ppmi-info. org/) – is funded by the Michael J. Fox Foundation for Parkinson's Research and funding partners, including Abbvie, Avid Radiopharmaceuticals, Biogen Idec, Bristol-Myers Squibb, Covance, Eli Lilly & Co, F Hoff man-La Roche, GE Healthcare, Genentech, GlaxoSmithKline, Lundbeck, Merck, MesoScale, Piramal, Pfizer, and UCB.

Author contributions

H.Y. conceptualized and designed the study, investigated related works and wrote source code and manuscript.

D.Y.K. reviewed the manuscript.

S.K. contributed to the conceptualization for the study, reviewed the manuscript, and supervised the study as a project administrator.

Competing interests

The authors declare no competing interests.

Data availability

This research was conducted using publicly available datasets. The data utilized in this analysis are sourced from the Parkinson’s Progression Markers Initiative (PPMI) database, which is hosted and maintained by The Michael J. Fox Foundation and a core group of academic scientists and industry partners launched the PPMI. The datasets can be freely accessed through https://www.ppmi-info.org/access-data-specimens/download-data, if you register with PPMI and obtain approval for the use of the data and metadata. No additional data were created or generated as part of this research.

Funding

This research was supported by Dong-A University in Republic of Korea. (10.13039/501100002468).

Risacher, S. L. & Saykin, A. J. in Seminars in neurology. 386–416 (Thieme Medical Publishers).
Filippou, V. & Tsoumpas, C. J. M. p. Recent advances on the development of phantoms using 3D printing for imaging with CT, MRI, PET, SPECT, and ultrasound. 45, e740-e760 (2018).
Jo, T., Nho, K. & Saykin, A. J. J. F. i. a. n. Deep learning in Alzheimer's disease: diagnostic classification and prognostic prediction using neuroimaging data. 11, 220 (2019).
Wang, Y.-R. et al. Low-count whole-body PET/MRI restoration: an evaluation of dose reduction spectrum and five state-of-the-art artificial intelligence models. European Journal of Nuclear Medicine and Molecular Imaging 50, 1337–1350 (2023).
Bousse, A. et al. A Review on Low-Dose Emission Tomography Post-Reconstruction Denoising with Neural Network Approaches. arXiv preprint arXiv:2401.00232 (2023).
Kulathilake, K. S. H., Abdullah, N. A., Sabri, A. Q. M. & Lai, K. W. A review on deep learning approaches for low-dose computed tomography restoration. Complex & Intelligent Systems 9, 2713–2745 (2023).
Wang, T. et al. Machine learning in quantitative PET: A review of attenuation correction and low-count image reconstruction methods. Physica Medica 76, 294–306 (2020).
Umirzakova, S., Ahmad, S., Khan, L. U. & Whangbo, T. J. I. F. Medical image super-resolution for smart healthcare applications: A comprehensive survey. 102075 (2023).
Zhou, S. K. et al. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. 109, 820–838 (2021).
Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. J. N. d. m. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. 3, 136 (2020).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252 (2015).
Anwar, S. M. et al. Medical image analysis using convolutional neural networks: a review. Journal of medical systems 42, 1–13 (2018).
Choi, H., Ha, S., Im, H. J., Paek, S. H. & Lee, D. S. Refining diagnosis of Parkinson's disease with deep learning-based interpretation of dopamine transporter imaging. NeuroImage: Clinical 16, 586–594 (2017).
Nanni, L. et al. Comparison of transfer learning and conventional machine learning applied to structural brain MRI for the early diagnosis and prognosis of Alzheimer's disease. Frontiers in neurology 11, 576194 (2020).
Soliman, A. et al. Adopting transfer learning for neuroimaging: a comparative analysis with a custom 3D convolution neural network model. BMC medical informatics and decision making 22, 318 (2022).
Khan, N., Hon, M. & Abraham, N. Transfer Learning with intelligent training data selection for prediction of Alzheimer’s Disease. arXiv 2019. arXiv preprint arXiv:1906.01160 (2019).
Yagis, E. et al. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Scientific reports 11, 22544 (2021).
Sato, R., Iwamoto, Y., Cho, K., Kang, D.-Y. & Chen, Y.-W. Accurate BAPL score classification of brain PET images based on convolutional neural networks with a joint discriminative loss function. Applied Sciences 10, 965 (2020).
Aderghal, K., Boissenin, M., Benois-Pineau, J., Catheline, G. & Afdel, K. in International Conference on Multimedia Modeling. 690–701 (Springer).
Aderghal, K., Benois-Pineau, J., Afdel, K. & Gwenaëlle, C. in Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing. 1–7.
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Gou, J., Yu, B., Maybank, S. J. & Tao, D. Knowledge distillation: A survey. International Journal of Computer Vision 129, 1789–1819 (2021).
Yang, Y., Guo, X., Ye, C., Xiang, Y. & Ma, T. CReg-KD: Model refinement via confidence regularized knowledge distillation for brain imaging. Medical Image Analysis 89, 102916 (2023).
Sundaresan, V. et al. Automated detection of cerebral microbleeds on MR images using knowledge distillation framework. Frontiers in Neuroinformatics 17 (2023).
Guan, H., Wang, C. & Tao, D. MRI-based Alzheimer’s disease prediction via distilling the knowledge in multi-modal data. NeuroImage 244, 118586 (2021).
Noothout, J. M. et al. Knowledge distillation with ensembles of convolutional neural networks for medical image segmentation. Journal of Medical Imaging 9, 052407–052407 (2022).
Dou, Q., Liu, Q., Heng, P. A. & Glocker, B. Unpaired multi-modal segmentation via knowledge distillation. IEEE transactions on medical imaging 39, 2415–2425 (2020).
Chen, M., Xing, L., Wang, Y. & Zhang, Y. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11766–11775.
Hu, M. et al. in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23. 772–781 (Springer).
Chen, Y., Xian, Y., Koepke, A., Shan, Y. & Akata, Z. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016–7025.
Shen, Z. & Xing, E. in European Conference on Computer Vision. 673–690 (Springer).
Liu, X., He, P., Chen, W. & Gao, J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482 (2019).
Hahn, S. & Choi, H. Self-knowledge distillation in natural language processing. arXiv preprint arXiv:1908.01851 (2019).
Wang, H., Lohit, S., Jones, M. N. & Fu, Y. What makes a" good" data augmentation in knowledge distillation-a statistical perspective. Advances in Neural Information Processing Systems 35, 13456–13469 (2022).
Wang, T., Zhu, J.-Y., Torralba, A. & Efros, A. A. Dataset distillation. arXiv preprint arXiv:1811.10959 (2018).
Yagis, E. et al. Deep learning in brain MRI: Effect of data leakage due to slice-level split using 2D convolutional neural networks. (2021).
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A. & Gould, S. in Proceedings of the IEEE conference on computer vision and pattern recognition. 3034–3042.
He, K., Zhang, X., Ren, S. & Sun, J. in Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Wisniewski, G., Seibyl, J. & Marek, K. DatScan SPECT image processing methods for calculation of striatal binding ratio. Parkinson’s Progression Markers Initiative (2013).
Rolls, E. T., Huang, C.-C., Lin, C.-P., Feng, J. & Joliot, M. Automated anatomical labelling atlas 3. Neuroimage 206, 116189 (2020).
Liu, L. et al. On the variance of the adaptive learning rate and beyond. arXiv 2019. arXiv preprint arXiv:1908.03265 (2019).
Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
Tung, F. & Mori, G. in Proceedings of the IEEE/CVF international conference on computer vision. 1365–1374.
Peng, B. et al. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 5007–5016.
Dosovitskiy, A. & Brox, T. in Proceedings of the IEEE conference on computer vision and pattern recognition. 4829–4837.

No competing interests reported.

Download PDF

Reviewers agreed at journal
01 Jul, 2024
Reviews received at journal
26 Jun, 2024
Reviewers agreed at journal
11 Jun, 2024
Reviewers invited by journal
19 May, 2024
Editor assigned by journal
19 May, 2024
Editor invited by journal
10 May, 2024
Submission checks completed at journal
10 May, 2024
First submitted to journal
02 May, 2024

You are reading this latest preprint version

Enhancement and Evaluation for deep learning-based classification of volumetric neuroimaging with 3D-to-2D Knowledge Distillation

Status:

Version 1

Abstract

Figures

INTRODUCTION

RELATED WORKS

2D projection approach for deep learning-based neuroimaging analysis

Knowledge Distillation

METHODS

2D projection alteration for 3D-to-2D KD

Automatic projection via rank-pooling-based projection

Building Volumetric Prior Knowledge Through Training 3D Teacher Network

2D Student Network for diagnostic system

Aligning volumetric feature representation with 3D-to-2D Knowledge Distillation

EXPERIMENTS

Dataset and preprocessing

Experimental Details

RESULTS

Performance comparison according to the different 2D projection methods

2D-to-3D KD for volume-aware 2D CNN

Application of automatic projection method and 3D-to-2D KD

Ablation study on loss function

Similarity matrix according to different 2D projection methods

DISCUSSION

Comparison of the Proposed 3D-to2D KD with prior research in Deep Learning-Based Neuroimaging Classification

The impact of Modality Gap on 3D-to2D KD

Limitations and Future Research

Declarations

Acknowledgements

Author contributions

Competing interests

Data availability

Funding

References

Additional Declarations

Status:

Version 1