Design
This section describes the overall process of using the expanded dataset to train ACRnet to identify chest X-ray images. As shown in Fig. 1, during the whole process, we first use DCGAN to synthesize cardiomegaly and emphysema chest X-ray images, and then, send the expanded dataset into ACRnet. After several rounds of feature extraction, the input images will be finally recognized by the model and the category label will be output. We call module A as a block, and ACRnet is composed of multiple blocks in series.
Data description
The images we use come from the Chest X-ray 14 public dataset, each with single or multiple pathological markers. Radiological reports show that the accuracy of pathological markers is more than 90 % [25]. We select images only suffering from cardiomegaly or emphysema from Chest X-ray 14 public dataset, and add normal images for research. We divide all data into 80 % for training and 20 % for test. Table 1 shows the number of chest X-ray images we screen from the Chest X-ray 14 dataset for training and test. Due to the small number of both cardiomegaly and emphysema ( less than 1000 ) in the original dataset, which is not conducive to training the network model, we use DCGAN to expand the original data set. We synthesize 2000 images of cardiomegaly and emphysema, since there are more than 60000 normal chest radiography images in the chest X-ray 14 dataset, we select 2000 normal images together with the synthesized images to expand the original dataset. We use the same test set when we test the models trained with the original dataset and the ACRnet trained with the expanded dataset. In order to meet the requirements of the model for the input size, we resize both the original 1024 × 1024 pixels image and the 256 × 256 pixels image synthesized by DCGAN into 224 × 224 pixels image.
Table 1. Number of the chest X-ray images in training and test.
Disease
|
Train set
|
Test set
|
cardiomegaly
|
840
|
210
|
emphysema
|
720
|
180
|
normal
|
800
|
200
|
Deep convolution generative adversarial networks
Deep convolutional generative adversarial network is a special deep learning model which combines Convolutional neural network ( CNN ) and GAN[26]. DCGAN consists of two different types of networks, one is a generator network for synthesizing images, the other is a discriminator network for identifying real and synthetic images. In the process of training DCGAN, the discriminator and generator are trained simultaneously. The generator network we used consists of 7 transposed convolution layers, 6 ReLU layers, 3 batch normalization layers and 1 tanh layer on the end. The discriminator network consists of 7 convolution layers, 6 ReLU layers, 3 batch normalization layers and a Sigmoid layer on the end. All the convolutional and transposed convolutional layers have the same window size of 4 × 4 pixel with 16 filters for each layer. DCGAN network can solve insufficient training caused by small dataset. DCGAN is trained by using all the chest radiographs of emphysema and cardiac hypertrophy from the original data. After 1000 iterations, we use DCGAN to synthesize 2000 chest X-ray images of emphysema and cardiomegaly to expand the training dataset. Table 2 shows the model structure and parameters of generator and discriminator in DCGAN.
Overview of convolutional neural networks(CNNs)
CNNs training begins in a feedforward fashion. Input information is transmitted from the initial layer to the output layer, and errors are propagated forward from the last layer [27]. In this section, we will introduce 4 neural networks used in the experiments. VGG is a typical and effective neural network with 6 different configurations, of which VGG16 and VGG19 are commonly used [27]. The VGG model usually consists of 5 blocks, each containing several convolution layers and a pooling layer. VGG16 includes13 convolution layers, 3 fully connected layers and 5 pooling layers. The convolution layer of VGG16 adopts 3×3 convolution kernels, and the pooling layer adopts 2×2 pooling kernels. InceptionV2 is an upgraded version of GoogleLeNet (InceptionV1) [28]. InceptionV1 is a 22-layer network structure with four channels, using convolution kernels of 1×1, 3×3 and 5×5 sizes. Since convolution kernels of different sizes have different receptive fields, InceptionV1 can better learn features of different scales.
In addition, in InceptionV1, visual information is aggregated at different scales, so that the subsequent network level can extract features from different scales. InceptionV2 first propose using batch normalization to accelerate network training and prevent gradient from disappearing [29]. InceptionV2 also replaces every convolution kernel of 5×5 in InceptionV1 with two 3×3 convolution kernels, which reduces the parameters and enhances the nonlinear properties while maintaining the same receptive field. ResNet is a widely used neural network with 5 structures of 18-layer, 32-layer, 50-layer, 101-layer and 152-layer [30]. The residual block in ResNet uses the jump connection structure to alleviate the gradient disappearance caused by the increase of network depth. Using residual structure not only improves the accuracy of the model, but also makes the model easier to optimize. With the deepening of the neural network, people find that improving the flow of information in the deep network model can make the neural network easier to be trained. In the block of CliqueNet [31], each layer is the input also the output of other layers, all these layers constitute a ring structure and update alternately. This structure makes high-level visual information be sent back to the previous level for feature reuse. In addition, the introduction of attention mechanism inhibits the irrelevant neurons representing background and noise, which further improves the recognition ability of the model.
Table 2. The model structure and parameters of generators and discriminator in DCGAN.
Generator
|
|
|
Discriminator
|
|
|
Layer name
|
Input
|
output
|
Layer name
|
Input
|
output
|
ConvTranspose2d
|
100
|
16 x 32
|
Conv2d
|
3
|
16
|
BatchNorm2d
|
|
|
LeakyReLU
|
|
|
ReLU
|
|
|
Conv2d
|
16
|
16 x 2
|
ConvTranspose2d
|
16 x 32
|
16 x 16
|
LeakyReLU
|
|
|
BatchNorm2d
|
|
|
Conv2d
|
16 x 2
|
16 x 4
|
ReLU
|
|
|
LeakyReLU
|
|
|
ConvTranspose2d
|
16 x 16
|
16 x 8
|
Conv2d
|
16 x 4
|
16 x 8
|
BatchNorm2d
|
|
|
BatchNorm2d
|
|
|
ReLU
|
|
|
LeakyReLU
|
|
|
ConvTranspose2d
|
16 x 8
|
16 x 4
|
Conv2d
|
16 x 8
|
16 x 16
|
ReLU
|
|
|
BatchNorm2d
|
|
|
ConvTranspose2d
|
16 x 4
|
16 x 2
|
LeakyReLU
|
|
|
ReLU
|
|
|
Conv2d
|
16 x 16
|
16 x 32
|
ConvTranspose2d
|
16 x 2
|
16
|
BatchNorm2d
|
|
|
ReLU
|
|
|
LeakyReLU
|
|
|
ConvTranspose2d
|
16
|
3
|
Conv2d
|
16 x 32
|
1
|
Tanh
|
|
|
Sigmoid
|
|
|
Constitution of ACRnet
This section introduces the structure of ACRnet, which is a 97-layer neural network. In ACRnet, we construct an adaptive cross-transfer residual structure to extract feature information to improve the recognition efficiency of the model.
To some extent,the feature information extracted by convolution kernel is related to the channel in convolution neural network. Adaptive learning is to obtain the importance of different features through autonomous learning, and then suppress secondary features and enhance main features according to importance[32]. In the adaptive module, the adaptive global average pooling layer ( Adaptive Avg Pool ) is used to compress the feature along the spatial dimension, and each two-dimensional feature is transformed into a figure which has global sensing ability and reflects the feature distribution. And then we use 1×1 convolution kernel ( input channel is a, output channel is a / 4 ) for dimension reduction, then use ReLU activation function to increase nonlinearity, and finally use another 1×1 convolution kernel ( input channel is a / 4, output channel is a ) for dimension increase, so as to reduce the computational cost. In the last layer of the adaptive structure, the Sigmoid function is used to generate a figure between 0 and 1 as the output result. Passing adaptive figure to the hierarchy behind the network is equivalent to multiplying the eigenmatrix by a weight coefficient.
In order to extract image feature better, we combine adaptive and residual structure to construct a deep neural network model. In ACRnet, the residual structure transmits the features of layer 0 and layer 1 to the end of layer 3 and 4. The adaptive structure transmits the features of layer 0 and layer 2 to the end of layer 2 and layer 4. This structural design avoids the reduction effect of residual structure on adaptive function when residual and adaptive structure transmit the feature information extracted from the same layer to the same layer behind. The adaptive structure generates a characteristic adaptive coefficient between 0 and 1, and the residual structure generates a matrix. Adaptive and residual structures transfer feature information from the same layer to the same subsequent network layer, which is equivalent to adding the feature matrix generated by the residual structure to a matrix multiplied by the adaptive coefficient, resulting in the adaptive function weakened by the residual structure. Fig. 2 demonstrates ACRnet’s adaptive cross-transfer residual block.
Hyper-parameters tuning
In our work, we try to estimate the influence of various hyperparameters on the performance of the depth model. We divide all the parameters into external parameters and internal parameters and we fine-tune the parameters for each model. The external parameters of the model include activation functions, learning rate and batch size, which affect model performance a lot. We find that Adam is significantly better than Adadelta or SGD to select activation functions. Learning rate affects the convergence speed of the model, but excessive learning rate will reduce the accuracy of the model. We tried seven different learning rates { 1e-3,1e-4,1e-5,3e-3,3e-4,3e-5,3e-6 } and find it’s better in 1e-4 or 3e-4. Experimenting repeatedly, we select 32 as batch size from { 16,24,32,48,64 }. The internal parameters in the neural network include the size of convolution kernel, stride, the size of pooling layer window, channel and dropout. In the VGG16, InceptionV2, ResNet101 and CliqueNet, we choose the original internal parameters. Table 3 shows the structure and parameters of ACRnet model.
Table 3. Structure and parameters in ARnet.
layer
|
Output size
|
Kernel size
|
channel
|
Input (224x224x3)
|
block x 4
|
112 x 112
|
3
|
60
|
block x 4
|
56 x 56
|
3
|
120
|
block x 4
|
28 x 28
|
3
|
240
|
block x 6
|
14 x 14
|
3
|
480
|
block x 6
|
7 x 7
|
3
|
960
|
AvgPool (kernel_size =7, stride=1)
|
Linear (output=3)
|
Performance Evaluation
We choose Inception Score ( IS ) [33] and Fréchet Inception Distance ( FID ) [34] as indexes to evaluate the quality of synthetic images, both of them use InceptionV3 [37], pre-trained on ImageNet [36], to classify the generated images. The higher IS, the lower FID, and the better image quality. We introduce accuracy (ACC) [38], Precision, Recall and F1 score [39] as indicators to evaluate the performance of the detection model. The formulas for these indicators are as follows.
Here TP, FP, TN and FN represent the number of true positive, false positive, true negative and false negative, respectively. In addition, in order to further evaluate the model performance [41, 42], we used the confusion matrix [40] and plotted the receiver operating characteristic (ROC) curve and the area under the curve (AUC). The higher the AUC, the better performance.