We developed a platform that uses a client-server architecture to annotate the potential lesion areas of COVID–19 on the radiography images. The platform can be deployed on a private cloud for security and local sharing. X-data from 212 patients diagnosed with COVID–19 were analyzed by two experts to determine the lesion areas. The X-data were collected from the public covid-chestxray-dataset35, and the images were resized to 512 x 512. Each image contained 1- 2 suspected areas with inflammatory lesions (SAs). CT-data from 95 patients diagnosed with COVID–19 and 50 patients diagnosed with influenza were annotated by the two experts using a rapid keystroke-entry format. The images of the CT scans were collected using the PHILIPS Brilliance iCT 256 system. The slice thickness of the CT scans was 5 mm, and the CT-data images were grayscale images with 512 x 512 pixels. Areas with 2–5 SAs were annotated in the images for each case, and these areas ranged from 16 x 16 to 64 x 64 pixels. Five clinical indicators (white blood cell count, neutrophil percentage, lymphocyte percentage, procalcitonin, C-reactive protein) were also obtained, as shown in Supplementary Table 1. As a control, we randomly selected 5,000 normal cases from a public dataset (Kaggle RSNA)36. The X-data of the normal cases (XNDS) and that of the COVID–19 cases (XCDS) constituted the X dataset (XDS). We collected additional CT-data of 120 cases from a public lung CT dataset (LUNA–16, a large dataset for automatic nodule detection in the lungs)37. It was confirmed by the two experienced radiologists that no lesion areas of COVID–19 or influenza were present. The CT-data of the COVID–19 cases (CTCDS), the influenza cases (CTIDS), and the normal cases (CTNDS) constituted the clinically-diagnosed CT dataset (CTDS). The images of the SAs and the clinical indicator data constituted the correlation analysis dataset (CADS). We split the XDS, CTDS, and CADS into the training-validation (trainval) part and test part. The details of the three datasets are shown in Table 1. The train-val part of CTDS is referred to as CTTS, and the test part is called CTVS. The same naming scheme was adopted for XDS and CADS, i.e., XTS, XVS, CATS, and CAVS, respectively.
PCA was used to determine the characteristics of the medical images for the COVID–19, influenza, and normal cases. PCA was used to visually compare the characteristics of the medical images (X-data, CT-data) for the COVID–19 cases with those of the normal and influenza cases, including the XNDS, the XCDS, the CTCDS, the CTIDS, and the CTNDS. Figure 2 shows the mean image of each sub-dataset and the five eigenvectors that represent the principal components of PCA in the corresponding feature space. Significant differences are observed between the COVID–19, influenza, and normal cases, indicating the possibility of being able to distinguish COVID–19 cases from normal and influenza cases.
The CNN-based classification framework exhibited excellent performance based on the validation by experts using multi-modal data. The structure of the proposed framework, consisting of the stage I sub-framework and the stage II sub-framework, is shown in Fig. 5-a, where Q, L, M, and N are the hyper-parameters of the framework for general use cases. The values of Q, L, M, and N were 1, 1, 2, and 2, respectively, in this study; this framework referred to as the CNNCF framework. The stage I and stage II sub-frameworks were designed to extract features corresponding to different optimization goals in the analysis of the medical images. The performance of the CNNCF was evaluated using multimodal datasets (X-data and CT-data) to ensure the generalization and transferability of the model, and several evaluation indicators were used (sensitivity, precision, specificity, F1, kappa). The salient features of the images extracted by the CNNCF were visualized in a heatmap (four examples are shown in Supplementary Fig. 1). In this study, four experiments were conducted; the following results were obtained.
- Experiment-A. The results of the five evaluation indicators for the comparison of the COVID- 19 cases and normal cases for the XVS are shown in Table 2. Three experts evaluated the images, i.e., a 7th-year respiratory resident (Respira.), a 3rd-year emergency resident (Emerg.), and a 1st-year respiratory intern (Intern). An excellent performance was obtained, with the best score of F1 of 96.72%, a kappa of 95.40%, a specificity of 99.33%, and a precision of 98.33%. The sensitivity index was 95.16%, which was higher than that of the Intern (93.55%) and lower than that of the Respire. (100%) and Emerg. (100%). The receiver operating characteristic (ROC) scores for the CNNCF and the experts are plotted in Fig. 3-a; the area under the ROC curve (AUROC) of the CNNCF is 0.9961. The precision-recall scores for the CNNCF and the experts are plotted in Fig. 3-d; the area under the precision-recall curve (AUPRC) of the CNNCF is 0.9910.
- Experiment-B. The results of the five evaluation indicators for the CTVS are shown in Table 3. The CNNCF achieved the highest performance and the best score of all five evaluation indices. The ROC scores are plotted in Fig. 3-b; the AUROC of the CNNCF is 1.0. The precision-recall scores are shown in Fig. 3-e, and the AUPRC of the CNNCF is 1.0.
- Experiment-C. The results of the five evaluation indicators for the CTVS are shown in Table 3. The CNNCF exhibits good performance for the five evaluation indices, which are similar to that of the Respire. and higher than that of the Intern and the Emerg. The ROC scores are plotted in Fig. 3-c; the AUROC of the CNNCF is 1.0. The precision-recall scores are shown in Fig. 3-f; the AUPRC of the CNNCF is 1.0.
- Experiment-D. The boxplots of the five evaluation indicators, the kappa coefficient, and the specificity are shown in Fig. 4, and the precision and sensitivity are shown in Supplementary Fig. 4. A bootstrapping method38 was used to calculate the empirical distributions, and McNemar’s test39 was used to analyze the differences between the CNNCF and the experts. The p-values of the McNemar’s test (Supplementary Table 2–4) for the five evaluation indicators were all 1.0, indicating that there was no statistically significant difference between the CNNCF results and the expert evaluations.
Introspection studies identify salient features of COVID–19. In clinical practice, the diagnostic decision of a clinician relies on the identification of the SAs in the medical images by radiologists. The statistical results show that the performance of the CNNCF for the identification of COVID–19 is as good as that of the experts. A comparison consisting of two parts was performed to evaluate the discriminatory ability of the CNNCF. In the first part, we used Grad-CAM, which is a non-intrusive method to extract the salient features in medical images, to create a heatmap of the CNNCF result. Supplementary Fig. 1 shows the heatmaps of four examples of COVID–19 cases in the XDS and CTDS. In the second part, we used density-based spatial clustering of applications with noise (DBSCAN) to calculate the center pixel coordinates (CPC) of the salient features corresponding to COVID–19. All CPCs were normalized to a range of 0 to 1. Subsequently, we used a significance test (ST)40 to analyze the relationship between the CPC of the CNNCF output and the CPC annotated by the experts. A good performance was obtained, with a mean square error (MSE) of 0.0108, a mean absolute error (MAE) of 0.0722, a root mean squared error (RMSE) of 0.1040, a correlation coefficient (r) of 0.9761, and a coefficient of determination (R2) of 0.8801.
A strong correlation was observed between the lesion areas detected by the proposed framework and the clinical indicators. In clinical practice, multiple clinical indicators are analyzed to determine whether further examinations (i.e., medical image examination) are needed. These indicators can be used to assess the predictive ability of the model. In addition, various examinations are required to perform an accurate diagnosis in clinical practice. However, the correlations between the results of various examinations are often not clear. We used the stage II sub-framework and the regressor block of the CNNRF to conduct a correlation analysis between the lesion areas detected by the framework and five clinical indicators (white blood cell count, neutrophil percentage, lymphocyte percentage, procalcitonin, C-reactive protein) of COVID–19 using the CADS. The inputs of the CNNRF were the lesion area images of each case, and the output was a 5-dimensional vector describing the correlation between the lesion areas and the five clinical indicators.
The MAE, MSE, RMSE, r, and R2 were used to evaluate the results. The ST and the Pearson correlation coefficient (PCC)41 were used to determine the correlation between the lesion areas and the clinical indicators. A strong correlation was obtained, with MSE = 0.0163, MAE = 0.0941, RMSE = 0.1172, r = 0.8274, and R2 = 0.6465. At a significance level of 0.001, the value of r was 1.27 times the critical value of 0.6524. This result indicates a high and significant correlation between the lesion areas and the clinical indicators. The PCC was 0.8274 (range of 0.8–1.0), indicating a strong correlation. The CNNRF was trained on the CATS and evaluated using the CAVS. The initial learning rate was 0.01, and the optimization function used was the stochastic gradient descent (SGD) method42. The parameters of the CNNRF were initialized using the Xavier initialization method43.