This study was approved by the ethics committee of Guangdong Second Provincial General Hospital and all participants provided written informed consent after they were provided a complete description of the study.
Here, we present a support vector machine (SVM) method for the classification of patients with COVID–19 and patients with other pneumonias via a radiomics framework. The workflow of our proposed method is shown in Figure. 1. First, lung infection areas (region of interest, ROI) in the CT images of each sample were artificially delineated. Second, thirty-two texture features and five histogram features were extracted from ROI data using a quantitative radiomics features model. Finally, a SVM classifier trained using such quantitative radiomics features from training data was used to distinguish COVID–19 patients and other pneumonias patients. The detailed methodology behind each step of the proposed method is described below.
- Participants and Data Acquisition
Ninety patients with COVID–19 (56 males, 34 females; mean ± standard deviation age, 45.36 ± 11.58 years) were recruited, and 90 patients with other pneumonias (COVID–19-negative; 58 males, 32 females; mean ± standard deviation age, 46.54 ± 8.40 years) were recruited as a control group.
Chest CT images of all participants were acquired using a 16- slice CT (Philips). All chest CT images were acquired in about 2 min using a helical scan of the chest as follows: reconstruction slice thickness = 2 mm; reconstruction slice increment = 2 mm. The CT volume was composed of 98−165 slices with 512 × 512 pixels.
Radiomics texture analysis has been proposed since the early 1980s as a method for extracting relevant information representing tissue types from various medical images. Previous studies [11], [12] hypothesized that texture features can reflect heterogeneity within tumors, which is of great significance in cancer research. Texture analysis is a key component of radiology [13].
A gray level co-occurrence matrix (GLCM) [14] considers the arrangement of voxel pairs to calculate the texture index. GLCM is calculated from 13 different directions in 3D with a δ-voxel distance (‖𝑑⃗‖) relationship between adjacent voxels. The index value is the average of the indexes in the 13 directions of the space (X, Y, Z). From this matrix, seven textural indices (homogeneity, energy, contrast, correlation, entropy_log10, entropy_log2, and dissimilarity) are computed. The gray run length matrix (GLRLM) [15] gives the size of the uniform run for each gray level. The matrix is calculated for 13 different directions in 3D (4 in 2D). Eleven texture indices are computed from this matrix: Short-Run Emphasis, Long-Run Emphasis, Low Gray-level Run Emphasis, High Gray-level Run Emphasis, Short-Run Low Gray-level Emphasis, Short- Run High Gray-level Emphasis, Long-Run Low Gray-level Emphasis, Long-Run High Gray-level Emphasis, Gray-Level Non-Uniformity for run, Run Length Non-Uniformity, and Run Percentage.
The neighborhood gray level difference matrix (NGLDM) [16] corresponds to the gray level difference (8 in 2D) of a voxel and its 26 neighborhoods in three dimensions. Three texture indices (coarseness, contrast, and busyness) are computed from this matrix. The Gray Level Zone Length Matrix (GLZLM) [17] provides information about the uniform zone size of each gray level in 3 dimensions (or 2D). Eleven texture indices are computed from this matrix: Short-Zone Emphasis, Long-Zone Emphasis, Low Gray-level Zone Emphasis, High Gray-level Zone Emphasis, Short-Zone Low Gray-level Emphasis, Short- Zone High Gray-level Emphasis, Long-Zone Low Gray-level Emphasis, Long-Zone High Gray-level Emphasis, Gray-Level Non-Uniformity for zone, Zone Length Non-Uniformity, and Zone Percentage.
All texture analysis processes in this article were performed on the LIFEx (Local Image Features Extraction) platform [18]: Three attending physicians with training in imaging delineated the lung infection area (region of interest, ROI) of each slice in the CT image of each sample. The senior physician was responsible for reviewing and modifying; finally, a three- dimensional ROI region was obtained in each CT image (Figure. 2).
The voxel size was then spatially resampled to 1 mm × 1 mm × 0.5 mm for a 3D ROI in each CT image of all participants. The initial voxel values were resampled into 256 grey levels and rescaled between mean–3*Sd - mean+3*Sd of the ROI content, where mean and Sd are the mean and standard deviation of the voxels included in the ROI, respectively. Eventually, the 32 texture features described above were calculated from each ROI of the participants. We also built a histogram of each CT image and calculated five radiomic histogram features related to histogram skewness, kurtosis, and entropy.
- Diagnostic Classification
This study used a machine learning method—support vector machine (SVM). The concept of SVM was first proposed by Vapnik and Cortes [19] in 1995. It is based on the statistical VC dimension theory and the principle of structural risk minimization. It has many advantages in studies of small sample size with nonlinear and high-dimensional pattern recognition problems. The SVM finds a hyperplane that maximizes the distance between the two types of sample points closest to the hyperplane and the hyperplane.
After calculating textural and histogram features in each sample, we obtained a feature matrix (180×37) where 180 is the number of subjects (including 90 patients with COVID–19 and 90 patients with other pneumonias), and 37 is the number of extracted textural and histogram features. Using the feature matrix as input, SVM with different kernels (Linear, Radial Basis Function (RBF), Polynomial (Poly), and Sigmoid) was developed to train a machine learning model for classification in COVID–19 patients and other pneumonias patients. These classification models used a 10-fold cross-validation method for training and testing. The training samples had an inner 10- fold CV for tuning the penalty coefficient C (fault tolerance). This process was repeated 20 times, and the average of 20 rounds of 10-fold CV test results (accuracy, sensitivity, specificity, and area under ROC curve (AUC)) was used as the final SVM classification performance. All machine learning processes for training and testing used PyCharm (http://www.jetbrains.com/pycharm/, JetBrains PyCharm Community Edition 2018.2.4 x64).
- Statistical Analysis and Correlation Analysis
The demographic data for all participants were analyzed using SPSS 22. Differences in age between COVID–19 patients and other pneumonias patients were compared using the Wilcoxon rank-sum tests. Gender differences were assessed via chi- squared tests.
Nonparametric permutation tests estimated the statistical significance of average classification performance by determining whether the average classification performances exceeded the level of opportunity. The class labels of the training data were randomly ranked 1,000 times before training, and the 20 rounds of 10-fold CV procedure were repeated. The P value of the permutation test was defined as: 𝑃 = (𝑁exceeds + 1)/(𝑁substitution + 1). Here, 𝑁exceeds represents the number of times the permuted performance exceeded the one obtained for the true labels. The 𝑁substitution represents the rounds of permutation.
In the COVID–19 patients, correlation analysis was also conducted to determine whether the textural and histogram features correlated with the laboratory test index of blood, i.e., blood oxygen (SPO2H), white blood cell count (WBC), lymphocytes (LYM), neutrophils (NE), C-reactive protein (CRP), hypersensitive C-reactive protein (hs-CRP), and erythrocyte sedimentation rate (ESR)