Study design, population, and data collection:
This retrospective study was approved by our Institutional Review Board with an exemption of informed consent. Our study followed the Strengthening of Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines for cross-sectional studies (http://www.equator-network.org/reporting-guidelines/strobe/).
Figure 1 shows the flowchart of patient selection. At the time of this study, our registry of patients presenting to the emergency department (ED) with suspected COVID-19 (otherwise known as persons under investigation) consisted of 5,766 patients from February 7, 2020, to Jun 30, 2020. A subset of clinical variables using various analysis methods in this cohort had previously been published but addressing completely different questions [26, 27]. Only patients who were diagnosed with COVID-19 by real-time polymerase chain reaction (RT-PCR) for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) were included in the study. Inclusion criteria were SARS-CoV-2 positive patients requiring IMV. Patients younger than 18 years of age were excluded. To maintain the same cohort that has at least 5 consecutive day data, patients with less than 5-day data were excluded. The final sample size after exclusions consisted of 110 IMV survivors and 76 IMV non-survivors prior to discharge.
Input variables
The input variables include serial pCXR, demographic information (age, sex, ethnicity, and race), chronic comorbidities (smoking, diabetes, hypertension, asthma, chronic obstructive pulmonary disease, coronary artery disease, heart failure, cancer, immunosuppression, and chronic kidney disease), serial vital signs (heart rate, respiratory rate, pulse oxygen saturation, systolic blood pressure, temperature, diastolic blood pressure, mean arterial pressure, FiO2, pCO2, HCO3, pH, pO2, hematocrit, potassium, and sodium), and serial laboratory tests (C-reactive protein, D-dimer, ferritin, lactate dehydrogenase, lymphocyte count, procalcitonin, alanine aminotransferase, brain natriuretic peptide, troponin, white blood cell, and aspartate aminotransferase).
Statistical methods
Statistical analyses were performed in SPSS v26 (IBM, Armonk, NY). Categorical variables are presented as frequencies and percentages, and the comparison between groups was made using χ2 tests. Continuous variables are presented as medians and interquartile ranges (IQR), and the comparison between groups was made using Mann-Whitney U tests.
Outcome measures: The two outcomes were in-hospital mortality and the duration of IMV in days (i.e., how much time the patient needs to be on IMV). A total of 15 predictions were made as follows: Patient mortality and duration on IMV were predicted using either: a) data from the first day of IMV only (day 1 data), b) data from the fifth day of IMV only (day 5 data), c) data from the first three consecutive days of IMV (day 1–3 data), d) data from consecutive days 3–5 of IMV (day 3–5 data), or e) data from the first five consecutive days of IMV (day 1–5 data). Predictions were made using: i) pCXR data alone, ii) non-imaging data alone, and iii) both pCXR and non-imaging data.
Deep-learning architecture: The architecture of the deep learning algorithm (Fig. 2) consists of three main inputs: serial pCXRs, serial non-imaging features, and demographics/comorbidities. A 2D convolutional neural network (CNN) module designed to capture image patterns from pCXRs was based on VGG-16, a classical CNN architecture that has been widely proven to be effective [28]. The 2D CNN module consists of five convolutional blocks: the first two blocks have two convolutional layers while the last three have three convolutional layers. The last convolutional layer in each block is set to a stride of two to replace the maxpooling in the original VGG architecture, which is proven to be better in the ability of non-linear fitting [29]. In order to balance the computing burden and effectiveness of the system, the number of channels is reduced from 64-128-256-512-512 in the original VGG network to 16-32-64-128-128 in our system. The activation function ‘ReLU’ is followed by each convolutional layer to introduce non-linearity into the system. Batch normalization layers are deployed as well to ensure the stability of the training and reduce the risk of overfitting. After normalization, longitudinal features including serial vital signs and serial laboratory tests are then concatenated with the image patterns extracted from pCXRs before being fed into one long-short time memory (LSTM) layer, which is a deep learning technique aiming at processing time-series data. LSTM layer is a type of recurrent neural network (RNN) layer. Compared to traditional RNN layers, LSTM can control memory over time and the flow of information into and out of the layer through the use of three “gates”, the input, output, and forget gates [30]. SGD optimizer was used with a learning rate of 1e-4. Nesterov momentum was applied with momentum set as 0.9 to avoid local minima for loss. Categorical cross entropy was used as the loss function to measure the difference between predicted results and ground truth. The training process lasted for 20 epochs. Inside of the LSTM layer, 200 hidden units are deployed and the hyperbolic tangent (tanh) function is set as the activation function. After that, non-longitudinal features including demographic information and chronic comorbidities processed by three fully connected layers are concatenated with the output from LSTM and then all features extracted from three resources are fed into three more fully connected layers to make the final predictions. Between the last three fully connected layers, two dropout layers with a drop rate of 0.1 are deployed to prevent overfitting.
Performance evaluation: To predict mortality (binary variable), ROC analysis was employed with area under curve (AUC), accuracy, precision, recall, and F1 score. Results are reported in five-fold cross validation. Figure 3 shows a diagram of the cross validation process. In stratified 5-fold cross validation, the dataset was split into five subsets, each with the same sample size and with an equal ratio of samples from each outcome class. Four-fifth of the data was used to train the model whilst one-fifth of the data was held out for validation, creating an 80%:20% of training:validation split. This is repeated five times so each fold was used as the validation set once. The performance reported was the averaged of the five repeats. Each fold was evaluated using the same metrics: accuracy, AUC, specificity, and sensitivity. Individual values for each fold were not reported, rather an average value with standard deviation was shown in Tables 2 and 3. The internal validation hold set was treated as a testing set and was not touched in the training process. There was no external validation of data from another institution due to difficulty in obtaining such detailed data. DeLong statistical test was used to compare AUC differences between groups. A p-value < 0.05 was considered to be statistically significant. To predict duration on IMV (continuous variable), correlation analysis was employed. Slopes, intercepts, correlation coefficients, p values, and mean absolute errors (MAE) were calculated.
To minimize overfitting, we employed the following approaches: 1) ReLU was used as the activation function and batch normalization layers were deployed to minimize overfitting, 2) five-fold cross-validation was used, 3) regularization was used, 4) early stopping of the training was deployed when no improvements were seen for 10 epochs, and 5) only clinical variables that were relevant to predicting mortality from our previous studies were used.