Table 1 The conclusion of machine learning methods used in COVID-19 prediction tasks1
Non-deep learning methods
|
Deep learning methods
|
Statistics
|
Regression
|
SVM
|
Decision tree
|
Basic NN
|
CNNs
|
RNNs
|
LDA [38]
NB [40]
k-NN [42]
|
LR [36,37]
Cox [32]
|
[36]
|
RF [36]
XGBoost [1]
|
BPNN [28,32,36],
GRNN [29],
RBFNN [29],
PNN [29]
|
Basic CNN [30],
Transfer CNN [31],
GDCNN [33],
COVID-Net [34],
COVNet [35]
|
/
|
1 We use the abbreviations of methods and the full names are listed in Table 1.
Table 2 Demographic, laboratory and outcome information of 375 samples in training dataset
Characteristics*
|
Statistics+
|
Characteristics*
|
Statistics+
|
Demographics
|
|
Outcomes
|
|
Age, mean (years)
Gender
|
58.83
Male 224; Female 151
|
Survival, count and rate
Mortality, count and rate
|
201, 53.6%
174, 46.4%
|
Lab test mean (min, max)
patient’s last measurements
|
|
|
|
cTnI
Hemoglobin
Serum chloride
Prothrombin time
procalcitonin
Eosinophils(%)
sIL-2R
Alkaline phosphatase
albumin
basophil(%)
Interleukin 10
Total bilirubin
Platelet count
monocytes(%)
antithrombin
Interleukin 8
indirect bilirubin
RDW
Neutrophils(%)
Total protein
Anti-TP
Prothrombin activity
HBsAg
mean corpuscular volume
hematocrit
WBC
Tumor necrosis factorα
MCHC
fibrinogen
Interleukin 1β
Urea
lymphocyte count
PH value
Red blood cell count
Eosinophil count
Corrected calcium
Serum potassium
|
747.76 (1.9, 43905.19)
124.89 (6.4, 178.0)
102.73 (77.7, 138.2)
16.01 (11.69, 84.22)
0.99 (0.02, 49.34)
0.56 (0.0, 5.61)
961.52 (61.0, 5608.04)
84.86 (37.18, 481.5)
33.06 (19.1, 45.07)
0.21 (0.0, 1.38)
12.5 (5.0, 500.0)
16.28 (2.95, 276.0)
187.78 (1.2, 472.5)
6.66 (0.62, 31.62)
87.57 (20.0, 136.0)
88.08 (5.0, 6385.85)
6.85 (1.0, 79.3)
12.94 (10.91, 22.91)
76.09 (15.9, 98.1)
66.06 (36.7, 79.33)
0.17 (0.02, 8.74)
80.97 (16.3, 136.64)
10.35 (0.0, 250.0)
89.75 (62.3, 110.73)
37.0 (19.9, 51.3)
15.18 (0.8, 1726.6)
12.07 (4.0, 168.0)
344.17 (286.0, 464.33)
4.38 (0.5, 8.89)
6.56 (5.0, 79.44)
9.74 (2.17, 68.4)
1.06 (0.13, 35.53)
6.47 (5.0, 7.54)
7.58 (0.1, 749.5)
0.03 (0.0, 0.33)
2.34 (1.78, 2.67)
4.45 (3.1, 9.86)
|
glucose
neutrophils count
Direct bilirubin
Mean platelet volume
ferritin
RBCW-SD
Thrombin time
Lymphocyte(%)
Anti-HCV
D-D dimer
Total cholesterol
AST
Uric acid
HCO3-
calcium
NT-proBNP
LDH
platelet large cell ratio
Interleukin 6
FDP
monocytes count
PLT distribution width
globulin
γ-GT
INR
basophil count(#)
2019-nCoV nucleic acid
MCH
APTT
hs-CRP
anti-HIV
serum sodium
thrombocytocrit
ESR
GPT
eGFR
creatinine
|
8.37 (2.43, 32.37)
7.03 (0.85, 31.43)
9.5 (1.7, 216.3)
10.89 (9.04, 14.0)
1634.37 (26.8, 42402.91)
41.78 (32.5, 83.3)
18.17 (13.0, 133.67)
16.45 (0.6, 54.83)
0.13 (0.03, 1.85)
6.2 (0.21, 29.82)
3.66 (0.66, 6.11)
56.53 (8.0, 1858.0)
297.47 (57.0, 1001.0)
22.59 (6.3, 29.7)
2.1 (1.35, 2.5)
3166.26 (5.57, 70000.0)
492.12 (143.0, 1867.0)
31.63 (16.48, 54.1)
130.37 (1.5, 5000.0)
44.29 (4.0, 182.4)
0.54 (0.08, 22.75)
12.97 (8.42, 25.3)
32.96 (13.8, 46.04)
51.79 (10.6, 555.25)
1.3 (0.86, 10.51)
0.02 (0.0, 0.12)
-1.0 (-1.0, -1.0)
30.9 (20.8, 47.67)
41.72 (25.59, 100.27)
72.49 (0.2, 320.0)
0.1 (0.05, 0.26)
141.06 (122.8, 171.4)
0.21 (0.05, 0.49)
32.33 (2.0, 102.0)
42.6 (5.0, 1508.0)
81.07 (2.0, 164.7)
120.38 (39.25, 1497.0)
|
* Characteristics have three types - demographics (age and gender), outcomes (survival and mortality) and laboratory test (74 items).
+ Statistics is statistical data for corresponding characteristics, such as rate, mean value and range. The statistical methods are described in each column.
Table 3 AUC-ROC of COVID-19 mortality prediction results by using baselines
|
0 days early*
|
3 days early*
|
6 days early
|
9 days early
|
12 days early
|
Cox1
|
0.9550.06
|
0.9920.02
|
0.870.01
|
0.85.01
|
0.810
|
k-NN1
|
0.9500.02
|
0.9090.01
|
0.890.02
|
0.840.02
|
0.816
|
SVM1
|
0.9690.04
|
0.9540.02
|
0.9300.03
|
0.895.04
|
0.857
|
DT1
|
0.974
|
0.9590.03
|
0.924.00
|
0.897.01
|
0.869
|
BPNN1
|
0.980.02
|
0.954.05
|
0.9331
|
0.894
|
0.878
|
PNN1
|
0.985.01
|
0.961.02
|
0.9402
|
0.889
|
0.889
|
RNN1
|
0.985
|
0.960.01
|
0.931.00
|
0.910
|
0.871
|
LSTM1
|
0.990.01
|
0.961
|
0.937.02
|
0.920
|
0.897
|
T-LSTM2
|
0.9970.00
|
0.9690.01
|
0.947
|
0.921
|
0.914
|
* n days early: The models make prediction n days before the final death/survival time. They use sequence data from day 0 to n days before the last time to predict.
1 Cox: Cox's proportional hazards regression model is semi parametric regression model. It can analyze the influence of many factors on outcomes. It is used in [32].
1 k-NN: k-Nearest Neighbors method makes prediction based on the information of nearest k samples in training set. In this mortality prediction task, the most accurate results appeared when k = 3.
1 SVM: Support Vector Machines classify by solving the separation hyperplane which can divide the training data correctly and has the largest geometric interval.
1 DT: Decision tree is a simple classifier consisting of sequences of hierarchically organized binary decisions. It is used in [6].
1 BPNN: Back Propagation Neuron Network makes the signal and the error propagate forward and backward separately. It is used in [28].
1 PNN: Probabilistic Neural Network is a forward propagation network and does not need back propagation to optimize parameters by using Bayesian decision-making. It is used in [29].
1 RNN: Recurrent Neural Network have been introduced in the ‘T-LSTM’ section.
1 LSTM: Long Short-Term Memory which we have introduced in the ‘T-LSTM’ section. Here, the hyperparameter setting is same as T-LSTM.
2 T-LSTM: Time-aware LSTM is the model used in this paper. Its inputs are the three-dimensional vectors and the time intervals. The values for each dimension are the values of LDH, lymphocyte and hs-CRP in patients’ blood tests. Its output is the binary result 0/1. Here, 0 indicates survival and 1 indicates death. The hidden states in its units are 64 dimensional, and the fully connected layer has 32 dimensions.
Table 4 AUC-ROC of COVID-19 mortality prediction results by using T-LSTM on different sets at different timestamps
|
Training+
|
Validation+
|
Test+
|
0 days early*
|
0.9960.01
|
0.9970.01
|
0.9970.00
|
3 days early
|
0.9890.00
|
0.9870.01
|
0.9690.01
|
6 days early
|
0.960
|
0.9572
|
0.947
|
9 days early
|
0.944
|
0.9350.01
|
0.921
|
12 days early
|
0.9260.01
|
0.9242
|
0.914
|
15 days early
|
0.8910.01
|
0.8830.01
|
0.863
|
18 days early
|
0.8520.01
|
0.8342
|
0.8192
|
* n days early: The model makes a prediction n days before the final death/survival time. It uses sequence data from day 0 to n days before the last time to predict.
+ We use the records of 375 patients as the training set; the ratio of training set to verification set is 0.8:0.2. The records of 110 patients make up the test set. This experiment is conducted on 5-fold cross-validation.
Table 5 AUC-ROC of COVID-19 mortality prediction results by using T-LSTM with 40 or 3 laboratory tests+
|
0 days early*
|
3 days early*
|
6 days early
|
9 days early
|
12 days early
|
40 features
|
0.9490.01
|
0.9200.03
|
0.9150.01
|
0.9100.01
|
0.9030.01
|
3 features
|
0.9970.00
|
0.9690.01
|
0.947
|
0.921
|
0.914
|
* n days early: The model makes prediction n days before the final death/survival time. It uses sequence data from day 0 to n days before the last time to predict.
+ The inputs of T-LSTM are time series of 48 laboratory tests or 3 laboratory tests (LDH, lymph and hs-CRP).
Table 6 Feature statistics of patients in different stages of COVID-19 disease progression
Survival class (general)
|
Mean value
|
Stage 1
|
Stage 2
|
Stage 3
|
Stage 4
|
Mortality rate (%)
|
7.57
|
0
|
1.91
|
0
|
Time distance (days)
|
22.65
|
14.08
|
7.98
|
2.75
|
Lymph (%)
|
16
|
18
|
21
|
31
|
LDH (U/l)
|
328
|
301
|
245
|
199
|
hs-CRP (mg/l)
|
43
|
39
|
21
|
3
|
Indirect Bilirubin (μmol/l)
|
7
|
6
|
4
|
3
|
Creatinine (μmoI/l)
|
98
|
75
|
89
|
76
|
INR
|
2
|
1
|
1
|
1
|
Serum Sodium (mmol/l)
|
138
|
139
|
140
|
137
|
eGFR (ml/min)
|
79
|
109
|
112
|
111
|
Serum Chlorine (mmol/l)
|
97
|
99
|
103
|
102
|
Albumin (g/l)
|
41
|
39
|
38
|
40
|
Death class (critical)
|
Mean value
|
Stage 1
|
Stage 2
|
Stage 3
|
Stage 4
|
Mortality rate (%)
|
76.32
|
88.76
|
91.28
|
100
|
Time distance (days)
|
26.96
|
18.76
|
9.32
|
2.05
|
Lymph (%)
|
15
|
10
|
9
|
4
|
LDH (U/l)
|
338
|
364
|
375
|
499
|
hs-CRP (mg/l)
|
48
|
55
|
69
|
84
|
Indirect Bilirubin (μmol/l)
|
8
|
9
|
14
|
23
|
Creatinine (μmoI/l)
|
104
|
106
|
120
|
125
|
INR
|
2
|
2
|
3
|
2
|
Serum Sodium (mmol/l)
|
140
|
140
|
135
|
129
|
eGFR (ml/min)
|
75
|
71
|
70
|
57
|
Serum Chloride (mmol/l)
|
96
|
103
|
104
|
105
|
Albumin (g/l)
|
40
|
32
|
33
|
32
|
Table 7 Ranking of average KL divergence values of top 40 features
Ranking
|
Feature
|
Average KL
|
Ranking
|
Feature
|
Average KL
|
1
|
Lymph
|
0.0421
|
21
|
Total Cholesterol
|
0.0041
|
2
|
LDH
|
0.0392
|
22
|
Interleukin 6
|
0.0032
|
3
|
hs-CRP
|
0.0376
|
23
|
I sIL-2R
|
0.0031
|
4
|
Indirect Bilirubin
|
0.0324
|
24
|
cTnI
|
0.0030
|
5
|
Creatinine
|
0.0302
|
25
|
RBCW-SD
|
0.0024
|
6
|
INR
|
0.0235
|
26
|
Uric Acid
|
0.0023
|
7
|
Serum Sodium
|
0.0232
|
27
|
Corrected Calcium
|
0.0022
|
8
|
eGFR
|
0.0225
|
28
|
Interleukin 8
|
0.0019
|
9
|
Serum Chloride
|
0.0224
|
29
|
Prothrombin Time
|
0.0018
|
10
|
Albumin
|
0.0193
|
30
|
Serum Potassium
|
0.0017
|
11
|
Globulin
|
0.0177
|
31
|
Interleukin 1β
|
0.0017
|
12
|
Hematocrit
|
0.0122
|
32
|
D-D dimer
|
0.0016
|
13
|
Hemoglobin
|
0.0091
|
33
|
FDP
|
0.0016
|
14
|
Fibrinogen
|
0.0079
|
34
|
Antithrombin
|
0.0015
|
15
|
γ-GT
|
0.0079
|
35
|
Procalcitonin
|
0.0010
|
16
|
ESR
|
0.0078
|
36
|
Platelet Count
|
0.0009
|
17
|
NT-proBNP
|
0.0074
|
37
|
WBC
|
0.0006
|
18
|
APTT
|
0.0053
|
38
|
Ferritin
|
0.0005
|
19
|
Eosinophils
|
0.0051
|
39
|
Interleukin 10
|
0.0004
|
20
|
basophil
|
0.0049
|
40
|
PLT
|
0.0004
|