In this section, we have discussed about the data preparation, the parameters and metrics which are used for COVID-Predictor and the outcome of the predictor.
Table 1. Statistics of the refined datasets of corona and other viruses
Virus Name
|
Source of Sequence
|
After filtering out above 20K bp Sequences
|
No. of Sequence
|
Max Length of Sequence
|
Min Length of Sequence
|
Avg Length of Sequence
|
SARS-CoV-1
|
NCBI
|
515
|
32759
|
21221
|
29608
|
MERS
|
NCBI
|
291
|
30150
|
27121
|
29983
|
SARS-CoV-2
|
GISAID
|
2369
|
29986
|
20008
|
29520
|
Other virus
|
NCBI
|
After filtering out above 10K bp Sequences
|
600
|
19897
|
10735
|
15316
|
Data Preparation
The dataset of SARS-CoV-1, MERS, other kind of viruses like Ebola and Dengue were downloaded from NCBI while SARS-CoV-2 was downloaded from GISAID in fasta format on 28th March 2020. Although proposed predictor does not require sophisticated data prepossessing, only it requires complete genome sequence of viruses. As a result 515, 291, 2369 sequences of SARS-CoV-1, MERS, SARS-CoV-2 respectively of length more than 20K bp while 600 other virus such as Ebola and Dengue of length more than 10K bp are considered in our experiment. The statistics of the refined consolidated datasets are shown in Table 1, while the country wise statistics of SARS-CoV-2 is reported in Table 2. In order to visualise the virus sequences, t-distributed Stochastic Neighbor Embedding (tSNE)15 is used on count vector as generated by k-mer and n-gram techniques. k-mer is now an essential part of many methods in bioinformatics such as genome and transcriptome assembly, metagenomic sequencing, error correction of sequence reads etc.16. Solis-Reyes et.al in11 has explained that k-mer works better than other popular methods like REGA17, SCUEAL18, COMET19 etc. The embedded representation of all four virus classes and top 21 country specific sequences of SARS-CoV-2 are shown in Figure 1 and 2.
Table 2. Statistics of country wise refined sequences of SARS-CoV-2
Country
|
No.
|
of
|
Sequences
|
Country
|
No.
|
of
|
Sequences
|
Country
|
No.
|
of
|
Sequences
|
Country
|
No.
|
of
|
Sequences
|
USA
|
|
|
590
|
Spain
|
|
|
27
|
Chile
|
|
|
7
|
Mexico
|
|
|
1
|
Iceland
|
|
|
343
|
Congo
|
|
|
19
|
Ireland
|
|
|
6
|
Nepal
|
|
|
1
|
China
|
|
|
275
|
Scotland
|
|
|
18
|
Vietnam
|
|
|
6
|
Nigeria
|
|
|
1
|
Netherlands
|
|
|
190
|
Canada
|
|
|
17
|
Kuwait
|
|
|
4
|
Northern Ireland
|
|
|
1
|
England
|
|
|
160
|
Italy
|
|
|
17
|
Slovakia
|
|
|
4
|
Pakistan
|
|
|
1
|
Wales
|
|
|
107
|
Taiwan
|
|
|
17
|
Czech Republic
|
|
|
3
|
Panama
|
|
|
1
|
Japan
|
|
|
83
|
Singapore
|
|
|
14
|
Saudi Arabia
|
|
|
3
|
Peru
|
|
|
1
|
France
|
|
|
75
|
Finland
|
|
|
13
|
Fujian
|
|
|
2
|
Poland
|
|
|
1
|
Australia
|
|
|
64
|
South Korea
|
|
|
13
|
Hungary
|
|
|
2
|
Russia
|
|
|
1
|
Belgium
|
|
|
45
|
Georgia
|
|
|
10
|
India
|
|
|
2
|
South Africa
|
|
|
1
|
Portugal
|
|
|
44
|
Luxembourg
|
|
|
10
|
Thailand
|
|
|
2
|
Sweden
|
|
|
1
|
Brazil
|
|
|
34
|
Denmark
|
|
|
9
|
Cambodia
|
|
|
1
|
Turkey
|
|
|
1
|
Switzerland
|
|
|
31
|
Malaysia
|
|
|
8
|
Colombia
|
|
|
1
|
|
|
|
|
Hong Kong
|
|
|
30
|
New Zealand
|
|
|
8
|
Ecuador
|
|
|
1
|
|
|
|
|
Germany
|
|
|
27
|
Norway
|
|
|
8
|
Lithuania
|
|
|
1
|
|
|
|
|
Parameters setting and Metrics
The experiments have been performed using python 3.6 and executed on an Intel Core i5-2410M CPU at 2.30 GHz Machine with 8GB RAM and Windows 7 operating system. The required input parameters are experimentally set and those are number of trees for RF = 100, decision for RF is “gini”, alpha value as smoothing factor of MNB is 0.1 and kernel used in GSVM is “rbf”. To evaluate results of COVID-Predictor, the popular performance metrics such as Accuracy, Precision, Recall and F1 −Score are used.
Table 3. Classification performance of different machine learning techniques after performing 10-fold cross validation with different values of k-mer and n-gram on 1000 genome sequences of SARS-CoV-1, MERS, SARS-CoV-2 and Other virus samples
Method
|
k-mer
|
n-gram = 2
|
n-gram = 3
|
n-gram = 4
|
n-gram = 5
|
Aggregated Score
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
MNB
|
2
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99905
|
0.99910
|
0.99905
|
0.99905
|
0.99835
|
GSVM
|
0.94857
|
0.95725
|
0.94857
|
0.94952
|
0.96762
|
0.97151
|
0.96762
|
0.96795
|
0.98190
|
0.98324
|
0.98190
|
0.98191
|
0.99238
|
0.99276
|
0.99238
|
0.99237
|
0.97359
|
RF
|
0.99429
|
0.99458
|
0.99429
|
0.99428
|
0.99429
|
0.99458
|
0.99429
|
0.99428
|
0.99429
|
0.99457
|
0.99429
|
0.99428
|
0.99619
|
0.99632
|
0.99619
|
0.99618
|
0.99482
|
MNB
|
3
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99905
|
0.99910
|
0.99905
|
0.99905
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99835
|
GSVM
|
0.96762
|
0.97151
|
0.96762
|
0.96795
|
0.98190
|
0.98324
|
0.98190
|
0.98191
|
0.99238
|
0.99276
|
0.99238
|
0.99237
|
0.99905
|
0.99909
|
0.99905
|
0.99905
|
0.98561
|
RF
|
0.99429
|
0.99458
|
0.99429
|
0.99428
|
0.99524
|
0.99548
|
0.99524
|
0.99523
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99714
|
0.99725
|
0.99714
|
0.99714
|
0.99623
|
MNB
|
4
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99905
|
0.99910
|
0.99905
|
0.99905
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99882
|
GSVM
|
0.98190
|
0.98324
|
0.98190
|
0.98191
|
0.99238
|
0.99276
|
0.99238
|
0.99237
|
0.99905
|
0.99909
|
0.99905
|
0.99905
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99344
|
RF
|
0.99429
|
0.99456
|
0.99429
|
0.99428
|
0.99714
|
0.99725
|
0.99714
|
0.99714
|
0.99714
|
0.99725
|
0.99714
|
0.99714
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
0.99670
|
MNB
|
5
|
0.99905
|
0.99910
|
0.99905
|
0.99905
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99929
|
GSVM
|
0.99238
|
0.99276
|
0.99238
|
0.99237
|
0.99905
|
0.99909
|
0.99905
|
0.99905
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99765
|
RF
|
0.99619
|
0.99633
|
0.99619
|
0.99618
|
0.99714
|
0.99725
|
0.99714
|
0.99714
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99740
|
MNB
|
6
|
0.99810
|
0.99817
|
0.99810
|
0.99810
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99929
|
GSVM
|
0.99905
|
0.99909
|
0.99905
|
0.99905
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99714
|
0.99724
|
0.99714
|
0.99714
|
0.99882
|
RF
|
0.99714
|
0.99725
|
0.99714
|
0.99714
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99714
|
0.99724
|
0.99714
|
0.99714
|
0.99764
|
MNB
|
7
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99953
|
GSVM
|
1.00000
|
1.00000
|
1.00000
|
1.00000
|
0.99905
|
0.99908
|
0.99905
|
0.99905
|
0.99714
|
0.99724
|
0.99714
|
0.99714
|
0.99619
|
0.99632
|
0.99619
|
0.99619
|
0.99811
|
RF
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99714
|
0.99724
|
0.99714
|
0.99714
|
0.99810
|
0.99816
|
0.99810
|
0.99809
|
0.99787
|
Outcome of the Predictor
The dataset consisting all four types of virus sequences such as SARS-CoV-1, MERS, SARS-CoV-2 and Other viruses has been divided into two sets - one for training set and other for validation purpose. Stratified sampling method has been applied to prepare training dataset to ensure that representative from all four types of virus classes are present. As a result 1000 of virus sequences are used in training. Moreover, data samples are carefully selected from each category to avoid imbalance class problem. The validation dataset contains those sequences which are not present in training dataset. The training dataset is used in three independent machine learning techniques viz. MNB, GSVM and RF. For each machine learning technique, the motifs of virus sequences are created using k-mer method. Thereafter, such motifs are combined using n-gram technique to create count vector which is used to train the classifiers. In our experiments the value k of k-mer varies between 2 to 7, while the value of n-gram varies between 2 to 5. Each classifier has been evaluated with 10-fold cross validation followed by further validation on unseen dataset taken from NCBI and GISAID on 8th April 2020. The performance metrics of each machine learning technique
Table 4. Classification performance of COVID-Predictor on validation data
Source
|
Data Samples
|
Accuracy Precision Recall F1-Score
|
NCBI + GISAID
|
2043 Sequences (262 SARS-CoV-1, 41 MERS,
1440 SARS-CoV-2, 300 Other virus)
|
0.92217 0.92991 0.92217 0.90726
|
NCBI
|
493 Sequences (Only SARS-CoV-2)
|
1.00000 1.00000 1.00000 1.00000
|
GISAID
|
4747 Sequences (Only SARS-CoV-2)
|
1.00000 1.00000 1.00000 1.00000
|
with 10 fold cross validation for different values of k-mer and n-gram have been reported in Table 3. Four quantitative metrics are further consolidated as single aggregated score for ease of comparison. The aggregated score has been computed simply by taking average of all the scores following the similar approach of what is used in20. The boundary of aggregated score is [0,1] where higher value signifies better result. It is evident from the Table 3 that MNB based COVID-Predictor produces higher aggregated score, i.e. 0.99953 for value of k-mer as 7. Similar results are also observed for MNB based COVID-Predictor for other values of k-mer. Thus, according to the results, we have prepared the pre-trained model of COVID-Predictor with 1000 genomic sequences of four virus classes for values of k-mer and n-gram as 7 and 3 respectively. To gain further confidence, we have used additional validation set of sequences as reported in Table 4. While validated with 2043 samples, it is observed that 159 cases are false positive considering prediction of SARS-CoV-2 is positive. After further investigation, it has been found that these 159 sequences are SARS-CoV-1 and misclassified by COVID-Predictor as SARS-CoV-2. As our primary objective is to predict SARS-CoV-2, we further wanted to examine the rate of false negative. For this purpose, additional two sets of SARS-CoV-2 sequences are used separately, one with 493 samples from NCBI and another with 4747 samples from GISAID. Both the cases, the COVID-Predictor predicted SARS-CoV-2 with 100% accuracy. This experiment establishes that COVID-Predictor with the proposed feature building approach has potential to predict SARS-CoV-2 with higher accuracy. The same pre-trained model is used to build the web based application where the unknown sequences can be uploaded to predict the class of coronavirus. The screen shot of the web based predictor is shown in Figure 3 and 4.