Variations. The full genomes of 41 isolates from six continents were compared with a reference seed genome and a lot of codon alterations were detected in all open reading frames (orfs), except orf10. Orf1ab that contain 15 different coding regions was the largest fragment in genome and codon alterations were detected in 66 positions, resulting in 39 amino acid changes. For orf3a, codon alterations were detected in eight positions and seven of them were found to cause the amino acid changes. Fragments of orf6, orf7a and orf8 contained codon alterations in one, two and two positions, respectively. All codon alterations were detected to cause amino acid changes, except one in orf6.
The codon alterations were also detected in gene fragments encoding structural proteins including S, E, M and N proteins. These codon alterations detected in only structural proteins were depicted comprehensively in Table 1, 2 and 3. In the gene fragment of S, codon alterations were detected in eleven positions and nine of them were detected to cause amino acid changes. Especially codon alteration caused by the substitution of adenine for guanine at position 1841 was prevalent and detected in 17 different isolates from 14 countries (Table 1). The altered codon caused a change from aspartic acid to glycine. In the gene fragment of N, codon alterations were detected in 12 different positions. Among these codon alterations, three were prevalent and two of them were detected in the same six isolates from five countries (Georgia, Greece, Argentina, Colombia, Nigeria). One of them was caused by consecutive guanine to adenine transition at two positions 608 and 609, and resulted in arginine to lysine substitution whereas the other one was caused by a changing from guanine to cytosine at position 610 and resulted in glycine to arginine substitution. The third prevalent variation caused by substitution of cytosine to thymine was detected in 36 isolates from 29 countries and did not result in amino acid alteration (Table 2).
The fragments belonging to E and M were the smallest when compared to S and N proteins. In the gene fragment of E, two variations caused change in amino acid sequence and one of them was detected in two isolates whereas a codon alteration detected in the gene fragment of M protein did not result in amino acid change (Table 3).
Phylogenetic analysis. SARS-CoV-2 sequences from Turkey (EPI_ISL_424366), Canada- Ontario (EPI_ISL_418384), and Australia-Western Australia (EPI_ISL_420539) were grouped in a cluster while a sequence of Wuhan SARS-CoV-2 (NC_045512) and two sequences from Kuwait (EPI_ISL_416458, EPI_ISL_416541) were grouped in another cluster that two clusters diverge from the same node on the phylogenetic tree. China-Yunnan (MT049951) sequence was clustered with sequences from East and South Asian Countries including New Zealand (EPI_ISL_416538), Australia-Queensland (EPI_ISL_420879), India (MT050493), Japan (LC528233), and Singapore (EPI_ISL_414379) except a sequence from Chile (EPI_ISL_415661). Sequences from Chile (EPI_ISL_415661) and Japan (LC528233) were clustered together. Two sequences from Greece (EPI_ISL_418263, EPI_ISL_418264) were clustered together while two sequences from Georgia (EPI_ISL_420140, EPI_ISL_420144) were clustered in separate from each other. (Fig. 1).
Physico-chemical parameters. The number of amino acids varied from 75 to 1273 among structural proteins. The largest one was S protein with ~142 kDa whereas E protein with ~8.4 kDa was the smallest one (Table 4). Among non-structural proteins, except orf1ab, the amount of amino acids varied from 43 to 275. Orf3a with ~32 kDa was the largest protein whereas orf7b with 5.2 kDa was the smallest protein (Table 4). Each non-structural protein that is encoded by orf1ab was also analysed, and the amount of amino acids was detected to vary from 83 to 1945. Nsp-3 with ~218 kDa molecular weight was one of the largest proteins whereas the smallest one was nsp-7 with ~9.3 kDa size (Table 4). When all proteins encoded by full genome were analysed, theoretical PI value was between 4.6 and 10.07. Among structural proteins, only S protein was negatively charged whereas E, M and N protein were positively charged. In addition, orf7a, orf10, nsp-6, nsp-9, nsp-13, nsp-14 and nsp-16 proteins were positively charged whereas the remaining proteins were negatively charged except nsp-4 and nsp-8 that were neutral. The estimated half-life was 30 h for all proteins, except proteins that were encoded by orf1ab. Only nsp-1 in orf1ab had 30 h estimated half-life. According to instability index, N protein was found as instable while S, E, M structural proteins and most of the non-structural proteins were found as stable. Aliphatic index showed a significant variation ranging between 52.53 to 144 among all proteins. Grand average of hydropathicity value was found negative in S and N proteins as well as in most of the non-structural proteins that were encoded by orf1ab (Table 4).
Secondary structure. According to results obtained from structural proteins, alpha helix was between ~22 and 47%, that of extended strand was between ~10 and 22%, and that of random coil was between ~40 and 60%. For non-structural proteins, alpha helix varied between 0 and 69%, that of extended strand varied between ~3 and 47%, and that of random coil varied between ~28 and 58% (Table 5).
Antigenicity. All structural proteins were predicted as probable antigen. Antigenicity value varied from 0.4638 to 0.6298. E protein and its variant L37H had the highest antigenicity value whereas S protein had the lowest antigenicity value. Interestingly, all non-structural proteins were also predicted as probable antigen, except nsp-16 encoded by orf1ab. In addition, orf7b had the highest antigenicity value with 0.8462 among all proteins (Table 6).
Solubility. According to solubility prediction, S, E and N proteins were soluble. Among non- structural proteins, orf3a as well as nsp-2, 4, 7, 10, 12, 13, 14, 15 and 16 proteins encoded by orf1ab were predicted as insoluble. The solubility prediction of another protein, nsp-3 encoded by orf1ab, could not be retrieved due to large fragment size (Table 6).
Subcellular localisation and transmembrane helices. The number of transmembrane helices varied from 0 to 3 among structural proteins. The number of transmembrane helices was the lowest in N protein whereas it was the highest in M protein. Among non-structural proteins, although the number of transmembrane helices varied from 0 to 8, most of them had no transmembrane helices (Table 6). When subcellular localisation predictions were examined, S, M and N proteins were predicted to be in host endoplasmic reticulum. E as well as M proteins were also predicted to be in host cell membrane. Non-structural proteins were predicted to locate in cell membrane, endoplasmic reticulum, cytoplasm, nucleus (Table 6).
Signal peptide. According to prediction of signal peptide based on four different parameters, only S protein and its variant were predicted to have a signal peptide during the analyses of structural proteins. Among non-structural proteins, orf7a, orf8, variant of orf8 and nsp-10 were predicted to have a signal peptide (Table 7).
Allergenicity. None of the proteins analysed showed allergenic properties for MEME/MAST motif and IgE epitopes (Table 8).
BetaWrap motifs. Among all proteins analysed, only S protein and its variant (D614G) were predicted to contain BetaWrap motifs (Table 8).
Similarity with Host Proteome. No significant similarity was predicted between analysed viral proteins and host proteins (Table 8).
B cell epitopes. A lot of linear B cell epitopes were predicted for S, variant S (D614G), N and variants of N (S197L and R203K/G204R), E and variant E (L37H), orf8 and nsp-10 proteins using Bcepred and IEDB. Epitopes that were predicted in both Bcepred and IEDB, and detected as probable antigen were presented in Table 9. Obtained predictions showed that nearly all epitopes had more antigenicity value than those of their own proteins. Among these analysed proteins, the highest antigenicity value (1.4530) was predicted for an epitope (GGDGKMKD) belonging to N protein and its variants. Another epitope (THTGTGQ) that had a high antigenicity value of 1.0789 was predicted in Nsp-10 encoded by orf1ab. Also, any antigenic epitope was not predicted for M and orf7a proteins. All predicted probable antigenic epitopes were depicted in Table 9.
MHC-I and MHC-II epitopes. A lot of MHC-I epitopes were predicted as probable antigen (Table 10). Antigenicity values belonging to epitopes were generally predicted higher than those of their own proteins. Among structural proteins, an epitope (KLNDLCFTNV) that had the highest antigenicity value (2.6927) was predicted in S protein and its variant (D614G). For non- structural proteins, an epitope (ITLCFTLKRK) in orf7a had the highest antigenicity value (2.5150). Any antigenic epitope was not predicted for nsp-10. On the other hand, KWPWYIWLGF, FLAFVVFLLV, FARTRSMWSF and RNRFLYIIKL, AQFAPSASAF and LGIITTVAAF epitopes belonging to S (including variant D614G), E (including variant L37H), M, N (including S197L and R203K/G204R) and orf8 (including L84S), respectively, had an IC50 value lower than 10 and a percentile rank varying from 0.02 from 0.1, indicating a strong binding among the epitope and MHC-I alleles.
Similarly, a lot of MHC-II epitopes were predicted as probable antigen (Table 11). Also, nearly all epitopes had higher antigenicity values than those of their own proteins. Among structural proteins, PTNFTISVTTEILPV and VTLAILTAHRLCAYC epitopes predicted in S protein (including variant D614G) and variant L37H had the highest antigenicity value. For non- structural proteins, orf7a had an epitope (IVFITLCFTLKRKTE) that was predicted as a probable antigen with a high antigenicity value (1.8597). Any antigenic epitope was not predicted for nsp-10. Among MHC-II epitopes, although there were a lot of epitopes with low percentile rank, only one epitope that had an IC50 value lower than 10, indicating a strong binding among epitope and MHC-II alleles, was detected in orf8.
Post-translational modifications. S protein and its variant (D614G) were predicted to have highly N-glycosylated and phosphorylated sites as well as a few O-glycosylated and acetylated sites. M, E (including L37H), orf7a, and nsp10 proteins were predicted to have N-glycosylated and phosphorylated sites while orf7a was predicted to have an acetylation site. Orf8 and its variant (L84S) were predicted to have N-glycosylated and phosphorylated site whereas two additional phosphorylation sites, one of which locate in exposed surface and the other one is buried, were predicted in only variant L84S. In addition, N protein and its two variants (S197L and R203K/G204R) were predicted to have N-/O-glycosylated, phosphorylated and acetylated sites. When N protein and its variant were compared, O-glycosylation or phosphorylation sites were not detected in variant S197L whereas an extra acetylation site was detected in variant R203K/G204R.
Docking Analysis. All probable antigenic epitopes that have a low IC50 value and percentile rank could not be docked with their MHC-I or MHC-II alleles because of limitations associated with available MHC-I and MHC-II alleles variations in data bank or server. Accordingly, KWPWYIWLGF, KLNDLCFTNV, FLAFVVFLLV, LIFLWLLWPV, MEVTPSGTWL, FLIVAAIVFI and LEYHDVRVVL epitopes belonging to S (including variant D614G), E (including variant L37H), M, N (including variants S197L and R203K/G204R), orf7a and orf8 (including variant L84S), respectively, were docked with receptors of selected MHC-I alleles (Figs. 2, 3, and 4).
During docking analysis conducted by MHC-II alleles, in S protein, core regions of PTNFTISVTTEILPV, SIIAYTMSLGAENSV, and GYFKIYSKHTPINLV epitopes were docked with receptor of HLA-DRB1*07:01. Also, core region of another epitope (QDLFLPFFSNVTWFH) in S protein was docked with receptor of HLA-DRB1*15:01. In M protein, core regions of ASFRLFARTRSMWSF, RTLSYYKLGASQRVA and PKEITVATSRTLSYY epitopes were docked with receptor of HLA-DRB1*07:01. Also, core region of an epitope (QIAQFAPSASAFFGM) in N protein was docked with receptor of HLA- DRB1*07:01. Similarly, core region of an epitope (VTLAILTAHRLCAYC) in variant L37H was docking to receptor of HLA-DRB1*1501.