Mutations quantity among geographical areas
At the start, we decided to find the occurred mutations in order to understanding the incidence rates of mutations and realizing the potentially essential mutations statistically. The number of 6394483, 6177403, 5841477 and 895738 sequences belonging to E, M, N and S proteins were qualified and studied for identifying the number of mutations, respectively.
According to the achieved data, 96.40% of E amino-acid (AA) sequences (AASs) displayed no mutation (Fig 2). These features for M, N and S AASs were 36.76%, 2.20% and 2.11%, respectively. Additionally, the one mutation incidence rate was 3.56%, 59.64%, 5.68% and 26.86%, sequentially. 0.02%, 2.80%, 7.11% and 26.15% for E, M, N and S AASs displayed two mutations, respectively. Besides these data, the incidence rate of three and four mutations for E, M, N and S AASs has been demonstrated in Fig 2.
The achieved data belonging to the E protein demonstrated that 77.72% of AASs in Africa and 95.72% of Asia ASSs did not display any mutation (Fig 2). This exhibition in Europe, North America, South America and Oceania has been observed in 96.10%, 97.44%, 97.30% and 96.86% of AASs, respectively. In comparison, 49% of Africa ASSs did not include any mutation for M protein. Such feature for M protein was concluded by 36.44%, 36.62%, 35.03%, 60.45% and 39.04% in Asia, Europe, North America, South America and Oceania, successively. One mutation of M protein was shown in 47.35%, 61.19%, 60.27%, 60.36%, 37.87% and 56% in Asia, Europe, North America, South America and Oceania ASSs, respectively. 1.43%, 1.9% and 2.13% of Africa, Asia and Europe ASSs illustrated two mutations in M ASSs, same as the 4.01%, 1.33% and 2.35% of ASSs in North America, South America and Oceania, successively.
4.53% of the samples in Africa were without any mutation in N protein. Also 2.98%, 1.80%, 2.36%, 1.14% and 4.31% of N AASs in the area of Asia, Europe, North America, South America and Oceania were without mutation occurrence, respectively. In contrast to Africa, which displayed 19.35% with one mutation, the one mutation incidence rate in other 5 areas was noticeably lower and except Oceania, other areas displayed almost similar one mutation incidence rate. The percentage of N ASSs with two mutations in Oceania and Africa were higher than other areas. This demonstration in Oceania and Africa ASSs were 28.81% and 16.55%, respectively but in Asia, Europe, North America and South America were 7.07%, 4.8%, 9.53% and 4.95%, successively. Concerning the S protein, it has been resulted that in South America, only 0.46% of AASs did not display any mutation and about 82% demonstrated four and more mutations. For such protein, the no mutation incidence rate in Oceania was 8.45%, the highest no mutation incidence rate. The one mutation incidence rate among S AASs has been demonstrated as 36.01%, 35.81%, 20.75%, 31.70%, 6.95% and 13.78% in the ASSs of Africa, Asia, Europe, North America, South America and Oceania, respectively. 81.96%, 35.92%, 24.07%, 17.14%, 16.99% and 2.52% of S AASs displayed four and more mutations among South America, North America, Asia, Africa, Europe and Oceania ASSs, in order from large to small. The prevalence of AASs with one mutation in Africa, Asia and Europe were higher than other types of achieved data (Fig 2). Besides, the most prevalent number of mutations in Oceania and Americas were two mutations and more than three mutations, respectively.
In the following, we decided to draw a heat map for mutations to detect their frequency in total and among each of the areas. Data displayed the most mutations relative to the total AASs among the E, M, N and S AASs occurred in the regions of 7 to 14 AA (0.0018 frequency), 66 to 88 AA (0.0279 frequency), 164 to 205 AA (0.0294 frequency) and 508 to 635 AA (0.0079 frequency), respectively (Fig 3). The second highest mutations frequency in the E, M, N and S AASs arose in the regions of 56 to 63 AA (0.0006 frequency), 1 to 22 AA (0.0010 frequency), 205 to 246 AA (0.0201 frequency) and 1 to 127 AA (0.0048 frequency), respectively. The necessity of heat map refers to the variation in the appearance number of mutations and dispersion in samples of which mutations have been occurred.
Mutation’s features based on the geographical areas
In the next step, the locations of mutations in the protein structure and their frequency have been considered to identify more dimensions of mutations. As it has been mentioned in the Fig 4, the most frequent mutation in the E protein is attributed to T9I with the frequency rate of 0.0128 and after that, P71L with 0.0068 frequency, V62F 0.0066 frequency, L21F/L21V with 0.0017/0.0003 frequencies and V58F with 0.0013 frequency have the highest frequency rate of mutations. Locations of top three frequent mutations were shown in Fig 5 section A. Accordingly, T9I were the most frequent mutation in Europe, Oceania, North America and South America with the 0.0187, 0.0249, 0.0066 and 0.0049 frequency rates, respectively. Nevertheless, P71L was the most frequent mutation in Africa and Asia with frequency rates of 0.1643 and 0.0146, respectively. V62F is one of the first ten frequent mutations in Asia (0.0118 frequency), Europe (0.0016 frequency) and North America (0.0024 frequency), in contrast to Africa (0.0011 frequency), Oceania (0.0004 frequency) and South America (0.0012 frequency) which this mutation was as eighth, sixth and ninth, respectively.
Regarding the M protein, analysis showed I82T (0.6015 frequency), D3G (0.0077 frequency), A63T (0.0073 frequency), Q19E (0.0072 frequency) and A2S (0.0033 frequency) are the first five mutations with highest frequency, respectively (Fig 4). I82T was the most frequent mutation in all six areas (Fig 5 section B). This situation is different from D3G mutation which is not the second frequent mutation in Asia and North America. In these areas, F28L (0.020 frequency) and A81S (0.0083 frequency) were at the second position of frequent mutations, respectively. In Africa and Europe, A63T was the third frequent mutation with 0.0224 and 0.0091 frequencies, respectively. On the other hand, the third frequent mutation in Asia, Oceania, North America and South America were D3G (0.0048 frequency), Q19E (0.0259 frequency), S197N (0.0069 frequency) and R164H (0.004 frequency), respectively.
Analysis of N AASs data illustrated that the R203M/R203K with 0.6084/0.2489 frequencies was at the first position of frequent mutations (Fig 5 section C). Globally, D377Y (0.6134 frequency) mutation ranks second, D63G (0.6002 frequency) ranks third, G215C (0.5479 frequency) ranks fourth and G204R/G204P (0.2352/0.0134 frequencies) ranks fifth mutation (Fig 4). In all continents except South America, up to fourth position of frequent mutations were similar to the global results. Analysis data of South America resulted in the different arrangement. The frequency of R203M mutation is higher than R203K mutation in all continents excluding South America. The R203M/R203K frequencies in Africa were 0.4195/0.1965, in Asia were0.6033/0.3052, in Europe were0.6074/0.2826, in North America were 0.6310/0.1776 and in Oceania were 0.6143/0.3008. However, in South America the frequencies of R203M/R203K were 0.3570/0.5700. A further dimension of differences between South America and other areas belongs to the positions of second and third frequent mutations. G204R (0.5685 frequency) and P80R (0.4184 frequency) ranks second and third mutations in South America.
The pattern of mutation frequency for S AASs displayed that in the world, D614G with 0.9756 frequency achieved first place among frequent mutations. In the following, L18F (0.1680 frequency), A222V (0.1579 frequency), E484K (0.1454 frequency) and N501Y (0.1120 frequency) ranks second to fifth frequent mutations (Fig 4). The first frequent mutation in S AASs is identical in all six geographical areas; however, the frequencies are different between them (Fig 5 section D). The frequency of D614G in Africa, Asia, Europe, North America, South America and Oceania was resulted as 0.8884, 0.9579, 0.9743, 0.9835, 0.9959 and 0.9047, respectively. In Africa, P681R/P681H mutations (0.0674/0.0396 frequencies) ranks second, Q677H/Q677K mutations (0.0661/0.0163 frequencies) ranks third, L452R mutation (0.0728 frequency) ranks fourth and S477N mutation (0.0712) ranks fifth. In addition to the point that the second mutation in Asia attributed to E484K/E484Q (0.1112/0.0211 frequencies), P681R/P681H (0.1127/0.0150 frequencies), W152L (0.0897 frequency) and G769V (0.0903 frequency) are the third to fifth mutations in S AASs. In Europe, A222V (0.4217 frequency) is the second frequent mutation and L18F (0.1702 frequency), S477N (0.0999 frequency) and S98F (0.0466 frequency) are the third to fifth, respectively. Among the mutations occurred in S AASs from North America, E484K (0.1393 frequency) ranks second, L18F (0.1227 frequency) ranks third, P681H/P681R (0.0849/0.0268 frequencies) ranks fourth and L452R (0.1021) ranks fifth which are different in the types in frequent mutations from the South America. In South America, V1176F with 0.8592 frequency, E484K with 0.8268 frequency, N501Y with 0.7745 frequency and L18F with 0.7742 frequency are the frequent mutations, respectively. S477N (0.6864 frequency), P681R (0.0229 frequency), V1068F (0.0196 frequency) and N439K (0.0179 frequency) are the frequent mutation following D614G mutation in Oceania, respectively. Additional data have been listed in Additional file 1 , 2, 3 and 4.
Evolutionary trends of emergence and distribution of top ten mutations with respect to the time and geographical areas
In order to more practical approach to mutations, identifying the trend of mutations emergence and the timeline of their spreading can be considered. The current part of study can be efficient in recognize the factors influencing the efficiency or inefficiency of drugs and vaccines. The distribution pattern of each of the top ten mutations, based on the time of collection, has been illustrated in Fig 6 and supplemental data have been shown in Additional file 5, 6, 7 and 8.
The T9 mutation, which is the most frequent mutation in E AASs in the world, began to prevail from October 2021 and till January 2022, it is present with 0.0693 frequency rate. In comparison, P71 mutation gained in prevalence in May 2020 and after decreasing in August 2020, restarted to increase from September 2020. P71 mutation was at maximum frequency rate in March 2021 with 0.0257 frequency rate. Although V62 mutation was present from the first days of pandemic, it has been increased from August 2021 and in October 2021 was at maximum frequency rate (0.0058). In Africa, the emergence of P71 has been detected in August 2020. In subsequent, it started to increase up to April 2021 with highest frequency rate (0.3673) and gained to decrease till September 2021. Accordingly, the highest frequency of P71 mutation in all other continents is almost identical to Africa. In Asia, V62 mutation increased noticeably from august 2021 and displayed its maximum frequency rate (0.0901) in October 2021. At the beginning of pandemic in South America, V58 mutation has been grown and was at highest frequency rate (0.0457) in May 2020. However, it declined from July 2020.
The most worldwide frequent mutation of M AASs, I82, has notably frequency rate in January 2021 (0.1095). The second globally peak of I82 prevalence started from May 2021 and it has the highest frequency rate (0.9969) in October 2021 (Fig 6). From this perspective, except South America, the evolutionary trend in distribution of I82 mutation in all continents have almost identical pattern. Although I82 was detectable in South America from the beginning of pandemic, it has almost consistent and near zero frequency rate before April 2021 and started to prevail from July 2021. In all areas, the Q19 mutation increased from October 2021; at the time which I82 mutation began to decrease in the entire world.
The prevalence of mutations among N AASs is fluctuant. R203 mutation has a peak of frequency rate in January 2020and began to increase from February 2020 till august 2020. It globally started to prevail again from November 2020 and has a growing pattern in the following with 0.9907 frequency rate in January 2022 (Fig 6). Considering the evolutionary trends in areas resulted in the similar pattern of R203 prevalence excluding South America. The achieved data from this area demonstrated almost steady pattern of frequency rate for R203 from April 2020 to January 2022 the evolutionary trends of D63 and G215 mutations have approximately identical pattern and both started to increase from April 2021 in all continents. Also, one of the frequent and exclusive mutations in South America, P80 mutation, has been increased from November 2020 and started to decrease from June 2021.
The growing evolutionary trend of D614 mutation has been started from February 2020 in the entire world. In contrast to other areas, in Africa the mentioned mutation did not have steady pattern and has been demonstrated fluctuant. L18 mutation has been increased from August 2020 and began to decrease from July 2021 in the world. Such pattern has been demonstrated by E488 and N501, globally (Fig 6). Contrary to them, A222 mutation displayed different trend. It gains in prevalence from July 2020 and started the decreasing pattern from October 2020. In May 2020, S477 mutation, which was one of the top ten frequent mutations in Oceania, began to increase and decreased just three months later (August 2020). The results achieved from South America demonstrated that except D614, approximately all other mutations obviously started to increase from November 2020.
Protein-Protein Interaction (PPI) network presentation
The Protein-Protein Interaction (PPI) network with 57 nodes and 153 edges presents the interaction between E, M, N, S SARS-CoV-2 protein and human proteins (Fig 7) (See Additional file 9). Through the ranking analysis, Ras GTPase-activating protein-binding protein 1 (G3BP1) was identified as high human gene rank (Fig 8). Additional data have been illustrated in Additional file 10. The network showed the linkage between the M protein cluster gene and E and N members which are linked with the A-kinase anchoring protein 8 like (AKAP8L) human gene playing role as a bottleneck. Also in this achieved network, Zinc Finger DHHC-Type Palmitoyltransferase 5 (ZDHHC5) and Golgin A7 (GOLGA7) have been shown as the human genes with highest interaction with S protein.