Analysis of mutations
Of the ten samples from patients with a confirmed diagnosis for COVID-19, 8 were successful in the new generation sequencing procedure, generating complete genomic sequences with mean coverage level > 99%. The analysis of mutations in the genome of these strains demonstrated the presence of a total of 22 alterations in different sites, one of which was found in a non-coding region. Among those found in coding regions, 12 are classified as non-synonymous mutations. They are found in 7 viral proteins: nsp1, nsp12, Spike, ORF3a, ORF6, ORF8 and Nucleoprotein. The complete list of mutations found, and the percent frequency of occurrence is shown in Table 1.
Table 1. Mutations found in strains of SARS-CoV-2 isolated in Rondônia
Mutation
|
Place of occurrence
|
Type of mutation
|
Proteic alteration
|
Freq.
|
Gene
|
Protein
|
(position at the gene level)
|
C241T
|
5' UTR
|
-
|
-
|
-
|
100%
|
C364A
|
ORF1ab
|
nsp1
|
Nonsynonymous
|
D33E
|
25%
|
C3037T
|
ORF1ab
|
nsp3
|
Synonymous
|
-
|
100%
|
C11563T
|
ORF1ab
|
nsp7
|
Synonymous
|
-
|
12.50%
|
C14265T
|
ORF1ab
|
nsp12 RdRp
|
Synonymous
|
-
|
12.50%
|
C14408T
|
ORF1ab
|
nsp12 RdRp
|
Nonsynonymous
|
P4715L
|
100%
|
C15324T
|
ORF1ab
|
nsp12 RdRp
|
Synonymous
|
-
|
25%
|
C16428T
|
ORF1ab
|
nsp13 Helicase
|
Synonymous
|
-
|
25%
|
T22156C
|
S
|
Spike / Surface
|
Synonymous
|
-
|
12.50%
|
C23244A
|
S
|
Spike / Surface
|
Nonsynonymous
|
P561H
|
25%
|
A23403G
|
S
|
Spike / Surface
|
Nonsynonymous
|
D614G
|
100%
|
C23917T
|
S
|
Spike / Surface
|
Synonymous
|
-
|
12.50%
|
T25036C
|
S
|
Spike / Surface
|
Synonymous
|
-
|
12.50%
|
G25855T
|
ORF3a
|
|
Nonsynonymous
|
D155Y
|
25%
|
A26045G
|
ORF3a
|
|
Nonsynonymous
|
Q218R
|
25%
|
T27299C
|
ORF6
|
|
Nonsynonymous
|
I33T
|
75%
|
A28108C
|
ORF8
|
|
Nonsynonymous
|
Q72P
|
12.50%
|
G28881A
|
N
|
Nucleoprotein
|
Nonsynonymous
|
R203K
|
75%
|
G28882A
|
N
|
Nucleoprotein
|
Synonymous
|
-
|
75%
|
G28883C
|
N
|
Nucleoprotein
|
Nonsynonymous
|
G204R
|
75%
|
T29148C
|
N
|
Nucleoprotein
|
Nonsynonymous
|
I292T
|
75%
|
C29367T
|
N
|
Nucleoprotein
|
Nonsynonymous
|
P365L
|
37.50%
|
Four mutations were found in 100% of the isolates: C241T, C3037T, C14408T and A23408G. With the exception of alteration C14408T, the others were classified as signature mutations for super spreader group 4 (SS4) identified by Yang and collaborators (2020). Therefore, according to this information, all samples from the present study that were isolated belong to the SS4 group. Phylogenetic analysis to determine evolutionary groups (figure 2) confirms this classification.
The vast majority of the mutations found have no clinical/virological significance described in the literature, with some being considered unique. Among the known mutations in the ORF1ab gene, C14408T was found in 100% of the samples, which results in the replacement of a proline amino acid with a leucine at position 323 (P323L) of nsp12 RdRp (RNA-dependent RNA polymerase). Alterations in viral enzymes of this nature raise a level of concern, since they can cause resistance to drugs that have RdRp as a target, as previously described for hepatitis C, Influenza and also for one Coronavirus infection in mice treated with Rendesivir 31–33.
However, the P323L alteration results in an amino acid with an isoelectric point similar to the wild type amino acid 34, which may mean a not-so-significant change in the molecular structure of this protein. In addition, it is located outside the nsp12 RdRp catalytic region. However, it is located in a region equivalent to the SARS-CoV RdRp interface domain. This domain is supposedly implicated in the interaction with other viral proteins that can regulate the processivity of RdRp during the activity of Replicase Transcriptase Complex (RTC) 35,36. RTC, in turn, has an interaction with the exonuclease nsp14, the protein responsible for reviewing viral RNA synthesis, and this interaction is important in the control of accurate RNA replication 35. Thus, it is assumed that there is the possibility of an indirect influence of C14408T on the viral mutation rate.
In addition, it was recently proposed in a pre-printed study with over 11,200 sequences that this alteration may be associated with an increase in the rate of viral mutations 37. In addition to this study, another also observed an extremely high rate of nucleotide substitutions in a group of viruses descending from an isolated parent strain in Germany (Germany/BavPat1/2020|EPI_ISL_406862), the SS4 group, which acquired the C14408T mutation later and was more widespread to other European countries 22. Thus, considering the information raised about it and the high frequency of occurrence of this substitution in the analyzed samples, further studies are needed to assess the role of this mutation in the fidelity of RdRp, whose errors can directly affect the long-term effectiveness of a vaccine and specific antiviral drugs.
Among the modifications found in the S gene, A23403G stands out. This non-synonymous mutation was found in 100% of the samples and results in the replacement of an aspartate amino acid with a glycine at position 614 (D614G) of the Spike protein. This protein, through its receptor binding domain (RBD), mediates the interaction of the virus with the host cell by binding to ACE-2, which consequently facilitates membrane fusion and viral penetration 38,39. The substitution results in an amino acid with an isoelectric point different from the wild type 34, which may provide greater conformational freedom in the structure of the protein, improving local entropy and affecting the recognition interaction via RBD, through positioning of waste involved in this process 40. This is one of the reasons this mutation has been a source of debate in the scientific literature regarding an association with possible higher transmissibility of SARS-CoV-2 34,40.
Other reasons have also been pointed out to justify this association: 1) Structural reason: the resulting conformational change can improve the membrane fusion step by facilitating the separation of the S1 domain from the S2 domain of the Spike protein bound to the receptor; 2) Immune reason: because it is a residue located in a region equivalent to the target epitope of antibody-dependent improvement in SARS-CoV, it is assumed that the binding with the antibody may alter the conformation of the protein and increase its interaction with the ACE-2 receptor and; 3) Genetic epidemiological reason: an increase in the isolation frequency of strains that contain the D614G mutation has been detected in several regions of the world, including detection of the G614 variant's prevalence in a matter of weeks in places where the D614 variant was previously prevalent 34,41.
Considering that this mutation was first identified on January 28, 2020 (Germany/BavPat1/2020|EPI_ISL_406862), this increase in frequency was also found in a direct genomic analysis of all 1539 SARS-CoV-2 genomes deposited in the GISAID platform between February 29th and March 26th. There was a prevalence of 56% of isolates belonging to the SS4 group, which hosts the D614G mutation as a signature characteristic of the group, showing the rapid dissemination of this variant over time 22.
A recently published in vitro study that is currently in prepress performed comparisons of the functional properties of the D614 and G614 variations of the Spike protein, finding greater efficiency of infectivity with the G614 variant in the replication of pseudotyped retroviruses in cells that express ACE-2. The improvement was associated with a possible marked incorporation of Spike protein into the final structure of the virus, which may therefore improve the transmission of SARS-CoV-2 between different hosts 42. Daniloski, Guo and Sanjana (2020) also came to the similar conclusion of a higher proportion of Spike protein per virion. In contrast, a pre-printed analysis of 15,691 SARS-CoV-2 genomes indicated that recurrent mutations in this protein did not increase viral transmissibility 44.
Clinically, D614G does not appear to be associated with the severity of the disease 41. This can be justified by the fact that the G614 variation provides limitations to the rate and efficiency of intra-host replication 42. Although the number of samples in the present study may be considered low, none of the individuals with the strains analyzed died of the infection and only one (12.5%) developed a clinical condition considered severe (dyspnoic). This percentage rate is similar to the general epidemiological rate of clinical resolution of COVID-19, showing no link between D614G and the severity of the disease.
Two changes in different genes were found together in 75% of the samples: T27299C and T29148C. Both are classified as non-synonymous mutations that result in the substitution of an isoleucine amino acid with a threonine at positions 33 (I33T) and 292 (I292T) of the viral proteins ORF6 and N, respectively. According to the study by Candido et al. (2020), the coexistence of these two mutations corresponds to the signature of one of three main clades of SARS-CoV-2 spread in Brazil, named clade 2. This clade is the most spatially disseminated strain in the country, with isolated representatives in a total of 16 of the 26 Brazilian states as of the end of April 2020. However, this same study did not obtain a good representation of the northern region of the country, with the absence of genetic data of circulating strains in Rondônia. Therefore, we have detected the spread of this clade to yet another Brazilian state. Phylogenetic analyses developed in the present study that will be presented and discussed later showed that this group of samples corresponds to pangolin line B.1.1., although clade 2 identified by Candido et al. (2020) may include other pangolin lineages. The authors did not specify the pangolin lines included in each of the three main clades detected.
Other changes were also found in the N gene of the strains analyzed. Three sequential nucleotide changes are highlighted: G28881A, G28882A and G28883C, which were found together in 75% of the samples. They result in the replacement of two amino acids of the viral Nucleoprotein, R203K and G204R (R - arginine; K - lysine; G - glycine). The potential effect of these mutations on viral and host processes has been investigated, and it has been observed that they result in considerable changes in the predicted binding with some miRNAs, which may play a role in influencing the progress of the infection. Some of the miRNAs that bind to this mutated type of nucleoprotein may be under-regulated in several types of cancer. This increases the possibility that cancer patients may have a high susceptibility to the mutated variant due to a reduced ability to contain the virus, compared to the wild-type infection 40.
Another alteration detected in the same gene is C29367T, found in three (37.5%) of the eight samples in the study. It is a nonsynonymous mutation that results in a P365L substitution. This mutation has not yet been described in the scientific literature. When looking for other sequences with this mutation in Dataset B, it was observed that none of the strains included show this change. This leads to the assumption that it has appeared more recently in the viral evolutionary history of SARS-CoV-2. Because it was detected only in some sequences in Rondônia, it may have appeared after the virus entered the state and can be used as a marker to study viral spread among different municipalities in the state.
Evolutionary analysis
Temporal signal evaluation
For both Dataset A and Dataset B, it was possible to observe a linear regression curve that shows a positive correlation between genetic diversity and sampling time, showing the existence of sufficient time signal in the data sets to justify a molecular clock approach (figure 1, A and B). Although the time signal level may be considered low for Dataset B, as evidenced by the R2 value (0.2058), this parameter should not be used to test the statistical significance of the regression, because the individual data points in the graph are not distributed independently, but are partially correlated due to shared phylogenetic ancestry 28.
Determination of evolutionary group
Phylogenetic analysis to determine evolutionary groups of SARS-CoV-2 strains in the present study, using dataset A, confirmed one of the conclusions previously obtained with the study of mutations: all samples were identified as belonging to the SS4 group, with a posterior support of the 100% cladistic distribution (figure 2). The best distribution of the non-correlated relaxed clock was “exponential”, chosen through the analysis of convergence of MCMC run parameters and tree topology.
According to the evolutionary history of the SS groups pointed out by Yang et al. (2020), strains descended from the original virus were transmitted to various locations in the world and were dominant for a period of time, during the early outbreak of COVID-19. However, with continuous transmission in different environments, the virus has evolved into four large super-spreading clusters, along with other variants derived directly from the original virus. SS group members became dominant, with different variants prevailing in different regions of the world, in mid-February and March.
The SS1 strains first emerged and were transmitted mainly in Asia, South Korea and the USA. They persisted in China during the post-initial outbreak phase, being less prevalent in other parts of the world. Groups SS2 and SS3 were transmitted mainly in mid-January and February, in Asian countries other than China, as well as Europe and Brazil, specifically in the State of São Paulo during the initial phase of the outbreak. Finally, group SS4 emerged in late January and was reported for the first time in Germany. It was primarily responsible for the outbreak of a pandemic on the European continent, replacing the previous dominance of strains SS2 and SS3 in the region. From this continent, this variant has spread to several other locations around the world, as already discussed in relation to the D614G substitution. It also arrived in South America where, in mid-March, it entered the State of Rondônia.
This analysis allowed us to observe that at least two different events of entry occurred in the State, both of European descent. It also showed a deficiency of phylogenetic signal to differentiate strains from groups SS2 and SS3. In fact, for identification through direct genomic observation, both groups have only one signature mutation each (G26144T for SS2 and G11083T for SS3), which may show little phylogenetically useful difference for differentiating strains from these groups, when considering the integral size of the SARS-CoV-2 genome and its biological tendency to maintain conservation. In addition, we observed some Brazilian strains deposited in Genbank (MT126808.1 and MT350282.1) that have both of the aforementioned substitutions. Therefore, we suggest the union of groups SS2 and SS3 in the classification of super spreaders. Fortunately, this question does not negatively influence the determination of the samples as descendants of the SS4 group.
Detailed evolutionary history
Phylogenetic analysis to detail the evolutionary history of the SARS-CoV-2 strains from the present study was performed based on the relaxed correlated molecular clock model using dataset B. The “lognormal” distribution was chosen through the convergence analysis of MCMC run parameters and tree topology. The inferred tree allowed us to observe that 75% of the strains isolated in the State of Rondônia belong to pangolin lineage B.1.1.; while the remaining 25% belong to line B.1. (figure 3, A and B). This classification was supported by 100% of subsequent probability in determining the lineage at some cladistic level of the tree and provides support for the previous conclusion of the occurrence of at least two SARS-CoV-2 entry events in the State.
In order to avoid inaccurate conclusions regarding the introduction of SARS-CoV-2 in the State, details of the evolutionary path were obtained from information on clades that included samples with posterior support greater than 85%. Therefore, considering this criterion, it was not possible to fully detail the evolutionary history of the introduction of the B.1.1 strain. In addition to being of European descent, B.1.1. strains from the state of Rondônia also descend from an ancestral strain that circulated in Argentina around the transition from February to March, with a differentiation date of February 25th (95% HPD between February 14th and 29th (figure 3A). It was not possible to draw any further conclusions about the detailed path between the transmission from Argentina to Rondônia, nor whether it occurred directly between these localities.
However, another interpretation is also possible. This group of sequences share a common ancestor, descended from an older one (dated February 15th, with 95% HPD between January 28th and February 26th) that gave rise to isolated strains in the middle of March in the state of Minas Gerais and the Federal District. Therefore, it is possible that strains circulating in these states have spread to Argentina and Rondônia. A previous study identified the transmission of B.1.1. strains to some South American countries, including Argentina 48. This supports the second hypothesis surrounding the introduction of this lineage into the State.
The detailing of the evolutionary path regarding the introduction of the B.1. line provided more detailed information about this process. Just like for B.1.1., B.1. strains also share ancestry with a parental strain that circulated in Argentina, having differentiated from a common ancestor on February 29th (95% HPD between February 26th and March 15th). Another more recently shared common ancestor gave rise to strains that circulated in the Brazilian states of Minas Gerais and Ceará, dated March 9th (95% HPD between March 8th and 21st) (figure 3B). This last dating does not represent the exact period of arrival of this lineage in the State, but a period close to this event. It should be noted that the first confirmed case in the state of Rondônia occurred on March 20th.
Three pairs of samples are lined up in the analysis in a monophyletic manner with 100% posterior support, showing a very high degree of similarity between them. This shows the expected effect of sustained community transmission of the virus in the state. The monophyletic relationship of B.1.1. strains of the sample pairs 01-03 and 07-08 may provide relevant information about the viral dissemination profile in the State. With the exception of sample 07, the others have the aforementioned C29367T alteration in their genome, which presumably arose after the introduction of SARS-CoV-2 in the State and which can be used as a source of information to study the form of dissemination. Therefore, it is presumed that this alteration occurred after passing, not necessarily directly, from 07 to 08 in the city of Porto Velho (place of residence of their respective carriers). Subsequently, there was a continuation of the transmission of strains that carry this mutation before reaching the list of samples 01-03. Since sample 03 was isolated from a patient residing in Porto Velho, and sample 01 was isolated from a patient residing in the municipality of Jaru (about 290 kilometers away from the capital), it is assumed that a strain was transmitted from Porto Velho to Jaru.