Fine structure of a partition in the spike glycoprotein encoded in the SARS-CoV-2 genome

doi:10.21203/rs.3.rs-2236542/v1

Download PDF

Article

Fine structure of a partition in the spike glycoprotein encoded in the SARS-CoV-2 genome

https://doi.org/10.21203/rs.3.rs-2236542/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The gene encoding the spike glycoprotein of the SARS-CoV-2 virus that causes COVID-19 disease, was analyzed through two types of periodic tables (standard and cube) of the genetic code to discover the internal fine structure of the spike (S) protein. The analysis was performed on the Wuhan-Hu-1 SARS-CoV-2 sequence (GenBank accession number NC_045512.2). A partition was detected between codon numbers (three-letter code numbers) 47 and 48 that code amino acids in the S-protein. The population distribution of organized codes and amino acid replacements in the S-protein showed large differences between two regions of the cube-type periodic table. The genetic codes of codon numbers 48–63 (4th plane of the cube table) had a higher frequency than the genetic codes of each of the other three planes (1st–3rd planes). Planes-linkage structures involved in the partition were also analyzed and a simplified model for the S-protein gene was obtained where a planes-linkage of the 4th plane and another planes-linkage of the 1st–3rd planes were linked together in alternate shifts. Most of the code population in the 4th plane and their planes-linkage multiformity gave additional support to the partition between codon numbers 47 and 48 in the S-protein gene. Analysis of real lineages of the SARS-CoV-2 virus through the cube-type periodic table identified distinguishing features of the Omicron lineage that included not only a large code population within the receptor-binding domain of the S-protein, but also large percentage rises in the population of amino acid replacements in the 1st and 2nd planes.

Biological sciences/Biochemistry

Biological sciences/Chemical biology

Biological sciences/Molecular biology

Here, a chemistry-focused version of the genome of the SARS-CoV-2 virus that causes COVID-19 disease is presented. Similar to the study of atoms and molecules through the periodic table of the elements, this study is devoted exclusively to a fundamental description of the SARS-CoV-2 genome without the need to use advanced level tools such as mathematical and computational tools including molecular dynamics. Specifically, a periodic table of the genetic code, which was first proposed in 2002 ¹, was applied to analyze the spike (S)-protein encoded in the SARS-CoV-2 genome. The periodic table of the elements is a useful tool in material science, so, in this study, the usefulness of a periodic table of the genetic code was examined by analysis of the S-protein gene. The obtained results showed that the periodic table of the genetic code could construct a simple and clear picture of the S-protein. If this table can be widely used to analyze the genetic code in the SARS-CoV-2 genome, a new analytical tool will become available.

Brief overviews of the SARS-CoV-2 virus are available here ^2,3. The SARS-CoV-2 genome is a positive-sense single-stranded RNA that contains approximately 30,000 letters of genetic code. The RNA strand is surrounded by a membrane that is composed of non-structural proteins and four structural proteins. One of the four structural proteins is the S-protein that the virus uses to gain entry to human epithelial cells. Here, the focus is on the S-protein because it binds to the host angiotensin-converting enzyme 2 (ACE2) receptor, and changes in the genetic code of the S-protein gene can result in the virus becoming more infectious and/or spreading more easily among people.

Data analysis

Two types of a periodic tables of the genetic code, standard type (Table 1) and cube type (Fig. 1), were used to analyze the S-protein gene in the SARS-CoV-2 genome. Codon numbers (three-letter code numbers) were assigned to each codon, and these numbers had a central role in analyzing the S-protein gene sequence.

The data used for the analysis were obtained from the CoV-GLUE Amino acid variation database ⁴, Outbreak.info Variants ⁵, and NCBI’s GenBank NC_045512.2 ⁶.

Table 1: Periodic table of the standard-type genetic code ^{1, 7}.

		1st base					1st base
	2nd	C	A	G	T	2nd	C	A	G	T	3rd
I	C	0P-3	1T-5	2A-1	3S-3	A	4H-3	5N-5	6D-1	7Y-3	C
II	G	8R-3	9S-5	10G-1	11C-3	T	12L-3	13I-5	14V-1	15F-3
III	C	16P-5	17T-3	18A-3	19S-1	A	20Q-5	21K-3	22E-3	23Stop	A
IV	G	24R-4	25R-3	26G-3	27Stop	T	28L-5	29I-3	30V-2	31L-1
V	C	32P-1	33T-3	34A-3	35S-5	A	36Q-1	37K-3	38E-3	39Stop	G
VI	G	40R-1	41R-3	42G-3	43W-5	T	44L-1	45M-3	46V-3	47L-5
VII	C	48P-3	49T-3	50A-3	51S-3	A	52H-3	53N-3	54D-3	55Y-3	T
VIII	G	56R-3	57S-3	58G-3	59C-3	T	60L-3	61I-3	62V-3	63F-3
Hydrogen bonds		6	5	6	5		5	4	5	4

0P-3 indicates that for genetic code CCC, the codon number is 0, the coded amino acid is P, and the inversion number is 3. The inversion number has periodicity in each period I–VIII, and the same amino acids appear in the same columns. For further details about inversion numbers see Morimoto (2002) ¹. Additionally, for further details about information connected with the periodic tables/figure proposed by other scientists see Morimoto (2009) ⁷.

Fine structure of the population distribution of the codons for the S-protein

First, the specific structure of the S-protein gene was described using the cube-type periodic table. Each of the 1274 genetic codes organized in the S-protein gene ⁶ was allocated to one of four planes in Fig. 1. The results for the first 24 codes (amino acids) are showed in Table 2.

Table 2: Allocation of the first 24 codes (amino acids 1–24) in the S-protein gene to the planes in Fig. 1.

Amino acid No. 1-12

Codons

ATG

TTT

GTT

TTT

CTT

GTT

TTA

TTG

CCA

CTA

GTC

TCT

Codon No. Amino acid

45M

63F

62V

63F

60L

62V

31L

47L

16P

28L

14V

51S

Plane No. in Fig. 1

3rd

4th

2nd

3rd

2nd

1st

4th

Amino acid No. 13-24

Codons

AGT

CAG

TGT

GTT

AAT

CTT

ACA

ACC

AGA

ACT

CAA

TTA

Codon No. Amino acid

57S

36Q

59C

62V

53N

60L

17T

25R

49T

20Q

31L

Plane No. in Fig. 1

4th

3rd

4th

2nd

1st

2nd

4th

2nd

Table 2 shows that amino acids 2–6 and 12–18 were seamlessly allocated to the 4th plane in Fig. 1. Seamless allocation to the same plane was globally observed for some of the 1274 genetic codes organized in the S-protein. The seamless allocation to the 4th plane in particular, was significantly more frequent than it was to the other three planes. (See Table S1 in the Supplementary Material for details.) Further analysis of the same numerical planes-linkage value in each of the four planes gave the results shown in Table 3.

Table 3: Summary of constituent-codon members of the same planes-linkage value in the 1st–4th planes shown in Fig. 1.

Plane No. in Figure 1	The 1st plane	The 2nd plane			The 3rd plane	The 4th plane						Sum
Linkage number	3	5	4	3	3	8	7	6	5	4	3
Total number	8	4	5	18	1	2	1	8	9	20	33	109
Sum	8(7%)	27(25%)			1(1%)	73(67%)

See Table S2 in the Supplementary Material for more details.

Many planes-linkages over three planes-linkages were observed in the 1st–4th planes, especially in the 4th plane (73/109 or 67%) as shown in Table 3. The numerical values of the planes-linkage multiformity as well as planes-linkages themselves were much larger in the 4th plane than they were in each of the other three planes.

The results in Tables 2 and 3 together provide data for a simplified model of the S-protein gene in the SARS-CoV-2 genome, including a well-ordered set of specific codon planes-linkages. Importantly, the model indicates that one planes-linkage of the 4th plane and another planes-linkage of the 1st–3rdplanes link together in alternate shifts as follows: the ATG initiation codon (3rd plane)–five member planes–linkage of the 4th plane–(some member planes-linkage of the 1st–3rdplanes)–(some member planes–linkage of the 4th plane)–(some member planes-linkage of the 1st–3rdplanes). Consequently, there is a partition between the 1st–3rd planes and 4th plane, or between codon numbers 47 and 48. If the genetic codes that form the planes-linkages can be analyzed successfully, the results may provide clues to a constituting rule of the genetic code of the S-protein gene.

Second, further support of the concept that there is a partition in the S-protein gene is described. The aggregative population of the genetic code organization in the S-protein gene showed that the 1st, 2nd, 3rd, and 4th planes had 203 (16%), 344 (27%), 137 (11%), and 590 (46%) of the 1274 genetic codes. Clearly, the code population in 4th-plane in Fig. 1 was much higher than the code populations in the other three planes. (See Table S3 in the Supplementary Material for more details.) These results confirm the finding that there is a partition between the 1st–3rdplanes and the 4th plane, or between codon numbers 47 and 48, in the S-protein gene.

How each genetic code mutates

First, each single-letter mutation in the three-letter code will theoretically occur in one of four planes in Fig. 1. However, in some exceptional cases, the amino acid replacements (e.g., H146Q) can have more than one mutant, probably a single point mutation. In such cases, the genetic codes only mutate to face the vertically outward code from the code’s plane in Fig. 1 (the cube-type table), with the exception of variant W152R. For example, for the three variants, CCC 0→3 TCC (P25S), TCC 3→15 TTC (S94F), and CAC 4→20 CAA, 36 CAG (H146Q), each figure close to the genetic code indicates its codon number and the arrows indicate the amino acid replacement of the mutation. Hence, the amino acid replacement P25S is the result of a mutation from codon number 0 (code CCC) to codon number 3 (code TCC). Consequently, single-letter mutations in the amino acid replacements P25S and S94F occur within the 1st plane in Fig. 1, whereas the single-letter mutation in variant H146Q faces the vertically outward codes CAA (codon number 20 in the 2nd plane) or CAG (codon number 36 in the 3rd plane) from code CAC (codon number 4 in the 1st plane).

Second, each genetic code in Table 1 can be expressed as a period and a group. The Roman numbers indicate periods, and the smallest codon number in each column of Table 1 is used as the group number of the corresponding group. For example, the genetic code CCC is denoted as I-0, TCC as I-3, TTC as II-15, CAA as III-4, CAG as V-4, and so on, and therefore variant P25S can also be expressed by I-0→I-3.

In the standard-type table (Table 1), the effect of a single-letter mutation theoretically occurs within two periods that correspond to each plane in Fig. 1. Specifically, a mutation within the 1st plane in Fig. 1 occurs between two codes in periods I and II in Table 1, and a mutation within the 2nd plane occurs in periods III and IV, and so on. In some exceptional cases, a single-letter mutation occurs within the same group between different periods. For example, variant H146Q occurs within the 4th group between periods I and III or between periods I and V.

Fine structure of the population distribution of amino acid replacements in the S-protein

The fine structure of the population distribution of amino acid replacements was investigated. The Global Initiative on Sharing All Influenza Data (GISAID) provides data about new variants of the SARS-CoV-2 virus that have been detected worldwide and enables rapid and open access to epidemic and pandemic virus data ⁸. The amino acid replacements table in GISAID contains 46,251 variants of SARS-CoV-2, and the first 2000 replacements include 154 different variants of the S-protein relative to the Wuhan-Hu-1 SARS-CoV-2 S-protein ⁶.

The fine structure of each codon that encoded each of the 154 amino acid replacements was analyzed. Specifically, which plane in Fig. 1 was connected with each of the mutated codons and which codon number was involved with its mutation, and whether a mutated letter was the first or the second letter and which bases were involved were investigated. The results are summarized in Table 4.

Table 4: Plane dependence of variant population in the 154 amino acid replacements of the S-protein, and the codon numbers and constituent bases related to the mutations.

Plane No. in Fig. 1		The 1st plane		The 2nd plane		The 3rd plane		The 4th plane
Three letters codes	Bases	Codon numbers	Popu-lation	Codon numbers	Popu- lation	Codon numbers	Popu- lation	Codon numbers	Popu- lation	Total
The 1st letter	(C, A, G, T)	(0,1,2,3) (4,5,6,7) (8,9,10,11) (12,13,14,15)	15 (11%)	(16,17,18,19) (20,21,22,23) (24,25,26,27) (28,29,30,31)	10 (7%)	(32,33,34,35) (36,37,38,39) (40,41,42,43) (44,45,46,47)	2	(48,49,50,51) (52,53,54,55) (56,57,58,59) (60,61,62,63)	32 (24%)	59 (43%)
The 2nd letter	C↔G	(0,8) (1,9) (2,10) (3,11)	0	(16,24)(17,25) (18,26)(19,27)	1	(32,40)(33,41) (34,42)(35,43)	0	(48,56)(49,57) (50,58)(51,59)	2	3
	A↔T	(4,12) (5,13) (6,14) (7,15)	0	(20,28)(21,29) (22,30)(23,31)	1	(36,44)(37,45) (38,46)(39,47)	0	(52,60)(53,61) (54,62)(55,63)	1	2
	C↔A	(0,4) (1,5) (2,6) (3,7)	0	(16,20)(17,21) (18,22)(19,23)	0	(32,36)(33,37) (34,38)(35,39)	1	(48,52)(49,53) (50,54)(51,55)	2	3
	C↔T	(0,12) (1,13) (2,14) (3,15)	5	(16,28)(17,29) (18,30)(19,31)	15 (11%)	(32,44)(33,45) (34,46)(35,47)	3	(48,60)(49,61) (50,62)(51,63)	18 (13%)	41 (30%)
	A↔G	(4,8) (5,9) (6,10) (7,11)	1	(20,24)(21,25) (22,26)(23,27)	3	(36,40)(37,41) (38,42)(39,43)	1	(52,56)(53,57) (54,58)(55,59)	4	9
	G↔T	(8,12) (9,13) (10,14)(11,15)	4	(24,28)(25,29) (26,30)(27,31)	2	(40,44)(41,45) (42,46)(43,47)	1	(56,60)(57,61) (58,62)(59,63)	12 (9%)	19 (14%)
Total			25 (18%)		32 (23%)		8 (6%)		71 (52%)	136 (100%)

The count number of amino acid replacements such as CCT 48→51 TCT (P251S, P330S, P384S, P479S, P631S), is 5, not 1, and the total number of variants is 136, not 154. This is because the total number of plane-vertically outward mutations is 18 variants. The population percentages are a measure for the 136 variants. See Table S4 in the Supplementary Material for more details of amino acid replacements.

Table 4 contains all possible cases of single-letter mutations except plane-vertically outward mutations. Accordingly, almost all the 154 amino acid replacements occupy a reasonable position in Table 4. For example, the amino acid replacement of variant P681H (CCT 48→52 CAT) is a mutation within the 4th plane, where the second letter C is replaced by A. Consequently, the pair of codon numbers (48, 52) are in a reasonable position in Table 4 and the variant P681H contributes to population number 2 at its position with the variant T76N (ACT 49→53 AAT).

As already mentioned, the aggregative population of the genetic code organization in the S-protein gene showed that the 4th plane had 590 (46%) of the 1274 genetic codes. (See Table S3 in the Supplementary Material for more details.) The population of 71 (52%) in the 4th plane (Table 4) had an overwhelming majority of numerical values for population of genetic codes related with amino acid replacements as well as the genetic code organization in the S-protein gene. These fine structures also support the presence of a partition between the 1st–3rd planes and the 4th plane, or between codon numbers 47 and 48, in the S-protein gene.

Analysis of real lineages

Here, the cube-type periodic table was used to analyze real Lineage Comparison data obtained from Outbreak.info Variants ⁵. The results are shown in Table 5.

Table 5: Plane dependence of the code population of amino acid replacements in the S-protein in real lineages, and the constituent bases related to the mutations.

Plane No. in Fig. 1		The 1st plane	The 2nd plane	The 3rd plane	The 4th plane
Three letters codes	Bases	Population				Total
The 1st letter	(C, A, G, T)	1(α) 1(ο)	1(α)1(β) 1(γ)1(ο)	0	1(α)1(β)1(γ)5(γ) 1(δ)1(ο)2(ο)	18 (35%)
The 2nd letter	C↔G	0	1(δ)	0	1(δ)	2 (4%)
	A↔T	0	0	1(λ)	0	1 (2%)
	C↔A	1(γ)	1(δ)2(ο)	1 (γ)	2(α)1(β) 1(λ)1(ο)	10 (19%)
	C↔T	1(ο)	1(α) 1(β)	0	1 (γ) 1(λ)1(λ)	6 (12%)
	A↔G	1(ο)	1(ο)	1(δ)	1(α)2(β)1(γ)2(δ) 1(λ) 1(ο)2(ο)	13 (25%)
	G↔T	0	0	1(δ)	1(λ)	2 (4%)
Total		5 (9%)	11 (21%)	4 (8%)	32 (62%)	52 (101%)

1(α) is the population of lineage Alpha, 1(β) is the population of lineage Beta, and so on. Boldface indicates amino acid replacements within the receptor-binding domains of the S-proteins. See Table S5 in the Supplementary Material for more details of amino acid replacements in real lineages.

The population of 32 (62%) of the 4th plane had an overwhelming majority of numerical values for genetic codes related with amino acid replacements (Table 5). Among the four planes, the 2nd plane had a relatively large population 11 (21%), and among the single-letter mutations of three-letter code, mutations in the first and second letters had large populations 18 (35%) and 34 (65%), respectively, which suggests the latter may be preferentially mutated.

Table 6 shows the population of amino acid replacements in the receptor-binding domain (RBD) of the S-protein (amino acid 319 (code AGA)-541 (code TTC)) ⁹ extracted from Table 5, having divided the lineages into lineage Omicron and the other lineages.

Table 6: Population of amino acid replacements in the RBD of the S-protein of real lineages.

	Population
Plane No. in Fig. 1	The 1st plane	The 2nd plane	The 3rd plane	The 4th plane	Total
The 1st letter mutation	0	1(β) 1(γ)	0	1(α)1(β) 1(γ)	1(α)2(β) 2(γ)
The 2nd letter mutation	0	1(δ)	1(γ) 1(δ) 1(λ)	1(λ)	1(γ) 2(δ) 2(λ)
Total	0	β, γ, δ: 1,each	γ, δ, λ: 1,each	α, β, γ, δ: 1,each	10
The 1st letter mutation	1(ο)	1(ο)	0	1(ο)	3
The 2nd letter mutation	2(ο)	3(ο)	0	1(ο)	6
Total	3(o)	4(o)	0	2(o)	9

The population of amino acid replacements in the RBD shows that lineages Alpha, Beta, and Gamma had two mutations each in the 1st letter of the three-letter code on the whole, lineages Delta and Lambda had two mutations each in the 2nd letter on the whole, and lineage Omicron had three and six mutations in the 1st and 2nd letters, respectively. These results suggest that the positions of single-letter mutations are likely to shift from the 1st to the 2nd letters within the RBD as the virus continues to evolve.

Not only did one lineage population of amino acid replacements within the RBD have larger numerical values in lineage Omicron (population 9) than in the other lineages (populations of approximately 2), but populations in the 1st and 2nd planes also marked significant percentage changes compared with the percentage changes in the other five lineages. Indeed, lineage Omicron had three mutations in the 1st plane, whereas the other five lineages had no mutations in this plane. Lineage Omicron had four mutations in the 2nd plane, whereas lineages Beta, Gamma and Delta had one mutation each in this plane, and the other two lineages (Alpha and Lambda) had no mutations. In the 3rd and 4th planes, the population of mutations in lineage Omicron had no large differences compared with those in the other five lineages. These large changes in the population of mutations in the 1st and 2nd planes may have contributed to the changes that led to the evolution of the highly contagious Omicron virus. The variants in the 2nd letter of the genetic code in the 1st and 2nd planes were S375F, S477N, T478K, E484A and Q498R. Variant T478K had only one commonality with lineage Delta, which may correspond to the patient population of the two lineages.

A periodic table of the genetic code is proposed as a new tool for the analysis of genomes without the need for advanced mathematical and computational tools such as direct coupling analysis and neuronal network-based methods. Here, a cube-type periodic table was used to analyze the SARS-CoV-2 S-protein gene and a partition between codon numbers 47 and 48, or between the 1st–3rd planes and the 4th plane, of the S-protein was discovered. This finding indicates that there were large differences in population distributions of the genetic codes and amino acid replacements in the S-protein gene in two regions of the cube, the 1st–3rd planes and the 4th plane. A planes-linkage structure involved in the partition was also discovered. The code linkage multiformity supports the existence of the partition in the S-protein gene. Consequently, a simplified model for the S-protein gene was considered, where a planes-linkage of the 4th plane and one of the 1st–3rd planes linked together in alternate shifts; i.e., the ATG initiation codon (3rd plane)–five member planes-linkage of the 4th plane–(some member planes-linkage of the 1st–3rd planes)–(some member planes-linkage of the 4th plane)–(some member planes-linkage of the 1st–3rd planes). Structural analysis of the codes that constitute the planes-linkage may provide a new approach to studies of the S-protein gene.

Analysis of real lineages showed that, for the Omicron lineage, the population percentage of amino acid replacements within the RBD of the S-protein marked a significant change in the 1st and 2nd planes compared with those for the other five lineages analyzed. These differences may have contributed to the changes that led to the evolution of the highly contagious Omicron virus. Because lineages Omicron and Delta are both highly contagious viruses, the commonality in amino acid replacements of variant T478K in the RBD may be associated with their high infectiveness.

Future studies that combine the proposed new analytical tool with medical information and mathematical and computational methods, e.g., dynamical symmetry algebra and direct coupling analysis, are needed to confirm the results.

Data Availability

The data that support the findings of this study are openly available.

Acknowledgments

I thank CoV-Glue, Outbreak.info, and NCBI for making SARS-CoV-2 genome information publicly available, and Springer Nature for permitting the reuse of Figure 1. I thank Margaret Biswas, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript.

ADDITIONAL INFORMATION

Conflicts of Interest

The author declares that there is no conflict of interest regarding the publication of this article.

Funding Statement

This research received no specific funding but was performed as part of the author’s employment by the Institute for Science Education, Shiga 520-0531, Japan.

Morimoto, S. A periodic table for genetic codes. J. Math. Chem. 32, 159–200 (2002).
Border, P. SARS-CoV-2 virus variants: a year into the COVID-19 pandemic. Rapid response, UK Parliament POST (published 27 January 2021). https://post.parliament.uk/.
Lamers, M.M. and Haagmans, B.L. SARS-CoV-2 pathogenesis. Nat. Rev. Microbiol. 20, 270–284 (2022).
CoV-GLUE, Amino acid variation database. http://cov-glue.cvr.gla.ac.uk/#/replacement.
Outbreak.info, SARS-CoV-2 (hCoV-19) Mutation Reports. https://outbreak.info/compare-lineages.
NCBI GenBank. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2.
Morimoto, S. Application of a periodic table for the genetic code to influenza A/H3N2 virus. Nat. Prec. (2009). https://doi.org/10.1038/npre.2009.428.2.
Global Initiative on Sharing All Influenza Data (GISAID). https://www.gisaid.org/.
Huang, Y., Yang, C., Xu, X., Xu, W. & Liu, S. Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta. Pharm. Sinic. 41, 1141–1149 (2020).

No competing interests reported.

Supplementarymaterialso10.20.pdf

Download PDF

Version 1

posted

You are reading this latest preprint version

Fine structure of a partition in the spike glycoprotein encoded in the SARS-CoV-2 genome

Status:

Version 1

Abstract

Figures

Introduction

Materials and Method

Results and Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1