Fine structure of the population distribution of the codons for the S-protein
First, the specific structure of the S-protein gene was described using the cube-type periodic table. Each of the 1274 genetic codes organized in the S-protein gene 6 was allocated to one of four planes in Fig. 1. The results for the first 24 codes (amino acids) are showed in Table 2.
Table 2: Allocation of the first 24 codes (amino acids 1–24) in the S-protein gene to the planes in Fig. 1.
Amino acid No. 1-12
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
Codons
|
ATG
|
TTT
|
GTT
|
TTT
|
CTT
|
GTT
|
TTA
|
TTG
|
CCA
|
CTA
|
GTC
|
TCT
|
Codon No. Amino acid
|
45M
|
63F
|
62V
|
63F
|
60L
|
62V
|
31L
|
47L
|
16P
|
28L
|
14V
|
51S
|
Plane No. in Fig. 1
|
3rd
|
4th
|
4th
|
4th
|
4th
|
4th
|
2nd
|
3rd
|
2nd
|
2nd
|
1st
|
4th
|
Amino acid No. 13-24
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
|
24
|
Codons
|
AGT
|
CAG
|
TGT
|
GTT
|
AAT
|
CTT
|
ACA
|
ACC
|
AGA
|
ACT
|
CAA
|
TTA
|
Codon No. Amino acid
|
57S
|
36Q
|
59C
|
62V
|
53N
|
60L
|
17T
|
1T
|
25R
|
49T
|
20Q
|
31L
|
Plane No. in Fig. 1
|
4th
|
3rd
|
4th
|
4th
|
4th
|
4th
|
2nd
|
1st
|
2nd
|
4th
|
2nd
|
2nd
|
Table 2 shows that amino acids 2–6 and 12–18 were seamlessly allocated to the 4th plane in Fig. 1. Seamless allocation to the same plane was globally observed for some of the 1274 genetic codes organized in the S-protein. The seamless allocation to the 4th plane in particular, was significantly more frequent than it was to the other three planes. (See Table S1 in the Supplementary Material for details.) Further analysis of the same numerical planes-linkage value in each of the four planes gave the results shown in Table 3.
Table 3: Summary of constituent-codon members of the same planes-linkage value in the 1st–4th planes shown in Fig. 1.
Plane No. in Figure 1
|
The 1st plane
|
The 2nd plane
|
The 3rd plane
|
The 4th plane
|
Sum
|
|
Linkage number
|
3
|
5
|
4
|
3
|
3
|
8
|
7
|
6
|
5
|
4
|
3
|
|
|
Total number
|
8
|
4
|
5
|
18
|
1
|
2
|
1
|
8
|
9
|
20
|
33
|
109
|
|
Sum
|
8(7%)
|
27(25%)
|
1(1%)
|
73(67%)
|
|
See Table S2 in the Supplementary Material for more details.
Many planes-linkages over three planes-linkages were observed in the 1st–4th planes, especially in the 4th plane (73/109 or 67%) as shown in Table 3. The numerical values of the planes-linkage multiformity as well as planes-linkages themselves were much larger in the 4th plane than they were in each of the other three planes.
The results in Tables 2 and 3 together provide data for a simplified model of the S-protein gene in the SARS-CoV-2 genome, including a well-ordered set of specific codon planes-linkages. Importantly, the model indicates that one planes-linkage of the 4th plane and another planes-linkage of the 1st–3rd planes link together in alternate shifts as follows: the ATG initiation codon (3rd plane)–five member planes–linkage of the 4th plane–(some member planes-linkage of the 1st–3rd planes)–(some member planes–linkage of the 4th plane)–(some member planes-linkage of the 1st–3rd planes). Consequently, there is a partition between the 1st–3rd planes and 4th plane, or between codon numbers 47 and 48. If the genetic codes that form the planes-linkages can be analyzed successfully, the results may provide clues to a constituting rule of the genetic code of the S-protein gene.
Second, further support of the concept that there is a partition in the S-protein gene is described. The aggregative population of the genetic code organization in the S-protein gene showed that the 1st, 2nd, 3rd, and 4th planes had 203 (16%), 344 (27%), 137 (11%), and 590 (46%) of the 1274 genetic codes. Clearly, the code population in 4th-plane in Fig. 1 was much higher than the code populations in the other three planes. (See Table S3 in the Supplementary Material for more details.) These results confirm the finding that there is a partition between the 1st–3rd planes and the 4th plane, or between codon numbers 47 and 48, in the S-protein gene.
How each genetic code mutates
First, each single-letter mutation in the three-letter code will theoretically occur in one of four planes in Fig. 1. However, in some exceptional cases, the amino acid replacements (e.g., H146Q) can have more than one mutant, probably a single point mutation. In such cases, the genetic codes only mutate to face the vertically outward code from the code’s plane in Fig. 1 (the cube-type table), with the exception of variant W152R. For example, for the three variants, CCC 0→3 TCC (P25S), TCC 3→15 TTC (S94F), and CAC 4→20 CAA, 36 CAG (H146Q), each figure close to the genetic code indicates its codon number and the arrows indicate the amino acid replacement of the mutation. Hence, the amino acid replacement P25S is the result of a mutation from codon number 0 (code CCC) to codon number 3 (code TCC). Consequently, single-letter mutations in the amino acid replacements P25S and S94F occur within the 1st plane in Fig. 1, whereas the single-letter mutation in variant H146Q faces the vertically outward codes CAA (codon number 20 in the 2nd plane) or CAG (codon number 36 in the 3rd plane) from code CAC (codon number 4 in the 1st plane).
Second, each genetic code in Table 1 can be expressed as a period and a group. The Roman numbers indicate periods, and the smallest codon number in each column of Table 1 is used as the group number of the corresponding group. For example, the genetic code CCC is denoted as I-0, TCC as I-3, TTC as II-15, CAA as III-4, CAG as V-4, and so on, and therefore variant P25S can also be expressed by I-0→I-3.
In the standard-type table (Table 1), the effect of a single-letter mutation theoretically occurs within two periods that correspond to each plane in Fig. 1. Specifically, a mutation within the 1st plane in Fig. 1 occurs between two codes in periods I and II in Table 1, and a mutation within the 2nd plane occurs in periods III and IV, and so on. In some exceptional cases, a single-letter mutation occurs within the same group between different periods. For example, variant H146Q occurs within the 4th group between periods I and III or between periods I and V.
Fine structure of the population distribution of amino acid replacements in the S-protein
The fine structure of the population distribution of amino acid replacements was investigated. The Global Initiative on Sharing All Influenza Data (GISAID) provides data about new variants of the SARS-CoV-2 virus that have been detected worldwide and enables rapid and open access to epidemic and pandemic virus data 8. The amino acid replacements table in GISAID contains 46,251 variants of SARS-CoV-2, and the first 2000 replacements include 154 different variants of the S-protein relative to the Wuhan-Hu-1 SARS-CoV-2 S-protein 6.
The fine structure of each codon that encoded each of the 154 amino acid replacements was analyzed. Specifically, which plane in Fig. 1 was connected with each of the mutated codons and which codon number was involved with its mutation, and whether a mutated letter was the first or the second letter and which bases were involved were investigated. The results are summarized in Table 4.
Table 4: Plane dependence of variant population in the 154 amino acid replacements of the S-protein, and the codon numbers and constituent bases related to the mutations.
Plane No. in Fig. 1
|
The 1st plane
|
The 2nd plane
|
The 3rd plane
|
The 4th plane
|
|
|
Three letters codes
|
Bases
|
Codon numbers
|
Popu-lation
|
Codon numbers
|
Popu- lation
|
Codon numbers
|
Popu- lation
|
Codon numbers
|
Popu- lation
|
Total
|
The 1st letter
|
(C, A, G, T)
|
(0,1,2,3) (4,5,6,7) (8,9,10,11) (12,13,14,15)
|
15 (11%)
|
(16,17,18,19) (20,21,22,23) (24,25,26,27) (28,29,30,31)
|
10 (7%)
|
(32,33,34,35) (36,37,38,39) (40,41,42,43) (44,45,46,47)
|
2
|
(48,49,50,51) (52,53,54,55) (56,57,58,59) (60,61,62,63)
|
32 (24%)
|
59 (43%)
|
The 2nd letter
|
C↔G
|
(0,8) (1,9) (2,10) (3,11)
|
0
|
(16,24)(17,25) (18,26)(19,27)
|
1
|
(32,40)(33,41) (34,42)(35,43)
|
0
|
(48,56)(49,57) (50,58)(51,59)
|
2
|
3
|
A↔T
|
(4,12) (5,13) (6,14) (7,15)
|
0
|
(20,28)(21,29) (22,30)(23,31)
|
1
|
(36,44)(37,45) (38,46)(39,47)
|
0
|
(52,60)(53,61) (54,62)(55,63)
|
1
|
2
|
C↔A
|
(0,4) (1,5) (2,6) (3,7)
|
0
|
(16,20)(17,21) (18,22)(19,23)
|
0
|
(32,36)(33,37) (34,38)(35,39)
|
1
|
(48,52)(49,53) (50,54)(51,55)
|
2
|
3
|
C↔T
|
(0,12) (1,13) (2,14) (3,15)
|
5
|
(16,28)(17,29) (18,30)(19,31)
|
15 (11%)
|
(32,44)(33,45) (34,46)(35,47)
|
3
|
(48,60)(49,61) (50,62)(51,63)
|
18 (13%)
|
41 (30%)
|
A↔G
|
(4,8) (5,9) (6,10) (7,11)
|
1
|
(20,24)(21,25) (22,26)(23,27)
|
3
|
(36,40)(37,41) (38,42)(39,43)
|
1
|
(52,56)(53,57) (54,58)(55,59)
|
4
|
9
|
G↔T
|
(8,12) (9,13) (10,14)(11,15)
|
4
|
(24,28)(25,29) (26,30)(27,31)
|
2
|
(40,44)(41,45) (42,46)(43,47)
|
1
|
(56,60)(57,61) (58,62)(59,63)
|
12 (9%)
|
19 (14%)
|
Total
|
|
25 (18%)
|
|
32 (23%)
|
|
8 (6%)
|
|
71 (52%)
|
136
(100%)
|
The count number of amino acid replacements such as CCT 48→51 TCT (P251S, P330S, P384S, P479S, P631S), is 5, not 1, and the total number of variants is 136, not 154. This is because the total number of plane-vertically outward mutations is 18 variants. The population percentages are a measure for the 136 variants. See Table S4 in the Supplementary Material for more details of amino acid replacements.
Table 4 contains all possible cases of single-letter mutations except plane-vertically outward mutations. Accordingly, almost all the 154 amino acid replacements occupy a reasonable position in Table 4. For example, the amino acid replacement of variant P681H (CCT 48→52 CAT) is a mutation within the 4th plane, where the second letter C is replaced by A. Consequently, the pair of codon numbers (48, 52) are in a reasonable position in Table 4 and the variant P681H contributes to population number 2 at its position with the variant T76N (ACT 49→53 AAT).
As already mentioned, the aggregative population of the genetic code organization in the S-protein gene showed that the 4th plane had 590 (46%) of the 1274 genetic codes. (See Table S3 in the Supplementary Material for more details.) The population of 71 (52%) in the 4th plane (Table 4) had an overwhelming majority of numerical values for population of genetic codes related with amino acid replacements as well as the genetic code organization in the S-protein gene. These fine structures also support the presence of a partition between the 1st–3rd planes and the 4th plane, or between codon numbers 47 and 48, in the S-protein gene.
Analysis of real lineages
Here, the cube-type periodic table was used to analyze real Lineage Comparison data obtained from Outbreak.info Variants 5. The results are shown in Table 5.
Table 5: Plane dependence of the code population of amino acid replacements in the S-protein in real lineages, and the constituent bases related to the mutations.
Plane No. in Fig. 1
|
The 1st plane
|
The 2nd plane
|
The 3rd plane
|
The 4th plane
|
|
Three letters codes
|
Bases
|
Population
|
Total
|
The 1st letter
|
(C, A, G, T)
|
1(α) 1(ο)
|
1(α)1(β) 1(γ)1(ο)
|
0
|
1(α)1(β)1(γ)5(γ) 1(δ)1(ο)2(ο)
|
18 (35%)
|
The 2nd letter
|
C↔G
|
0
|
1(δ)
|
0
|
1(δ)
|
2 (4%)
|
A↔T
|
0
|
0
|
1(λ)
|
0
|
1 (2%)
|
C↔A
|
1(γ)
|
1(δ)2(ο)
|
1 (γ)
|
2(α)1(β) 1(λ)1(ο)
|
10 (19%)
|
C↔T
|
1(ο)
|
1(α) 1(β)
|
0
|
1 (γ) 1(λ)1(λ)
|
6 (12%)
|
A↔G
|
1(ο)
|
1(ο)
|
1(δ)
|
1(α)2(β)1(γ)2(δ) 1(λ) 1(ο)2(ο)
|
13 (25%)
|
G↔T
|
0
|
0
|
1(δ)
|
1(λ)
|
2 (4%)
|
Total
|
5 (9%)
|
11 (21%)
|
4 (8%)
|
32 (62%)
|
52 (101%)
|
1(α) is the population of lineage Alpha, 1(β) is the population of lineage Beta, and so on. Boldface indicates amino acid replacements within the receptor-binding domains of the S-proteins. See Table S5 in the Supplementary Material for more details of amino acid replacements in real lineages.
The population of 32 (62%) of the 4th plane had an overwhelming majority of numerical values for genetic codes related with amino acid replacements (Table 5). Among the four planes, the 2nd plane had a relatively large population 11 (21%), and among the single-letter mutations of three-letter code, mutations in the first and second letters had large populations 18 (35%) and 34 (65%), respectively, which suggests the latter may be preferentially mutated.
Table 6 shows the population of amino acid replacements in the receptor-binding domain (RBD) of the S-protein (amino acid 319 (code AGA)-541 (code TTC)) 9 extracted from Table 5, having divided the lineages into lineage Omicron and the other lineages.
Table 6: Population of amino acid replacements in the RBD of the S-protein of real lineages.
|
|
Population
|
|
Plane No. in Fig. 1
|
The 1st plane
|
The 2nd plane
|
The 3rd plane
|
The 4th plane
|
Total
|
The 1st letter mutation
|
0
|
1(β) 1(γ)
|
0
|
1(α)1(β) 1(γ)
|
1(α)2(β) 2(γ)
|
The 2nd letter mutation
|
0
|
1(δ)
|
1(γ) 1(δ) 1(λ)
|
1(λ)
|
1(γ) 2(δ) 2(λ)
|
Total
|
0
|
β, γ, δ: 1,each
|
γ, δ, λ: 1,each
|
α, β, γ, δ: 1,each
|
10
|
The 1st letter mutation
|
1(ο)
|
1(ο)
|
0
|
1(ο)
|
3
|
The 2nd letter mutation
|
2(ο)
|
3(ο)
|
0
|
1(ο)
|
6
|
Total
|
3(o)
|
4(o)
|
0
|
2(o)
|
9
|
The population of amino acid replacements in the RBD shows that lineages Alpha, Beta, and Gamma had two mutations each in the 1st letter of the three-letter code on the whole, lineages Delta and Lambda had two mutations each in the 2nd letter on the whole, and lineage Omicron had three and six mutations in the 1st and 2nd letters, respectively. These results suggest that the positions of single-letter mutations are likely to shift from the 1st to the 2nd letters within the RBD as the virus continues to evolve.
Not only did one lineage population of amino acid replacements within the RBD have larger numerical values in lineage Omicron (population 9) than in the other lineages (populations of approximately 2), but populations in the 1st and 2nd planes also marked significant percentage changes compared with the percentage changes in the other five lineages. Indeed, lineage Omicron had three mutations in the 1st plane, whereas the other five lineages had no mutations in this plane. Lineage Omicron had four mutations in the 2nd plane, whereas lineages Beta, Gamma and Delta had one mutation each in this plane, and the other two lineages (Alpha and Lambda) had no mutations. In the 3rd and 4th planes, the population of mutations in lineage Omicron had no large differences compared with those in the other five lineages. These large changes in the population of mutations in the 1st and 2nd planes may have contributed to the changes that led to the evolution of the highly contagious Omicron virus. The variants in the 2nd letter of the genetic code in the 1st and 2nd planes were S375F, S477N, T478K, E484A and Q498R. Variant T478K had only one commonality with lineage Delta, which may correspond to the patient population of the two lineages.