Recovery of SARS-CoV-2 proteomes
Proteomes from 92 isolates of SARS-CoV-2 were recovered in FASTA format using the GenBank database for reference. Metadata related to the collection date, city, host, and source of isolation were also downloaded. Most isolates were col- lected by nasopharyngeal (n = 31) and oropharyngeal (n = 12) swabs, and to a lesser extent by bronchoalveolar lavage (n = 11). Sequencing was performed from December 2019 (the Wuhan-Hu-1 strain) to March 11, 2020 (the SARS-CoV- 2/human/USA/PC00101P/2020 isolate). Most sequences came from the USA (n = 51) and China (n = 27). Only one isolate came from LATAM (Brazil, February 2020). The entire metadata set of proteomes used, as well as access identification, is shown in (Supplementary Table 2).
Identification of structural and non-structural proteins, their mechanism of action in SARS-CoV-2, and their conservation in proteomes
"The spike protein acts like a Trojan horse".
SP plays a fundamental role in SARS-CoV-2 infection because it mediates the entry of the virus into host cells through its S1 domain, which binds to ACE2 receptor making the subsequent cleavage of the S2 domain. The increased glycosylation of SP allows it to evade the adaptive immune response and protect epitopes from recognition and neutralisation by antibodies. The immune response through viral and RNA receptors is the first line of defence against SARS-CoV-2 infection, producing cytokines and IFN-139. Some proteins encoded by the coronavirus genome are capable of interfering with the innate immune response from the early phase of infection, which affects various signalling pathways important in maintaining immunity40 (Supplementary Fig. S2). Thus, identifying the mechanisms of action, characteristics of structural and NSPs, and accessory transcripts within the process of recognition, entrance, invasion and replication is important for finding potential vaccine targets. When given a set of proteins and knowing their mechanism of action inside the cell, identifying the NSPs may be an early approach to avoiding extensive immune involvement. In particular, NSPs necessary for replication and produced at an early stage should be considered. Considering structural protein production is also important as it may be an essential means by which to cover the SARS-CoV-2 life cycle.
We considered the structural, non-structural and accessory proteins of 92 proteomes to identify conserved regions according to frequency and alignment. Our findings are shown in (Supplementary Table 3). The conserved sequence blocks were used in their entirety for the prediction of conserved epitopes.
Most frequently identified HLA-I and HLA-II alleles in LATAM
We identified 168 of the most frequent HLA-I alleles in 13 countries. The HLA-A*02:01 allele was found at an above average frequency in Argentina, Brazil, Chile, Colombia, Cuba, Ecuador, Nicaragua, Peru and Venezuela. The alleles A*24:02 and B*40:02 were identified above the 80th percentile; they rank first in frequency in most countries. Some countries had missing data and small sample sizes, e.g. Guatemala (where only the B*53:01 allele was found above the mean) and Trinidad and Tobago (where the only available allele, C*16:02, was used for complementary analyses).
As shown in (Supplementary Table 4), the frequencies found between the HLA-I loci were A (n = 126), B (n = 264) and C (n = 64); those above the mean frequencies were A (n = 48), B (n = 91) and C (n = 27). For HLA-II, the allele and allele groups had the following order of frequency: DRB1*04, DRB1*13, DRB1*07, DRB1*11, DRB1*08: 02, DRB1*15, DRB1*08, DRB1*01, DRB1*03 and DRB1*16:02. However, all DRB1 alleles listed by NetMHC 3.2, spanning 16 countries, had above average frequencies (Supplementary Table 5). (Supplementary Fig. S3) shows the distribution of the main alleles and allelic groups by country according to HLA-I and HLA-II.
HLA systems and SARS-CoV-2 proteomes
HLA systems in humans are located in the most polymorphic region of the genome, which is related to continuous selection in its interaction with extracellular and intracellular pathogens. Furthermore, this diversity corresponds to rapid adaptation to environmental change, as well as records of social conditions. This has become a hallmark of various human populations after their first few migrations41.
When identifying vaccine targets for SARS-CoV-2 and other emerging pathogens of probable zoonotic origin, it may be possible to generate an immunological response that develops memory. While this would only work in certain populations as it is restricted to HLA molecules, it could occur because of the diverse affinities that HLA-I and II molecules have for different sets of peptides that are processed from SARS-CoV-2.
We propose a scenario in which all the proteins encoded by the SARS-CoV-2 genomes may be susceptible to recognition by HLA-I and II molecules. This is possible because post-transcriptional changes do not occur in the proteins encoded by the viral genome. However, post-translational modifications may occur, which is inconvenient for experimental interpretation of results obtained from immunogenic in silico analyses.
Proteins synthesised in the cytoplasm without prediction of the transmembrane domain may be less likely to be changed by glycosylation. However, in studies of coronaviruses closely related to SARS-CoV-2, the polyprotein pp1a/b synthesised in the cytoplasm can result in NSP3, NSP4 and NSP6, which despite being non-structural proteins are susceptible to glycosylation given their proximity to the endoplasmic reticulum42. The structural proteins SP, MG and envelope protein can harbour a greater number of post-translational modifications, with N-glycosylations being most likely. These modifications can vary among proteins and are not necessarily related to the evasion of the immune system. NP, although a structural protein, has modifications related to phosphorylation that optimise its recognition of RNA. It is therefore an important vaccine target because of its greater expression during the infection42.
In the SP of SARS-CoV-2, glucan shields with densities that limit probable antibody recognition have not been identified43. However, its interaction with innate immunity remains unknown. With these modifications, HLA may be able to recognise and present the peptides derived from antigenic processes when they are captured by dendritic cells44. This has been most frequently studied in cancer, evidenced in T and B cell primers, which are in some cases even more immunogenic45–47. Because they are naturally recognised as epitopes, their in vitro recognition can be included in complementary studies. It quickly identifies the immunogenic regions susceptible to being glycosylated and therefore optimises the potential vaccine formulations.
The best option for producing a SARS-CoV-2 vaccine in a short period is not only the construction of the best targets with potential immunogenicity but also the development of libraries of potential vaccine peptides, including the identification of experimental post-translational changes, especially those affecting immune recognition. Affinity and antigenicity should be selected as the most important characteristics to initiate specific antigenic recognition. The proteins responsible for virus replication and those that perform cleavage to become functional are important vaccine and pharmacology targets since they take part in the early stages of infection. The antigen presentation process is shown in (Supplementary Fig. S4), with special emphasis placed on SARS-CoV-2. Identification of potential epitopes (proteins or antigenic peptides with a high affinity for HLA-I or II molecules) from a certain population with more frequent HLAs could lead to adequate antigen presentation of TLs. This may be used to activate a favourable adaptive response and generate immunological memory. HLA-I molecules present epitopes to TL CD8+, which are cytotoxic, lyse infected cells, initiate apoptosis and prevent the formation of new virions. When HLA-II molecules present epitopes to TL CD4+1,48, they promote the formation and maturation of new lineages of TLs for the production of specific antibodies by B lymphocytes (BLs).
Recent experimental studies on SARS-CoV in patients with COVID-19 have identified the structural proteins NP, SP and MG of SARS-CoV-2 as targets for the antibodies in the serum of patients with a neutralising capacity49. These proteins are the main targets for stimulating the creation of antibodies by BLs. These characteristics are included in our vaccine proposal, which is summarised in Figs. 2 and 3.
Potential epitopes of TLs, CD4+, CD8+ and BLs are contained in the conserved sequences of 92 SARS- CoV-2 proteomes
To identify and select the potential HLA-II epitopes, we used 19 HLA-II molecules from the available DRB gene alleles to perform an agglutination prediction using NetMHCII 2.3. We excluded the conserved sequences that were <15 mers from SP, NP, and MG proteins. In total, 199 peptides with strong binding capacities and antigens were identified. The proteins with the highest number of agglutinations was SP (n = 114), followed by MG (n = 63) and NP (n = 23). Discrimination on the origin of predictions from the conserved sequence is included in (Supplementary Table 6).
HLA-II alleles found to be strong binders of various peptides (promiscuous alleles) were identified in the following order of frequency: DRB1*04:02 (n = 50), DRB1*09:01 (n = 35), DRB1*08:01 (n = 33), DRB1*16:02 (n = 33), DRB1*04:03 (n = 30), DRB1*07:01 (n = 24), DRB1*01:01 (n = 23) and DRB1*04:05 (n = 20). (Supplementary Table 7) summarises the conserved sequences present in the SP binding domain of interest; the DRB1*01:01, DRB1*04:05, DRB1*10:01 and DRB1*16:02 alleles were found to be promiscuous in this domain (predicted to be strong binders). The allele DRB1*16:02 is found relatively more frequently in LATAM; therefore, it may be related to clinical outcomes and deserve to be studied in more detail.
The 15 predicted mers, with antigenic and strong binding characteristics, were evaluated for allergenic and toxic character- istics using AllerTOP and Toxin Ped, respectively. To achieve more precise identification of strong binders from the 167 most frequent HLA-I molecules in LATAM, we used two algorithms based on artificial neural networks identified as being the most accurate by IEBD (as of March 15, 2019).
A set of 944 peptides were classified as strong binders and antigens; these were the result of a consensus from the two algorithms. The proteins with the highest number of predictions in order of frequency were as follows: the ORF1 transcript (n= 573), SP (n = 163), MG (n = 51), the ORF-3a transcript (n = 43) and NP (n = 34). The contribution made by each conserved sequence to these proteins is included in (Supplementary Table 8).
A prediction was not possible only for the HLA-B*51:10 allele because it was not found on the available list. In addition to the antigenicity and strong binding characteristics of these groups, we also considered those with vaccine potential, and potential immunogenic, non-allergenic and non-toxic characteristics to be on our list. Adaptive immunity protection is now known to change the natural course of the disease by neutralising SARS-CoV-250. The most recent experimental structures of the SP and NP proteins, found in RCSB PDB (https://www.rcsb.org/) with the following IDs, were used: 6lzg, 6m0j, 6vw1, 6w41, 6yla, 6yor, 6csb, 6vxx, 6vyb, 6m3m, 6vyo and 6wkp. These structures have special characteristics, such as binding to ACE2 receptor, an antibody extracted from a convalescent SARS patient named CR302251, and conformational pre-fusion in the open and closed state. Analyses were performed using the Discotope 2.0 server adjusted to 80% specificity.
We used the prediction algorithm on all proteins based on their available 3D structure, using the specific SP or NP chains. The predictions, including the characteristics of the complexes found, the intervals, and frequency with which probable epitopes they were identified in each structure, are shown in (Supplementary Table 9). For SP we have indicated three possible regions of continuous epitopes and one discontinuous that may be important due to the extracellular access. The areas between residues 443-450, 487-494, 496-506 and discontinuous 454-459-460-469-471 seem to be more frequently related to probable interactions with BL. A principal feature of this predictions, is absence of glycosylation in this residues.
For NP, we identified five regions of possible continuous epitopes and two discontinuous ones for RNA binding domain. On average, the intervals are longer than SP. The regions comprise the intervals in the residues: 59-64, 91-106, 120-130, 136-148, 150-156 and discontinuous between different intervals in the residues 66-82, 115-130, 163-171. NP and SP with linear and discontinues epitopes zones described above, are illustrated in (Supplementary Fig. S5).
The linear epitope intervals of SP and NP described in the previous structural approach were taken into account to identify longer sequences (>15 mers), which would be capable of evoking a response mediated by the BL receptor. For this approach, we used the ABCpred server and adjusted specificity to 90%.
Other filters, which included predicting epitopes in the functional domains using RDB or the RNA binding site for NP, antigenicity, toxicity, allergenicity and the identification of conserved sequences in SARS-CoV with experimental immunological evidence available at IEBD, were used to identify two ideal options. These are presented in (Supplementary Table 10). After the filters were applied, the best options were selected to choose the BL candidates to be incorporated into the multi-epitope construction. These sequences are summarised in (Supplementary Table 11).
Peptides with vaccine potential are characterised by more of the most frequent alleles of LATAM in con- served sequences from SARS-CoV-2 proteomes
Peptide matrices with vaccine potential were constructed to identify peptides associated with the most frequent HLA-I and HLA-II molecules in LATAM. The least number of peptides found to cover the most frequent LATAM alleles is summarised in Table 1. These ‘potential promiscuous vaccine peptides’ (PPVPs) share characteristics for the optimised construction of multi-epitopes, including being antigenic, non-toxic and non-allergenic. This group of peptides contain conserved and partial sequences that are experimentally proven to bind to various molecular targets that mediate the specific immune response against SARS-CoV. For HLA-II molecules, DDSEPVLKGVKLHYT, the peptide conserved in SARS-CoV and contained in the SP, is likely the best candidate for an antigenic peptide; it is also valid because of experimental evidence of its binding to targets in restricted HLA, TLs and BLs. Other partial sequences with experimental validity were identified in a 90% BLAST of structural proteins including MG and SP.
For HLA-I molecules, the majority of predictions corresponded to the ORF1 transcript, in which the KVKYLYFIK peptide was experimentally valid and conserved in SARS-CoV with a high antigenicity score. Only two other non-ORF1 peptides, ITLCFTLKR conserved in ORF7a and WTAGAAAYY of SP, were found in this group.
The ORF1 transcript and the structural proteins SP and MG have the highest recognition by the most frequent alleles in LATAM
The peptides found with the highest promiscuity in HLA-I molecules were MPYFFTLLL (n = 66), which is from the ORF1 transcript, followed by epitopes conserved in SARS-CoV SP, FAMQMAYRF (n = 58), and ORF1, FLLNKEMYL (n = 42). Those with the highest promiscuity in HLA-II molecules were a partially conserved SARS-CoV epitope found in MG, SFRLFARTRSMWSFN (n = 7), followed by two predictions from the SP protein, QSIIAYTMSLGAENS (n = 4) and VLSFELLHAPATVCG (n = 4). The latter was partially recognised by tests on BL and HLA molecules in SARS-CoV.
The alleles with the highest promiscuity for all the conserved proteins in SARS-CoV-2 proteomes (restricted to 9 mers) were preferentially found in the C locus. The alleles found in order of frequency were as follows: HLA-C*08:01-HLA-C*08:03 (n = 98), HLA-C*16:01 (n = 97), HLA-C*03:05 (n = 94), HLA-C*03:03-HLA-C*03:04 (n = 92), HLA-C*03:02-HLA-C*15:03 (n = 90) and HLA-C*16:02-HLA-C*12:03-HLA-C*01:06-HLA-C*15:02 (n = 89). Alleles with relatively low, but higher than average, frequencies belonged to the countries Colombia-Brazil, Colombia-NIC and Colombia-Costa Rica-NIC. Those around the 50th percentile belonged to the countries Brazil-Colombia-NIC-Peru, Colombia-Venezuela and Brazil-Colombia-NIC, respectively with the alleles previously mentioned. The alleles most frequently found in LATAM included HLA-A*02:01, HLA-A*24:02 and HLA-B*40:02. Based on their capacity to recognise peptides derived from SARS-CoV-2 these alleles were ranked 81st, 119th and 154th, respectively.
Flexible peptide-protein coupling signals favourable energy and anchorage residues to HLA molecules
To estimate the conformation of the complexes, we used as receptors the experimental structures of the alleles DRB1_0401, DRB1_1101, HLA-B*35:01 and HLA-C*07:02 (IDs in RCSB-PDB: 5NI9, 6CPN, 1XH3 and 5VGE, respectively). As ligands of the peptides, we used SFRLFARTRSMWSFN and FAMQMAYRF.
The best models were those contained in clusters with the highest density, in accordance with the average RSMD obtained from each simulation and their adequate position in the HLA cleft. The results are shown in Fig. 4. A map of the contacts between peptides and HLA-I molecules is shown with residues found to more frequently interact in the clusters (functioning as an anchor and generally located in the first and last residues) included. They also display some interactions with β -2 globulin, which serve to stabilise the α1 and α2 chains where the binding groove is formed and facilitate peptide bonding52.
In HLA-II, the regions that most frequently interact as anchors are FARTRSMWS for DRB1_0401 and FRLFARTRS for DRB1_1101 (although they differ from the core sequence of the prediction). The residues 4, 5 and 7 most frequently interact, giving a greater number of anchors than those in HLA-I; this results in flanking regions with small deviations in the peptide backbone and differing HLA cleft accommodation.
Construction of the proposed multi-epitope vaccine
In the N terminal, β -defensins have a range of immune responses related to the maturation of cells that mediate innate immunity, such as dendritic cells and TLs, as well as antiviral activities53,54. Another adjuvant used was the universal memory TL helper peptide (TpD), an auxiliary peptide that can aid memory generation as a target of TL CD4+55. The adjuvant Pan DR T helper epitope (PADRE) was also attached to the construct; it is relevant in multi-epitope constructs given its ability to potently stimulate the innate and humoral immune systems through generation of high and specific IgG titers. It can also overcome barriers, indicated by the high diversity among HLA molecules; hence, it reaches a larger population and is safe56,57. Towards the C terminal, a peptide domain capable of interacting with M cells and mediating up to an 8-fold increase in intestinal absorption was added58.
The linkers used as spacers between the HTL GPGPG and AYY epitopes are recognised as suitable spacers that facilitate antigen presentation by directly interacting with transport and assembly mediators to HLA molecules59. The di-lysine KK that separates epitopes from BLs is located close to the C terminal.
Among the adjuvants used were the EAAAK linkers: efficient separators between the domains present in the multi-epitope construct60. The following linkers were used in the vaccine, which contained 510 amino acids: 6 EAAK, 6 GPGPG, 11 AAY and 5 KK. The proposed order of the construct is presented in Fig. 5.
The physicochemical properties of the multi-epitope construct are consistent with the requirements for generating an immune response in an experimental model
To generate a construct that was safe, stable and capable of evoking an immune response, the antigenic, non-allergenic and non-toxic properties of the multi-epitope construct were established using Vaxijen 2.0, Allergen FP 1.0 and ToxinPred, respectively. The physicochemical characteristics of the multi-epitope construct, along with special consideration for the final CTGKSC peptide, are shown in Table 2. In addition to generating a safe construct, it was necessary to identify a thermostable multi-epitope construct, indicated by the aliphatic index, for laboratory testing. Solubility and thermostability are associated with adequate overexpression in Escherichia coli (E. coli), which is the bacteria most commonly used to produce recombinant proteins61.
Given the parameters from the primary structure of the multi-epitope construct, adequate production in vitro can be inferred because of overexpression in E. coli and observed safety in immunological studies.
Three-dimensional structure modelling and validation of the multi-epitope vaccine construct
The predicted structural conformation of our construct can be correlated with functional annotation and other multi-epitope constructs. The secondary structure was analysed using the SOPMA and PSIPRED servers, which revealed the presence of 43% α-helix, 21% β -foil, 30% coils and 6% β -turns in the vaccine’s construction (Supplementary Figs. S6 and S7). A reliable approximation of the three-dimensional structure of the construct can be used for further in-depth studies. Molecular dynamics can be used to analyse coupling stability with immunological targets in innate recognition of viral elements as membrane TLR receptors with the production capacity of IFN-1. Therefore, we approximated the tertiary structure using Robbeta, based on homology models, and we used analysis of the Ramachandran diagram and ERRAT to validate the quality of this approach. The results are summarised in (Supplementary Fig. S8).
Molecular docking, interactions with heterodimer TLR-4/MD-2
By using the ClusPro 2.0 server to simulate molecular docking between the vaccine construct and TLR-4, 30 models were generated (Supplementary Table 12); these were classified by cluster size according to their representative position. The lowest energy -1192.6 (Supplementary Fig. S9), was found in the sixth cluster with 22 members. The first cluster contained 35 members, which indicates an acceptable probability for the native pose of the complex. This cluster with the balanced adjustment was chosen because the pose of this was closer than those of adjustments 2 and 3, which were related to a majority of hydrophobic and electrostatic interfaces62.
This model positions the N terminal of the multi-epitope construct and a larger interface area towards the concave side of TLR-4, where its ectodomain forms mostly hydrogen and salt bridges. The free convex side is broken into the N terminal of TLR-4, which indicates the formation of hydrogen and salt bridges resulting from interactions with the terminal C of the multi-epitope construct.
In addition, interactions are present in the internal part and running towards Myeloid Differentiation Factor 2 (MD-2); these are mostly hydrogen and some saline bridges, the latter being maintained primarily by the ARG 157 and ARG 159 residues of the SFRLFARTRSMWSFN peptide. These are illustrated in (Supplementary Figs. S9-S11).
Molecular Dynamics
The root mean square deviation RMSD for the vaccine-receptor complex was evaluated during the complete simulation time (Fig. 6 A). The RMSD plot showed a considerably increase until it stabilizes at around 20 ns, with an average of about 0.8 nm, for the rest of the simulation time. These changes indicate that the vaccine attempts to finding the best position with respect to its receptor. After 20 ns from beginning of the simulation it is remained stable, which is an indicator that the vaccine reaches the best conformation to form a stable complex. In addition, the root mean square fluctuation RMSF of all residue, for vaccine and its receptor was evaluated. Residues from the vaccine start at GLY-1483 and finalize at residue CYS-1992 (see Fig. 6B), The RMSF showed a stable conformation from residue GLY-1483 to LYS1683 from the vaccine segment, which make part of the β -defensin 3 to TpD of the vaccine construct, Fig. 5a. This stabilization, is made due to the salt bridges formed into residue ASP-1424:ARG1639, ASP-756:ARG-1641, GLU-1180:ARG-1524, GLU-1183:ARG1520 and GLU-1556:ARG930 which
showed a regular contacts in the whole simulation (Supplementary Fig.12). On the other hand, a segment of the vaccine, which corresponds to residue PHE-1684 to CYS-1992 (Fig. 6B), showed a higher RMSF, displaying a pronounced motion compared with the GLY-1483 to LYS1683 segment, while TLR-4 receptor keeps stable. Beside this motion, the complex remains stable during the 68 ns of simulation.
The radius of gyration Rg was calculated to determine the compactness of vaccine-receptor complex system during the simulation, Fig. 6C. It does not show a significant change. The complex exhibit a compact folded structure in the whole simulation, with an average Rg of about 4.8 nm. This folded complex structure is stabilized by around 9 H-bonds in average (Fig. 6B ), and 5 salt-bridges which were described above. Thus, the Vaccine-TLR4 complex showed a stable conformation during the 68 ns of simulation.
Optimisation of cDNA from the vaccine construct for optimal expression of the vaccine product
For insertion of the vaccine construct into a plasmid vector, the CTGKSC+ protein sequence of 510 amino acids was inversely translated to a cDNA of 1530 nucleotides in length. The host system of expression varies, and the cDNA must adapt according to the use of the host codon. For optimal expression of the multi-epitope construct in the E. coli K 12 host, the resulting cDNA was codon-optimised according to the JCAT server. Furthermore, during optimisation the rho-independent transcription terminator and prokaryotic ribosomal binding sites in the middle of the cDNA sequence were avoided to generate optimal and complete protein expression. To insert the construct into the pET28a(+) cloning vector, the BamHI and HindIII cleavage sites were also avoided. The results are shown in (Supplementary Fig. S13).
Expression of the multi-epitope vaccine construct in E. coli K 12 by in silico cloning
The E. coli K 12 strain was selected as the organism for cloning purposes since multiple-epitopes vaccines are expressed and purified more easily in this bacterium. For this purpose, the expression vector pET28a(+) was used and excised with the restriction enzymes BamHI and HindIII. The optimised cDNA was then inserted near the ribosome binding site using Snapgene (Supplementary Fig. S13).
Immune system simulation
To identify the immunogenic profile of the vaccine construct, we used the C-IMMSIM immune server. As shown in Fig. 7, the secondary and tertiary responses showed greater global responses and the presence of memory T and B cells. This may have been due to the cumulative effect of cells at the serum level that possess a memory profile exceeding the total in injection 3. This is also associated with decreased antigenic concentrations and normal immunoglobulin levels. Cell-specific lineages were also stimulated, with the general activation of CD4+ and CD8+ T cells shown. In addition, HTL-1 cell-based immunity was predominant. This is associated with the production of the cytokines and interleukins IFN-1, TNF-β and IL-2, as well as the activation of professional antigen-presenting cells. These results are consistent with the construct having a natural immunogenic capacity, which was applied without LPS in the simulation. A more extensive immune simulation corroborates the memory formation indicated in this brief approach. It was conducted until day 311 with 12 injections, in a step of time beyond 460 days, adjusting the intervals that did not surpass 4 weeks, 12, 94, 178, 262, 346, 430, 514, 598, 682, 766, 850, 934 (Supplementary Fig. S14).