SARS-CoV-2 caused huge impact to human production, living and even life, has become a major challenge confronting the whole world. Development of vaccine is one of the effective means of prevention and treatment of the virus long-term. Epitope vaccine is the trend of development of vaccine due to the advantages of strong pertinence, less toxic and side effects and easy to transportation and storage [21]. The determination of epitopes is the basis of the development and application of vaccine, and the clinical diagnosis and treatment. Currently, the methods which were mainly used are X- crystal diffraction method, immune experiment method and bioinformatics method. The first two are time-consuming and laborious, the bioinformatics method is gaining more and more credibility among researchers [3, 21, 22]. There are many factors to be considered in the prediction of epitopes by bioinformatics method, such as the surface probablity and flexibility of the epitopes. At the same time, it is necessary to exclude the structurally stable and non-deformable α-helix, β-sheet, glycosylation sites which may obscure the epitopes or alter the antigenicity, etc [23]. Even so, the predicted epitopes are still inaccurate [4]. Compared with the current study on SARS-CoV-2, this work adopted various prediction methods and 3D structure databases developed in recent years, which were based on artificial neural network, Hidden Markov Model(HMM), Support Vector Machine(SVM), etc, such as ABCpred, BepiPred2.0, SEPPA 3.0, IEDB, etc. Compared with prediction by a single method [24], on the basis of a single protein [25] on the basis of epitopes of SARS [26], these methods and databases greatly improved the accuracy of prediction and had more bioinformatic meaning. We comprehensively analyzed the prediction results from the tools which were widely used, set up screening criteria on the basis of primary structure, secondary structure and tertiary structure, so that the prediction results would more accurate and reliable.
The S protein, the E protein and the M protein are surface proteins of SARS-CoV-2, which have the potential as antigenic molecules. However, the current study on the epitopes prediction of SARS-CoV-2 [27], due to the S protein has been reported to be the directly binding molecule of SARS-CoV-2 to ACE2[28], the prediction of epitopes is mainly focusing on the S protein, with few studies on the E protein and the M protein. In this work, we analyzed the S protein, the E protein and the M protein, predicted their epitopes. On the basis, 7 B cell epitopes were predicted, including 2 conformational and 6 linear B cell epitopes, one of the conformational and one of the linear are coincide. All of the epitope A, B, C, D located on the surface of the tail of the S protein, which is relatively easy to bind. The epitope E is located on the head of the S protein, which is the key area where the S protein recognizes and binds to ACE2 [28, 29], has the potential to block the infection process. The epitope F and the epitope G located on the end of the head of the E protein, the two epitopes coincide, this may due to they are all consecutive and the secondary structure avoided the α-helix and the β-sheet. The epitope H is derived from the M protein, the structure and conservation could not be determined due to the inability to predict reliable structure. However, it could be known from the surface probablity scores that the epitope H is more likely to be located on the surface of the M protein.
The higher the conservation score calculated by the Consurf Server is, the more likely the site is to be mutated in the evolutionary process. When the score < 1, the site is likely to be a conservative site; when the score is between 1 and 2, the site is a site which is likely to be a relatively easy mutation; when the score > 2, the site is likely to be an easy mutation site [30]. In the 7 epitopes obtained, all the epitopes of the S, E, M protein were absolute conservative among all SARS-CoV-2 sequences. For the human coronavirus dataset and the coronavirus dataset, only the average conservative score of the epitope D is higher than 1, which is prone to mutation. The epitope D should not be used as an epitope of the S protein. The conservation of the epitope H could not be calculated by the PDB file, the application value of the epitope H needed further experimental verification. Although the epitopes could be integrally considered to be conservative, the independent residues of these epitopes could still easy to mutate. Except the epitope E, all of 6 dominate epitopes contain 1–2 residues which has a conservative score higher than 1(Table 3C), indicating that these residues were likely to be easy mutation sites. These residues mostly located at the head or the tail of the epitopes, therefore, the mutation of these residues should be paid attention to, and the length of the epitopes should be adjusted according to the actual effect in application. The scores of epitopes in different datasets were different, which could due to the quantity of sequences in the datasets and the structures were analyzed in different situations.
The epitope detection in glycoproteins is significant to the study of the immunoreaction of SARS-CoV-2, but its challenge is less reliable than the epitope detection due to the presence of glycan[25]. In addition, SARS-CoV-2 would mutate frequently, the epitopes predicted might mutate too, so conservative epitopes analyzed in the present study might be more reliable. However, this work is limited. Without the molecular dynamics analysis, the binding between epitopes and antibodies was not simulated to further determine the availability of epitopes, but researches from different perspectives can provide more epitopes choices for subsequent studies.