We analysed the turns’ NE descriptors calculated from the three distinct subsets of protein types: turns in α + β structures, turns in all-α structures only, and turns just in all-β structures. We did not analyse the particular case of those proteins characterized as “random coil proteins” because the universe encompassing only such types of structures is significantly smaller and would not yield statistically meaningful results. (For a total of 137,739 PDB, we considered eligible structures against 325 random coil structures [as of April 2022]). Previous work on a variety of protein districts clearly demonstrated that the NE, in general, is better described when using the MANOVA test on corresponding descriptors than when the univariate test is applied (17, 18). To compare the NE of turns with the α-helices and β-sheet NE, in this work, we first applied the same type of tests to define the list of most relevant NE descriptors (MRND) and then compared those to the cases of α-helices and β-sheet MRNDs, as explained below.
In Fig. 5, we present the sequence logo showing the most frequently found amino acid residues per position along the turn expanse. This is relevant as it lists the 4 most frequently found residues at selected positions (3–8) in the turns: G, D, P, and L. Figure 5 clearly shows that proline is the most commonly found residue at position 2 and, for longer turns at positions 5 and 6. The presence of aspartic acid is interesting, as this negatively charged residue confers a special function in turns to promote charged interaction with other functional protein domains.
Figure 5 Sequence logo showing the most frequently found amino acid residues per position along the turn extension.
Univariate test for the turns in the α + β dataset
A univariate test for the dataset of turns, selected as described above from the whole PDB, was tuned to consider the 67 STING descriptors (Table 4) and the 6 different lengths of turns (Fig. 3). A test was carried out for 402 (67x6) possible trials. Of the 402 completed tests, 357 show p-values < 1e− 6 (88.8%). Out of the 67 total descriptors, 51 of them (76.2%) appeared in all of the tests (for all the sizes of turns examined) with corresponding p-values less than 1e− 6. This effectively reduced the list of descriptors to be considered in the final analysis, and the new list is presented in Table 4. In Fig. 6, we present the differences between the results of the univariate tests using the STING descriptors for the turns versus the α-helices (full line), the turns versus the β-sheets (dashed line) and the β-sheets versus the α-helices (two dashed lines).
In Fig. 6, one may easily notice general differences (line trends), for which, from now on, we will refer to as TDt/α, TDt/β and TD α/β {Trend Difference: turn/α-helix; turn/β-sheet and α-helix/β-sheet}. The “difference” here is related to the number of cases (percent-wise) for a given descriptor, where the p-value was found to be less than 1e− 6, corresponding to the NE of the turn, minus the corresponding NE of the α-helices (and likewise for the remaining combinations of TDs). As shown in Fig. 6, in the case of turns versus α-helices, there are 29 descriptors with differences ranging between 50% and 75% and 6 descriptors with a difference of more than 75%. The latter are those such as aromatic contacts, charge repulsive contacts, hydrogen bond main chain – water – side chain, hydrogen bond main chain – water – water – main chain, hydrogen bond main chain – water – water – side chain, hydrophobic WNA by distance, and hydrophobic WNA at the surface.
In the case of the turns versus the β-sheets, there are 15 descriptors with a difference between 50% and 75% and no descriptor with a difference greater than 75%.
Comparatively, for β-sheets versus α-helices, there are only three descriptors with a difference between 50% and 75% and none with a difference greater than 75%.
Decreasing the flanking region size from 32 to only 16 AAs, we obtained essentially the same results, from the 402 tests, 354 of them showed p-values being < 1e-6 (88.1%), and 50 descriptors (74.6%) appeared in all of the tests with p-value being less than 1e-6. However, when we decreased the flanking region size to only 8 AAs before and 8 AAs after the turns, we found what we consider “better” results, with 358 tests out of a total of 402 (89.1%) showing p-values < 1e− 6. The set of descriptors that appeared in all of the positive tests (those resulting in a p-value < 1e− 6) are the same as shown in Table 5.
Although different in intensity and number, the general trend for the turns minus the α-helices for the studied variables is similar to the degree of the turns minus the β-sheets. Both abovementioned cases are somewhat similar when the line trend is concerned for the case of α-helix minus β-sheet. We will discuss the possible implications of such a finding, described in Fig. 6, in the discussion session.
In our Dictionary of Internal Proteins NE (https://www.proteinnanoenvironments.cnptia.embrapa.br) in the SSE chapter, we demonstrate comparative plots for three SSE NEs with respect to MRND.
Figure 6 The trend in variation of percent-wise participation of a selected descriptor appearing in the cases where the p-value was less than 1e-6 for the univariate tests. Here, we calculated and presented the difference between the cases: the turns versus the α-helices (shown in solid line), the turns versus the β-sheets (shown in dashed line) and the β-sheets minus the α-helices (shown in point-dash line) NE. δ indicates the difference in percent-wise participation in the cases with p-values < 1e-6: the turns versus the α-helices, the turns versus the β-sheets, and the β-sheets versus the α-helices.
The univariate test for the turns in the all-α dataset
The turns found in protein types defined as all-α structures have six different lengths, totalling 402 possible tests (67 selected descriptors x 6 possible lengths). In the 309 cases, the p-values were < 1e− 6 (76.9% of the total). Forty-five descriptors (67.2%) appeared in all the tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e− 6 (Table 5).
For the flanking region with 16 AA before and 16 AA after the turns, we had 299 of the 402 tests (74.4%) with p-values < 1e− 6. Forty-one descriptors (65.7%) reached all the tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e− 6 (Table 5).
Considering 8 AA before and 8 AA after the turns, we had 295 in the 402 tests (73.1%) with p-values < 1e− 6, and 41 of the 67 descriptors (61.2%) had p-values less than 1e− 6 in all the tests (lengths 3, 4, 5, 6, 7, 8) (Table 5).
The univariate test for the turns in the all β-sheet dataset
The turns in the dataset named all-β structures had 6 different lengths, totalling 402 tests. Of the 402 tests, 284 had a p-value < 1e− 6, representing 70.6% of the total. In this case, 37 descriptors (55.2%) reached all tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e-6 (Table 5).
Similar to the previous cases, we limited the flanking region to 16 AA before and 16 AA after the turns. In this case, 282 of the 402 tests (70.1%) had p-values < 1e-6, and 37 descriptors reached all tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e-6 (Table 5).
Using 8 AA before and 8 AA after the turns, 284 of the 402 tests had a p-value < 1e− 6, which represents 70.6%. Table 5 shows which descriptors had all tests (lengths 3, 4, 5, 6, 7, 8) with a p-value less than 1e-6.
Table 5
The STING descriptors with 100% of conducted univariate Kolmogorov‒Smirnov tests resulting in p-values < 1e-6 in the “α + β”, the “all-α”, and the “all-β” databases, and for flanking region sizes of 32, 16, and 8 AA. The white-filled boxes show the missing position for the selected descriptor (line) within a specific protein type and flanking region size (column).
Dataset
|
turns in the α + β
|
turns in the all-α
|
turns in the all β-sheet
|
Flanking region
|
32 AA
|
16 AA
|
8 AA
|
32 AA
|
16 AA
|
8 AA
|
32 AA
|
16 AA
|
8 AA
|
1. Accessible_Surface_in_Isolation
|
|
|
|
|
|
|
|
|
|
2. aromatic
|
|
|
|
|
|
|
|
|
|
3. aromatic_WNADist
|
|
|
|
|
|
|
|
|
|
4. aromatic_WNASurf
|
|
|
|
|
|
|
|
|
|
5. ch_attractive_WNADist
|
|
|
|
|
|
|
|
|
|
6. Ch_attractive_WNASurf
|
|
|
|
|
|
|
|
|
|
7. ch_repulsive_WNADist
|
|
|
|
|
|
|
|
|
|
8. ch_repulsive_WNASurf
|
|
|
|
|
|
|
|
|
|
9. Clash
|
|
|
|
|
|
|
|
|
|
10. Cross_Link_Order_CA
|
|
|
|
|
|
|
|
|
|
11. Cross_Pres_Order_CA
|
|
|
|
|
|
|
|
|
|
12. Dihedral_Chi1
|
|
|
|
|
|
|
|
|
|
13. Dihedral_Chi2
|
|
|
|
|
|
|
|
|
|
14. Dihedral_Chi3
|
|
|
|
|
|
|
|
|
|
15. Electrostatic_Potential_at_CA
|
|
|
|
|
|
|
|
|
|
16. Electrostatic_Potential_@_CA_WNADist
|
|
|
|
|
|
|
|
|
|
17. Electrostatic_Potential_@_CA_WNASurf
|
|
|
|
|
|
|
|
|
|
18. Electrostatic_Potential_at_LHA
|
|
|
|
|
|
|
|
|
|
19. Electrostatic_Potential_@_LHA_WNADist
|
|
|
|
|
|
|
|
|
|
20. Electrostatic_Potential_at_LHA_WNASurf
|
|
|
|
|
|
|
|
|
|
21. Electrostatic_Potential_Average
|
|
|
|
|
|
|
|
|
|
22. Electrostatic_Potent._Average_WNADist
|
|
|
|
|
|
|
|
|
|
23. Electrostatic_Potent._Average_WNASurf
|
|
|
|
|
|
|
|
|
|
24. hbmm
|
|
|
|
|
|
|
|
|
|
25. hbmm_WNADist
|
|
|
|
|
|
|
|
|
|
26. hbmm_WNASurf
|
|
|
|
|
|
|
|
|
|
27. hbms
|
|
|
|
|
|
|
|
|
|
28. hbms_WNADist
|
|
|
|
|
|
|
|
|
|
29. hbms_WNASurf
|
|
|
|
|
|
|
|
|
|
30. hbmwm
|
|
|
|
|
|
|
|
|
|
31. hbmwm_WNADist
|
|
|
|
|
|
|
|
|
|
32. hbmwm_WNASurf
|
|
|
|
|
|
|
|
|
|
33. hbmws
|
|
|
|
|
|
|
|
|
|
34. hbmws_WNADist
|
|
|
|
|
|
|
|
|
|
35. hbmws_WNASurf
|
|
|
|
|
|
|
|
|
|
36. hbmwwm
|
|
|
|
|
|
|
|
|
|
37. hbmwwm_WNADist
|
|
|
|
|
|
|
|
|
|
38. hbmwwm_WNASurf
|
|
|
|
|
|
|
|
|
|
39. hbmwws
|
|
|
|
|
|
|
|
|
|
40. hbmwws_WNADist
|
|
|
|
|
|
|
|
|
|
41. hbmwws_WNASurf
|
|
|
|
|
|
|
|
|
|
42. hbss
|
|
|
|
|
|
|
|
|
|
43. hbss_WNADist
|
|
|
|
|
|
|
|
|
|
44. hbss_WNASurf
|
|
|
|
|
|
|
|
|
|
45. hydrophobic
|
|
|
|
|
|
|
|
|
|
46. hydrophobic_WNADist
|
|
|
|
|
|
|
|
|
|
47. hydrophobic_WNASurf
|
|
|
|
|
|
|
|
|
|
48. Hydrophobicity_KDI
|
|
|
|
|
|
|
|
|
|
49. Density_IFR_CA_3
|
|
|
|
|
|
|
|
|
|
50. Density_Internal_CA_3
|
|
|
|
|
|
|
|
|
|
51. Number_Unused_Contact
|
|
|
|
|
|
|
|
|
|
52. Number_Unused_Contact_WNADist
|
|
|
|
|
|
|
|
|
|
53. Number_Unused_Contact_WNASurf
|
|
|
|
|
|
|
|
|
|
54. Percent _of_Space_Clash
|
|
|
|
|
|
|
|
|
|
55. Temperature_Factor_CA
|
|
|
|
|
|
|
|
|
|
According to Table 5, which shows results from the KS univariate tests, the MRNDs for the turns in the α + β, the turns in the all-α, and the turns in the all β-sheet, and the flanking regions of 32AA, 16AA and 8AA is a set composed of the following classes of descriptors: Surface Accessibility (Accessible_Surface_in_Isolation), Contacts (aromatic, charge attractive and charge repulsive, hydrogen bonds), Space_Clash (Clash, Percent), Structural (Cross Link and Cross Presence, Dihedral Angles, Temperature_Factor), Physical Chemical (Electrostatic Potential, Hydrophobicity), Density (IFR and Internal) and Unused Contacts (Number of_Unused_Contacts).
MANOVA
Previous work demonstrated that the Kolmogorov‒Smirnov test is not the best way to analyse the NE of secondary structure elements (19) (20). Although we use this method to select some of the best descriptors for each secondary structure element type characterization (as a first approximation), its results are not always satisfactory – usually having low coverage. As shown in Table 5, in the best case, 55 descriptors (82.1% of the total [67] descriptors used for this analysis) appear in 100% of tests where the p-value is < 1e− 6 (the whole PDB dataset). That is the case for turns in the “α + β” type of proteins. In the case of turns in all-α structures, we have 46 descriptors with a p-value < 1e− 6 in 100% of the tests. Finally, in the case of turns in all-β structures, we have 37 descriptors with a p-value < 1e− 6 in 100% of the tests.
As expected, this result corroborates the previous observation shown in Fig. 6 that turns are much more different (in terms of necessary descriptors to differentiate one from the other) from α-helices and less so from β-sheets. Consequently, one is required to find more descriptors for overall good coverage in classifying turns in α-helices and in (α + β) + (α/β), while much fewer descriptors are necessary to distinguish turns from β-sheets in (α + β) + (α/β).
Multivariate tests for the flanking region of 32 AAs
MANOVA tests for the same descriptor set (as shown in Table 4) are described below. Figure 7 shows the results for the four statistical tests present in the MANOVA algorithm: Pillai, Wilks, Hotelling-Lawley, and Roy. These tests are available on the R manova function (26) and were employed in all the analytical procedures of this part of our work.
Pillai’s trace is a statistical test whose value ranges from 0 to 1. Increasing values indicate that the effects are contributing more to the model; the null hypothesis must be rejected for large values (30) (31).
In Wilk's lambda test, the null hypothesis must be rejected when Wilk's lambda is close to zero, although this must be done in combination with a small p-value. Lambda is a measure of percent variance in dependent variables that are not explained by differences in independent variable levels. The value zero means that there is no variance not explained by the independent variable. Therefore, the closer the statistic is to zero, the greater the variable in question contributes to the model (32).
In the Hotelling-Lawley test, also called Hotelling's T-squared test, the objective is to calculate a value for T (in this case, for T-squared) and compare it to a table value; if the calculated value is greater than the value found in the table, the null hypothesis must be rejected (33).
Roy is a positive value multivariate test statistic obtained in a hypothesis test. Increasing values for the statistic indicates increasing contributions of effects to the model in question. The null hypothesis must be rejected for large values (34).
After eliminating correlated descriptors (nonorthogonal ones) and removing the data disobeying the normal distribution, we were able to execute 6 tests for turns in the α + β dataset, 4 tests for turns in the all-α dataset, and 2 tests for turns in the all-β datasets. Table 6 presents the frequency of each descriptor in the MANOVA tests for the turns in the α + β, all-α, and all-β datasets, counted per unique size of the turn (and there are six such sizes, as previously described).
Multivariate tests for the flanking region of 16 AAs
As mentioned above, we also tried tests after limiting the flanking region to only 16 AAs before and 16 AAs after the turns and posteriorly to 8 AAs before and 8 AAs after the turns and applied MANOVA tests in these conditions as well. Using the flanking region of 16 AA, we obtained 83.3% with p-values < 1e− 6 in the α + β structures, 66.7% with p-values < 1e− 6 in the all-α structures, and 100% with p-values < 1e− 6 in the all-β structures (Fig. 7). Table 6 gives the frequency of each descriptor in the MANOVA tests for the three datasets (the turns in the α + β, all-α, and all-β datasets).
Multivariate tests for the flanking region of 8 AAs
For the flanking region of 8 AA, we obtained 83.3% with p-values < 1e− 6 in all the structures, 50% with p-values < 1e− 6 in the all-α structures, and no results for the all-β structures (Fig. 7). The lack of results in the all-β dataset is a consequence of the lack of tests after eliminating the correlated descriptors and removing the data with no normal distribution. In Table 6, we can see the frequency of each descriptor used in the MANOVA tests for the datasets: the turns in the α + β dataset, the turns in the all-α dataset, and the turns in the all-β dataset. There are six sizes of the loops (3, 4, 5, 6, 7, 8), and the MANOVA test was performed for each size. For example, for turns in the α + β proteins, the frequency of each descriptor refers to how many tests it participated in, with the maximum possible number being six. For the flanking region of 32 AAs, the most frequently encountered descriptors were ch_attractive, Dihedral_Chi3, and hbswws_WNADist; each appeared in three tests. For the flanking region of 16 AAs, the most frequent descriptor was aromatic_WNADist, which appeared in all the six tests. Finally, for a flanking region of 8 AAs, the most frequent descriptors are aromatic_WNADist and aromatic_WNASurf; with each appearing in all the six tests.
Figure 7 Results of the MANOVA test applied to the turn datasets for the flanking regions of 32, 16 and 8 AA. In the case of the α + β proteins, the best results were for flanking regions of 16 AA and 8 AA, with 83.3% of the tests with p-values below 1e-6. However, in the case of the all-α proteins, we had 100% of tests with p-values below 1e-6 when we worked with a flanking region of 32 AA, and when we tested the all-β proteins, we had 100% of tests with p-values below 1e-6 for the 32 AA and 16 AA flanking regions. There are no results for the all-β rounds because in this case, no test was performed after the data preparation phase.
Table 6
The frequency of descriptor appearance registered in the results after the MANOVA tests. The numbers indicate how many times the specific descriptor was used in the MANOVA test. The analysis considered 6 turn sizes (3, 4, 5, 6, 7, 8), and the MANOVA test was performed once for each size. Consequently, the maximal number of cases that one descriptor might appear is six. We then grouped the above-listed descriptors into five general classes: A) contacts, B) structural, C) electrostatic potential, D) hydrophobicity and E) unused contacts.
Descriptors
|
Flanking regions size
(protein class type)
|
32 AA
|
16 AA
|
8 AA
|
α + β
|
all-α
|
all-β
|
α + β
|
all-α
|
all-β
|
α + β
|
all-α
|
all-β
|
aromatic
|
1
|
2
|
|
2
|
|
|
3
|
|
|
ch_attractive
|
4
|
1
|
|
6
|
|
|
5
|
|
|
ch_repulsive
|
2
|
4
|
1
|
4
|
3
|
|
4
|
3
|
|
ch_repulsive_WNADist
|
|
1
|
|
4
|
|
|
4
|
|
|
ch_repulsive_WNASurf
|
1
|
|
|
1
|
|
|
|
|
|
Cross_Pres_Order_CA
|
|
|
|
1
|
|
|
|
|
|
Dihedral_Chi1
|
3
|
|
|
4
|
|
|
4
|
1
|
|
Dihedral_Chi2
|
|
|
1
|
|
|
|
|
|
|
Dihedral_Chi3
|
2
|
|
|
|
|
|
4
|
|
|
Dihedral_Chi4
|
1
|
|
|
4
|
|
|
4
|
|
|
disulfide
|
|
|
|
|
|
|
|
1
|
|
disulfide_WNADist
|
1
|
|
|
|
|
|
2
|
1
|
|
disulfide_WNASurf
|
|
|
|
1
|
1
|
|
|
|
|
Electrostatic_Potential_at_CA_WNADist
|
|
|
|
1
|
|
|
2
|
|
|
Electrostatic_Potential_at_LHA
|
|
|
2
|
|
|
|
2
|
|
|
Electrostatic_Potential_Average
|
|
|
|
|
|
|
1
|
|
|
Electrostatic_Potential_Average_WNASurf
|
1
|
|
|
1
|
|
|
1
|
|
|
hb-mwm
|
|
|
1
|
|
|
1
|
|
1
|
1
|
hb-mwm_WNASurf
|
|
|
|
|
|
|
1
|
|
|
hb-mws
|
1
|
|
2
|
3
|
|
|
3
|
|
|
hb-mws_WNADist
|
1
|
|
|
1
|
|
|
1
|
|
|
hb-mws_WNASurf
|
|
|
|
1
|
|
|
1
|
|
|
hb-mwwm
|
|
|
1
|
|
|
|
|
|
|
hb-mwwm_WNADist
|
1
|
|
|
|
|
|
|
|
|
hb-mwwm_WNASurf
|
|
|
|
|
|
|
1
|
|
|
hb-mwws
|
|
|
|
2
|
|
|
3
|
|
|
hb-mwws_WNADist
|
|
|
|
1
|
|
|
1
|
|
|
hb-ss
|
|
3
|
1
|
1
|
|
|
3
|
|
|
hb-sws
|
1
|
1
|
|
1
|
|
|
3
|
|
|
hb-sws_WNADist
|
1
|
|
|
1
|
|
|
|
|
|
hb-sws_WNASurf
|
1
|
|
|
1
|
|
|
2
|
|
|
hb-swws
|
1
|
|
1
|
2
|
|
|
2
|
|
|
hb-swws_WNADist
|
1
|
|
|
2
|
|
|
2
|
|
|
hb-swws_WNASurf
|
|
1
|
|
2
|
1
|
|
4
|
1
|
|
hydrophobic
|
|
|
|
|
|
|
|
|
|
hydrophobic_WNADist
|
|
3
|
1
|
2
|
1
|
1
|
1
|
2
|
1
|
hydrophobic_WNASurf
|
1
|
2
|
2
|
2
|
2
|
|
2
|
2
|
|
Number_Unused_Contact_WNADist
|
|
|
|
1
|
|
|
|
|
|
Number_Unused_Contact_WNASurf
|
1
|
|
|
|
|
|
2
|
|
|