Comprehensive analysis of the distinct nano environments characteristics containing the different secondary structure elements: α- helices, β-sheets, and turns

doi:10.21203/rs.3.rs-3427687/v1

This work is the third part of our initiative to fully describe the internal protein nano environments (NEs) for the three existing types of secondary structure elements (SSE). In our previous work, the NE of both the α-helix and the β-sheet were analysed. The knowledge acquired in this research is important considering that secondary structure element formation is a crucial step in protein folding and an important phase that precedes the final 3D protein structure. In the current paper, STING´s database of physical-chemical and structural descriptors was used to gather the necessary information to characterize the NE of loops, or, as they are often called, turns. Given that approximately 20% of all protein-type residues form turns, research in this field is essential, and analysis of the obtained results will further contribute to our comprehension of how proteins fold. In addition, the results in this paper will contribute to the better training of algorithms that evaluate the degree of overall protein structure quality and, consequently, structure prediction. This is currently very important given we are witnessing a revolution in algorithms employing artificial intelligence for protein structure prediction. Powered by the STING’s database (wide-ranging protein structure information source), statistical testing was used to retrieve a set of descriptors that fully delineate the NE of turns. By collecting such data, it is then possible to list the variances with respect to the NE of α-helices and β-sheets and, by doing so, establish the most relevant NE descriptors (MRND) for each of the three SSEs. The results show that the α-helical and β-sheet Nes, as well as the amino acid residue composition, all behave in a similar fashion as a “key and lock” system. In other words, it is necessary for a set of specific descriptors to assume respective specific values (within the bounds of a very definite value region) to construct the specific secondary structure element NE at a certain protein location. Consequently, there is a set of descriptors that act together that are required to satisfy specific conditions for secondary structure element occurrences. The very same requirement, we found, occurs in the case of turns.

According to the Stride definition, 38.88% of all protein-type residues are found in α-helices, 22.16% in β-sheets, 19.06% in coils, and 19.90% in turns (1). A turn is defined as any region between two regular secondary structures (α-helices or β-sheets) that is at least three residues in length and at most eight residues in length (2) (3) and satisfies the known turn´s geometrical restrictions (see Table 1). The coils correspond to residues not associated with α-helix, β-strand, or turns. According to Toniolo (4), the separation between the two end residues classifies the turns into the following groups:

α-turn when the end residues are separated by four peptide bonds (i ◊ i ± 4)
β-turn, separated by three bonds (i ◊ i ± 3)
γ-turn, separated by two bonds (i ◊ i ± 2)
δ-turn, separated by one bond (i ◊ i ± 1), which Is sterically unlikely
π-turn, separated by five bonds (i ◊ i ± 5)

β-turns are the most common turns, corresponding to approximately 25–30% of all turns (5), and there are nine different types of β-turns, distinguished by their ideal residue-to-residue angles: I, II, VIII, I’, II’, VI_b, VI_a1, VI_a2, and IV (6) (7), as shown in Fig. 1. Table 1 shows the dihedral angles for β-turns (8). In this third part of our secondary structure NE characterization investigation, we looked for parameters fully describing the turn’s NE. While we recognize the existence of a variety of turn subclasses, in this work, we decided to treat all turns as a single class so that we might pick the most general characteristics of the respective NE.

The complete knowledge of the NE specific for each of the secondary structure elements may be useful to estimate, for example, with more precision where one helix ends and the primary protein structure folds into a turn.

Figure 1 Frequency of the turn types among proteins as of April 2016 (PDB with 117,929 protein structures). (7)

Table 1

βturn types with their dihedral angles (8)
Turn Type	Dihedral angles (°)
Turn Type	ϕi + 1	ψi + 1	ϕi + 2	ψi + 2
I	-60	-30	-90	0
I'	60	30	90	0
II	-60	120	80	0
II'	60	-120	-80	0
IV	-61	10	-53	17
VIa1	-60	120	-90	0
VIa2	-120	120	-60	0
VIb	-135	135	-75	160
VIII	-60	-30	-120	120

Most Relevant Nano environment Descriptors – specific case, SSE: α-helix and β-sheet

The internal NE concept for proteins was formally introduced by Neshich (9) to describe the local protein structure environment, corresponding to the selected internal protein district. A protein district is a functional and/or constitutional part of the protein structure, responsible for carrying out a specific task, such as interactions with specific partners (with some districts on other proteins or from the very same protein, but different chain, for example), as well as engaging in vital folding steps that spawn local protein structure constituents, etc. The protein district concept was analysed in detail in (10). The NE role is related to maintaining the functional purpose of different protein districts. According to Mazoni (11), the ten most studied NEs are as follows:

Protein Interfaces (12);

Antibody and antigen interfaces (13);

Protein surface hot spots (14);

Interfaces between proteins and DNA or RNA;

Interfaces between proteins and ligands (15);

Interfaces between proteins and membranes;

Amino acid residues from catalytic sites (16);

Allosteric sites (or exosites);

Secondary structure elements (17);

Maximum distance reach (MDR) for detection of amino acid residue presence (18)

For all the NEs, we are set to define a unique corresponding MRND set. At the public site:

https://www.proteinnanoenvironments.cnptia.embrapa.br/

one may easily find what MRND sets correspond to the NEs we have studied thus far.

We hypothesize that each studied NE can be characterized by its most relevant nano environment descriptors (MRND), which is defined as the most relevant descriptor set that describes, with high specificity and broad coverage, the NE selected for analysis. These descriptors can be comprehended as a required product of a spatially conglomerated set of amino acids, not necessarily contiguous in primary sequence, which gives rise to general conditions considered suitable to compose a specific NE for determining a protein district that possesses a very specific role/activity in the protein.

As indicated, this paper is the third in a series about the nano environments of secondary structure elements. Two previous papers separately considered α-helices (19) and β-sheets (20).

The initial step of the approach for the SSE NE analysis is a “position classification” that is based on dividing the data to be analysed into two sets: data collected from the region considered inside the secondary structure element expanse and data from the region considered as belonging to outside the secondary structure element expanse. The starting hypothesis is that these two datasets of physical-chemical and structural descriptors are significantly different. If this hypothesis is true then there must exist a set of descriptors that characterizes the NE of interest in a unique way. In our earlier study on the NE of α-helices (19), it was demonstrated that STING´s database physical-chemical and structural descriptors uniquely characterize that particular NE. Table 2 lists those descriptors for which the Kolmogorov‒Smirnov (KS) test yielded a p-value lower than 1e^− 6 in more than 80% of the cases of the secondary structures. We applied tests for sets of proteins classified as either all-α or as α-helices in (α + β) + (α/β) (19).

The STING descriptors for which the Kolmogorov‒Smirnov (KS) test yielded p-values lower than 1e^− 6 in more than 80% of the cases of the α-helical structures

Table 2

The STING descriptors for which the Kolmogorov‒Smirnov (KS) test yielded a p-value lower than 1e-6 in more than 80% of the cases for the secondary structures. In proteins of the “all α-helices type” dataset, the “hydrogen bonds between the main chain and main chain” and “hydrogen bonds between the main chain and main chain in weighted neighbour averages by distance” and at the “surface” appear to be the most relevant.
α in all-α	[%]
The hydrogen bond between the main chain - main chain atoms	85.71
The hydrogen bond between the main chain – main chain atoms, weighted neighbour averages by distance	85.71
The hydrogen bond between the main chain and main chain atoms, weighted neighbour averages at the surface	82.85
α in α + β
The Hydrogen bond between the main chain—main chain atoms, weighted neighbour averages by distance	84.09

As one can observe from Table 2, the main-chain-to-main-chain hydrogen bonding in the α-helices is of fundamental importance for the definition of the corresponding NE.

The same approach was used for characterizing the β-sheet NE. Using the KS test, we identified which descriptors describe the β-sheet NE in the following dataset cases: A) all-β structures; B) β-sheet in (α + β) + (α/β) type of proteins, and its variations: C) the parallel strands only; D) the anti-parallel strands only; and E) one strand only.

Table 3 lists those descriptors for which the Kolmogorov‒Smirnov (KS) test yielded a p-value lower than 1e^− 6 in more than 80% of the cases. In two cases, 1) β-sheet in (α + β) + (α/β) parallel only and 2) β-sheet in (α + β) + (α/β) with one strand only, we do not have descriptors with more than 80% of tests showing p-values lower than 1e^− 6. In these cases, we still present the best-ranked results. All the other descriptors showed a p-value lower than 1e^− 6, with coverage in under 60% of the cases (20).

For the sake of easier comprehension, we recommend reading a full description for all mentioned STING NE descriptors, but here, we decided to explicate only two in particular: Amino acid residues that are far apart in the primary structure can make contact with each other in the three-dimensional structure of the protein. This characteristic is called the cross-link order. The order is the number of contacts exercised by a certain residue. Cross Presence Order counts all the residues of amino acids that are inside the spherical probe of radii 3.5 Å, 5 Å and 8.5 Å, even if these amino acid residues do not make contact with each other. The order is the number of residues of amino acids within this sphere.

STING descriptors for which the Kolmogorov‒Smirnov (KS) test yielded p-values lower than 1e^− 6 in more than 70% of the cases for the β-sheet structures

Table 3

The STING descriptors for which the Kolmogorov‒Smirnov (KS) test yielded p-values lower than 1e-6 in more than 70% of the cases for the β-sheet structure. We can see that the hydrogen bond main chain-main chain contact descriptor, the structural descriptors Cross Presence Order and Cross Link Order, and the number of unused contacts descriptor are fundamental for the β-sheet NE.
β-sheets in all-β	[%]
The hydrogen bond between the main chain—main chain atoms, weighted neighbour averages by distance	91.66
Cross_Presence_Order_CA	87.50
Number_Unused_Contacts_WNADist	83.33
β-sheets in (α + β) + (α/β)
Cross_Pres_Order_CA	92.85
The hydrogen bond between the main chain—main chain atoms, weighted neighbour averages by distance	85.71
Cross_Link_Order_CA	85.71
β-sheet in (α + β) + (α/β) parallel only
Cross_Pres_Order_CA	73.91
β-sheet in (α + β) + (α/β) anti-parallel only
Cross_Pres_Order_CA	92.85
Cross_Link_Order_CA	85.71
The hydrogen bond between the main chain—main chain atoms, weighted neighbour averages by distance	82.14
β-sheet in (α + β) + (α/β) with one strand only
The hydrogen bond between the main chain—main chain atoms, weighted neighbour averages by distance	70.58

According to Table 3, the “main-chain to main-chain hydrogen bonding” (as previously noted for the case in the α-helices) is also of fundamental importance for the definition of the β-sheet NE. In addition, for the case of β-sheets only, “Cross Presence” and “Cross Link” are also necessary, as well as the number of “Unused Contacts’.

The complete set of physical-chemical and structural descriptors necessary for in-depth SSE NE analysis was extracted from STING_RDB (21), comprising the specific SSE turns DataMart. The database STING_RDB has nearly 12 billion records organized in 98 key tables (data from January 17, 2022). Using an optional variety of parameterizations, STING is currently capable of calculating 1,307 types of protein structural descriptors. After eliminating the descriptors that are nonorthogonal to each other, the 67 most representative descriptors were selected from the STING_RDB to be used to fully describe the NE where the nucleation and maintenance of secondary structure elements (in this specific case, the turns) occurs, ultimately resulting in descriptors from ten different classes (22) (Table 4). In the Contacts class, the “hb” acronym means hydrogen bond; “m” means the main chain; “s” means side chain; “w” means water; “ch” means charge. Descriptors in rows #2–10 refer to “hydrogen bonds” (hb) established between the “main-chain” atoms (Lines 2, 3, and 4); “main chain and side chain atoms” (Lines 5, 6, and 7), and “side chains” atoms (Lines 8, 9, and 10). The interacting atoms belong to two different amino acid residues. Cases are listed for both “no water” molecule intervention and one water or two water molecules included in the hb formation (w or ww). Descriptors #16–43 refer to the same contact descriptors as above; however, they are weighted by distances to surrounding neighbours (Lines 16–29) or weighted by distances measured at the surface (Lines 30–43). Descriptors #44–45 refer to the molecular density at the protein interface (Line 44) and internal protein structure density (Line 45). Descriptors #46–48 refer to the electrostatic potential at the α-carbon (Line 46), average value over residue atoms (Line 47), and EP value at the last heavy atom (Line 48). Descriptor #49 refers to hydrophobicity defined by using the Kyte-Doolittle scale. Descriptors #50–55 refer to electrostatic potential descriptors, weighted by the neighbouring residue distance and surface. Descriptors #56–57 refer to the number of clashes among residues and the percent of clashes, respectively. Descriptors #58–64 refer to structural parameters such as cross-link (Line 58) and cross-presence (Line 59) order. The dihedral angles CHI (Lines 60–63) and temperature factor at α-carbon (Line 64). Descriptor #65 refers to the number of unused contacts, i.e., the difference between the maximum number of contacts available (defined as the maximum number of contacts that this type of amino acid was found, in the whole PDB, to be able to establish) and the number of actual contacts established. Finally, descriptors #66–67 refer to the number of unused contacts, weighted by the neighbouring residue distance (Line 66) and surface (Line 67). Classes terminated by WNA mean Weighted Neighbour Averages, a value that can be defined by Distance (WNADist) or at the corresponding Surface (WNASurf). Turn definitions were extracted from the DSSP (23) and Stride (24) algorithms.

Regarding the two algorithms used for the classification of the residues within the turns, it is appropriate to briefly describe them here: DSSP is an algorithm that works by recognizing the patterns of hydrogen bonds and geometric features extracted from the coordinate space of the atoms that make up each amino acid. The Stride algorithm uses, in addition to the recognition of connection patterns of hydrogen, information about the dihedral angles. We have chosen to work exclusively with data provided by Stride, as it is more “consistent”. For example, DSSP and Stride treat the very same loop from PDB 1a4b, chain A, residues 71–79 differently (Fig. 2). In the DSSP definition, some of those residues in the region 71–79 are classified as “turn”, others as “bend” or “coil”. However, in the Stride definition, all of the residues are classified only as being “turn”. According to Stride, on April 11, 2022, there are 172,007 PDB structures with “turns” identified in them, and according to DSSP, there are 173,301 PDB structures with “turns”, considering the total number of 179,288 PDB structures. The resulting and corresponding datasets used in this work are available at http://www.cbi.cnptia.embrapa.br/m318309/TurnDataset/.

Figure 2 Definition by the DSSP (A) and Stride (B) algorithms for the secondary structure element composed of residues covering the region 71–79 (in yellow) of the PDB “1a4b”, chain “A”.

Table 4

The STING descriptors used for the statistical analysis of the turn NE. IFR means interface forming residues. CA means α carbon and LHA means the last heavy atom. KDI refers to the Kyte DooLittle hydrophobicity scale. Classes terminated by WNA mean Weighted Neighbour Averages, a value that can be defined by Distance (WNADist) or at the corresponding Surface (WNASurf).
Descriptor Name	#	Descriptor Name	#
Accessible_Protein Surface_in_Isolation	1	hb-mws_WNASurf	34
hb-mm	2	hb-mwws_WNASurf	35
hb-mwm	3	hb-ss_WNASurf	36
hb-mwwm	4	hb-sws_WNASurf	37
hb-ms	5	hb-swws_WNASurf	38
hb-mws	6	hydrophobic_WNASurf	39
hb-mwws	7	aromatic_WNASurf	40
hb-ss	8	ch_attractive_WNASurf	41
hb-sws	9	ch_repulsive_WNASurf	42
hb-swws	10	disulfide_WNASurf	43
hydrophobic	11	Density at IFR_CA_3	44
aromatic	12	Density at Internal_CA_3	45
ch_attractive	13	Electrostatic_Potential_at_CA	46
ch_repulsive	14	Electrostatic_Potential_Average	47
disulfide	15	Electrostatic_Potential_at_LHA	48
hb-mm_WNADist	16	Hydrophobicity_KDI	49
hb-mwm_WNADist	17	Electrostatic_Potential_at_CA_WNADist	50
hb-mwwm_WNADist	18	Electrostatic_Potential_Average_WNADist	51
hb-ms_WNADist	19	Electrostatic_Potential_at_LHA_WNADist	52
hb-mws_WNADist	20	Electrostatic_Potential_at_CA_WNASurf	53
hb-mwws_WNADist	21	Electrostatic_Potential_Average_WNASurf	54
hb-ss_WNADist	22	Electrostatic_Potential_at_LHA_WNASurf	55
hb-sws_WNADist	23	Number of Clash	56
hb-swws_WNADist	24	Percent of Clash	57
hydrophobic_WNADist	25	Cross_Link_Order_CA	58
aromatic_WNADist	26	Cross_Pres_Order_CA	59
ch_attractive_WNADist	27	Dihedral_Chi1	60
ch_repulsive_WNADist	28	Dihedral_Chi2	61
disulfide_WNADist	29	Dihedral_Chi3	62
hb-mm_WNASurf	30	Dihedral_Chi4	63
hb-mwm_WNASurf	31	Temperature_Factor_CA	64
hb-mwwm_WNASurf	32	Number_Unused_Contact	65
hb-ms_WNASurf	33	Number_Unused_Contact_WNADist	66
hb-mws_WNASurf	34	Number_Unused_Contact_WNASurf	67

Considering the 179,288 PDB files we had available in the initial phase of this analysis, we identified 172,007 PDB files containing a total of 3,702,448 turns. However, these are absolute values, which include identical turns sequence-wise. To eliminate sequence redundancy, first, we extract from the datamart the sequences of turns of sizes 3 to 8, and then using cd-hit software (25), we eliminate the turns with 100% sequence redundancy, that is, those with 100% identical sequences. As a result, we obtained 137,739 PDB files containing 1,707,238 turns. Those are the structures we included in the next analysis. Then, the turn structures were aligned by position and length. Additionally, included in the analysis were 32 AA residues before the NE extension of the studied curve and 32 AA residues after the C-terminus of the aligned curves. In fact, in this particular work, as loops are generally much shorter (in terms of the constitutive number of amino acid residues) than α-helices and β-sheets, we considered two other sizes for flanking regions in addition to the classic size of 32 amino acids: the 16- and 8-amino acid flanking regions. In cases where the provided structure did not have 32, 16 or 8 AA residues before or after the turn, we supplemented the missing positions with gaps.

To obtain more accurate results from the descriptor analysis, we assumed that the general type of protein structure (all α, or all β type proteins) helps to single out a set of descriptors of a specific kind in each protein type. This is clearly achieved by simply eliminating diverse “signals” arising from the presence of one or two SSEs found in that particular protein. Consequently, we made three subdatasets: turns in α + β, turns in all-α, and turns in all-β structures. In previous work, we identified 4,376 chains placed in the type of proteins named “all-α protein structures” and 50,803 chains in the “all-β structures” type (20). Random coiled structures were not particularly useful for our study because there are only 325 such structures on PDB in April 2022. Figure 3 shows the distribution of the number of turns grouped by length in these four cases on a log10 scale.

Figure 3 Distribution of the number of turns grouped by the length in α + β structures, in all- α all-α structures, in all-β all-β structures, and random coiled structures using the log10 scale. Turns formed by 3 AAs are most common in all of the cases.

In the α-helices, the average size of the helix element is 11.9 amino acids, and we have found among all the PDB structures containing α-helices, a range in length from 5 to 109 AAs (19). In the β-sheet case, the average size is 7.58 AAs, although we have identified β-sheets with up to 36 AAs in length (20). In both cases, we used the flanking region with 32 AA before and 32 AA after the secondary structure element. However, in the case of the turn, considering the range of lengths from 3 to 8, the average size of the turns is 4.27 AAs (Fig. 4). For that reason, we reduced the flanking region first to 16 AA and then only to 8 AA. Statistical analysis of the amino acid descriptors in both regions (inside and outside the SSE expanse) was performed by using the R programming language, and a one- or two-sample Kolmogorov‒Smirnov test was applied (26). The Kolmogorov‒Smirnov test is used to decide whether a sample comes from a population with a specific distribution. Unlike Student's or Welch's t-test, the KS test makes no assumptions about the origin of the data. (27) In the case of this work, we applied the test to determine whether the values of the physical-chemical and structural descriptors of the AA residues present in the turns were significantly different, or not, from those values found in the AA residues around them. After aligning the turns by position and length (starting the alignment on the N-terminus and finishing on the C-terminus of turns, as if all turns with the same length were “stacked”), we observed the values of the selected descriptors for each region (before, inside, and after the turns). As was the case in a previews work (19) (20), we also used the threshold of 1e^− 6 for the p-value to determine if the values of the descriptors inside and outside of the turn expanse were significantly different. If the answer is positive, the descriptor is clearly important to the process of construction and maintenance of that particular secondary structure element.

Figure 4 Maximum, minimum and average number of AAs per secondary structure element.

Although the Kolmogorov‒Smirnov test (univariate test) is significant for selecting important descriptors one at a time, the combination of two or more descriptors for statistical evaluation makes it necessary to use multivariate tests. For this, the MANOVA test (multivariate test) achieves greater precision and thus matches the desired search for the NE’s unique characteristics. For this particular analysis, it was necessary to first prepare the data for MANOVA tests. The initial step was to apply the Shapiro test to verify if the selected descriptor values were normally distributed (23), using only descriptors that were found to have a p-value < 0.05. This is important because the statistical tests used in MANOVA rely on the assumption of normality to make inferences about the population from the sample data. Such a procedure differs from the one applied for the Kolmogorov‒Smirnov test, which is a nonparametric test and therefore does not use the values themselves but their order. By not using values, there is no need for the data to be normally distributed. Without this assumption, the results of MANOVA may not be reliable or accurate, potentially leading to incorrect conclusions about the relationships between variables (28). The second step applied was the correlation test, removing highly correlated variables, using Kendall and Spearman correlation coefficients. Although the presence of these variables increases the sample universe, they introduce white noise in the statistical analysis and “pull” the data to one side, and therefore must be removed (29).

We analysed the turns’ NE descriptors calculated from the three distinct subsets of protein types: turns in α + β structures, turns in all-α structures only, and turns just in all-β structures. We did not analyse the particular case of those proteins characterized as “random coil proteins” because the universe encompassing only such types of structures is significantly smaller and would not yield statistically meaningful results. (For a total of 137,739 PDB, we considered eligible structures against 325 random coil structures [as of April 2022]). Previous work on a variety of protein districts clearly demonstrated that the NE, in general, is better described when using the MANOVA test on corresponding descriptors than when the univariate test is applied (17, 18). To compare the NE of turns with the α-helices and β-sheet NE, in this work, we first applied the same type of tests to define the list of most relevant NE descriptors (MRND) and then compared those to the cases of α-helices and β-sheet MRNDs, as explained below.

In Fig. 5, we present the sequence logo showing the most frequently found amino acid residues per position along the turn expanse. This is relevant as it lists the 4 most frequently found residues at selected positions (3–8) in the turns: G, D, P, and L. Figure 5 clearly shows that proline is the most commonly found residue at position 2 and, for longer turns at positions 5 and 6. The presence of aspartic acid is interesting, as this negatively charged residue confers a special function in turns to promote charged interaction with other functional protein domains.

Figure 5 Sequence logo showing the most frequently found amino acid residues per position along the turn extension.

Univariate test for the turns in the α + β dataset

A univariate test for the dataset of turns, selected as described above from the whole PDB, was tuned to consider the 67 STING descriptors (Table 4) and the 6 different lengths of turns (Fig. 3). A test was carried out for 402 (67x6) possible trials. Of the 402 completed tests, 357 show p-values < 1e^− 6 (88.8%). Out of the 67 total descriptors, 51 of them (76.2%) appeared in all of the tests (for all the sizes of turns examined) with corresponding p-values less than 1e^− 6. This effectively reduced the list of descriptors to be considered in the final analysis, and the new list is presented in Table 4. In Fig. 6, we present the differences between the results of the univariate tests using the STING descriptors for the turns versus the α-helices (full line), the turns versus the β-sheets (dashed line) and the β-sheets versus the α-helices (two dashed lines).

In Fig. 6, one may easily notice general differences (line trends), for which, from now on, we will refer to as TDt/α, TDt/β and TD α/β {Trend Difference: turn/α-helix; turn/β-sheet and α-helix/β-sheet}. The “difference” here is related to the number of cases (percent-wise) for a given descriptor, where the p-value was found to be less than 1e^− 6, corresponding to the NE of the turn, minus the corresponding NE of the α-helices (and likewise for the remaining combinations of TDs). As shown in Fig. 6, in the case of turns versus α-helices, there are 29 descriptors with differences ranging between 50% and 75% and 6 descriptors with a difference of more than 75%. The latter are those such as aromatic contacts, charge repulsive contacts, hydrogen bond main chain – water – side chain, hydrogen bond main chain – water – water – main chain, hydrogen bond main chain – water – water – side chain, hydrophobic WNA by distance, and hydrophobic WNA at the surface.

In the case of the turns versus the β-sheets, there are 15 descriptors with a difference between 50% and 75% and no descriptor with a difference greater than 75%.

Comparatively, for β-sheets versus α-helices, there are only three descriptors with a difference between 50% and 75% and none with a difference greater than 75%.

Decreasing the flanking region size from 32 to only 16 AAs, we obtained essentially the same results, from the 402 tests, 354 of them showed p-values being < 1e-6 (88.1%), and 50 descriptors (74.6%) appeared in all of the tests with p-value being less than 1e-6. However, when we decreased the flanking region size to only 8 AAs before and 8 AAs after the turns, we found what we consider “better” results, with 358 tests out of a total of 402 (89.1%) showing p-values < 1e^− 6. The set of descriptors that appeared in all of the positive tests (those resulting in a p-value < 1e^− 6) are the same as shown in Table 5.

Although different in intensity and number, the general trend for the turns minus the α-helices for the studied variables is similar to the degree of the turns minus the β-sheets. Both abovementioned cases are somewhat similar when the line trend is concerned for the case of α-helix minus β-sheet. We will discuss the possible implications of such a finding, described in Fig. 6, in the discussion session.

In our Dictionary of Internal Proteins NE (https://www.proteinnanoenvironments.cnptia.embrapa.br) in the SSE chapter, we demonstrate comparative plots for three SSE NEs with respect to MRND.

Figure 6 The trend in variation of percent-wise participation of a selected descriptor appearing in the cases where the p-value was less than 1e^-6 for the univariate tests. Here, we calculated and presented the difference between the cases: the turns versus the α-helices (shown in solid line), the turns versus the β-sheets (shown in dashed line) and the β-sheets minus the α-helices (shown in point-dash line) NE. δ indicates the difference in percent-wise participation in the cases with p-values < 1e^-6: the turns versus the α-helices, the turns versus the β-sheets, and the β-sheets versus the α-helices.

The univariate test for the turns in the all-α dataset

The turns found in protein types defined as all-α structures have six different lengths, totalling 402 possible tests (67 selected descriptors x 6 possible lengths). In the 309 cases, the p-values were < 1e^− 6 (76.9% of the total). Forty-five descriptors (67.2%) appeared in all the tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e^− 6 (Table 5).

For the flanking region with 16 AA before and 16 AA after the turns, we had 299 of the 402 tests (74.4%) with p-values < 1e^− 6. Forty-one descriptors (65.7%) reached all the tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e^− 6 (Table 5).

Considering 8 AA before and 8 AA after the turns, we had 295 in the 402 tests (73.1%) with p-values < 1e^− 6, and 41 of the 67 descriptors (61.2%) had p-values less than 1e^− 6 in all the tests (lengths 3, 4, 5, 6, 7, 8) (Table 5).

The univariate test for the turns in the all β-sheet dataset

The turns in the dataset named all-β structures had 6 different lengths, totalling 402 tests. Of the 402 tests, 284 had a p-value < 1e^− 6, representing 70.6% of the total. In this case, 37 descriptors (55.2%) reached all tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e-6 (Table 5).

Similar to the previous cases, we limited the flanking region to 16 AA before and 16 AA after the turns. In this case, 282 of the 402 tests (70.1%) had p-values < 1e-6, and 37 descriptors reached all tests (lengths 3, 4, 5, 6, 7, 8) with p-values less than 1e-6 (Table 5).

Using 8 AA before and 8 AA after the turns, 284 of the 402 tests had a p-value < 1e^− 6, which represents 70.6%. Table 5 shows which descriptors had all tests (lengths 3, 4, 5, 6, 7, 8) with a p-value less than 1e-6.

Table 5

The STING descriptors with 100% of conducted univariate Kolmogorov‒Smirnov tests resulting in p-values < 1e-6 in the “α + β”, the “all-α”, and the “all-β” databases, and for flanking region sizes of 32, 16, and 8 AA. The white-filled boxes show the missing position for the selected descriptor (line) within a specific protein type and flanking region size (column).
Dataset	turns in the α + β			turns in the all-α			turns in the all β-sheet
Flanking region	32 AA	16 AA	8 AA	32 AA	16 AA	8 AA	32 AA	16 AA	8 AA
1. Accessible_Surface_in_Isolation
2. aromatic
3. aromatic_WNADist
4. aromatic_WNASurf
5. ch_attractive_WNADist
6. Ch_attractive_WNASurf
7. ch_repulsive_WNADist
8. ch_repulsive_WNASurf
9. Clash
10. Cross_Link_Order_CA
11. Cross_Pres_Order_CA
12. Dihedral_Chi1
13. Dihedral_Chi2
14. Dihedral_Chi3
15. Electrostatic_Potential_at_CA
16. Electrostatic_Potential_@_CA_WNADist
17. Electrostatic_Potential_@_CA_WNASurf
18. Electrostatic_Potential_at_LHA
19. Electrostatic_Potential_@_LHA_WNADist
20. Electrostatic_Potential_at_LHA_WNASurf
21. Electrostatic_Potential_Average
22. Electrostatic_Potent._Average_WNADist
23. Electrostatic_Potent._Average_WNASurf
24. hbmm
25. hbmm_WNADist
26. hbmm_WNASurf
27. hbms
28. hbms_WNADist
29. hbms_WNASurf
30. hbmwm
31. hbmwm_WNADist
32. hbmwm_WNASurf
33. hbmws
34. hbmws_WNADist
35. hbmws_WNASurf
36. hbmwwm
37. hbmwwm_WNADist
38. hbmwwm_WNASurf
39. hbmwws
40. hbmwws_WNADist
41. hbmwws_WNASurf
42. hbss
43. hbss_WNADist
44. hbss_WNASurf
45. hydrophobic
46. hydrophobic_WNADist
47. hydrophobic_WNASurf
48. Hydrophobicity_KDI
49. Density_IFR_CA_3
50. Density_Internal_CA_3
51. Number_Unused_Contact
52. Number_Unused_Contact_WNADist
53. Number_Unused_Contact_WNASurf
54. Percent _of_Space_Clash
55. Temperature_Factor_CA

According to Table 5, which shows results from the KS univariate tests, the MRNDs for the turns in the α + β, the turns in the all-α, and the turns in the all β-sheet, and the flanking regions of 32AA, 16AA and 8AA is a set composed of the following classes of descriptors: Surface Accessibility (Accessible_Surface_in_Isolation), Contacts (aromatic, charge attractive and charge repulsive, hydrogen bonds), Space_Clash (Clash, Percent), Structural (Cross Link and Cross Presence, Dihedral Angles, Temperature_Factor), Physical Chemical (Electrostatic Potential, Hydrophobicity), Density (IFR and Internal) and Unused Contacts (Number of_Unused_Contacts).

MANOVA

Previous work demonstrated that the Kolmogorov‒Smirnov test is not the best way to analyse the NE of secondary structure elements (19) (20). Although we use this method to select some of the best descriptors for each secondary structure element type characterization (as a first approximation), its results are not always satisfactory – usually having low coverage. As shown in Table 5, in the best case, 55 descriptors (82.1% of the total [67] descriptors used for this analysis) appear in 100% of tests where the p-value is < 1e^− 6 (the whole PDB dataset). That is the case for turns in the “α + β” type of proteins. In the case of turns in all-α structures, we have 46 descriptors with a p-value < 1e^− 6 in 100% of the tests. Finally, in the case of turns in all-β structures, we have 37 descriptors with a p-value < 1e^− 6 in 100% of the tests.

As expected, this result corroborates the previous observation shown in Fig. 6 that turns are much more different (in terms of necessary descriptors to differentiate one from the other) from α-helices and less so from β-sheets. Consequently, one is required to find more descriptors for overall good coverage in classifying turns in α-helices and in (α + β) + (α/β), while much fewer descriptors are necessary to distinguish turns from β-sheets in (α + β) + (α/β).

Multivariate tests for the flanking region of 32 AAs

MANOVA tests for the same descriptor set (as shown in Table 4) are described below. Figure 7 shows the results for the four statistical tests present in the MANOVA algorithm: Pillai, Wilks, Hotelling-Lawley, and Roy. These tests are available on the R manova function (26) and were employed in all the analytical procedures of this part of our work.

Pillai’s trace is a statistical test whose value ranges from 0 to 1. Increasing values indicate that the effects are contributing more to the model; the null hypothesis must be rejected for large values (30) (31).

In Wilk's lambda test, the null hypothesis must be rejected when Wilk's lambda is close to zero, although this must be done in combination with a small p-value. Lambda is a measure of percent variance in dependent variables that are not explained by differences in independent variable levels. The value zero means that there is no variance not explained by the independent variable. Therefore, the closer the statistic is to zero, the greater the variable in question contributes to the model (32).

In the Hotelling-Lawley test, also called Hotelling's T-squared test, the objective is to calculate a value for T (in this case, for T-squared) and compare it to a table value; if the calculated value is greater than the value found in the table, the null hypothesis must be rejected (33).

Roy is a positive value multivariate test statistic obtained in a hypothesis test. Increasing values for the statistic indicates increasing contributions of effects to the model in question. The null hypothesis must be rejected for large values (34).

After eliminating correlated descriptors (nonorthogonal ones) and removing the data disobeying the normal distribution, we were able to execute 6 tests for turns in the α + β dataset, 4 tests for turns in the all-α dataset, and 2 tests for turns in the all-β datasets. Table 6 presents the frequency of each descriptor in the MANOVA tests for the turns in the α + β, all-α, and all-β datasets, counted per unique size of the turn (and there are six such sizes, as previously described).

Multivariate tests for the flanking region of 16 AAs

As mentioned above, we also tried tests after limiting the flanking region to only 16 AAs before and 16 AAs after the turns and posteriorly to 8 AAs before and 8 AAs after the turns and applied MANOVA tests in these conditions as well. Using the flanking region of 16 AA, we obtained 83.3% with p-values < 1e^− 6 in the α + β structures, 66.7% with p-values < 1e^− 6 in the all-α structures, and 100% with p-values < 1e^− 6 in the all-β structures (Fig. 7). Table 6 gives the frequency of each descriptor in the MANOVA tests for the three datasets (the turns in the α + β, all-α, and all-β datasets).

Multivariate tests for the flanking region of 8 AAs

For the flanking region of 8 AA, we obtained 83.3% with p-values < 1e^− 6 in all the structures, 50% with p-values < 1e^− 6 in the all-α structures, and no results for the all-β structures (Fig. 7). The lack of results in the all-β dataset is a consequence of the lack of tests after eliminating the correlated descriptors and removing the data with no normal distribution. In Table 6, we can see the frequency of each descriptor used in the MANOVA tests for the datasets: the turns in the α + β dataset, the turns in the all-α dataset, and the turns in the all-β dataset. There are six sizes of the loops (3, 4, 5, 6, 7, 8), and the MANOVA test was performed for each size. For example, for turns in the α + β proteins, the frequency of each descriptor refers to how many tests it participated in, with the maximum possible number being six. For the flanking region of 32 AAs, the most frequently encountered descriptors were ch_attractive, Dihedral_Chi3, and hbswws_WNADist; each appeared in three tests. For the flanking region of 16 AAs, the most frequent descriptor was aromatic_WNADist, which appeared in all the six tests. Finally, for a flanking region of 8 AAs, the most frequent descriptors are aromatic_WNADist and aromatic_WNASurf; with each appearing in all the six tests.

Figure 7 Results of the MANOVA test applied to the turn datasets for the flanking regions of 32, 16 and 8 AA. In the case of the α + β proteins, the best results were for flanking regions of 16 AA and 8 AA, with 83.3% of the tests with p-values below 1e-6. However, in the case of the all-α proteins, we had 100% of tests with p-values below 1e-6 when we worked with a flanking region of 32 AA, and when we tested the all-β proteins, we had 100% of tests with p-values below 1e-6 for the 32 AA and 16 AA flanking regions. There are no results for the all-β rounds because in this case, no test was performed after the data preparation phase.

Table 6

The frequency of descriptor appearance registered in the results after the MANOVA tests. The numbers indicate how many times the specific descriptor was used in the MANOVA test. The analysis considered 6 turn sizes (3, 4, 5, 6, 7, 8), and the MANOVA test was performed once for each size. Consequently, the maximal number of cases that one descriptor might appear is six. We then grouped the above-listed descriptors into five general classes: A) contacts, B) structural, C) electrostatic potential, D) hydrophobicity and E) unused contacts.
Descriptors	Flanking regions size (protein class type)
	32 AA			16 AA			8 AA
	α + β	all-α	all-β	α + β	all-α	all-β	α + β	all-α	all-β
aromatic	1	2		2			3
ch_attractive	4	1		6			5
ch_repulsive	2	4	1	4	3		4	3
ch_repulsive_WNADist		1		4			4
ch_repulsive_WNASurf	1			1
Cross_Pres_Order_CA				1
Dihedral_Chi1	3			4			4	1
Dihedral_Chi2			1
Dihedral_Chi3	2						4
Dihedral_Chi4	1			4			4
disulfide								1
disulfide_WNADist	1						2	1
disulfide_WNASurf				1	1
Electrostatic_Potential_at_CA_WNADist				1			2
Electrostatic_Potential_at_LHA			2				2
Electrostatic_Potential_Average							1
Electrostatic_Potential_Average_WNASurf	1			1			1
hb-mwm			1			1		1	1
hb-mwm_WNASurf							1
hb-mws	1		2	3			3
hb-mws_WNADist	1			1			1
hb-mws_WNASurf				1			1
hb-mwwm			1
hb-mwwm_WNADist	1
hb-mwwm_WNASurf							1
hb-mwws				2			3
hb-mwws_WNADist				1			1
hb-ss		3	1	1			3
hb-sws	1	1		1			3
hb-sws_WNADist	1			1
hb-sws_WNASurf	1			1			2
hb-swws	1		1	2			2
hb-swws_WNADist	1			2			2
hb-swws_WNASurf		1		2	1		4	1
hydrophobic
hydrophobic_WNADist		3	1	2	1	1	1	2	1
hydrophobic_WNASurf	1	2	2	2	2		2	2
Number_Unused_Contact_WNADist				1
Number_Unused_Contact_WNASurf	1						2

Analysing Fig. 6, the same number of descriptors was used, and all of them are important for determining the NEs where the different elements of the secondary structures are formed and maintained, but the level of importance is higher in the cases of the turns minus the α-helices than the turn minus the β-sheets. This is because the geometrical structural similarities between the turns and the α-helices are much higher according to Fig. 6. Namely, from a structural point of view, turns maybe the unsuccessful initiation of α helices. Therefore, there must be more characterizing descriptors to distinguish one from the other. The difference between the number of high-impact descriptors between the turns and the β-sheets is, according to Fig. 6 plot, smaller. Again, from the structural point of view, classical consideration is that turns are more similar to α helices than β-sheets. Similarly, the statistical analysis conducted here indicates that many more descriptors are needed in determining the turns with respect to the helices versus determining the turns with respect to the β-sheets.

According to Table 5, the Kolmogorov‒Smirnov hypothesis test demonstrated 7 major classes of NE descriptors: 1. Accessibility, 2. Contacts, 3. Dihedral Angles, 4. Electrostatic Potential, 5. Hydrophobicity, 6. Number of Unused Contacts and 7. Temperature Factor. All are essential for maintaining the existence of turns and are therefore crucial in the characterization of the NE of turns.

The MANOVA tests were used to indicate the ten most relevant NE descriptors (MRND). Figure 8 shows the ten most relevant NE descriptors (MRND) for the α-helices (19), the β-sheets (20), and the turns and their percentage of appearance in the tests. We can conclude from Fig. 8 that

1) The NE of α-helices is defined principally by three descriptor classes,

a) Number_Unused_Contacts,

b) Electrostatic_Potential, and

c) Hbms,

2) The NE of β-sheets is defined principally by three descriptor classes:

a) hydrogen bonds,

b) hydrophobic interactions, and

c) charge_attractive interactions, and

3) The NE of turns is defined principally by three descriptor classes:

a) charge attractive interactions,

b) geometry of bonds, and

c) hydrogen bonds.

In Fig. 8, there is a more detailed overview of the MRND, and the underlined numbers mark the nonspecific descriptors (descriptors appearing only in a single SSE category and which we considered specific/unique). In parentheses, the descriptor ranking is annotated so that the top percentage is taken as the five most relevant.

Figure 8 The MRND for α-helices, β-sheets, and turns. The SSEs for each descriptor are presented in the following order: red bar (α-helix), green bar (β-sheet), and blue bar (turn). The absence of the given colour bar means that the descriptor is not part of the MRND for the SSE from which it is missing. We can see that electrostatic potential and unused contacts are more important to maintain the α-helices, hydrophobic m-(1 or 2 water)-m and hydrophobic contacts are more important to maintain the β-sheets and that the contacts of the aromatic, charge attractive and repulsive types, hydrogen bonds between s-w-w-s, and dihedral angles are more important to maintain turns.

Comparing the three distinct secondary structure elements, we observe that the interatomic contacts among the amino acid residues are Omni present and are critical for all three SSE NE definitions. However, the geometry of the bonds among the amino acids and their side chains is much more prevalent in defining the turn NE than in cases of the α-helices or β-sheets. The electrostatic potential, as well as the number of unused contacts, are critically important and more specifically influence the description of the NE specific for α-helices. At the same time, hydrophobic interactions impose themselves as uniquely appearing for the determination of the NE specific for β-sheets.

From the thorough analysis described above, it follows that the different NEs have different sets of descriptors that describe them. For example, the ten most relevant descriptors for the protein‒protein interfaces are hydrogen bond, sponge, the density of contacts, electrostatic potential, hydrophobicity, pockets, density, secondary structure element, curvature, and order of cross-link (10). For the catalytic site residues, the ten most relevant descriptors are the order of cross-presence, order of cross-link, contact energy density, sponge, density, hydrogen bond side chain-side chain, local closeness, distance to the centre of mass, hydrogen bond main chain-water-side chain, and hydrophobic contacts WNA on the surface (14). In this work, we added one more piece to a puzzle that describes the complete universe of the 10 most relevant and most studied internal protein NEs – the one for SSE named turns or loops. By doing so, we hope to facilitate the work and motivate new undertakings that will include NE descriptors for the training of neural networks, which will hopefully end up being employed in predicting protein structure with better performance. However, our work needs to be completed, and we are already preparing a report on protein‒DNA interfaces and MRND for that specific NE. Work is also in progress for protein‒ligand interfaces and the respective MRND as well as defining the exosite/allosteric site MRNDs.

The 10 MRND for all three secondary structure elements is now available in the Dictionary of Internal Protein NEs (DIPN) at https://www.proteinnanoenvironments.cnptia.embrapa.br/index.html.

CA α carbon

IFR Interface forming residues

KDI Kyte DooLittle hydrophobicity scale

KS Kolmogorov‒Smirnov

LHA Last heavy atom

MRND Most relevant nano environments descriptors

NE Nano environments

SSE Secondary structure elements

WNA Weighted Neighbour Averages

WNADist Weighted Neighbour Averages defined by Distance

WNASurf Weighted Neighbour Averages defined by Surface

Availability of data and materials

The dataset analysed during the current study are available from http://www.cbi.cnptia.embrapa.br/m318309/TurnDataset/

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests

Funding

This work was supported by the Embrapa Digital Agriculture, Campinas, SP, Brazil.

Authors' contributions

I.M. wrote the article, J.A.S. and J.L.C. contributed to the creation of the database, L.B. and F.R.M. contributed to the statistical analyses, G.N. supervised the research and reviewed the article.

BORNOT A, DE BREVERN AG. Protein beta-turn assignments. Bioinformation. 2006: p. 153.
CHOI Y, AGARWAL S, DEANE CM. How long is a piece of loop? PeerJ. 2013; 1: p. e1.
DONATE LEea. Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction.. Protein Science. 1996; 5(12): p. 2600-2616.
TONIOLO C, BENEDETTI E. Intramolecularly hydrogen-bonded peptide Conformation. Critical Reviews in Biochemistry. 1980; 9(1): p. 1-44.
Guruprasad K RS. Beta-and gamma-turns in proteins revisited: a new set of amino acid turn-type dependent positional preferences and potentials. J Biosci. 2000 Jun 25: p. 143-56.
VENKATACHALAM CM. Stereochemical criteria for polypeptides and proteins. V. Conformation of a system of three linked peptide units. Biopolymers: Original Research on Biomolecules. 1968; 6(10): p. 1425-1436.
DE BREVERN AG. Extension of the classical classification of β-turns. Scientific reports. 2016; 6(1): p. 1-15.
HUTCHINSON EG, THORNTON JM. A revised set of potentials for β‐turn formation in proteins.. Protein Science. 1994: p. 2207-2216.
NESHICH Gea. Using Structural and Physical–Chemical Parameters to Identify, Classify, and Predict Functional Districts in Proteins—The Role of Electrostatic Potential. In ROCCHIA W, SPAGNUOLO M(). Computational Electrostatics for Biological Applications.: Springer; 2015. p. 227-254.
Neshich Gea. Computational Electrostatics for Biological Applications. In Rocchia W,SM. Using Structural and Physical–Chemical Parameters to Identify, Classify, and Predict Functional Districts in Proteins—The Role of Electrostatic Potential.: Springer, Cham; 2015. p. 227–254.
MAZONI I, NESHICH G. DPIN: um dicionário dos nanoambientes internos das proteínas e seu potencial para transformação em ativos para a agricultura. Embrapa Agricultura Digital-Capítulo em livro científico (ALICE),. In Agricultura Digital - Pesquisa, Desenvolvimento e Inovação nas Cadeias Produtivas.; 2020. p. 219-233.
DE MORAES FRea. Improving predictions of protein-protein interfaces by combining amino acid-specific classifiers based on structural and physicochemical descriptors with their weighted neighbor averages. Plos One. 2014; 9(1): p. e87107.
VIART Bea. EPI-peptide designer: a tool for designing peptide ligand libraries based on epitope–paratope interactions. Bioinformatics. 2016; 32(10): p. 1462-1470.
DE CARVALHO PEREIRA JG. Caracterizaçao dos aminoácidos da interface proteına-proteına com maior contribuiçao na energia de ligaçao e sua prediçao a partir dos dados estruturais.. Tese de Doutorado. Master’s thesis, Universidade Estadual de Campina. Master’s thesis. Campinas: UNICAMP, Genética e Biologia Molecular; 2012.
BORRO Lea. Binding affinity prediction using a nonparametric regression model based on physicochemical and structural descriptors of the nano-environment for protein-ligand interactions. In STRUCTURAL BIOINFORMATICS AND COMPUTATIONAL BIOPHYSICS; 2016; Orlando.
SALIM A. Aplicação de técnicas de reconhecimento de padrões usando os descritores estruturais de proteínas da base de dados do software STING para discriminação do sítio catalítico de enzimas. Master's thesis. Campinas: UNICAMP, Faculdade de Engenharia Elétrica e Computação; 2015.
Mazoni I. MAZONI, IVAN. ANÁLISE DO NANO-AMBIENTE PROPÍCIO PARA NUCLEAÇÃO E MANUTENÇÃO DOS ELEMENTOS DA ESTRUTURA SECUNDÁRIA NO CONTEXTO ESTRUTURAL DAS PROTEÍNAS FUNCIONAIS. PhD Thesis. Campinas: Unicamp, Instituto de Biologia; 2018.
DA SILVEIRA CHea. Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins: Structure, Function, and Bioinformatics. 2009; 74(3): p. 727-743.
MAZONI Iea. Study of specific nanoenvironments containing α-helices in all-α and (α+ β)+(α/β) proteins. PloS one. 2018; 13(7): p. e0200018.
MAZONI Iea. A comparison between internal protein nanoenvironments of α-helices and β-sheets. Plos one. 2020; 15(12): p. e0244315.
OLIVEIRA SdMea. Sting_RDB: a relational database of structural parameters for protein analysis with support for data warehousing and data mining.. Genetics and molecular research. 2007: p. 911-22.
MAZONI I. ANÁLISE DO NANO-AMBIENTE PROPÍCIO PARA NUCLEAÇÃO E MANUTENÇÃO DOS ELEMENTOS DA ESTRUTURA SECUNDÁRIA NO CONTEXTO ESTRUTURAL DAS PROTEÍNAS FUNCIONAIS. PhD Thesis. Campinas: UNICAMP, Instituto de Biologia; 2018.
C KWaS. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.. Biopolymers. 1983: p. 2577-2637.
Heinig M,FD. STRIDE: a Web server for secondary structure assignment from known atomic coordinates of proteins.. Nucl. Acids Res. 2004: p. W500-2.
LI W, GODZIK A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006: p. 1658-1659.
LOGAN M. Biostatistical design and analysis using R: a practical guide. : John Wiley & Sons; 2011.
Chakravarti LaR. Handbook of Methods of Applied Statistics: John Wiley and Sons; 1967.
Zar JH. Biostatiscal Analysis. In.: Pretice Hall Inc.; 1999. p. Chapters 10 and 16.
Zar JH. Biostatiscal Analysis. In.: Pretice Hall Inc.; 1999. p. Chapter 19.
K.C.S P. Some New test criteria in multivariate analysis. Ann Math Stat. 1955; 26(1): p. 117–21.
Seber GAF. Multivariate Observations New York: John Wiley and Sons; 1984.
Nath RaPR. A new statistic in the one way multivariate analysis of variance. Computational Statistics and Data Analysis. 1985; 2: p. 297–315.
H H. The generalization of Student’s ratio.. Ann Math Stat. 1931: p. 360–378.
I. M. Johnstone BN. Roy’s largest root test under rank-one alternatives. Biometrika. 2017 Mar: p. 181–193.

No competing interests reported.

Comprehensive analysis of the distinct nano environments characteristics containing the different secondary structure elements: α- helices, β-sheets, and turns

Status:

Version 1

Abstract

Figures

Introduction

Materials and methods

Results

MANOVA

Multivariate tests for the flanking region of 32 AAs

Multivariate tests for the flanking region of 16 AAs

Multivariate tests for the flanking region of 8 AAs

Conclusions

Abbreviations

Declarations

Availability of data and materials

References

Additional Declarations

Status:

Version 1