Datasets
For a fair comparison, we use the same training, validation, and test sets as SPOT-1D and SAINT. The training set TR10029 contains 10 029 proteins, and the validation set VAL983 has 983 proteins. We benchmark our model on 3 test sets: TEST2016 with 1213, TEST2018 with 250, and CASP-12 with 49 proteins (see [13] and [14] for the details about these datasets).
Metric for secondary structure imbalance classification problem
Some protein secondary structures, e.g. alpha-helices, are much more frequent than others (Figure 5). This leads to the class imbalance problem [23] which is rarely mentioned or addressed in the literature about SS prediction. Assessing the performance of SS classifiers plays a vital role in their construction process. The most commonly used metrics of SS prediction performance are overall accuracies Q3 and Q8 [5,9,24] that are not appropriate for imbalance problems [25,26]. Using them may lead to the accuracy paradox where high accuracy is not necessarily an indicator of good classification performance [26], e.g., a classifier that always predicts class H will have ten times better accuracy than a classifier that always predicts class G (see Figure 5).
The existing popular measures proposed for imbalanced learning like the geometric mean or F-score can still result in suboptimal models [17]. For these reasons, we used the Adjusted Geometric Mean (AGM) well-suited for bioinformatics imbalance problems [16]. It has been shown both analytically and empirically to perform better than F-score. It has no parameters (like beta in F-score). It is given by Eq. 1 where GM is the geometric mean, SP is specificity, Nn is the proportion of negative samples, and SE is sensitivity.
\(AGM=\frac{GM + SP*{N}_{n}}{1+{N}_{n}}, SE>0\)
\(AGM=0, SE=0\)
|
(1)
|
AGM’s purpose is to increase the sensitivity while keeping the reduction of specificity to a minimum. Also, the higher the degree of imbalance, the higher reaction to changes in specificity. It returns values between 0 (the worst prediction) and 1 (a perfect prediction).
We calculate AGM for each structure separately. To assess the overall quality, we use macro-averaged F1 and AGM scores. That is, we take an average of overall scores for each structure. This way we do not favor more frequent classes.
Significance testing and effect size
Null hypothesis significance testing (nhst) is a commonly used statistical method for comparing classifier performances [26,27] although the authors mention their caveats. In the case where the test datasets are not random (like the benchmark datasets used in the evaluation of SS prediction), using classical nhst is problematic [26]. Random permutation tests based on the Fisher-Pitman model of inference [28] are an alternative that is strongly recommended by us in that case. In our experiments, we used a one-sided paired sample permutation test for difference in mean classifier performances (perm.paired.loc function from wPerm R package). The tests are performed at the sequence level. Tests for separate structures are performed only on the subsets of sequences for which it was possible to calculate a given metric (e.g., if the structure is present in the ground truth or prediction).
Here (to our knowledge, for the first time), we propose a new methodology to compare the significance of classifier performance differences. Significance testing as well as permutation tests alone do not resolve the problem of inferential interpretation. Statistical significance shows only that an effect exists, practical significance - the effect size - shows that the effect is large enough to be meaningful in the real world. Statistical significance alone can be misleading because it’s influenced by the sample size. Increasing the sample size always makes it more likely to find a statistically significant effect, no matter how small the effect is in the real world. Effect sizes are independent of the sample size and are an essential component when evaluating the strength of a statistical claim. Some authors [29] proposed to use confidence intervals for estimation of effect size, but they require a random sample to enable inference. Cohen’s effect size d [30] that we propose to use in our study for a paired-samples can be calculated by dividing the mean difference by the standard deviation of the differences. Whether an effect size should be interpreted as negligible (d < 0.01), very small (d < 0.2), small (d < 0.5), medium (d < 0.8), or large (d < 1.2) depends on the context (application) and its operational definition [31]. Thus, we propose to report statistical significance (denoted by p-values) together with practical significance represented by effect sizes (here, Cohen’s effect size d for a paired-samples).
ProteinUnet2 architecture
U-Net architectures have proven to be extremely effective in image segmentation tasks [32, 33]. The U-shaped architecture of ProteinUnet2 is based on the idea from our previous ProteinUnet for secondary structure prediction [15] (for which the results are presented in Supplementary Table S1). The new architecture was adjusted to handle multiple inputs by using multiple contractive paths, one for each input (Fig. 6). After each down-block, the features of all inputs are concatenated together and passed to the up-block via a skip connection. There are two output layers with softmax activations connected to the last up-block, separately for SS3 and SS8. In ProteinUnet2, we limited the maximum supported sequence length from 1024 to 704 to further improve training and inference times without losing accuracy. Anyway, SPOT-1D and SAINT were not trained with proteins longer than 700, and there are no proteins longer than 704 in our datasets. The input features and the number of filters were selected experimentally as described in the next section.
To mitigate the problem of the increased number of inputs and parameters of the network, in the final ProteinUnet2 architecture (Figure 6), we modified the architecture to be similar to the Attention U-Net [34]. That is, we decreased the number of convolutions in each down-block from 3 to 2, added dropouts with 0.1 rate between convolutions in all blocks, and applied attention gates right before the concatenation operation. ProteinUnet2 was implemented in the environment containing Python 3.8 with TensorFlow 2.4 accelerated by CUDA 11.0 and cuDNN 8.0. The code and trained models are available on the CodeOcean platform (https://codeocean.com/capsule/0425426) ensuring high reproducibility of the results.
Feature representation and selection
ProteinUnet2 takes a sequence of feature vectors\(X=\left({x}_{1}, {x}_{2}, {x}_{3},\dots , {x}_{N}\right)\) as input, where \({x}_{i}\)is the feature vector corresponding to the ith residue, and it returns two sequences of structure probabilities vectors \(Y=\left({y}_{1}, {y}_{2}, {y}_{3},\dots , {y}_{N}\right)\)as output, where \({y}_{i}\) is the vector of 3 or 8 probabilities of ith residue being in one of SS3 or SS8 states. The 8 states are specified by the secondary structure assignment program Define Secondary Structure of Proteins (DSSP) [35]. There are three helix states: 310-helix (G), alpha-helix (H), and pi-helix (I); three strand states: beta-bridge (B) and beta-strand (E); and three coil types: high curvature loop (S), beta-turn (T), and coil (C). These 8 classes are converted into 3-class problem by grouping the states: G, H, and I into H; B and E into E; and S, T, and C into C.
Similar to SPOT-1D, our final model contains 20 features from PSSM [10], and 30 features from HHM profiles [11]. The features were standardized to ensure 0 mean and SD of 1 in the training data. Additionally, we use contact maps generated by SPOT-Contact [36]. We use the same windowing scheme as described in SPOT-1D, but we do not standardize the contact maps as they are already in the acceptable range < 0, 1>. The window size of 50 was selected experimentally based on the results from Supplementary Table S1 that shows F1 scores and accuracies on the largest TEST2016 set for a single ProteinUnet trained with different input features on TR10029 and validated on VAL983. Supplementary Table S1 suggests that SPOT-Contact features gave better results of SS8 prediction than any other input alone. The worst results are reported for 7 physicochemical properties [37]. Thus, we did not investigate them further in ProteinUnet2.
Supplementary Table S2 shows the F1 scores and accuracies on TEST2016 for our proposed ProteinUnet2 trained with different combinations of input features and a different number of filters in down-blocks. It reveals that SPOT-Contact features alone outperformed combined PSSM and HHblits. However, the combination of all these 3 features (keeping the same number of filters) increased F1 scores for all SS8 structures when comparing to any other feature alone. Most of our results are better for the higher number of filters, but we did not test numbers higher than 64 to avoid overfitting and to keep the number of filters in all blocks the same as in the original ProteinUnet. Thus, we decided to investigate further only the combination PHSA 64 attention from Supplementary Table S2. The architecture for this combination is presented in Fig. 6.
Training procedures and ensembling
For the initial experiments presented in Supplementary Table S1 and Supplementary Table S2 the single models were trained on TR10029 dataset and validated on VAL983. However, in the final model, datasets TR10029 and VAL983 were combined and then divided into 10 stratified folds to ensure a similar ratio of each SS8 structure in each fold. There were nine factors of stratification: the sequence length - shorter/longer than mean sequence length, and one factor for each of 8 structures occurrence - fewer/more occurrences than a mean number of occurrences per chain. We trained 10 models, each time using a different fold as a validation set and the rest as the training set. The models were trained to optimize the categorical cross-entropy loss using Adam optimizer [38] with batch size 8 and an initial learning rate 0.001. The learning rate was reduced by a factor of 0.1 when there was no improvement in the validation loss for 4 epochs. The training was running until the validation loss was not improving for 7 epochs. Finally, the ensemble was created from the models with the lowest validation loss for each fold by taking the average of their softmax outputs, forming the final ProteinUnet2 prediction.