2.1 SCOP Benchmark Dataset
The Structural Classification of Proteins (SCOP) database has been widely used to evaluate the performance of various methods on protein classification such as in ProDec-BLSTM. In this work, SCOP1.67 dataset is thus used (the same as ProDec-BLSTM) and it is accessible online.
Positive and negative samples of training and testing data are randomly selected for each of the 102 families contained in our dataset, with the average of 9077 sequences in each training dataset. There are 507,119 different sequences in total in this dataset, of which the minimum length is 13 and maximum length is 1264. The sequences with their length shorter than 400bp account for 96% of the dataset. Hence, the sequence length is constrained to 400bp in this study, which means that sequences with their length over 400bp will be correspondingly controlled at 400th bp.
2.2 Sequences Representation
Since the physiological properties of protein rely on the physiological properties of amino acids, this study uses physiological properties of aminos acids to denote protein sequences. Table 1 demonstrates all 12 types of physicochemical properties37 of amino acids in our work, including chemical composition of the side chain (P1), polar requirement (P2), hydropathy index (P3), isoelectric point (P4), molecular volume (P5), polarity (P6), aromaticity (P7), aliphaticity (P8), hydrogenation (P9), Hydroxythiolation (P10), pK1(-COOH) (P11) and pK2() (P12).
2.3 Deep Neural Network Architecture Combined Inception and Resnet
This section illustrates our proposed ConvRes (shown in Fig.1),, which combines a variant Inception and a Resnet Block. Input data are fed into a variant Inception block, aiming to extract abstract features of protein sequences by using various kernel sizes. The features of protein sequences can be enhanced after the Inception block because different kernel sizes can be seen as different window sizes according to protein sequences. Then, Resnet block is employed as a detector by using the aforementioned features as input. Finally, this architecture will recognize whether the input sequence belongs to a certain family. More details will be clarified in the following subsections.
2.3.1 1-D Inception Block
Inception network is a frequently used structure in the field of Convolutional Neural Network (CNN), which extracts features by several kernels with different sizes. More abstract features can be received through the Inception network even if the objective possesses different sizes in the set of pictures. As for biological sequences, the window size plays a vitally important role on the accuracy of classification. However, no previous studies could help stipulate the optimal window size. So this paper adopts a variant Inception, called 1-D Inception block, combining the Inception structure with 1-dimentional convolution (shown in Fig.1).. Since the input of the Inception block is [Due to technical limitations, this equation is only available as a download in the supplemental files section], the output of this block can be described as follows,
[Due to technical limitations, this equation is only available as a download in the supplemental files section.] (1)
where Conv1D ( )represents 1-dimentional convolutional option of different filter size f,k stands for the number of different kernels in this block, and + means concatenate operation.
The enhanced features extracted from this block will be concatenated by channels, and sent to the following Resnet classifier to generate the final result.
2.3.2 Resnet classifier
Deep residual network (Resnet) is a highly configured edition of conventional CNN, which is formed by several convolutional layers and a residual operation between every two layers. Resnet solves the problem of gradient vanishing to some degree because of the residual operations, thus achieving better performance than conventional CNN.
The input feature ofconvolutional layer is [Due to technical limitations, this equation is only available as a download in the supplemental files section], so the input of layer is as follows,
[Due to technical limitations, this equation is only available as a download in the supplemental files section.] (2)
whereConv ( ) represents convolutional option of ith layer, and + stands for concatenate operation.
This paper employs 18 layers of Resnet as the classifier for remote homology protein detection. This Resnet classifier contains an independent convolutional layer followed by a max-pooling layer, and 4 residual blocks (with 2 convolutional layers in each block) followed by an average-pooling layer and a full connection layer. The concatenated features extracted by the 1-D Inception block will be sent to this Resnet classifier, in which each layer uses the extracted features and initial input of the previous layer as input and provides feature extraction with a higher-level abstraction. The final dense layers will recognize whether the input sequence belongs to the current family or not.
2.3.3 Implementation details
This network is implemented by using Keras 2.2.4 with the backend of TensorFlow 1.9.0. Six of different kernel sizes are adopted in our 1-D Inception block, which are set to 1, 3, 5, 9, 15, 21 respectively. The parameters used in Resnet block are the same as standard Resnet–1836. For each protein family, such binary classification network is trained and tested respectively. Each model is optimized by training for 150 epochs.
2.3.4 Performance evaluation
In this paper, the area under the receiver operating characteristic (AUROC) is used to evaluate the performance of our method and the existing methods. Receiver Operating Characteristic (ROC) curve is plotted by employing the true positive rate as x axis and the false positive rate as y axis according to different classification threshold. AUROC refers to the area under ROC plot, whose score is between 0 and 1.The stronger and better performance the classification achieves, the closer the AUROC score is to 1.