The process of this research is extracted and shown as a flowchart in Figure 4. Details of the process will be explained in following sections.
Dataset preparation
In this research, three datasets are constructed: positive dataset, negative dataset 1 and negative dataset 2. Positive dataset refers to anticancer peptides which are examined by experiments. They are collected form LEE dataset (total: 422) [13], Tyagi dataset (total: 450) [11], APD (total: 225)[16] and CancerPPD (total: 422) [17]. Negative dataset 1 is a collection of peptides that are anti-microbial but not anti-cancer. They are adapted from dbAMP dataset (total: 4057)[18] and Tyagi dataset (total: 1372). Peptides in negative dataset 2 are non-ACP, which are collected from Uniprot (total: 2635). Since anticancer peptides have been prove to be effective small molecules (<50 amino acids) [19], peptides longer than 50 amino acids are removed out of datasets. In addition, peptide contains artificial amino acids are removed. After this filtration step, 1492 peptide sequences in positive dataset, 4433 peptide sequences in negative dataset 1 and 2635 peptide sequences in negative dataset 2 are obtained. To reduce identical or similar peptides sequence, CD-HIT program [20] is utilized in this research. Results are shown as Table 6.
100% sequence-identity cut-off is applied on all of those three datasets. Then the processed positive dataset is compared with processed negative dataset 1 and processed negative dataset 2 separately using CD-HIT-2D [20]. It identifies and removes sequences in negative datasets that are similar to ones in positive dataset above a threshold of 40%. In addition, peptides that contained non-natural amino acids are removed. To balance datasets, some of peptide sequences are removed from negative datasets randomly. Ultimately, as shown in Table 7, each of those three datasets has 563 peptide sequences. Each dataset is then divided randomly into two subsets, the one contained 463 peptides is utilized as training dataset and the other one which contained 100 peptides is used as testing dataset.
|
Positive
|
Negative 1
|
N1-P
|
Negative 2
|
N2-P
|
Original
|
1492
|
4433
|
/
|
2635
|
/
|
1.0
|
565
|
2753
|
2697
|
1585
|
/
|
0.9
|
398
|
2055
|
2559
|
1178
|
/
|
0.8
|
306
|
1664
|
2426
|
892
|
/
|
0.7
|
249
|
1358
|
2290
|
724
|
/
|
0.6
|
201
|
1097
|
2091
|
624
|
/
|
0.5
|
159
|
765
|
1667
|
531
|
/
|
0.4
|
107
|
439
|
1101
|
399
|
1494
|
Table 6. CD-HIT results of datasets
|
Positive
|
Negative 1
|
Negative 2
|
Training set
|
463
|
463
|
463
|
Testing set
|
100
|
100
|
100
|
Table 7. Number of peptides in each dataset
Features Investigation
In order to utilize machine learning methods analyzing peptide sequences, features of sequences have to be extracted. In this research, 4 features are considered: amino acids composition (AAC), N5C5, k-space and position-specific scoring matrix (PSSM).
AAC
The AAC is the proportion of each amino acid in a given peptide sequence. It summarizes the peptide information in a vector of 20 dimensions. The AAC method has been successfully and widely applied in sequence-based classifications[21].
N5C5
Five amino acids from both N-terminal and C-terminal end of a given peptide are cut off and then connected as a novel sequence. Then the proportion of each amino acid in those new N5C5 sequences is calculated. Furthermore, to better analyze N5C5 sequences and visualize analysis results, heatmaps that show frequencies of each amino acid in each position are drawn.
K-space
K-space method extracts pairs of amino acids which have k (k = 0, 1, …) spacings form a given peptide sequence. In total, (N-k-1) pairs are selected from a peptide sequence which consists of N amino acids. After gathering all amino-acid-pairs, the frequency of each kind of pair is counted. In order to explore k-space diversity between positive dataset and those two negative datasets, the difference value of k-space frequency in positive dataset and that in negative datasets is then calculated. At last, those difference values of amino-acid-pairs are sorted, and ten pairs with the highest difference values are listed.
PSSM
PSSM is generated from a group of sequences previously aligned according to structural or sequence similarity. A PSSM for a given protein is an N 20 matrix P = {Pij : i = 1… N and j = 1 . . . 20 } , where N is the length of the protein sequence. It assigns a score Pij for the jth amino acid in the ith position of the query sequence. A large value indicates a highly conserved position while a small value indicates a weakly conserved position[22].
Model construction by machine learning techniques
In this study, supervised learning technique should be applied on text data for classification. Therefore, support vector machine (SVM)[14] is utilized in cooperation with sequential minimal optimization (SMO)[15]. For model construction, WEKA software (version 3.8.4) [23], and packages including LIBSVM (version 3.24) [24] and SMO package (using default parameters) within WEKA are utilized.
SVM is a data-driven supervised algorithm which constructs separating hyperplanes in high-dimensional space and selects the maximum-margin one for classification[25]. Based on its solid theoretical foundations, SVM has been successfully applied in various recognition and classification studies, including text classification[26], which is utilized in this research. SVM has also been successfully and widely used for high-dimensional biological data, including examination of gene expression profiles[27], mass spectra and genomics projects[28]. Comparing to other classifiers, such as artificial neural networks, SVM shows higher accuracy, particularly when numbers of features are large[28]. Furthermore, to improve the performance of SVM model, a program is designed to determine the optimum value of weight vector for each model in this research. As for adjusting gamma and cost value, a program in LIBSVM package[24] is applied to each model.
However, SVM does have some problems, including complexity and slow training speed for large-scale data. In order to solve these problems, another algorithm, SMO, is also applied for classification and shows both faster speed and better performance. SMO is a new algorithm for training SVMs, which breaks large quadratic programming (QP) optimization problem, a significant obstacle in original SVM algorithm, into a series smallest possible QP problem. By solving those smaller QP problems analytically, a time-consuming numerical QP optimization as an inner loop could be circumvented, and thus the computational time is shortened. The SVM maximization problem is as:
where, λ is Lagrange multiplier, x is input data and y represent class label. In SMO, two Lagrange multipliers are optimized while all other multipliers are kept constant using this equation[29]:
Moreover, since SMO only utilizes linear amount of memory, it can handle very large training sets[15], which is perfectly aligned with the need in biological data analysis. To compute a linear SVM, only one weight vector needs to be stored. The stored weight vector can be easily updated to reflect new Lagrange multiplier values by:
This algorithm has shown success in some biological applications, such as metabolism studies[30], genomics[31] and molecular studies[32].
Performance evaluation
To evaluate performance of machine learning models, four indexes are calculated: accuracy, specificity (SP), sensitivity (SN) and Matthews correlation coefficient (MCC). Details of these metrics are shown as following equations:
where TP-true positive-represents the number of correctly predicted positive labels, TN-true negative-refers to the number of corrected predict negative labels, FP-false positive-represents the number of positive labels that are wrongly predicted as negative, and FN-false negative-refers to the number of negative labels that are wrongly predicted as positive by the classifier. In addition to those evaluation metrics, receiver operating characteristic (ROC) curve is also generated in the step of weight adjustment to visualize the relationship of true positive rate and false positive rate, and used for comparison of performance.
Cross-validation and independent testing sets
In order to enhance robustness of the prediction model, ten-fold cross-validation is applied in model training step. In addition, to evaluate model built in this research and compare its performance with that of other existing tools, independent testing datasets are constructed in the dataset preparation step.