2.1 Benchmark datasets
In this study, we collected 3,536 experimentally validated siRNAs with corresponding potency, as suggested in the work of Han et al. [23]. They come from the studies of Huesken [17], Reynolds [24], Vickers [12], Haborth [25], Takayuki [26], and Ui-Tei [27]. Since the datasets from Reynolds, Vickers, Haborth, and Ui-Tei are all rather small, we gather them together and name them dataset-RVHU. The dataset from Takayuki is the only complete dataset targeting all positions of a single mRNA (EGFP), and we name it dataset-T. The dataset from Huesken is abbreviated as dataset-H. The details of these datasets are summarized in Table 1. Some approaches during comparison, such as DSIR, lack source codes and are hard to reproduce, so we utilize the i-score website to obtain their inhibition predictions. However, the first two siRNAs in Dataset-T are discarded on this website. Therefore, we removed 2 siRNAs from Dataset-T here. In addition, two siRNAs in dataset-RVHU failed to find target-binding sites on reported mRNAs, so we discarded them.
Table 1. The details of benchmark datasets. Two siRNAs are removed in dataset-T due to the limitation of i-score website, and two are discarded in dataset-RVHU as a result of lacking target sites.
Dataset
|
Num of siRNAs
|
Num of target mRNAs
|
Dataset-H
|
2431
|
34
|
Dataset-RVHU
|
405(407)
|
11
|
Dataset-T
|
700(702)
|
1
|
The inhibitions of dataset-H, dataset-RVHU, and dataset-T range from 0 to 134.1, -27.8 to 98.9, and 0 to 97, respectively. After normalizing them individually to the range of 0 ~ 1, we sorted them in an inhibition-descending order. As shown in Fig. 1, the normalized inhibitions of all three datasets show a nearly normal probability distribution, where the average is 0.543, and the standard deviation is 0.187.
2.2 The siRNA inhibition predictor
The deep learning-based siRNA inhibition predictor consists of three modules (shown in Fig. 2): feature extraction, convolution and pooling, and full connection.
2.2.1 Feature Extraction Module
Here, we mainly consider three types of features, which come from sequence context, thermodynamic property, and some other prior expert knowledge.
1) Encoding of siRNA and local mRNA sequence
A relevant study has shown that the flanking regions around the binding site on mRNA affect siRNA potency [28]. Here, we trim 20 nucleotides from the upstream and downstream regions on the target mRNA separately to form the local mRNA sequence, as suggested in previous work [23]. Note that the binding part usually refers to the core region of AS, composed of 19 nucleotides from its 5’ end. Therefore, the length of the local target mRNA adds up to 59 (20 + 19 + 20). We use one-hot encoding and pretrained RNA-FM encoding methods here. RNA-FM is a self-supervised model trained on 23 million ncRNA sequences from the RNAcentral database, and the produced embedding contains functional, sequential, structural, and evolutionary information [22]. Thus, we applied pretrained RNA-FM to help encode the structural and functional information of the siRNA sequence. One single nucleotide is transformed to a four-dimensional discrete vector by one-hot encoding and a 640-dimensional contiguous vector via pretrained RNA-FM encoding.
Overall, 21-nucleotide AS is transformed into two feature matrixes with sizes of 21*4 and 21*640. To save the computational cost caused by RNA-FM, the 59-nucleotide local mRNA is only encoded into one matrix with a size of 59*4 using one-hot encoding. If the binding site is located at the end of the mRNA, the upstream or downstream regions would contain less than 20 nucleotides. In this case, we encode the missing nucleotide with a special vector of [0.05, 0.05, 0.05, 0.05].
2) Encoding of siRNA thermodynamic properties
The thermodynamic stability of siRNA duplexes plays a vital role in the inhibition potency and longevity of siRNAs [26]. The absolute and relative thermodynamic stabilities of bases at the 5’ and 3’ ends of siRNAs determine which strand is bound to the RISC and hybridized with the target mRNA [2] [11] [29].
Here, we mainly consider the Gibbs free energy of the core region of AS. There are a total of 20 thermodynamic parameters, as shown in Table 2, including the energy of every two adjacent base pairs, the difference between the 5’ and 3’ ends, and the overall energy.
Table 2. The description of thermodynamic property (Gibbs free energy) in the core region of AS.
Features description
|
Number
|
Every two adjacent base pairs
|
18
|
Difference between 5’ and 3’ ends
|
1
|
Overall energy
|
1
|
3) Encoding of other biological features from expert knowledge
In addition to sequence context and thermodynamic properties, we extract four groups of features from expert knowledge.
Frequencies of multiple motifs. The sequence composition plays a primary role in determining the position of Dicer cleavage in both dsRNA and short hairpin RNA [30] [31]. Here, we calculate the occurrences of 4, 16, and 64 kinds of possible motifs in the 1-mer, 2-mer, and 3-mer counts of the siRNA sequence, respectively. This provides a simple way to encode the presence of specific nucleotides and short motifs.
Position Specific Scoring Matrix (PSSM). It is a matrix specifying the scores for observing particular amino acids at specific positions. In our study, this matrix is deduced from the training set. It has 4 rows and 21 columns, corresponding to 4 kinds of bases and 21 positions in the aligned ASs. When calculating the score for a new given AS sequence from the test set, we sum the negative log-odds score of that AS matching the motifs in PSSM. We use it to represent the similarity between a new test sample and the overall training samples.
Activity of siRNA oligonucleotides. It is affected by the secondary structure of the target mRNA, especially the target site [12] [13]. Unstructured siRNA duplexes can mediate more active silencing than those with base-paired ends, and the structure of the AS also affects potency [32]. We used RNAfold from the ViennaRNA package to predict the secondary structure of AS and mRNA sequences, including the minimum free energy (MFE), dot bracket notation (DBN), and centroid structure [33]. We used the MFE, the ratio of paired and unpaired bases in the DBN, as the features to describe the secondary structure of the AS and local mRNA. A smaller MFE means a higher affinity of targeted binding with more effective gene silencing. In addition, we took the secondary structure of the global mRNA into account to estimate the accessibility of the target sites more accurately.
GC content and GC stretch. GC content is the ratio of bases G and C in the AS. The GC stretch is the maximum length of continuous bases G or C. A relative study showed that the GC content in functional siRNAs has a trend of 30%-60% (Das et al. 2013). This might affect the potency via thermodynamic properties since the Gibbs energy between A and U is much larger than that between G and C. The GC stretch determines the lower bound of the overall stability.
2.2.2 Convolution and Pooling Module
The predictor mainly uses a CNN to detect potentially decisive motifs for predicting inhibition. In the one-dimensional convolution and pooling module, there are 45 different kernels with multiple sizes elicited from the work [23]. These kernel sizes range from 5 to 20 for one-hot encoding of AS (M1), from 5 to 20 for RNA-FM encoding of AS (M2), and from 6 to 21 for one-hot encoding of the local mRNA (M5). The kernel in each convolution operation goes through all possible combinations of adjacent bases. We utilize the ReLU activation function to increase the nonlinear mapping capacity. Then, average pooling and maximum pooling follow each convolution operation. The purpose of pooling is to reduce the number of parameters, remove useless noises, and collect the representative patterns. For 45 different convolutions, this module produces a 90-dimensional vector (M7 + M8 + M9).
2.2.3 Full Connection Module
In the fully connected module, we concatenate the pooling output (M7 + M8 + M9) and all other features together, including thermodynamic parameters (M3) and features extracted from expert knowledge (M4, M6). After batch normalization, the output is fed into four sequential submodules. Each submodule consists of one linear layer with ReLU activation and dropout. Finally, we use sigmoid activation to generate the prediction score, ranging from 0 to 1, the same as the inhibition label.
2.4 Experimental Setup
2.4.1 Sampling and training
It is well known that machine learning-based algorithms face the problem of out-of-distribution (OOD) generalization. Therefore, we sorted the siRNA samples in an inhibition-descending order. During 10-fold cross-validation, we choose the samples with indices of i, i + 10, i + 20... (i = 1, 2,...10) to form the test set to unify the distribution between the training set and test set. The Adam strategy is used to optimize the model with an initial learning rate set of 0.01. As a regression task, we train the model with the mean average error loss (MAE).
2.4.2 Evaluation metrics
To assess the prediction capacity, we use two statistical metrics to evaluate our method: the Pearson correlation coefficient (PCC) and Spearman correlation coefficient (SPCC). PCC is used to evaluate the linear correlation between two sets of continuous values, while SPCC is used to assess the correlation of the discrete rankings. High SPCC is more meaningful in siRNA design than PCC. They can be calculated with the following equations:
$$\text{PCC=}\frac{\text{n}\sum \text{xy}\text{-}\left(\sum \text{x}\right)\left(\sum \text{y}\right)}{\sqrt{\left(\left[\text{n}\sum {\text{x}}^{\text{2}}\text{-}{\left(\sum \text{x}\right)}^{\text{2}}\right]\left[\text{n}\sum {\text{y}}^{\text{2}}\text{-}{\left(\sum \text{y}\right)}^{\text{2}}\right]\right)}}$$
1
$$\text{SPCC=1-}\frac{\text{6}\sum {\text{d}}_{\text{i}}^{\text{2}}}{\left({\text{n}}^{\text{3}}\text{-n}\right)}$$
2
where x is the predicted value, y is the true value, n is the sample size, and di is the difference between the x rank and y rank for each pair of data.