Defining the HetEnc architecture
HetEnc is inspired by the domain separation network[23] developed for image analysis. The domain separation network extracts image representations into two subspaces: one private component, and another component shared by different domains. Here, we implemented a similar idea in HetEnc to represent the gene expression, to show the platform-shared (or platform-independent) information by organizing different platforms’ data into the designated encoding networks.
The entire HetEnc architecture is composed of two modules. The first feature representation module is the key module, which is designed to extract the gene expression representation into different subspaces via different representing or encoding networks. The first module involves three distinct encoding networks; Autoencoder (AE), CombNet and CrossNet (Figure 2a), for extracting different subspaces of the feature representation, respectively. The second module of HetEnc is the modeling step, which is basically a six-layer deep neural network (named 6-DNN) used to predict targeted endpoints using the intermediate features (Figure 2b).
In the feature representation step, the most differences between three networks (AE, CombNet and CrossNet) are the definition of input and output data, which could be the same or different platform. In all, there are four different combinations of microarray and RNA-seq, as shown in Figure 2a. For example, if microarray is the primary input platform, AE will use type (a). CombNet will use the input-output combination of (a) and (b). CrossNet will use the combination of (c) and (d). On the other hand, If RNA-seq is the primary input platform, AE will use type (b). CombNet will use the combination of (a) and (d), and CrossNet will use the combination of (c) and (d). Note that CombNet and CrossNet will not change when primary platform changed. The three intermediate feature sets generated by three encoding networks were named Feature-A, Feature-B and Feature-C for AE, CombNet and CrossNet, respectively.
The hyper-parameters of AE are described as follows: in total, nine layers were formed, with the number of nodes, p, 4096, 3072, 2048, 1024, 2048, 3072, 4096, p, respectively, where p was the number of input (and reconstructed) gene features. The first four layers are the encoding network and the last four layers are the decoding network. The compressed feature set, therefore, is from the bottleneck (i.e., intermediate) layer, with 1024 nodes. We used the hyperbolic tangent (Tanh) activation function for all AE hidden layers, and the sigmoid (logistic) activation function for the output layer. For denoising, we also added one drop-out layer between sets of two layers, and the drop-out ratio was set to 0.2.
In the modeling step, the six-layer feed-forward deep neural network (6-DNN) is depicted in Figure 2b. The hyper-parameters of 6-DNN are listed here: (1) Network size: sizes for each layer (i.e., node) in the network are x, 1024, 512, 256, 128 and 2, respectively; where x is the size of the intermediate features set, and 2 is a categorical endpoint for a binary endpoint. For most of this study, x=3072; when using one or two AE models in comparative analysis, the input shape would also change to 1024, 2048, respectively. (2) Activation function: we used the Rectified Linear Unit (RELU) activation function for all dense hidden layers, and Softmax activation for the output layer, as a classification task. (3) Regularization: between two hidden (dense) layers, batch normalization was added for purposes of regularization, as depicted in Figure 2b. Due to concern over introducing bias when using batch normalization and drop-out simultaneously (Li, et al., 2018), the drop-out layer is not implemented in the 6-DNN network structure.
Predictive Performance on SEQC Neuroblastoma Dataset
We evaluated our model on the SEQC neuroblastoma dataset. In total, six predictive models were trained for endpoints FAV, OS_All and OS_HR, and two data platforms (Microarray, RNA-seq), respectively. Because the first step (feature representation) is unsupervised, the encoding networks (i.e., AE, CombNet, CrossNet) generated by the first step would be shared between three endpoints in the modeling step.
We first applied Principle Component Analysis on the intermediate features generated by three encoding networks. Latent features from each encoding network will be analyzed both independently and combined as HetEnc features, as shown in Figure 3a. Microarray and RNA-seq samples were combined in PCA analysis. For AE features, latent features (Feature-A) of Microarray and RNA-seq samples were generated by different AE models; For CombNet and CrossNet, the latent features (feature-B and C) were generated by the same model. As a result, we observed in all PCA results, Microarray and RNA-seq samples are highly distinguished from each other along Principle Component 1 (PC1), indicating the large inherent differences (platform-related variance) between these two platforms. Compared to AE, CombNet and CrossNet has a closer distance between Microarray and RNA-seq samples on PC1. On the other hand, Microarray and RNA-seq samples fall into a similar range of PC2 particularly in CombNet and CrossNet, implying PC2 reflected some common properties (i.e., platform-independent variance) between two platforms.
To further reveal the platform-independent variance in PC2, a correlation analysis was performed on PC2 between Microarray and RNA-seq of the same sample. As shown in Figure 3b, the pair-wise correlation of PC2 from AE between Microarray and RNA-seq is not significant (r2=0.289). On the other side, there are high linear correlation of PC2 from CombNet and CrossNet between RNA-seq and Microarray, as r2 reached 0.921 and 0.89 respectively. Further, when combined latent features from AE, CombNet and CrossNet together as HetEnc, the linear correlation between two platforms became higher (r2=0.949). This PCA and pair-wise PC2 correlation analysis result indicated that CombNet and CrossNet could represent platform-independent features from the raw dataset.
The predictive models were first constructed and evaluated by five-fold cross-validation within the training dataset. In the five-fold cross-validation, we randomly separated the entire training dataset into five subgroups; where in each run, four subgroups of samples were used to train the model, and the remaining one was used as a testing set for evaluating the model’s performance. Each subgroup was tested once. By picking different random seeds, we repeated the five-fold cross-validation 20 times; therefore, a total of 100 (5*20) sub-models were built for each endpoint.
After evaluating the model via cross-validation, we trained one model using all 249 training samples for each endpoint. The model was finally evaluated on the test dataset with the other 249 samples. Similarly, we set different random seeds to run the whole modeling process 100 times to retrieve an average predicting performance. We measured the model’s performance for three endpoints (FAV, OS_All, OS_HR). As shown in Table 2, we observed a small standard deviation among 100 repeats in both cross-validation and external testing, indicating initial random seed had little influence on the overall performance. We also observed that the external evaluation performance was very close to the cross-validation result, confirming that our HetEnc model did not overfit the training dataset.
We compared our model performance to three machine learning models - support vector machine (SVM), nearest shrunken centroids (NSC) and k-nearest neighbors (KNN) -using the same training/testing sample distribution and data preprocessing (i.e., using the same 10042 genes). A detailed modeling process of SVM, NSC and KNN can be found elsewhere[24]. Furthermore, we compared our result to the submitted best models from all analysis teams in the SEQC/MAQC consortium[4], which held the same data distribution (i.e., training/testing split), but had no restrictions for machine learning methods (i.e, data normalization, feature selection, modeling algorithm, etc.) or expression datasets. The SEQC predictive models were built during the SEQC project, where a total of 6 microarray and 54 RNA-seq models were constructed, and we used Area under the Receiver Operating Characteristic Curves (AUC) for comparison.
The performances (AUC) of HetEnc, KNN, NSC, SVM and the best model from SEQC/MAQC for three Endpoints (FAV, OS_All and OS_HR) with two platforms (RNA-seq and Microarray) was shown in Table 2. Since OS_HR in average showed a low performance regardless of the platform and modeling algorithm, we notated this endpoint is not predictable by current dataset and its performance would not affect the comparison result. After all, we observed that these best models from SEQC analysis teams showed better overall performance than the models constructed by KNN, NSC and SVM in the previous study. One possible explanation could be the restriction of genes to those with one-to-one mapping between RNA-seq and microarray. However, this restriction did not have any detrimental effect for our HetEnc model. As a result, our HetEnc model still showed a significantly better predicting performance (p<0.01) than the best fine-tuned predictive models from the SEQC community.