Differentially expressed miRNAs were identified by Random Forest algorithm.
For predicting the chemotherapy effectiveness by serum miRNAs, we utilized the GSE56264 datasets to filter out the chemotherapy feature miRNAs. After we got the miRNA candidate, we validated these miRNAs in an independent cohort containing 15 chemotherapy responders and 15 non-responders (Fig. 1).
To characterize the miRNA expression between platinum-based chemotherapy responders and non-responders, we first performed the differentially expressed miRNA analysis of the GSE56264 datasets. First, we checked the batch effect between samples by boxplot (Supplementary Fig. 1). Due to the large batch effect between samples, we filtered the miRNA probes containing 0 or missing value that presented in more than 10 samples. After filteration of no expression miRNA probes, we included 249 miRNAs in the quantile normalization. After cleaning and normalizing the dataset, we performed principal component analysis (PCA) of all samples in GSE56264 to explore the heterogeneity between responders and non-responders of chemotherapy with regard to miRNA expression. PCA demonstrated chemotherapy responders had different miRNA expression pattern as compared with non-responders (Fig. 2A, Supplementary Fig. 2A). Differential expression analysis revealed eight miRNAs were significantly differently expressed between responders and non-responders (Fig. 2B). Notably, we failed to find the difference of miRNAs between responders and non-responders are significant to be predictive biomarkers in violin plots even though the log fold change is larger than the cut-off we set because the expression of differentially expressed miRNAs in most cases of both groups were zero.
To further investigate the chemotherapy miRNA signatures and overcome the limitations of conventional differential expression analysis, we applied the forest-based machine learning algorithm on the dataset. All 249 miRNAs after data cleaning were subjected to the random forest model. The top 20 important chemotherapy feature miRNAs were identified by random forest model (Fig. 3A). Interestingly, there was no overlap of miRNAs signatures identified between random forest and differential expression analysis. PCA plot also demonstrated a slightly bigger separation in the first principal component than that in differential expression analysis, suggesting machine learning algorithm could find the features more accurate than the conventional differential expression analysis (Fig. 3B).
The prediction of chemotherapy miRNA signatures were confirmed by three different machine learning algorithms.
Although the importance of each miRNAs was identified by random forest, we still needed to determine how many miRNAs we should involve as the chemotherapy response signatures. To predict the chemotherapy response by miRNAs, we validated the miRNAs predictive capacity by naïve Bayes (NB), the latent Dirichlet allocation (LDA) and Supportive Vector Machine (SVM). The prediction accuracy of miRNAs did not increase when the number of miRNAs involved was larger than 5 in all three models, suggesting the top 5 importance miRNAs could achieve the best predictive accuracy (Fig. 4). Therefore, we determined the top five miRNAs (miR-196b, miR-34c-5p, miR-181b, miR-27b and miR-26a) as the chemotherapy predictive miRNA signatures. Heatmap of these five miRNA signatures revealed there was a difference between chemotherapy responders and non-responders (Fig. 5A). We also found these five miRNAs were detectable in microarray (Fig. 5B), suggesting the five miRNAs were ideal for biomarker design.
Function annotation analysis identified the five miRNAs were enriched in biosynthesis.
To test the function pathway enriched by the five miRNAs, we performed function annotation analysis on the KEGG and GO by DIANA. Biological Processes (BP) of GO of the five miRNAs involved cellular nitrogen compound metabolic process, biosynthetic process, Fc-epsilon receptor signalling pathway, gene expression, symbiosis, encompassing mutualism through parasitism, L-fucose catabolic process. Cellular Components (CC) of GO involved organelle. Molecular Function (MF) of GO involved ion binding, nucleic acid binding transcription factor activity, molecular function. Pathways in KEGG involved Glycosphingolipid biosynthesis, Hematopoietic cell lineage, Synaptic vesicle cycle, Glycosphingolipid biosynthesis, Mucin type O-Glycan biosynthesis, MicroRNAs in cancer, Thyroid cancer (Fig. 6, Supplementary Fig. 3).
The chemotherapy prediction of miR-196b and miR-34c-5p were validated in patient serum.
Although the five miRNA signatures demonstrated the excellent predictive capacity, only miR-196b and miR-34c-5p showed large effect sizes between responders and non-responders. Considering both miR-196b and miR-34c-5p played critical roles in lung carcinogenesis, we only included these two miRNAs into clinical validation. In addition, due to blood-(serum-) based liquid biopsy has revealed its potentially clinical utility, we hypothesis serum miR-196b and miR-34c-5p level could predict the chemotherapy response for lung adenocarcinoma patients. A total of 30 patients who had been pathologically diagnosed as lung adenocarcinoma included in this study (Supplementary Table 1). As expected, we observed a significant difference of miR-196b expression between responders and non-responders (p = 2.365 * 10− 5). The expression of miR-34c-5p in non-responders was significantly higher than that in responders (Fig. 7A). More importantly, both miR-196b and miR-34c-5p demonstrated the high predictive accuracy (0.898 and 0.931, respectively) in our cohort (Fig. 7B). Taken together, these results uncovered serum miR-196b and miR-34c-5p were differentially expressed between chemotherapy responders and non-responders, and potentially the chemotherapy predictive biomarkers for lung adenocarcinoma patients.