To obtain DEGs expressed in healthy individuals at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control, publicly available microarray data set (GSE12375) was retrieved from the GEO repository. The unnormalized gene expression data (Supplementary Figure 1 [A, B]) was tested for quality assessment purposes calculating normalized unscaled standard error (NUSE). This quality control analysis, designed only for comparing arrays within one dataset, has revealed no signs of low-quality arrays with significantly elevated NUSE values (Supplementary Figure 2 [A, B]). Subsequently, data preprocessing was performed to eliminate the effect of background noise and to normalize and summarize the gene expression values per each probe of the database (Supplementary material 3 [A, B]) according to the standard protocol published elsewhere [15].
To evaluate the overall structure of the RNA-seq dataset, we performed the unsupervised dimensionality reduction of gene expression data (23,941 genes) at baseline and after 26 weeks of treatment with EPA+DHA mixture using PCA. As a result, no clustering was observed between EPA+DHA and HOSF samples (Figure 2 [A, B]) at these particular time points, indicating the possible limitation of the PCA methodology, which might be due to the effect size of the biological signal as well as on the fraction of samples containing this signal [16].
Next, the total number of DEGs was evaluated (Table 1 and Figure 3 [A]), showing a significant increase in the number of these genes after 26 weeks of treatment (1,805) with EPA+DHA mixture compared to the baseline (779). Similar patterns were observed in the study of Bouwens and coauthors where the EPA/DHA and HOSF consumptions using the same conditions and time points resulted in gene expression changes of 1,040 genes induced by PUFAs and 298 genes induced by the sunflower oil. Of these genes, 140 were overlapping between the groups, and 900 were belonged to the unique genes in the EPA/DHA group [5]. We identified 847 and 312 up-regulated unique genes and 753 and 262 unique down-regulated DEGs at baseline and after 26 weeks of PUFA treatment (Figure 3 [B, C]), where the number of up-regulated genes is slightly dominated over the down-regulated ones except for the overlapping (common) genes (97.5 ± 0.71 vs. 107.5 ± 0.71), showing the inverse dynamics (Figure 3 [C]).
To further investigate the functional impact of up-regulated and down-regulated DEGs upon the treatment with EPA+DHA mixture vs. control at different time points, these genes were mapped onto the DAVID database to identify the specific metabolic or disease pathways. This database is known to be one of the most popular tools in the field of high-throughput functional annotation bioinformatics and microarray analysis [17]. Subsequently, the most significantly enriched KEGG pathways were linked to the neurotrophin and mitogen-activated protein kinase (MAPK) signaling, complement/coagulation cascades, and axone guidance for the up-regulated DEGs and cancer or acute myeloid leukemia pathways, long-term depression, fructose/mannose and arachidonic acid metabolism for the down-regulated genes (Table 2). Further, all the genes related to these pathways were subjected to the PPI network analysis to observe their involvement in the interconnections of different pathways (Figure 4 [A-D]).
Previously, some PUFAs, such as arachidonic acid, were discovered to activate MAPK in rat liver epithelial WB cells by a protein kinase C-dependent mechanism [18]. Moreover, it has been already demonstrated in many in vitro, in vivo and in clinical studies that PUFAs can downregulate cancer-related cellular proteins and modify signaling to inhibit tumor growth and metastatic rate and to extend patients’ survival duration [19, 20]. Furthermore, PUFAs can promote the axonal outgrowth by translational regulation of Tau and collapsin response mediator protein 2 expression and probably for axonal guidance as well by the modulation of lipid rafts, which are essential for this particular pathway [21-23]. EPA and DHA are also well-known in alleviating symptoms of different mental illnesses starting from anxiety and depression to bipolar disorder and schizophrenia [24]. Recently, the PUFA effect on preoperative bleeding has been estimated in the randomized placebo-controlled clinical trial to be associated with a lower risk of bleeding at higher fish oil levels [25]. This information might support our findings that PUFAs could potentially trigger the upregulation of complement/coagulation cascade genes, reconsidering current recommendations not to use fish oil before cardiac surgery [26]. The negative impact of some PUFA dietary components on sugar metabolism has been also examined in the study, where the authors evaluated them for the prevention and treatment of type 2 diabetes [26]. The evidence might suggest that consumption of fish oil supplements at high doses could lead to a further worsening of glucose metabolism [27]. Another metabolic change caused by EPA and DHA is most likely related to the observed effects of marine n‐3 PUFAs on eicosanoid profiles, resulting in a decreased arachidonic acid metabolism via the inhibition of phospholipase A2 and cyclooxygenase [28].
To construct the PPI network and maximize the genome coverage, the human interactome was implemented as a nonredundant and undirected binary data set comprising 16,018 unique HGNC-curated protein IDs and 299, 018 interactions compiled from the literature [15]. As an outcome, the PPI networks belonging to cancer/leukemia pathways and MAPK/neurotrophin signaling were identified as the subnetworks with a high number of nodes and edges, where the less expressed proteins were located on the periphery of the network. Interestingly, the gene expression of neurotrophins and their target receptors were determined to be differentially up-regulated by n-3 PUFAs via an age-dependent mechanism in the C57BL/6 mice cerebral cortex [29]. On the contrary, DHA has previously been shown to inhibit the induction and progression of acute myeloid leukemia cell lines (KG1a and HL-60) via the DHA-induced apoptosis, increasing the expression of the pro-apoptotic Bax protein [30, 31].
Before applying supervised ML algorithms, the inferential statistics, such as one-way ANOVA, was performed on DEGs comprising the analyzed subnetworks, to show that most of the gene expression data were statistically significantly different (p-value < 0.05) except for the plakoglobin (JUP), lymphoid enhancer-binding factor 1 (LEF1) genes, and MYC proto-oncogene from cancer/leukemia pathways (Table 3). The logistic regression, naïve Bayes, and DNN models were implemented to solve the binary (EPA/DHA vs. HOSF) classification problem together with the receiver operating characteristics (ROC) analysis. The ROC analysis is considered to be valuable in evaluating predictive models, including gene expression pipelines and molecular docking experiments, since it captures the trade-off between sensitivity and specificity over a continuous range [32-34]. The ROC curves were plotted for the ML models (Supplementary Figure 4 and 5 [A-C]) to assess the true and false positive rates as the total number of correct positive results predicted among all the positive samples and the total number of incorrect positive predictions among all negative samples in the dataset. Besides, the area under the ROC curve (AUC) was calculated as a numerical value that can be used to compare the logistic regression and naïve Bayes models getting maximal AUC values for the former model (Tables 4 and 5). Ideally, the best prediction model produces a curve on the top-left corner (0, 1) indicating perfect classification (100% sensitivity and specificity), whereas our ROC curves occurred in the top half of the graph, giving only a better-than-average model prediction. In our case, the ROC/AUC analysis cannot be used for screening and diagnostic purposes as is it lacks a minimum sample size (≈ 300) required to estimate the ML performance more precisely [35].
The DNN classifier with three hidden layers provided the worst performance on all the datasets with the lowest accuracy and harmonic mean of the precision and recall (F-score) values (Table 6) due to the small sample size (n = 96). Indeed, logistic regression models for binary classification can provide better performance than DNN because they are less prone to overfitting and not so difficult to train [36, 37]. Overall, the models exhibited their best performance values for the MAPK signaling pathway except for the Naïve Bayes model classifier, where the highest performance values were detected for cancer/leukemia pathway. Moreover, the “refined” model performances for this pathway were significantly improved after the exclusion of the JUP, LEF1, and MYC genes (p-value ≥ 0.05), which has contributed by decreasing the accuracy and F-score values, especially in the performance of DNN model. Finally, the decision surface analysis was utilized using the top two feature importance determinants for DEGs that contributed towards classifying the EPA/DHA and HOSF samples (Supplementary Figures 6 and 7 [A-C]). From Figures 5 and 6, it is clear that the logistic regression and naïve Bayes methods were able to separate the majority of the EPA/DHA samples from the HOSF ones, learning the underlying patterns quite well based on the most important DEG features. Even though the naïve Bayes classifier is considered to be one of the most popular classifiers for class prediction or pattern recognition from microarray gene expression data, it is highly sensitive to any outliers with the classical estimates of the location and scale parameters [38]. In fact, a discrepancy was observed between the Naïve Bayes model performance and data clustering for the cancer/leukemia pathway before the model “refinement”, where the good model performance produced no data clustering (Figure 6 [C]). This phenomenon might be explained by the interference of high DEG number (n = 12), small relative importance scores for the top feature contributors, and the presence of statistically insignificant genes. In addition, the outlier analysis for the logistic regression and naïve Bayes models confirmed by the previous results on the model performances, where the former model outperformed the latter one by a decreasing number of outliers observed in cancer/leukemia pathway clustering (Table 7). Indeed, the outlier detection and removal procedure previously published elsewhere was able to improve the ML classification accuracy from 63% to 76% by reducing the variance of the training data and matching the accuracy of clinical judgment of medical experts [39]. Given that there is a limited number of medical facilities providing highly specialized and complex health care, this approach may be useful to improve the diagnostic process for patients without access to these facilities.