Gene expression is fundamental to biological processes and serves as a crucial pathway for converting genetic information into biological functions. The regulation of gene expression is essential for maintaining normal physiological activities in organisms and plays a central role in cellular functions(Kapuria et al., 2012; Strober et al., 2019; White, 1954). In disease research, abnormalities in gene expression are often associated with the onset and progression of various diseases. Specific genes, such as P53 and PTEN, have been closely linked to the development of multiple cancers due to their aberrant expression(Di Cristofano & Pandolfi, 2000; Liu et al., 2022; Marei et al., 2021). Therefore, accurate prediction of gene expression can deepen our understanding of the molecular mechanisms of diseases, providing a theoretical foundation for disease prevention, diagnosis, and treatment.
DNA methylation, an important epigenetic modification, regulates gene expression by adding methylation groups to the cytosine in DNA. This modification plays a crucial role in gene silencing and activation, thereby influencing cellular functions and biological development. Moreover, methylation patterns can be inherited during cell division, affecting the gene expression patterns of progeny cells and playing a vital role in many biological processes, including embryonic development, cell differentiation, and the onset and progression of diseases(Bird, 2002). Changes in methylation patterns, particularly abnormal methylation in gene promoter regions, are often associated with the development of various diseases and can serve as biomarkers for tumors (Jones & Baylin, 2007). Thus, in-depth study of the impact of methylation on gene expression not only helps in understanding the complex mechanisms of gene regulation but also may provide new perspectives and methods for treating a wide range of diseases, particularly cancers and genetic disorders.
To better understand the relationship between methylation and gene expression, it is crucial to study the translation regions. Enhancers play a significant role in the regulation of gene expression in cancers (Sur & Taipale, 2016), and may be located just a few base pairs away from the genes they regulate (Mora et al., 2016). For instance, in T-cell acute lymphoblastic leukemia, a super-enhancer for the MYC gene is found within 1.47 base pairs of the gene's transcription start site (TSS) (Herranz et al., 2014). This highlights the importance of considering the methylation status of gene distal regions in gene regulation and related disease research. DB Seal et al(Seal et al., 2020) utilized deep learning to integrate DNA methylation and copy number variation data to predict gene expression, but the analysis was limited to promoter regions within 1500 BP of the TSS, overlooking methylation in other regions. The geneEXPLORE (Kim et al., 2020) emphasized the importance of long-distance DNA methylation in predicting gene expression, proposing the innovative view that methylation at distant locations within a gene may be more significant than proximal methylation, although the prediction accuracy (R2) was only 0.491, indicating room for improvement.
In recent years, deep learning has been extensively applied in the field of biology, including genomics (Alharbi & Rashid, 2022; Montesinos-López et al., 2021; Zou et al., 2019), gene regulation ((Gan et al., 2019; Li et al., 2019)), protein structure prediction and functional analysis (Baek & Baker, 2022; Bileschi et al., 2022; Jisna & Jayaraj, 2021; Pakhrin et al., 2021), drug discovery (Jiménez-Luna et al., 2020; Nag et al., 2022; Pandey et al., 2022; Tropsha et al., 2024), and biomarker discovery (Echle et al., 2021; Liang et al., 2023; Mandair et al., 2023; Steyaert et al., 2023). In the field of biology, the extraction of features is of paramount importance. Deep learning, particularly Convolutional Neural Networks (CNN), has demonstrated robust capabilities in this area. Gunasekaran et al.(Gunasekaran et al., 2021) used CNN to extract features from DNA sequences, subsequently processing these features through Bidirectional Long Short-Term Memory networks. This methodology was designed to enhance the contextual understanding within the sequence classification tasks, ultimately aiming to increase the accuracy of DNA sequence classification. DeepRHD(Routray et al., 2022) employs a multi-level feature extraction approach, utilizing CNN as feature map extractors, and integrates additional machine learning models to optimize the classification process. Tests conducted on multiple standard datasets demonstrate that this method significantly enhances the accuracy and efficiency of remote homology detection in proteins. This is particularly evident when processing protein sequences with complex structures. Notably, the application of deep learning technologies in biomedical fields is increasingly widespread, enabling the handling of large-scale biomedical data and the extraction of complex patterns and features. These capabilities are crucial for understanding the intricacies of gene expression regulation and accurately predicting gene expression.
To further enhance the accuracy of predicting gene expression based on methylation, we proposed DeepMethyGene, employing conventional deep learning methods with variable convolutional kernels and ResNet blocks for prediction. DeepMethyGene predicted the expression levels of 13,982 genes in TCGA breast cancer data, achieving a five-fold cross-validation result R2 of 0.64. The DeepMethyGene model showed significant superiority over the geneEXPLORE model in predicting gene expression, especially when the number of methylation sites within a 1 Mb radius around the gene location was limited and the average distance to the gene transcription start site (TSS) was minimal, leading to notably higher prediction accuracy.