Machine learning models predict the immunotherapy response in tumor based on DNA methylation

doi:10.21203/rs.3.rs-4832764/v1

Download PDF

Research Article

Machine learning models predict the immunotherapy response in tumor based on DNA methylation

https://doi.org/10.21203/rs.3.rs-4832764/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

The epigenetic control of immune responses plays a crucial role in the development and progression of cancer. The need to identify biomarkers and create new predictive models is crucial in order to reliably estimate response rates in tumour immunotherapy, which are currently low.

Methods

We conducted a screening to identify loci that had variable methylation patterns in response to immunotherapy. We next focused on pathways that are relevant to this response and increased their representation.We investigated the expression of methylation loci associated with immunotherapy in tissues.We have also provided a concise overview of the Qtl features associated with several CpG loci.We examined the relationship between the levels of TMB, NeoAg, and PD-L1 and the effectiveness of immunotherapy.Identification of base preferences in DNA sequences by motif analysis allows for the demonstration of unique sequence patterns linked with DNA methylation.We created a total of seven machine learning models, namely Lasso regression, Xgboost, SVM, random forest, KNN, Naive Bayes, and Decision Tree. We then compared their respective functions and choose the best model..

Result

The five CpG loci that exhibited the most significant response to tumour immunotherapy were cg00045061, cg00107488, cg00056433, cg00090974, and cg00072957.We identified the immunotherapy-associated pathway, the ubiquitination-proteasome system, by screening differentially methylated sites.Upon analysis, we observed that the majority of the CpG loci that exhibited differential methylation were situated on the N Shore region of the CpG island.The GO enrichment analysis identified the top two pathways as modulation of microvillus length and CXCR4 chemokine receptor binding.On the whole the Random Forest model is considered the optimal choice for machine learning((Precision: 0.859,F1score: 0.907.Recalling: 0.941,ROC: 0.654).

Conclusion

Tumour methylation sites have the potential to be used as biomarkers for predicting the effectiveness of tumour immunotherapy and for future clinical applications.The Random Forest model is the most optimal choice among many machine learning algorithms for predicting methylation sites in immunotherapy.

In recent years, immunotherapy—represented by PD-1/PD-L1 inhibitors (ICIs) and other agents—has become increasingly significant in the treatment of tumors. When targeted therapy is not an option because of negative driver genes or when targeted medications are not currently available, immunotherapy may be a useful option for certain patients with advanced cancer.

Anti-tumor immunity is intricately governed by various factors, including immune-related genes, tumor immunological microenvironment, and epigenetics. Immunotherapy is not beneficial for all patients as some have inherent resistance to drugs, making it ineffective. Even for cancers that can be effectively treated with immunotherapy, only a small percentage of patients (around 20%-40%)[1] will respond positively. The effectiveness of immunotherapy is further reduced in patients with driver genes like EGFR and ALK, and most patients develop adaptive resistance or experience severe toxic side effects. Furthermore, the majority of patients will inevitably experience adaptive resistance and potentially severe toxic side effects.

DNA methylation is a form of chemical modification of DNA that can alter genetic expression without changing the DNA sequence. DNA methylation refers to the covalent bonding of a methyl group at the cytosine 5 carbon position of the genomic CpG dinucleotide by DNA methyltransferase[2].Clusters of CpGs during DNA methylation are known as CpG islands (CGIs) and are mainly located in the promoter and first exon regions of structural genes.The methylation status of certain immune-related genes can affect the tumor response to immunotherapy, e.g., the methylation status of the PD-L1 gene can regulate its expression, thereby affecting the effectiveness of immune checkpoint inhibitors[4].The immunotherapy biomarkers currently approved by the U.S. Food and Drug Administration (FDA) are PD-L1, tumor mutation burden (TMB), and microsatellite instability/mismatch repair deficiency (MSI/dMMR)[3].Microsatellite instability (MSI) and mismatch repair defects (MMR-D) are directly causally related.Malfunctions in the Mismatch Repair (MMR) system result in Microsatellite Instability (MSI).High microsatellite instability (MSI-H) and deficient mismatch repair (dMMR) tumors respond better to immune checkpoint drugs, particularly PD-1/PD-L1 inhibitors. The standard biomarkers may have inadequate sensitivity and specificity, resulting in inaccurate positive or negative results. For instance, while a high level of PD-L1 expression is typically linked to a positive reaction to immunotherapy, it is not certain that all patients with high expression will respond to treatment[5], and conversely, not all patients who respond to treatment will have high expression.Alternative marker assays, such as TMB assays, necessitate significant expenses and intricate technical apparatus. Moreover, the absence of uniformity among various laboratories and assays can result in inadequate comparability of outcomes. Immunohistochemical tests for PD-L1 might differ significantly amongst laboratories, which can impact clinical decision-making[21].Tumour heterogeneity also impacts the precision of biomarkers, thus necessitating additional screening of efficacious biomarkers for immunotherapy response prediction in clinical decision-making.

Previous studies have shown that methylation sites can serve as potential biomarkers for predicting the efficacy of tumor immunotherapies[6].The use of peripheral blood for cfDNA methylation testing has been an important modality in recent years for non-invasive diagnosis of tumors in their early stages[7].The process of DNA methylation controls the activity of immune-related genes, such as antigen-presenting genes, immunological checkpoint genes, and cytokine genes. The methylation status of these genes determines their expression levels, which in turn impacts the immune response.Furthermore, the methylation status of the MGMT promoter in certain tumors, such as glioblastoma, can serve as potential biomarkers. This methylation status is associated with sensitivity to specific chemotherapeutic agents and may also impact the response to immunotherapy. Similarly, the methylation status of the PD-L1 gene promoter can influence its expression and consequently affect the tumor's response to PD-1/PD-L1 inhibitors.In this passage, we discuss the differential methylation profiles of effective and ineffective immunotherapies as well as modeling predictions using different machine learning approaches to compare differences in predicting treatment response.

Datasets

The data used in this study included the previously published datasets from GEO(Gene Expression Omnibus) database were GSE175699 and GSE126043.The GSE175699 platform was GPL23976 and the GSE126043 platform was GPL21145.GSE175699 dataset is for melanoma patients and GSE126043 is for lung cancer patients.The protocol of GSE175699 was amplified bisulphite converted DNA , fragmented and hybridised to Illumina Infinium MethylationEPIC BeadChip using standard Illumina protocol.The clinical data includes gender,age,brafmut,nrasmut,ici response,brain metastasis and tissue in relation to ici treatment.We combined the CpG loci from the two datasets for the next analyses.A total of 78 patients were included in the combined dataset, of which 28 responded to immunotherapy and 50 did not.

Differential methylation analysis

We used the limma package to perform a difference analysis of the immunotherapy response and non-response groups and dplyr package for data analysis on R studio(version 4.3.1).We further untilized ewascatalog website(https://ewascatalog.org/) analyzing methylation sites’location and region.We translated the methylation sites into gene names and then analyze the protein-protein interaction networks(PPI) of separate data sets and combined data sets on the string website(https://cn.string-db.org).We employed clustering techniques to group protein interaction networks, and subsequently conducted an analysis of the corresponding pathways.We conducted a thorough investigation into the correlation between the expression of differential methylation sites and clinical characteristics, tumour mutational burden (TMB), and prognosis using data from the TCGA database and other datasets.We conducted a comparison of the expression of distinct CpG loci sites between the groups that responded to immunotherapy and those that did not respond. Subsequently, we created a box diagram illustrating the results (Figure 3 B-E).

QTL analysis

DNA methylation quantitative trait loci (meQTLs) are regions of the genome containing DNA sequence variants that influence the methylation levels. Single nucleotide polymorphisms (SNPs) are the most common variants in genome.We used Pancan-meQTL database(http://gong_lab.hzau.edu.cn/Pancan-meQTL/) for QTL analysis. This database contain methyltion 450k array data of 7242 TCGA samples.Pancan-meQTL aims to systematically predict the effects of SNPs on methylation levels, providing cis-meQTLs (SNPs affect local methylation levels of CpG sties) , trans-meQTLs (SNPs affect distant methylation levels of CpG sites) and GWAS-eQTLs(Genome-Wide Association Study - Expression Quantitative Trait Loci) across cancer types by using thousands of genotype and methylation data from The Cancer Genome Atlas (TCGA).Across 23 cancer types,we selected the mosst significant snp with the minimum p-value（P value ＜0.05) in Table 3.

Pathway enrichment analysis

We enriched the obtained differentially methylated sites by GO(Gene Ontology) to obtain the relevant pathways and performed gene localization enrichment using ewas data hub database(https://ngdc.cncb.ac.cn/ewas/).Gene localization enrichment refers to a situation in which the spatial distribution of genes within a specific genomic region (e.g., a portion of a chromosome or a specific genomic segment) is significantly concentrated in one region.Furthermore,we got the chromatin state and histone modification.Chromatin State is the structural and functional state of chromatin in different regions of the nucleus that reflect the density of DNA packaging, modification patterns, and associated transcriptional activity.Histone modification refers to covalent chemical modifications that occur on histones that can affect chromatin structure and gene expression.

Motif analysis

Methylation motif analysis is used to identify and analyze specific DNA sequence patterns (motifs) associated with DNA methylation in the genome.Motifs are short, specific nucleotide sequences that are often recognized as important elements in the regulation of gene expression and other genomic functions. By identifying these motifs, the role of DNA methylation in gene regulation can be better understood.We performed motif analysis of methylation sites associated with immunotherapy efficacy at the ewas data hub website(https://ngdc.cncb.ac.cn/ewas/datahub).These DNA sequence patterns that are related may serve as binding sites for transcription factors, thereby providing insights into the mechanics of gene expression control.

TMB,PD-L1,NeoAg correlation analysis

We utilized a breast and ovarian cancer ICB cohort contain the data includes TMB,PD-L1 epression and survival condition(OS and PFS).We conducted correlation analysis between variables using R studio(version 4.3.1).We performed a Pearson correlation coefficient analysis of TMB,NeoAg. The correlation analysis plot allows us to visualize the distribution of the data points and the regression line, which helps to understand the relationship between TMB, NeoAg, and the response to immunotherapy.Correlation analysis between PD-L1 expression and response to immunotherapy (effective or ineffective) using chi-square test.The p-value of chi_square_test indicates the significance of the association between PD-L1 expression and response to immunotherapy. If the p-value is less than 0.05, it means that there is a significant association between the two groups.

Machine learning process

We used machine learning methods to classify methylation expression profiles for prediction of immunotherapy response outcomes, aiming to find the best prediction model.Finding methylation sites meaningful for predicting immunotherapy as biomarkers on this basis.The machine learning methods include lasso regression,XGboost,KNN,support vector machine(SVM),Naive Bayes,random forest and decision tree.We calculated the precision,F1 score,recalling and ROC curves.Precision is a metric used to assess the accuracy of a recommender or classification system, which measures how many of the items recommended or predicted by the system are relevant.Recall is a metric used to assess the completeness of a recommender or classification system, which measures the proportion of all relevant items recommended or predicted by the system.Below are the relevant parameters：

TP (True Positives): the number of positive classes that are correctly determined to be positive.

TN (True Negatives): the number of negative classes that are correctly determined as negative classes

FP (False Positives): Number of negative classes incorrectly determined to be positive (false positives)

FN (False Negatives): Number of positive classes incorrectly determined to be negative (underreporting)

The formula for Precision is:

In this equation, TP is the number of true cases, i.e., cases in which a positive class is correctly determined to be positive. fp is the number of false positive cases, i.e., cases in which a negative class is incorrectly determined to be positive.

In this formula, TP is again the number of true instances, while FN is the number of false negative instances, i.e., cases where a positive class is incorrectly determined to be a negative class.This ratio is close to 1 indicating that TP (the number of true cases) is much larger than FN (the number of false negative cases) and that the model is able to successfully identify the vast majority of samples that are actually in the positive category, thus reducing the number of cases where the positive category is incorrectly determined to be a negative category .

The F1 score is a metric used to evaluate the balance between precision and recall in a recommender or classification system. It is a reconciled average of precision and recall, providing a single score that considers both.The following were the principles of machine learning models we constructed:

Lasso regression

Lasso regression is a linear model construction method for feature selection and parameter estimation that compresses the model by including an L1 regularization term (i.e., the sum of the absolute values of the coefficients of the variables) that shrinks certain coefficients to zero.The main advantage of Lasso regression is that it allows for effective variable selection, especially in the presence of a large number of predictor variables, and is able to filter out the variables that have the greatest impact on the dependent variable.We used R glmnet package for constructing Lasso regression model.

Xgboost

XGBoost is a scalable and convenient Gradient Boosting algorithm for building models in parallel. xgboost adds a regularization term to the loss function to prevent overfitting.XGBoost is an algorithm based on gradient boosting trees, which is an integrated learning method that combines a series of weak learners (usually decision trees) by training them step by step iteratively, with each iteration attempting to correct the errors of the previous iteration, and ultimately combining these weak learners into one strong learner.We built the Xgboost model to predict immunotherapy efficacy against methylated sites.Grid Search is a systematic hyperparameter optimization method that finds the optimal hyperparameter settings by traversing predefined combinations of hyperparameters.

KNN

K-Nearest Neighbors (KNN) algorithm is a commonly used supervised learning algorithm widely used in classification and regression tasks.KNN algorithm performs classification or prediction by measuring the distance between different feature vectors. The K nearest neighbors are found by calculating the distance between a new data point and all the data points in the training dataset.

Decision tree

Decision Tree (DT) is a supervised learning algorithm based on a tree structure for classification and regression tasks. A decision tree divides data into different categories or predicts continuous values through a series of decision rules. The basic idea is to divide the data according to the different values of the features, and eventually form a tree structure with leaf nodes representing decision results or predicted values.

Naive Bayes

Naive Bayes is a classification method based on Bayes' theorem with the assumption of conditional independence of features. Its core principle lies in the fact that, for a sample to be classified, the probability of occurrence of each category under the condition that this sample to be classified occurs is calculated.

SVM

Support Vector Machine (SVM) is a supervised learning algorithm widely used in machine learning, especially good at dealing with classification problems. Its core idea is to divide different categories of data by finding an optimal hyperplane, and at the same time maximize the distance from different categories of data to that hyperplane, i.e., the interval (margin)

After merging the two ICB therapy datasets we enriched a total of 3317 CpG loci.In these,91 were immunotherapy efficacy differential CpG loci(p＜0.05）.We obtained a heat map of differentially methylated sites(Figure 3A).

By differential analysis, we obtained the top ten differential CpG loci, which are cg00045061,cg00107488,cg00056433,cg00090974,cg00072957,cg00002810,cg00005423,cg00011460,cg00005734,cg00040566.We have arranged their respective P-values and logFC, chromosome locations and regions in the Table 1.Among the top ranked CpG loci, cg00045061, cg00056433 and cg00090974 were higher in the content immunotherapy response group than in the non-response group, while cg00090974 was higher in the response group than in the non-response group(Figure 3 B-E).The P value of cg00045061 was 9.31E-05 and location was in chr8，59496063.The region was in North shore .

The PPI nework revealed the cluster sites include UBE2E1,UBA2,UBE3A,SHANK2 and NELFCD.The PPI enrichment p-value was 3.22e-06 and the avg. local clustering coefficient was 0.833(Figure2B) .Then enriched KEGG pathway was ubiquitin mediated proteolysis.The strength was 1.95 and the false discovery rate was 0.0011.The GO enrichment result revealed that the first three pathways are regulation of microvillus length,CXCR4 chemokine receptor binding and brush border(Figure 2C).The genomic location enrichment revealed that the S_Shore and N_Shore regions showed significant enrichment, indicating that genes in these regions have higher log2(odds ratio)(Figure 2D).The TSS200, Intergenic, Body, and OpenSea regions showed significant depletion, indicating that genes in these regions have a low log2(odds ratio).The 5'UTR, 1stExon and 3'UTR regions were also somewhat enriched, but not as significantly as S_Shore and N_Shore.

The chromatin State heatmap(Figure 2A) showing different cell types in various chromatin states.The colour indicates the representation of each cell type in various chromatin states. The greater the redness of the colour, the higher the level of enrichment of the cell type in that chromatin state; conversely, the bluer the colour, the lower the level of enrichment.The alteration in colour enables the observation of the distinct attributes of various cell types in diverse chromatin states.Many cell types appear blue or light blue in the Quiescent/Low state, indicating that this quiescent or low active chromatin state is less frequent in these cells.Most cells in the Weak Transcription state exhibit red or light red coloration, indicating that the weakly transcriptionally active chromatin state is more enriched in these cells.Under promoter-associated chromatin states (e.g., Active TSS, Bivalent/Poised TSS), many cell types show red color, indicating that these promoter regions are generally active in different cell types.Most cells show blue or light blue color in the Repressed/PolyComb state, indicating that these regions are inhibited in most cell types.The methylation expressed in tissue in Table2 illustrated the average expression.Tao(hyper) represents to tissue specific hypermethylation.Tao(hypo) respresents to tissue specific hypomethylation.The most significant site was cg00066925 with the lowest

methylation in testis of 0.014 and highest methylation in bone of 0.926.Histological distribution provides comprehensive information on methylation sites.

The motif analysis result was “A T G C T AGC A GCT AGCT TGAC GACT TCAG TACG GTCA CTGA ATCG TAGC GACT CAGT AGT C AGCT TCGA ATCG TGCA TGCA”(Figure 4A).These motif sequences may correspond to specific transcription factor binding sites. Transcription factors are important proteins that regulate gene expression and modulate the transcriptional activity of genes by binding to specific sequences on DNA.The Sequence Logo shows base preferences at certain positions in multiple DNA sequences, with the size of the base at each position indicating its frequency of occurrence at that position. A larger letter indicates that the base occurs more frequently at that position.The P value was 1.00E-02 and Targets Sequences with Motif were 0.05.The Background Sequences with Motif was 0.013.

On the MEXPRESS website, we explored the relationship between the methylation-related gene NSMAF and clinical features.Pearson correlation coefficients were calculated to show the relationship between clinical characteristics and genes, and it was found that number_pack_years_smoked and survival time os were negatively correlated with the expression of NSMAF, and that cnv had a significant positive correlation with the expression of NSMAF(p=2.96e-24,r=0.472)(Figure 6).In addition we explored the survival of the NSMAF gene in different tumors (Figure 4B).Based on the figure, it is evident that LUAD patients with high expression of NSMAF have a greater likelihood of survival compared to those with low expression. This observation aligns with the discovery that NSMAF is linked to the loci cg00045061, which is found to be more prevalent in the group of patients who respond well to immunotherapy.We also analyzed the QTL(Cis-eQTLs,Trans-eQTLs and GWAS-eQTLs) related to specific genes in Table 3.The following were the analysis of machine learning.The results of machine learning are presented in Table 4.

Lasso regression

We used the ‘glmnet’ package for model constructed.We divided the training and test sets into 7 to 3.The precision for train was 0.618.test was 0.696.For F1 score,the train was 0.764 and the test was 0.821.The recalling for two groups were 1.The auc was 0.5625.We obtained a methylation importance ranking for this model.We obtained a methylation importance ranking for this model.The ranked top 10 CpG loci include cg00027808,cg00088026,cg00097626,cg00026230,cg00000769,cg00025138 ,cg00017441,cg00083765, cg00090800,cg00066468.(Figure1A )

Xgboost

We used the ‘xgboost’package for model building.We divided the training and test sets into 7 to 3.We obtained the optimal parameters of xgboost after Grid Search.Before parameterization,the precision for train was 0.982 and the test was 0.601.The F1 score for the train was 0.986 and for the test was 0.727.Recalling rate was 1 for the train and 0.8 for the test.The auc was 0.683.The parameters of the Xgboost model tuned to the optimal model after caret grid search are Nrounds 16 ,max_depth 100,gamma 0 ,colsample_bytree 0.6 ,min_child_weight 5,subsample 0.8.eta3 0.01.After parameterization,the precision for the train was 0.821 and for the test was 0.652.The F1 score for the train was 0.878 and for the test was 0.789.Recalling for the train was 1 and for the test was 1.The AUC was 0.708(Figure 1D).

KNN

We used the ‘class’ package and the ‘caret’ package to build KNN (K-Nearest Neighbors).We divided the training and test sets into 7 to 3.After parameterization we got the best K value 23.The precision for the train was 0.636 and for the test was 0.652.The F1 score for the train was 0.778 and for the test was 0.789.Recalling for the train was 1 and for the test was1.The AUC was 0.566.

Decision tree

We used the rpart function to train the decision tree model.After parameterization we obtained an optimal cp value of 0.091.The precision for the train was 0.856 and the test was 0.304.The F1 score for the train was 0.879 and the test was 0.333.Recalling for the train was 0.829 and the test was 0.267.The auc was 0.742(Figure 1B).

Naive Bayes

We used the e1071 package to construct the Naive Bayes model.After a grid search we get the optimal parameters,the precision for the train was 1 and the test was 0.522.The F1 score for the train was 1 and the test was 0.667.Recalling for the train was 1 and the test was 0.733.The auc for Naive Bayes was 0.675(Figure 1C).

Random Forest

We used the randomForest package to build a random forest model.Random Forest is an integrated learning algorithm, belonging to Bagging type, which produces the final prediction by combining the prediction results of multiple decision trees.Accuracy was used to select the optimal model using the largest value.The final value used for the model was mtry = 1658.The precision for the train was 1 and the test was 0.522.The F1 score for the train was 1 and the test was 0.8.The auc was 0.654(Figure 1F).

SVM

The e1071 package and the svm function are utilised to build the SVM model. The kernel parameter determines the type of kernel function, while the cost parameter governs the degree of regularisation strength.After parameterization we got the best Sigma was 4 and C was 0.03125 through a grid search.The precision for the train was 0.891 and the test was 0.652.The F1 score for the train was 0.921 and the test was 0.789.Recalling for the train was 1 and the test was 1.The auc was 0.567(Figure 1E).

In the correlation analysis between tumor TMB and immunotherapy response, we went through Pearson's product-moment correlation analysis，the result showed that t = 0.66263, df = 34, p-value = 0.512,95 percent confidence interval was -0.2239296 to 0.4256585,The estimate cor was 0.112913（Figure 5A).The Pearson correlation coefficient of NeoAg and immunotherapy response was 0.2216325 and p value was 0.207768(Figure 5B).Between PD-L1 expression and response to immunotherapy (effective or ineffective) using chi-square test,the Chi-squared test result showed that X-squared = 1.1663, df = 1, p-value = 0.2802.The chi-square statistic (1.1663) is used to measure the deviation between the observed value and the expected value, and a p-value higher than 0.05 indicates that at a significance level of 0.05, we cannot reject the original hypothesis.According to the chi-square test results (p-value = 0.2802), we did not have sufficient evidence of a statistically significant association between PD-L1 expression and response to immunotherapy. The distribution of the number of responding and non-responding patients with different PD-L1 expression status can be visualized by stacking the bar charts(Figure 4C). While this does not rule out a potential effect of PD-L1 expression on immunotherapy response, we cannot conclude that there is a statistically significant association between the two.

In this article, we analyzed differential methylation and establish seven machine learning methods for predicting accuracy in response and non-response groups to tumor immunotherapy.In the analysis of variance one of the methylation sites with the most significant difference was cg00045061, its corresponding gene was NSMAF, its location was in chr8, 59496063, and region was in North shore.Research on the cg00045061 locus is now relatively scarce, and the gene associated with it, NSMAF, encodes a protein associated with the activation of nerve sphingomyelinase, which is mainly involved in intracellular signaling and lipid metabolism.Patients who exhibit elevated NSMAF expression have a more favourable prognosis. This is in accordance with our previous discoveries regarding methylation sites.

Our analysis of the clinical characteristics and prognosis of NSMAF revealed that CNV can directly affect gene expression levels by increasing or decreasing gene copy number. Such changes can enhance or inhibit the function of genes associated with the immune response[8].By correlation analysis we can see that cnv promotes better immunotherapy efficacy, which may also be due to the influence on the expression of genes related to antigen presentation (e.g., MHC genes) or variants on immune checkpoint genes (e.g., PD-1, CTLA-4, etc.), which can modulate the expression level of these genes and influence the efficacy of immune checkpoint inhibitors[9-10].It has been shown that methylation site-associated NSMAF genes have an important impact on immunotherapy in regulating tumor cell apoptosis, influencing the immune microenvironment, and inflammatory response[11], in which NSMAF may be the common influence on immune infiltration profile in association with cnv generation.CNV can directly affect gene expression levels by increasing or decreasing gene copy number. Such changes can enhance or inhibit the function of genes associated with the immune response[8].

We analyzed the differentially methylated sites and found that most of them were located in the N Shore region of the CpG island, and the methylation status of the CpG island and its peripheral regions was closely related to the transcriptional activity of the genes.Methylation of the N Shore region may affect the openness of the promoter region, and thus regulate the initiation of genes and transcription[12].

The pertinent molecular mechanisms and influences on immunotherapy can be observed through the enrichment of methylation-related pathways.Microvilli are minute digit-like protrusions seen on the cell surface, including immunological cells like lymphocytes. These structures have a significant impact on cell signalling and immunological responses.Microvilli are minute digit-like protrusions seen on the cell surface, including immunological cells like lymphocytes. These structures have a significant impact on cell signalling and immunological responses[13].Modulating the length and structure of microvilli may enhance the efficacy of immunotherapy[14].

CXCR4, also known as C-X-C chemokine receptor 4, is a type of G-protein-coupled receptor that is present in several cell types. When it interacts with its ligand CXCL12, which is sometimes referred to as SDF-1, it initiates a cascade of signalling pathways that control cell migration, proliferation, and survival[15].Modulating CXCR4 can boost the efficacy of PD-1 immunotherapy by improving the T cells' capacity to identify and eliminate tumour cells.Blocking CXCR4 not only directly suppresses the proliferation of cancer cells, but also indirectly boosts the efficacy of PD-1 immunotherapy by enhancing the functionality of immune cells within the tumour microenvironment[16].

We conducted a KEGG enrichment analysis of the ubiquitination-proteasome system pathway and determined that it plays a significant role in the methylation pathway.In antigen presentation and immune recognition, the ubiquitination-proteasome system (UPS) plays a key role in antigen processing and presentation. After proteins are tagged by ubiquitination, they are degraded by the proteasome into small peptides, which are bound by MHC class I molecules and presented to the cell surface for recognition by T cells. Thus, UPS plays an important role in tumor antigen presentation, affecting the recognition and killing efficiency of T cells and thus the efficacy of immunotherapy.

The Ubiquitination-Proteasome System (UPS)plays a key role in antigen processing and presentation. Through ubiquitination labeling, proteins are degraded into small peptides and presented by MHC class I molecules to the cell surface for recognition by T cells. If the UPS functions abnormally, the antigen presentation process is disrupted, which may affect the efficacy of immune checkpoint inhibitors, as these drugs rely on T-cell recognition and attack of tumor cells.The Ubiquitination-Proteasome System (UPS) plays a key role in tumor immunotherapy, particularly in the efficacy of PD-1/PD-L1 immune checkpoint inhibitors.The UPS regulates the stability of PD-L1 through ubiquitination and proteasome-mediated degradation, and specific E3 ubiquitin ligases mediate PD- L1 ubiquitination, which labels it for proteasomal degradation[17]. By regulating the degradation of PD-L1, the effects of PD-1/PD-L1 immune checkpoint inhibitors can be indirectly influenced. It has been found that some tumor cells increase the stability of PD-L1 by decreasing its ubiquitination, thereby enhancing immune escape.A study study reveals a previously unrecognized immunoregulation function of RNF8 which downregulated the expression of gal-3 by K48-polyubiquitination and promoted gal-3 degradation via the ubiquitin-proteasome system[18]. This Tremendous effects of melanoma treatment can be achieved by facilitating immune cell infiltration combined with anti-PD-L1 treatment.

In addition, UPS affects immune cell function and the tumor microenvironment by modulating signaling pathways such as NF-κB, JAK/STAT, etc. Activation of the NF-κB signaling pathway is dependent on the ubiquitination and degradation of IκB proteins, which are essential for immune and inflammatory responses[19].Previous study has researched on ubiquitin proteasome-system genes signature to estimate the prognosis of HCC and to assist in individualized treatment[20].

Our correlation analysis revealed a weak correlation between TMB and PD-L1 and immunotherapy efficacy. This suggests that using TMB and PD-L1 as predictive biomarkers for immunotherapy may result in inaccurate predictions. To improve prediction accuracy, it is typically necessary to consider more significant biomarkers or a combination of other clinical and experimental results.Correlation analysis of NeoAg and TMB expression with immunotherapy response shows that they are both positively correlated with immunotherapy response, with a very weak linear correlation to some extent.No statistically significant association between PD-L1 expression and response to immunotherapy.

Through the examination of the distribution of methylation sites in tissues, we may identify particular tissues from which to extract methylation targets, hence enhancing the rate of successful extraction.We can use them as biomarkers for predicting the response to immunotherapy by screening for differentially methylated sites and comparing their expression in the immunotherapy effective and ineffective groups, and by detecting the methylation status of these sites, we can predict the response of patients to immunotherapy, thus realizing personalized treatment.

General assessment among all the measurements, Random Forest demonstrates superior performance. Despite not having the highest ROC value, Random Forest excels in other crucial metrics such as Precision, F1 value, and Recall. Consequently, it emerges as the optimal choice for overall performance.If there is a specific need for analysis, Random Forest is still the best choice if high accuracy and F1 values are required. For the need of high recall, Random Forest, SVM and Xgboost all perform well with 1. If the main focus is on the ROC value, Decision Tree has the highest ROC value of 0.742, but the other metrics are relatively low.In the machine learning model establishment, due to the limited dataset of methylation sites in tumor patients who have undergone immunotherapy, we only studied the methylation expression profiles of lung cancer patients and skin melanoma patients, and from the machine learning results, we can see that the model needs to further adjust the parameters to adapt to the different types of tumors, which already have good generalization performance. More datasets should be included for future studies to improve the predictive ability of the model.

When analysing larger datasets on methylation immunotherapy in the future, researchers will choose the most suitable methylation sites for fitting models by examining sites with differential methylation and comparing the rankings of importance from various machine learning models. This is highly relevant for clinical applications, and identifying specific methylation sites to predict the effectiveness of immunotherapy can help reduce costs while ensuring adequate sequencing depth. Whole genome sulfite sequencing offers the ability to obtain methylation data with precise information on individual bases, albeit it comes at a greater expense.

The inability to encompass a greater number of tumour patients is a limitation of our study. This is due to the limited sequencing of methylation sites in patients receiving immunotherapy, which may only be valid for the cancer types in our dataset.In conclusion,our research can be used to inform future research on the application of methylation site determination and machine learning models to clinical decision-making. This is necessary due to the limited number of actual cases of methylation being used in the clinic, and further research is required to improve the generalisation ability of machine learning to accommodate more patient types.

Availability of data and materials:

The raw data that support the findings of this study are available on request from the corresponding author.The datasets we used for machine learning during the current study are available in GSE175699(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE175699) and GSE126043(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126043).

Ethical approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests:All the authors reported no possible competing interests.

Funding

Not applicable

Authors' contributions:

Jing Ai and Fei Wu were responsible for processing data and variance analysis.Erle Deng performed methylation analysis .Zheng Gu built the machine learning models and wrote the main manuscript.Qiang Su and Junxian Yu were responsible for reviewing articles.

Acknowledgements

Not applicable

Zhang H, Lee S, Muthakana RR, et al. Intragenic Rearrangement Burden Associates with Immune Cell Infiltration and Response to Immune Checkpoint Blockade in Cancer. Cancer Immunol Res. 2024 Mar 4;12(3):287-295.
Li S, Peng Y, Panchenko AR. DNA methylation: Precise modulation of chromatin structure and dynamics. Curr Opin Struct Biol. 2022 Aug;75:102430.
Wang Y, Tong Z, Zhang W, et al. FDA-Approved and Emerging Next Generation Predictive Biomarkers for Immune Checkpoint Inhibitors in Cancer Patients. Front Oncol. 2021 Jun 7;11:683419.
Cho JW, Hong MH, Ha SJ,et al. Genome-wide identification of differentially methylated promoters and enhancers associated with response to anti-PD-1 therapy in non-small cell lung cancer. Exp Mol Med. 2020 Sep;52(9):1550-1563.
Cha JH, Chan LC, Li CW, et al. Mechanisms Controlling PD-L1 Expression in Cancer. Mol Cell. 2019 Nov 7;76(3):359-370.
Ye F, Liang Y, Hu J, et al. DNA Methylation Modification Map to Predict Tumor Molecular Subtypes and Efficacy of Immunotherapy in Bladder Cancer. Front Cell Dev Biol. 2021 Dec 3;9:760369.
Bian Y, Gao Y, Lin H, et al. Non-invasive diagnosis of esophageal cancer by a simplified circulating cell-free DNA methylation assay targeting OTOP2 and KCNA3: a double-blinded, multicenter, prospective study. J Hematol Oncol. 2024 Jun 18;17(1):47.
Zheng S, He A, Chen C, et al. Predicting immunotherapy response in melanoma using a novel tumor immunological phenotype-related gene index. Front Immunol. 2024 Mar 20;15:1343425.
Yang X, Hu Y, et al. Cell-free DNA copy number variations predict efficacy of immune checkpoint inhibitor-based therapy in hepatobiliary cancers. J Immunother Cancer. 2021 May;9(5):e001942.
Yang X, Hu Y, Yang K, et al. Cell-free DNA copy number variations predict efficacy of immune checkpoint inhibitor-based therapy in hepatobiliary cancers. J Immunother Cancer. 2021 May;9(5):e001942.
Li Z, Song W, Rubinstein M, et al. Recent updates in cancer immunotherapy: a comprehensive review and perspective of the 2018 China Cancer Immunotherapy Workshop in Beijing. J Hematol Oncol. 2018 Dec 21;11(1):142.
Joo JE, Mahmood K, Walker R, et al. Identifying primary and secondary MLH1 epimutation carriers displaying low-level constitutional MLH1 methylation using droplet digital PCR and genome-wide DNA methylation profiling of colorectal cancers. Clin Epigenetics. 2023 Jun 3;15(1):95.
Jung Y, Wen L, Altman A, et al. CD45 pre-exclusion from the tips of T cell microvilli prior to antigen recognition. Nat Commun. 2021 Jun 23;12(1):3872.
Park JS, Kim JH, Soh WC, et al. Trogocytic molting of T cell microvilli upregulates T cell receptor surface expression and promotes clonal expansion. Nat Commun. 2023 May 24;14(1):2980.
Zhou W, Guo S, Liu M, et al. Targeting CXCL12/CXCR4 Axis in Tumor Immunotherapy. Curr Med Chem. 2019;26(17):3026-3041.
Wu A, Maxwell R, Xia Y, et al. Combination anti-CXCR4 and anti-PD-1 immunotherapy provides survival benefit in glioblastoma through immune cell modulation of tumor microenvironment. J Neurooncol. 2019 Jun;143(2):241-249.
Liu Y, Yang J, Wang T, et al. Expanding PROTACtable genome universe of E3 ligases. Nat Commun. 2023 Oct 16;14(1):6509.
Guo Y, Shen R, Yang K, et al. RNF8 enhances the sensitivity of PD-L1 inhibitor against melanoma through ubiquitination of galectin-3 in stroma. Cell Death Discov. 2023 Jun 30;9(1):205.
Mooney EC, Sahingur SE. The Ubiquitin System and A20: Implications in Health and Disease. J Dent Res. 2021 Jan;100(1):10-20.
Liu ZY, Li YH, Zhang QK,et al. Development and validation of aubiquitin-proteasome system gene signature for prognostic prediction and immune microenvironment evaluation in hepatocellular carcinoma. J Cancer Res Clin Oncol. 2023 Nov;149(14):13363-13382.
Layfield LJ, Zhang T, Esebua M. PD-L1 immunohistochemical testing: A review with reference to cytology specimens. Diagn Cytopathol. 2023 Jan;51(1):51-58.

Table1 The top 10 methylation sites between immune therapy response group

rank	id	logFC	P.Value	Gene	location	Region
1	cg00045061	-0.0243	9.31E-05	NSMAF	chr8，59496063	North shore
2	cg00107488	0.0859	0.000643	COMT;TXNRD2	chr22:19930437	South shore
3	cg00056433	-0.136	0.00287	-	chr1:161391807	Open sea
4	cg00090974	-0.0932	0.00338	-	chr19:36404802	Open sea
5	cg00072957	-0.0161	0.00464	KIAA0146	chr8:48573717	Open sea
6	cg00002810	-0.0610	0.00618	DAB1	chr1:57888707	Island
7	cg00005423	0.117	0.00806	TMEM176B;TMEM176A	chr7:150499080	South shore
8	cg00011460	0.0592	0.00813	RBM47	chr4:40439822	Island
9	cg00005734	0.0934	0.00985	RALA	chr7:39662048	North shore
10	cg00040566	0.0683	0.00991	EBF3	chr10:131637099	Island

Table 2.The hypermethylation and hypomethylation expressed in different tissue

Probe	tau (hyper)	tau (hypo)	tau (\|hyper-hypo\|)	Tissue with lowest methylation	Tissue with highest methylation
cg00066925	0.014	0.926	0.912	testis	bone
cg00011460	0.017	0.925	0.908	testis	pituitary
cg00072957	0.037	0.89	0.853	thyroid	bone
cg00087906	0.874	0.024	0.849	cartilage	small_intestine
cg00001687	0.024	0.852	0.828	brain-cerebrum	fallopian_tube
cg00087420	0.023	0.756	0.732	testis	adrenal_gland
cg00023288	0.752	0.029	0.722	pituitary	endometrium
cg00002810	0.763	0.062	0.7	brain-cerebellum	colon
cg00078318	0.761	0.063	0.698	brain-cerebellum	small_intestine
cg00107890	0.064	0.724	0.661	brain-cerebellum	fallopian_tube

Table 3.Cis-eQTLs,Trans-eQTLs and GWAS-eQTLs related to CpG genes

Gene	Cis-eQTLs	Trans-eQTLs	GWAS-eQTLs
NSMAF	rs10093116	rs10938982	rs9297994
COMT;TXNRD2	rs5993933;rs1544325	rs17002510;rs7286782	rs4680
KIAA0146	rs4873779	rs1536348	rs2287654
DAB1	rs2406284	_	rs6679454
TMEM176B;TMEM176A	rs4463336	rs2304677	rs2888674
RBM47	rs16852140	rs10081568	_
RALA	rs10486803	rs2723995	rs10464366
EBF3	rs1015605	_	rs477692

Table 4.The result of machine learning methods

ML Methods	precision	F1值	recalling	ROC
Lasso	0.641	0.781	1	0.5625
Xgboost	0.771	0.851	1	0.708
KNN	0.641	0.781	1	0.566
Decision tree	0.693	0.718	0.663	0.742
Naive Bayes	0.859	0.718	0.663	0.675
Random Forest	0.859	0.907	0.941	0.654
SVM	0.820	0.882	1	0.567

No competing interests reported.

Download PDF

Reviewers agreed at journal
11 Sep, 2024
Reviewers invited by journal
31 Aug, 2024
Editor invited by journal
02 Aug, 2024
Editor assigned by journal
02 Aug, 2024
Submission checks completed at journal
02 Aug, 2024
First submitted to journal
31 Jul, 2024

You are reading this latest preprint version

Machine learning models predict the immunotherapy response in tumor based on DNA methylation

Status:

Version 1

Abstract

Figures

Introduction

Materials and Methods

Results

Discussion

Declarations

References

Tables

Additional Declarations

Status:

Version 1