Datasets
The data used in this study included the previously published datasets from GEO(Gene Expression Omnibus) database were GSE175699 and GSE126043.The GSE175699 platform was GPL23976 and the GSE126043 platform was GPL21145.GSE175699 dataset is for melanoma patients and GSE126043 is for lung cancer patients.The protocol of GSE175699 was amplified bisulphite converted DNA , fragmented and hybridised to Illumina Infinium MethylationEPIC BeadChip using standard Illumina protocol.The clinical data includes gender,age,brafmut,nrasmut,ici response,brain metastasis and tissue in relation to ici treatment.We combined the CpG loci from the two datasets for the next analyses.A total of 78 patients were included in the combined dataset, of which 28 responded to immunotherapy and 50 did not.
Differential methylation analysis
We used the limma package to perform a difference analysis of the immunotherapy response and non-response groups and dplyr package for data analysis on R studio(version 4.3.1).We further untilized ewascatalog website(https://ewascatalog.org/) analyzing methylation sites’location and region.We translated the methylation sites into gene names and then analyze the protein-protein interaction networks(PPI) of separate data sets and combined data sets on the string website(https://cn.string-db.org).We employed clustering techniques to group protein interaction networks, and subsequently conducted an analysis of the corresponding pathways.We conducted a thorough investigation into the correlation between the expression of differential methylation sites and clinical characteristics, tumour mutational burden (TMB), and prognosis using data from the TCGA database and other datasets.We conducted a comparison of the expression of distinct CpG loci sites between the groups that responded to immunotherapy and those that did not respond. Subsequently, we created a box diagram illustrating the results (Figure 3 B-E).
QTL analysis
DNA methylation quantitative trait loci (meQTLs) are regions of the genome containing DNA sequence variants that influence the methylation levels. Single nucleotide polymorphisms (SNPs) are the most common variants in genome.We used Pancan-meQTL database(http://gong_lab.hzau.edu.cn/Pancan-meQTL/) for QTL analysis. This database contain methyltion 450k array data of 7242 TCGA samples.Pancan-meQTL aims to systematically predict the effects of SNPs on methylation levels, providing cis-meQTLs (SNPs affect local methylation levels of CpG sties) , trans-meQTLs (SNPs affect distant methylation levels of CpG sites) and GWAS-eQTLs(Genome-Wide Association Study - Expression Quantitative Trait Loci) across cancer types by using thousands of genotype and methylation data from The Cancer Genome Atlas (TCGA).Across 23 cancer types,we selected the mosst significant snp with the minimum p-value(P value <0.05) in Table 3.
Pathway enrichment analysis
We enriched the obtained differentially methylated sites by GO(Gene Ontology) to obtain the relevant pathways and performed gene localization enrichment using ewas data hub database(https://ngdc.cncb.ac.cn/ewas/).Gene localization enrichment refers to a situation in which the spatial distribution of genes within a specific genomic region (e.g., a portion of a chromosome or a specific genomic segment) is significantly concentrated in one region.Furthermore,we got the chromatin state and histone modification.Chromatin State is the structural and functional state of chromatin in different regions of the nucleus that reflect the density of DNA packaging, modification patterns, and associated transcriptional activity.Histone modification refers to covalent chemical modifications that occur on histones that can affect chromatin structure and gene expression.
Motif analysis
Methylation motif analysis is used to identify and analyze specific DNA sequence patterns (motifs) associated with DNA methylation in the genome.Motifs are short, specific nucleotide sequences that are often recognized as important elements in the regulation of gene expression and other genomic functions. By identifying these motifs, the role of DNA methylation in gene regulation can be better understood.We performed motif analysis of methylation sites associated with immunotherapy efficacy at the ewas data hub website(https://ngdc.cncb.ac.cn/ewas/datahub).These DNA sequence patterns that are related may serve as binding sites for transcription factors, thereby providing insights into the mechanics of gene expression control.
TMB,PD-L1,NeoAg correlation analysis
We utilized a breast and ovarian cancer ICB cohort contain the data includes TMB,PD-L1 epression and survival condition(OS and PFS).We conducted correlation analysis between variables using R studio(version 4.3.1).We performed a Pearson correlation coefficient analysis of TMB,NeoAg. The correlation analysis plot allows us to visualize the distribution of the data points and the regression line, which helps to understand the relationship between TMB, NeoAg, and the response to immunotherapy.Correlation analysis between PD-L1 expression and response to immunotherapy (effective or ineffective) using chi-square test.The p-value of chi_square_test indicates the significance of the association between PD-L1 expression and response to immunotherapy. If the p-value is less than 0.05, it means that there is a significant association between the two groups.
Machine learning process
We used machine learning methods to classify methylation expression profiles for prediction of immunotherapy response outcomes, aiming to find the best prediction model.Finding methylation sites meaningful for predicting immunotherapy as biomarkers on this basis.The machine learning methods include lasso regression,XGboost,KNN,support vector machine(SVM),Naive Bayes,random forest and decision tree.We calculated the precision,F1 score,recalling and ROC curves.Precision is a metric used to assess the accuracy of a recommender or classification system, which measures how many of the items recommended or predicted by the system are relevant.Recall is a metric used to assess the completeness of a recommender or classification system, which measures the proportion of all relevant items recommended or predicted by the system.Below are the relevant parameters:
TP (True Positives): the number of positive classes that are correctly determined to be positive.
TN (True Negatives): the number of negative classes that are correctly determined as negative classes
FP (False Positives): Number of negative classes incorrectly determined to be positive (false positives)
FN (False Negatives): Number of positive classes incorrectly determined to be negative (underreporting)
The formula for Precision is:
In this equation, TP is the number of true cases, i.e., cases in which a positive class is correctly determined to be positive. fp is the number of false positive cases, i.e., cases in which a negative class is incorrectly determined to be positive.
In this formula, TP is again the number of true instances, while FN is the number of false negative instances, i.e., cases where a positive class is incorrectly determined to be a negative class.This ratio is close to 1 indicating that TP (the number of true cases) is much larger than FN (the number of false negative cases) and that the model is able to successfully identify the vast majority of samples that are actually in the positive category, thus reducing the number of cases where the positive category is incorrectly determined to be a negative category .
The F1 score is a metric used to evaluate the balance between precision and recall in a recommender or classification system. It is a reconciled average of precision and recall, providing a single score that considers both.The following were the principles of machine learning models we constructed:
Lasso regression
Lasso regression is a linear model construction method for feature selection and parameter estimation that compresses the model by including an L1 regularization term (i.e., the sum of the absolute values of the coefficients of the variables) that shrinks certain coefficients to zero.The main advantage of Lasso regression is that it allows for effective variable selection, especially in the presence of a large number of predictor variables, and is able to filter out the variables that have the greatest impact on the dependent variable.We used R glmnet package for constructing Lasso regression model.
Xgboost
XGBoost is a scalable and convenient Gradient Boosting algorithm for building models in parallel. xgboost adds a regularization term to the loss function to prevent overfitting.XGBoost is an algorithm based on gradient boosting trees, which is an integrated learning method that combines a series of weak learners (usually decision trees) by training them step by step iteratively, with each iteration attempting to correct the errors of the previous iteration, and ultimately combining these weak learners into one strong learner.We built the Xgboost model to predict immunotherapy efficacy against methylated sites.Grid Search is a systematic hyperparameter optimization method that finds the optimal hyperparameter settings by traversing predefined combinations of hyperparameters.
KNN
K-Nearest Neighbors (KNN) algorithm is a commonly used supervised learning algorithm widely used in classification and regression tasks.KNN algorithm performs classification or prediction by measuring the distance between different feature vectors. The K nearest neighbors are found by calculating the distance between a new data point and all the data points in the training dataset.
Decision tree
Decision Tree (DT) is a supervised learning algorithm based on a tree structure for classification and regression tasks. A decision tree divides data into different categories or predicts continuous values through a series of decision rules. The basic idea is to divide the data according to the different values of the features, and eventually form a tree structure with leaf nodes representing decision results or predicted values.
Naive Bayes
Naive Bayes is a classification method based on Bayes' theorem with the assumption of conditional independence of features. Its core principle lies in the fact that, for a sample to be classified, the probability of occurrence of each category under the condition that this sample to be classified occurs is calculated.
SVM
Support Vector Machine (SVM) is a supervised learning algorithm widely used in machine learning, especially good at dealing with classification problems. Its core idea is to divide different categories of data by finding an optimal hyperplane, and at the same time maximize the distance from different categories of data to that hyperplane, i.e., the interval (margin)