Data collection
All cancer types with more than 5 normal samples and metastatic information were selected from TCGA (https://portal.gdc.cancer.gov/). DNA Methylation data (Illumina Human Methylation 450) for 16 main cancer subtypes were downloaded for this study: bladder urothelial carcinoma (BLCA), breast invasive ductal carcinoma (D_BRCA), breast invasive lobular carcinoma (L_BRCA), colon adenocarcinoma (COAD), esophageal adenocarcinoma (ESCA), head and neck squamous cell carcinoma (HNSC), renal clear cell carcinoma (KIRC), renal papillary cell carcinoma (KIRP), hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), prostate adenocarcinoma (PRAD), rectal adenocarcinoma (READ), follicular thyroid carcinoma (F_THCA) and papillary thyroid carcinoma (P_THCA).
The probe annotation file was downloaded from GEO (GPL13534, HumanMethylation450_15017482, Illumina Inc.).
Clinical data were downloaded from cBioPortal for Cancer Genomics (http://www.cbioportal.org/) [11] by its web API.
Data processing
In AJCC CANCER STAGING MANUAL, tumor node metastasis (TNM) system is used as a general criterion to classify cancers by size and extent of the primary tumor (T), involvement of regional lymph node (N), and presence or absence of distant metastases (M) [12]. Based on TNM system, tumor samples in our study were classified into two main groups based on their metastasis state: localized tumors (N0 & M0: No regional lymph node metastases and No distant metastases), metastatic tumors (Regional lymph node metastases or Distant metastases or both). Sample compositions of all 16 cancer types were listed in Supplementary table S1. Early diagnosis biomarkers in our study were defined as the biomarkers that can effectively distinguish localized tumors from normal tissues.
Methylation levels were measured as β values for all 485,577 probes. β value, calculated as the ratio of methylated probe signal and total probe signal, ranges from 0 (entirely unmethylated) to 1 (entirely methylated). M value, calculated as log2 ratio of methylated probe signal and unmethylated probe signal, can be transferred from β value by equation (1). M values were found to be more statistically valid [13] and thus were used in our methylation analyses.
|
(1)
|
For each cancer type, probes containing >50% missing data in normal or tumor samples were removed. For cancer types with less than 10 normal samples, only probes containing no missing data in normal samples were used. The number of probes left for further steps ranged from 395,529 (READ) to 396,062 (BLCA & KIRC). Then, ‘impute.knn’ function from R package ‘impute’ (1.48.0) [14] was used to estimate missing data by 10 nearest neighbor averaging. Finally, probes located on sex chromosomes, and cross-reactive probes (probes that hybridize to alternate sequences) [15] were removed.
Data analysis and Biomarker selection
R package ‘limma’ (3.30.13) [16] was used to compare among normal samples (N), localized tumor samples (LT) and metastatic tumor samples (MT) for all 16 cancer types respectively based on their M values. Potential biomarkers need to satisfy following criteria:
- Significant methylation difference between N and LT (FDR < 0.05);
- Large mean β difference (> 0.1) between N and LT;
- Significant methylation difference between N and MT (FDR < 0.05);
- No correlation between factors (sex, race and age) and methylation level (FDR > 0.05).
After getting potential biomarkers for all cancer types, two types of biomarkers with biological and clinical significance were further selected:
- Pan-cancer biomarkers were defined as biomarkers that existed in ≥ 80% cancer types and all cancer types revealed the same variation trend from N to LT.
- Cancer-specific biomarkers were defined as biomarkers that existed in ≥ 60% cancer types and one cancer type (two cancer subtypes were also allowed here) revealed opposite variation trend from N to LT with other cancer types.
Biomarker validation
Validation datasets were selected from GEO based on two criteria. First, data was produced by the same platform (Illumina Human Methylation 450). Second, each dataset contained more than 5 normal tissues and tumor tissues.
A total of 11 GEO validation datasets from 7 main cancer types were as follows: GSE60185 (46 normal and 208 BRCA), GSE69914 (50 normal and 305 BRCA), GSE42752 (19 normal and 22 COAD-READ), GSE48684 (17 normal and 64 COAD-READ), GSE61441 (46 normal and 46 paired KIRC), GSE56588 (10 normal and 224 LIHC), GSE77269 (20 normal and 20 paired LIHC), GSE66836 (19 normal and 164 LUAD), GSE49149 (29 normal and 167 PAAD), GSE76938 (63 normal and 73 PRAD), GSE112047 (16 normal and 31 PRAD).
Processing of GEO datasets was in general the same as TCGA: imputing missing values, then analyzing M values between normal tissues and tumor tissues by ‘limma’ package. Biomarkers were validated only if they performed similar methylation variation pattern between TCGA and GEO datasets. Validated pan-cancer sites need to have consistent significant methylation variation (FDR < 0.05) between normal tissues and tumors in ≥ 80% GEO datasets. Validated cancer-specific biomarkers must have consistent significant methylation variation (FDR < 0.05) in ≥ 60% GEO datasets and matched specific cancer type with TCGA.
Model construction and validation
Logistic regression models were constructed based on methylation level (β) to measure the predictive ability of the biomarkers. In order to construct model using the least critical biomarkers, we referred to the Leave One Out Cross Validation (LOOCV) method. For dataset with n samples, each time we left one sample out and built model with the lowest Akaike Information Criterion (AIC) based on the rest n-1 samples. After summarizing all n models, we selected biomarkers included in all models as critical ones. The final model was constructed only by the critical biomarkers.
Four datasets from GEO was downloaded and used as validating datasets. GSE47915 (4 Gleason-6 prostate tumors and 4 benign prostate tissues), GSE76938 (63 normal and 73 PRAD), GSE112047 (16 normal and 31 PRAD) and GSE52955 (Kidney: 6 normal and 17 tumor samples; Bladder: 5 normal and 25 tumor samples; Prostate: 5 Normal and 25 tumor samples).
Pyrosequencing sensitivity test in urine samples
LNCaP cells were cultured in RPMI-1640 medium supplemented with 10% FBS. One control group (30ml normal urine) and two experimental groups (30ml normal urine spiked in 103 or 104 LNCaP cells) were prepared. DNA was extracted by urine DNA extraction kit (Ningbo AJcore, China) and then treated with QiagenEpiTect Bisulfite Kit (Qiagen, 59104) for bisulfite conversion. After PCR amplification, the methylation level of cg26140475 loci was measured by pyrosequencing.
The authors state that they have obtained approval from Hangzhou Medical College’s medical ethics committee and have followed the principles outlined in the Declaration of Helsinki for all human experimental investigations. In addition, for investigations involving human subjects, informed consent has been obtained from the participants involved.