Patient Selection and plasma samples
Patient plasma samples were obtained from the Ontario Health Study (OHS) with protocols approved by the University of Toronto Health Sciences research ethics board (protocol #34088). Peripheral blood was drawn from OHS participants upon recruitment to the study, and 1.6 mL plasma was separated and collected within 48 hours, and immediately cryopreserved at the OHS Biobank. Participants in OHS that had developed breast cancer following recruitment to the study were identified by linking individuals who had provided a blood sample to the Ontario Cancer Registry through Cancer Care Ontario (CCO). Variables that were used for linkages included health insurance number, age, sex and name. At the time of linkages, cancer registry data had been made available through the Ontario Cancer Registry up until December 2017. Breast cancers were confirmed by histological analyses of tissue biopsies at the time of diagnosis, and immuno-histochemical tests for hormone receptor status were reported in the pathology records of breast cancer cases. A total of 167 OHS participants that donated biologics developed breast cancer following study recruitment. Blood plasma from 110 breast cancer participants was available and pulled from the biobank. Additionally, 1.6 mL plasma from 108 cancer-free controls matched to cases by age, sex, date of biologic collection, ethnicity, smoking status, and alcohol consumption frequency were also selected. Control participants that have not had a history of cancer prior to or following study enrolment and did not pass away from other comorbidities up to December 2017 were retained.
Next-generation Sequencing Library Construction and cfMeDIP-Seq protocol
The cfDNA was extracted from plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen). 5-10ng of cfDNA was used as input to generate methylated cfDNA libraries (IP libraries) along with an input control library (IC libraries). Quality of incoming cfDNA was assessed using the Fragment Analyzer (Agilent) following the manufacturers guidelines. 0.1ng of Arabadopsis thaliana DNA was added to samples prior to library preparation. Combined samples were prepared using the KAPA Hyper Prep library protocol (Roche), with standard End Repair & A-tailing and ligation of xGen Duplex Seq Adapter (IDT), followed by incubation at 4°C overnight. Unmethylated lambda (λ) DNA was added to partially completed IP libraries and enriched for methylated DNA using the MagMeDip Kit (Diagenode) and purified with the IPure Kit v2 (Diagenode). Sample indices were added to IP and IC libraries via PCR. Completed libraries were quantified by Qubit (Life Technologies) and Fragment Analyzer (Agilent). Both IP and IC libraries underwent shallow sequencing (~20,000 reads) on the MiSeq platform as a quality control step. IP libraries were sequenced to approximately 60M read pairs in 2x50bp mode on Novaseq platform (Illumina). All breast cancer samples were batched together with controls to mitigate batch effects between sequencing runs.
Raw Sequencing File Processing
Following sequencing, the FASTQ raw reads were adapter trimmed, with unique molecular identifiers (UMIs) appended to fastq headers using UMI Tools (version 0.3.3). The reads were then aligned to hg19 using Bowtie233 (version 2.3.5.1) in paired end mode at default settings. Aligned SAM files were converted to BAM file format, indexed, and sorted using SAM tools (version 1.9)34. Aligned reads were subsequently deduplicated according to alignment positions and UMIs using UMI Tools.
Quality Control and Sample Inclusion
One control sample was excluded from our study owing to mortality from non-cancer related causes during study follow up. Six controls were excluded due to diagnoses of cancer pre-disposing conditions that were identified from study follow up questionnaires during follow up. Three control samples were excluded owing to diagnosis of another cancer following sample collection and processing. Following library preparation, four samples were removed as no reads were generated during the MiSeq quality control step. We retained and analysed all samples with more than 10 million deduplicated reads. 39 samples were removed owing to Novaseq sequencing instrument failure that resulted in poor sequencing yields. To assess enrichment efficiency, the number of methylated and unmethylated Arabidopsis spike-ins aligned to F19K16 and F24B22 respectively were counted, and the proportion of methylated spike-ins generated out of the total spike-ins were calculated. Seven samples with less than 95% of spike-in reads that were methylated were excluded. An additional eight were samples owing to poor CpG enrichment assessed through GoGe (< 1.75) and relH enrichment scores (< 2.7) calculated using MEDIPS (R package version 1.12.0) were also removed36. See Extended Data Table 1 for quality control metrics and sample information among remaining samples.
Computing cfMeDIP-Seq methylation signals
To identify regions with differential methylation between pre-diagnosis breast cancers and control cfDNA, coverage profiles were generated for each sample across 300 bp non-overlapping binned tiled windows using MEDIPS. To reduce background signal from non-tumour-derived cfDNA and reduce the feature search space, we leveraged publicly accessible data to filter for potentially informative regions. Regions frequently methylated in haematopoietic cells were inferred using whole genome bisulfite sequencing data of peripheral blood leukocytes (n = 78) from the International Human Epigenetics Consortium (IHEC)36. We averaged the level of methylation across all CpG sites within the same 300 bp non-overlapping tiled window for each sample to infer the level of methylation within a specified region. Regions with an average of methylation level greater than 0.4 across all PBL samples were excluded. Remaining 300-bp bins with at least six or more CpG sites located at CpG islands, shores and shelves, or in FANTOM5 annotated promoters and enhancers were tested for differential methylation37.
Repeated cross validation for differential methylation calling and predictive modelling
A 10-fold cross validation (CV) approach was used to evaluate the discriminatory performance of a methylation biomarker. First, the pre-cancer cases and control samples were divided into 10 approximately equal sized sets using stratified sampling, balancing the proportion of pre-diagnosis cases by years prior to diagnosis following blood collection in each fold set. Iteratively, for each fold in the CV procedure, one set was selected as the test set and the remaining nine sets were designated as the train set (comprising 10% and 90% of participants respectively). Within the train set, differential methylation calling was performed using a Wald test of the regression coefficient from a negative binomial regression of cfMeDIP-Seq methylation level on train set case and control status using DESeq2 (R package version 1.30.1), adjusting for batch and age38. Additionally, we filtered out regions lowly methylated in cancer-free participants in the train set by identifying regions with a mean count across train set controls less than the mean count across all regions among train set samples. The remaining features with a p < 0.05 were retained and considered significant within each subsampling iteration. Across the 2000 (200 replicates of 10-fold CV) subsampled DMR calls, regions that passed all filter thresholds and were significant across at least 200 DMR calls (> 10%; 485 regions) were retained and investigated.
Within each subsampling iteration, the top 200 hypermethylated features were used to construct a random forest model with Caret (R package 6.0)39 from methylation counts of train set samples normalised by library size using DESeq2. The model performance was then assessed by applying the predictive model to the held-out test to obtain risk scores that reflect the proportion of decision trees that classify the sample as breast cancer. The 10-fold CV procedure was repeated 200 times to estimate the uncertainty in the analysis results. To assess the performance across the 200 repeats, we averaged the AUC and other performance metrics across the 10-folds for each CV repeat and calculated overall average performance and confidence intervals across the 200 repeated procedures.
TCGA Breast Cancer and Pan-cancer DNA Methylation Array
To identify whether cfDNA pre-diagnostic DMRs overlapped with DMRs between breast cancer and other tissue types, we conducted differential methylation analysis on TCGA 450k methylation array data from 976 paired healthy and normal tissue biopsy spanning 12 cancer types, and between publicly accessible PBL 450k methylation array data and TCGA breast cancer data40. Solid cancer and normal tissue raw IDAT files were downloaded from the TCGA data portal, and PBL from the GeoExpression Omnibus (GSE87571 and GSE42861). IDAT files were processed to generate beta methylation values from IDAT files using Minfi (1.36.0 R package)41, and normalised using the preprocessFunnorm function. To test for differentially methylated CpG sites between paired healthy and tumour biopsies, an F-test was performed using the DMPFinder function from Minfi across 485,512 CpG sites. To avoid imbalances in DMR calling towards specific cancer types with more samples, a resampling without replacement of 5 paired healthy and tumour biopsies from each of the 12 cancer types was repeated 1000 times, and a median p-value across the 1000 repeats was calculated per probe site. An FDR correction was applied to median p-values, and CpG sites with a median FDR q-value below 0.01 and median absolute difference in methylation of greater than 0.1 were retained as candidate markers for discriminating between tumour and healthy tissue. Differential methylation calling between all TCGA breast cancer and TCGA breast normal tissue, as well as between breast cancer tissue and PBL was also conducted using the DMPFinder function between all samples from each respective group to identify additional breast cancer specific markers. To infer whether cfDNA markers were reflective of breast cancer tissue methylomes, we identified 112 out of 207 significant DMRs among pre-diagnostic cfDNA markers overlapping with 247 CpG sites on 450k methylation array probes to cluster breast cancer, breast normal, PBL and TCGA tissue samples using tSNE.
Transcription factor and gene set enrichment
Transcription factor binding sites among significant DMRs were located using LOLA (R package version 1.20.0)42, with ChIP-Seq data from Encode used as a reference for known transcription factor binding sites43. Permutation testing was performed by randomly subsampling 485 regions 200 times across the genome to obtain a distribution of expected number of overlaps with TFBS of interest. The number of overlapping binding sites among randomly subsampled regions and among significant DMRs for each transcription factor of interest were z-score normalised. Gene set enrichment analysis among the 485 significantly hypermethylated regions was performed using the R package rGREAT (R package version 1.22.0) with default settings to identify enriched gene ontology molecular functions44. Gene sets with a binomial test and FDR adjusted p-value < 0.1 were considered significant.
Time-dependent model assessment
Owing to the outcome and age dependent sampling used in the study, the artificial case-to-non-case ratio in our sample was not representative of the Canadian adult population. We calculated sampling weights to adjust for this sampling bias in the time-dependent model assessment analysis using age specific cumulative breast cancer incidence rates from the Canadian Cancer Registry, in addition to all-cause mortality rates in Ontario reported by Statistics Canada. Often, case-non-case samples are analyzed without regard for the variable length of follow-up (time-to-event for cases, and time-to-censoring or loss to follow-up for non-cases). In cohort study samples, event times of cases may be longer than non-case follow-up times, and classification accuracy is assessed for intermittent follow-up times, both of which require time-dependent AUCs to summarize the discriminatory capacity of a model or marker. The predictive performance was assessed using time-dependent receiver operating characteristic (ROC(t)) curves, corresponding area under the curve (AUC(t)) estimates, accounting for the sampling weights. Cross-validated AUC estimates were obtained by taking the average of the AUC estimates from all “left-out” folds within a 10-fold CV replicate. The mean and the 2.5% and 97.5% quantiles were obtained from the AUC estimates across the set of 200 replicates of the CV procedure. As a summary measure of the classification accuracy across all follow-up times observed in the sample, we calculated a weighted concordance index; defined as the probability that, for any two randomly chosen participants, the observation with the shorter time to diagnosis also had the larger risk score assigned by the random forest classifiers. Pairs within the calculation were weighted to adjust for the study sampling45. Concordance, like the area under the curve (AUC) statistic, measures a model’s ability to discriminate between cases and non-cases, but does not address absolute probabilities assigned to each class.