Methylation Capture Sequencing (MC-seq)
DNA Samples Description. DNA was extracted from de-identified PBMC collected from four individuals. Genomic DNA quality was determined by estimating the A260/A280 and A260/A230 ratios by spectrophotometry and concentration by fluorometry. DNA integrity and fragment size were confirmed using a microfluidic chip run on an Agilent Bioanalyzer. To assess the reproducibility of MC-seq by DNA quantity, DNA samples from each participant were profiled in triplicate times with high (>1000ng), medium (300-1000ng) and low (150-300ng) DNA input. In total, 12 DNA samples were measured by MC-seq. Bisulfate conversion was conducted for each DNA sample as described below.
Methyl-Seq Target Enrichment Library Prep. Indexed paired-end whole genome sequencing libraries were prepared using the SureSelect XT Methyl-Seq kit (Agilent, part#G9651B). Genomic DNA was sheared to a fragment length of 150-200 bp using focused acoustic energy delivered by the Covaris E220 system (Covaris, part#500003). Fragmented sample size distribution was determined using the Caliper LabChip GX system (PerkinElmer, Part#122000). Fragmented DNA ends were repaired with T4 DNA Polymerase and Polynucleotide Kinase and “A” base was added using Klenow fragment in a single reaction followed by AMPure XP bead-based purification (Beckman Coulter, part#A63882). The methylated adapters were ligated using T4 DNA ligase followed by AMPure XP bead purification. Quality and quantity of adapter-ligated DNA were assessed using the Caliper LabChip GX system. Samples yielding >350 ng were enriched for targeted methylation sites by using the custom SureSelect Methyl-Seq Capture Library. Hybridization was performed at 65°C for 16 hours using a C1000 Thermal Cycler (BIO-RAD, part# 1851197). Once the enrichment was completed the samples were mixed with streptavidin-coated beads (Thermo Fisher Scientific, part#65602) and washed with a series of buffers to remove non-specific bound DNA fragments. DNA fragments were eluted from beads with 0.1 M NaOH. Unmethylated C residues of enriched DNA were modified by bisulfite conversion using the EZ DNA Methylation-Gold Kit (Zymo Research, part#D5005). The SureSelect enriched, bisulfite-converted libraries were PCR amplified using custom-made indexed primers (IDT, Coralville, Iowa). Dual-indexed libraries were quantified by quantitative polymerase chain reaction (qPCR) using the Library Quantification Kit (KAPA Biosystems, Part#KK4854) and inserts size distribution was assessed using the Caliper LabChip GX system. Samples with a yield of ≥2 ng/ul were proceeded to sequencing.
Flow Cell Preparation and Sequencing. Sample concentrations were normalized to 10 nM and loaded onto an Illumina NovaSeq flow cell at a concentration that yields 40 million passing filter clusters per sample. Samples were sequenced using 100bp paired-end sequencing on an Illumina HiSeq NovaSeq according to Illumina standard protocol. The 10bp dual index was read during additional sequencing reads that automatically follows the completion of the first read. Data generated during sequencing runs were simultaneously transferred to the Yale Center for Genome Analysis high-performance computing cluster. A positive control (prepared bacteriophage Phi X library) provided by Illumina was spiked into every lane at a concentration of 0.3% to monitor sequencing quality in real time.
Preprocessing and Quality Control. Signal intensities were converted to individual base calls during a run using the system's Real Time Analysis (RTA) software. Sample de-multiplexing and alignment to the human genome was performed using Illumina's CASAVA 1.8.2 software suite. The sample error rate was required to be less than 1% and the distribution of reads per sample in a lane was required to be within reasonable tolerance.
Quality control (QC) on MC-seq was conducted following standard procedure as previously described(18). Quality of sequence data was examined by using FastQC (ver. 0.11.8). Adapter sequences and fragments at 5’ and 3’ (phred score <30) with poor quality were removed by Trim_galore (ver. 0.6.3_dev). We used Bismark pipelines (ver. v0.22.1_dev) to align the reads to the bisulfite human genome (hg19) with default parameters(19). Quality-trimmed paired-end reads were transformed into a bisulfite converted forward strand version (C->T conversion) or into a bisulfite treated reverse strand (G->A conversion of the forward strand). Duplicated reads were removed from the Bismark mapping output and CpG, CHG and CHH (where H = A, T or C) were extracted.
All CpG sites were grouped by sequencing coverage, also known as read depth. The groups with coverage of 1x to 100x were used to test the relationship between coverage and number of CpG sites. Only the CpG sites with coverage > 10x depth were used for final comparisons to ensure MC-seq data quality. Genes were annotated using Homer annotatePeaks.pl, including intergenic, 5’UTR, promoter, exon, intron, 3’UTR, transcription start site (TTS), and non-coding categories. CpG island, shore, shelf, and open sea annotation was defined by locally developed bash and R scripts based on genomic coordinates (hg19) of CpG islands from the UCSC genome browser. Definition of CpG shores was defined as up to 2 kb from CpG islands and CpG shelf was defined as up to 2 kb from a CpG shore.
Assessment of Reproducibility. We assessed CpG- and participant-based reproducibility for MC-seq among 12 samples with DNA quantity of high, medium, and low input in two ways. First, CpG-based reproducibility was assessed by calculating Pearson correlations using the CpG sites in common from the samples from the same participant with different input DNA quantities. Scatterplots were rendered showing 10,000 randomly selected common CpG sites comparing samples with high and medium, high and low and medium and low DNA input. Second, participant-based reproducibility was assessed by comparing methylation profiles among pairs of participants using the samples with high DNA input, by calculating Pearson correlations of common CpG sites.
EPIC Array Data Preprocessing
The Infinium Methylation EPIC array (Illumina, San Diego, CA, USA) was used to measure PBMC DNA methylation profiles from the same four participants. These four samples with DNA input of 1000 ng were preprocessed using standard procedures as previously described (20). Briefly, the predicted sex based on methylome was consistent with self-reported sex for all samples. All samples had a call rate greater than 0.15. A total of 19,090 CpG sites on X chromosomes and 537 CpG sites on Y chromosomes were filtered. Probes within 10 base pairs of a single nucleotide polymorphism were removed. A total of 846,464 CpG sites passed quality control.
Comparison of Methylation at Each CpG site Between MC-seq and EPIC Array
The overall distribution of gene annotation in terms in relation to CpG island and genetic region between MC-seq and EPIC array data among samples from the four participants were compared. CpG sites common between MC-seq and EPIC array assays were defined according to genomic coordinates. Pearson correlation and the absolute beta-difference value (Db) were calculated among common CpG sites between MC-seq methylation percentage values and EPIC methylation beta values by using R (ver. 3.5.1). If median Db of the common CpG site was > 0.1, it was defined as a discordant CpG pair between MC-seq and EPIC; otherwise, the CpG site was defined as a concordant CpG pair. The density plot of Db and a Manhattan plot showing the distribution of Db across epigenome were illustrated. Scatterplots were rendered showing the correlation of b values from 10,000 randomly selected CpG sites measured by both MC-seq and EPIC array.