Setting up the benchmarking
Methods to deconvolute DNA methylome profiles of heterogeneous samples into their constituent cell or tissue types can be broadly categorized into linear methods and more complex machine learning models. For our benchmarking, we selected 13 commonly used or recently developed methods (Table 1). Since many of these have different underlying assumptions of data structure and distribution, we also tested seven data normalization approaches (Table S1), resulting in a total of 91 algorithm-normalization combinations. Deconvolution is typically performed on a limited subset of marker CpGs. For most analyses, we therefore decided to use a fixed number of markers per cell type (n = 100 for each source), based on pre-defined criteria (see methods), such that each cell or tissue type had an equal number of representations in the reference. To deconvolute mixtures of four tissue types, we thus identified 400 marker CpGs from reference datasets, while we identified 600 marker CpGs for mixtures of six leukocyte types (Table S2). The same set of CpGs was used for every comparison. As a ground truth, 200 in silico mixtures were generated by combining DNA methylation profiles of defined tissues or cell types (Table S3) in specified but randomly generated proportions. We provide a detailed description of all in silico mixtures in Table S4. Next, we tested the performance of each algorithm-normalization combination by computing measures of accuracy between deconvoluted and actual proportions. We assessed deconvolution performance by quantifying the root mean square error (RMSE), reflecting the absolute error between true and predicted values, as well as by using the R2, which quantifies the correlation between true and predicted values but is less perceptive of systematic biases. Evaluation of both metrics is required, as some deconvolutions show a high linearity (R2) but also a large RMSE suggesting systematic deviations (see below). R2 and RMSE values for the various analyses are compiled in supplementary table 5 and 6 respectively. When we applied deconvolution on in silico mixtures that were generated from the same dataset that serve as deconvolution references, most methods produced near perfect results (Figure S1) as expected. This differs from a real-world scenario, where reference profiles are generated from different samples, and often also profiled in different laboratories. To evaluate the real-world usage, we therefore selected reference datasets from the same cell or tissue sources, but which were independently generated (Fig. 1). These datasets differ markedly from those we use to generate our in silico mixtures, as was evident in a head-to-head comparison of DNA methylation levels at 100 marker CpGs (Fig. 2a, Figure S2a).
Tissue fraction deconvolution
As a first means of benchmarking all algorithm-normalization combinations, we focused on a relatively straight-forward deconvolution problem, by assessing the deconvolution performance for mixtures of four distinct tissues: small intestine (smallest fraction = 0.35%, largest fraction = 69.12%), blood (smallest fraction = 0.12%, largest fraction = 66.31%), kidney (smallest fraction = 1.2%, largest fraction = 74.04%) and liver (smallest fraction = 0.4%, largest fraction = 56.53%) (Figure S2), profiled using 450K micro-arrays (HumanMethylation450K BeadChips; Illumina) (Fig. 2a). The tissue types profiled vary in the specificity of the marker CpGs identified from the reference samples. This can be quantified by computing mean difference in methylation between one cell type and all others for their respective marker CpGs. These values ranged from 57% for blood to 42% for small intestine (Fig. 2b). Specificity of marker CpGs was also evident in the dataset used for in silico mixture generation (Fig. 2a-b). Indeed, although tissue type fractions were in general accurately predicted (RMSE = 0.11, R2 = 0.65), we observed larger deviations for the small intestine, with a significantly higher RMSE (RMSE = 0.17; P < 10− 16) (Fig. 2c-d). These deviations are likely in part driven by the lack of concordance between the small intestine datasets used for generating the in silico mixtures and for generating the DNA methylation reference (Figure S2a). This emphasizes the need to use concordant reference datasets for achieving the best deconvolution performance.
Disregarding the effect of normalization, MethylResolver and EMeth-Laplace were the poorest and best performing deconvolution methods respectively, though differences in performance were not significant. Also most normalization methods performed comparably. Only log transformation performed significantly worse than all other normalization methods (Fig. 2d-e and S2b ). The best RMSE value was achieved when combining ridge regression with quantile normalization (RMSE = 0.08; R2 = 0.71), while the worst RMSE value was produced by combining log normalization with EMeth-Normal deconvolution (RMSE = 0.19; R2 = 0.35) (Fig. 2d-e).
Impact of cell type similarity
Next, we investigated how these algorithms perform a more difficult and relevant deconvolution task, namely deconvolution of relatively homogeneous cell type fractions that have a common developmental origin. Specifically, we investigated how blood cell types can be reliably deconvolved, by constructing pseudo-bulks generated from 450K microarray profiles of fluorescence-activated cell sorting (FACS)-purified neutrophils (smallest fraction = 0.07%, largest fraction = 51.74%), monocytes (smallest fraction = 0.07%, largest fraction = 44.25%), CD4+ (smallest fraction = 0.07%, largest fraction = 47.89%) and CD8 + T cells (smallest fraction = 0.18%, largest fraction = 45.88%), natural killer cells (smallest fraction = 0.28%, largest fraction = 51.59%) and B cells (smallest fraction = 0.3%, largest fraction = 41.60%) (Figure S3), and applying the same strategy we described higher. 100 marker CpGs were identified for each cell type, resulting in a total of 600 loci (Fig. 3a and S3a). Overall, algorithms showed a similar specificity (Fig. 3b). Even though EpiDISH was slightly outperformed by other algorithms, it was the only algorithm consistently predicting proportions at an RMSE below 0.07. Notably, for some cell types deconvolution was more accurate than for others: indeed, quantification of natural killer and CD8 + T cell abundance performed poorly (P < 0.001), perhaps because they are both cytotoxic effector cells thus sharing functional activities, despite originating from divergent lineages. Furthermore, DNA methylation for natural killer cell loci differed significantly between reference and validation datasets (P < 10− 16; Figure S3b-c). To further validate these rankings using an in vivo dataset, we evaluated 85 DNA methylome profiles from blood samples26, comparing the cell type fractions predicted by deconvolution to the cell type fractions measured by flow cytometry. Overall deconvolution performance was lower for in vivo data, perhaps reflecting the additional measurement uncertainty introduced by applying flow cytometry (Fig. 3c-d). Notably, EpiDISH performed consistently well in both in silico and in vivo data (in silico: RMSE = 0.06, in vivo: 0.04; Fig. 3e-f). None of the normalization methods significantly improved deconvolution over non-normalized data, but log normalization performed significantly worse (P < 10− 16), as similar to what we descirbed higher.
Larger array size improves deconvolution.
We next assessed the impact of array size on deconvolution efficiency, by analyzing DNA methylome profiles generated for the same cell types using EPIC microarrays (Infinium MethylationEPIC v1.0 BeadChip; Illumina), which encompass over 850,000 probes. These include most probes represented on 450K microarrays, as well as about 350,000 additional probes targeting a more enhancer CpGs and fewer CpG island CpGs than on the 450K microarray27. Of note, all 600 marker CpGs we identified from 450K arrays higher were also present on the EPIC arrays. Pseudo-bulks were again produced for neutrophils (smallest fraction = 0.5%, largest fraction = 37.11%), monocytes (smallest fraction = 0.5%, largest fraction = 54.70%), CD4 (smallest fraction = 0.07%, largest fraction = 42.37%) and CD8 + T cells (smallest fraction = 0.04%, largest fraction = 49.18%), natural killer cells (smallest fraction = 0.09%, largest fraction = 43.98%) and B cells (smallest fraction = 0.24%, largest fraction = 48.73%) (Figure S5). When using the same marker CpGs identified from 450K data on EPIC array data, the performance for all 91 algorithm-normalization combinations improved significantly (RMSE of 0.03 versus 0.06; P < 10− 6), suggesting a higher concordance between the EPIC array data used for generating in silico mixtures and for deconvolution (Figure S4).
We next identified 600 new marker CpGs from the EPIC array data. 484 CpGs were selected that do not overlap with those represented on the 450K array (Fig. 4a and S5a). Of these 484 additional marker CpGs, 246 overlap with known enhancer regions. As before, natural killer cells and CD8 + T cells were also not separated as accurately as other cell types (Fig. 4b), likely due to the significant discordance in methylation ratios between reference and validation for natural killer cell marker loci (P < 10− 12; Figure S5b-c). The relative performance rankings of algorithms and normalization methods were nevertheless comparable to the analyses using 450K array probes described higher (Fig. 4c). However, when comparing deconvolution CpGs selected from the EPIC probes to those selected from 450K array data, fraction estimates improved slightly for all cell types, except for CD8 + T-cells (Fig. 4d).
To further validate these in silico analyses, we next assessed performances on DNA extracted from different cell types, mixed in vitro at prespecified ratios14. Here, deconvolution performance was comparable to our in silico generated mixtures, thus validating the relative rankings of deconvolution and normalization methods, as well as our strategy for generating pseudo-bulks (Fig. 4d-f).
Impact of the number of marker CpGs on deconvolution
All analyses described higher rely on 100 marker CpGs per cell or tissue type. Depending on the method used to analyze DNA methylation, a lower number of marker CpGs may be preferable (e.g., when cost is to be minimized). To assess the impact of the number of marker genes on deconvolution, we next repeated our performance assessment, while varying the number of marker CpGs included. Specifically, we selected the 2 to 500 top-ranked marker CpGs for each cell type and assessed performance for each algorithm on unnormalized data (normalization did not improve performance; Fig. 5a-b). For many algorithms, the performance increased consistently when 5 or more marker CpGs were included. Other algorithms, such as EMeth-binomial and EMeth-Normal, reached an optimum around 50 CpGs and performed poorer as more CpGs were included. The EMeth algorithms appeared to perform relatively well for low numbers of marker CpGs (n = 2 to 10). Finally, the EMeth-Laplace method was top-performing, both when using only a few marker CpGs (n = 2 to 10), or when many markers were included CpGs (n = 300–500).
Deconvoluting small fractions
For generating in silico mixtures, we generate randomly specified fractions between 0 and 75%. However, in many instances, rare cell types are evident or of particular interest. As these cell types are less abundantly present in the mixture, their profile contributes less to the bulk signal and deconvolution becomes more difficult. In order to test how predictions differed for these smaller fractions, we reassessed performance exclusively for cell type contributions below 3%. Accuracy at this threshold was noticeably lower compared to larger fractions (Fig. 6a-b). Interestingly, adding reference CpGs improved R2 on average from 0.88 to 0.97 (P < 10− 16) for all fractions, and also for small fractions an average increase from 0.07 to 0.40 (P < 10− 16) was observed, indicating that addition of reference CpGs is particularly beneficial for predicting small fractions, but also that small fractions are difficult to predict accurately, irrespective of the algorithm used (Fig. 6c and 5a). In conclusion, deconvolution for small fractions is inadequate in performance for all methods tested, but this can be mitigated to some extent by enlarging the reference marker panel.
Impact of incomplete or over-extensive reference
In many cases, the reference used for deconvolution can be inaccurate, by including more or fewer cell types than those present in a bulk sample. This can introduce noise into the deconvolution experiment. Therefore, we compared deconvolution performances when one cell type was lacking from the reference, or when a cell type was included in the reference but absent from the in silico mixture between algorithms. Removing cell types from the reference generally tends to improve deconvolution accuracy, except when a cell type is left out that is similar to another one included in the reference (e.g., CD4- and CD8 + cell types; Figure S6a). Furthermore, regularization-based deconvolution methods performed very well in this experiment, significantly outperforming MethAtlas (P < 0.01), which is a generally well-performing algorithm (Figure S6b). Including more cell types in the reference has a similar effect, but seemingly less intense, with slightly worse deconvolution accuracy for cell types that are highly similar (Figure S6c). In this setting, EMeth-Laplace, Meth atlas and ordinary least squares produced the best deconvolution results, significantly outperforming EMeth-Binomial (P < 0.01) and ridge deconvolution (P < 10− 16; Figure S6d). In general, an over-inclusive reference performs better than an incomplete one. Cell types absent from the in silico mixture were erroneously assigned fractions up to 5.6%, with the largest fraction found for natural killer cells by EMeth-Binomial, and the lowest fraction found for CD8 + T-cells by MethylResolver (Figure S6e). Finally, we noted that MethylResolver produces multiple predictions below 0, which is obviously impossible in reality.
Deconvolution of DNA methylation sequencing data
DNA methylation is increasingly being profiled by sequencing-based methods such as whole-genome, reduced representation, targeted or amplicon bisulfite sequencing (BS-seq), or third generation nanopore-based sequencers. Here, DNA methylation levels are quantified by calculating the fraction of reads with a methylated CpG over all reads at a given locus, yielding data similar to array-based measurements. These profiles however differ from array-based profiles as they are count-based, quantifying the exact number of sequencing reads showing CpG methylation, rather than a percentage-based estimate of the fraction of methylated CpGs. Also, selection of marker CpGs differs, with a much larger search space: all 28 million CpGs in the human genome can putatively serve for selection of marker CpGs from whole-genome bisulfite sequencing (WGBS) data, versus only ~ 450,000 or ~ 850,000 CpGs available on microarrays. Also, instead of parsing individual CpGs, average DNA methylation over entire genomic regions can be leveraged for deconvolution, with DNA methylation at flanking CpGs being often highly correlated 28.
Here, we tested performance for sequencing-based DNA methylation data of the same deconvolution methods we describe higher (Table 1). We performed deconvolution using non-overlapping genomic regions flanking 100bp as markers. Including multiple flanking CpG reduces measurement errors and improve overall deconvolution performance (RMSE = 0.041 vs 0.052, P < 10− 16; Fig. 7 and S7). When selecting differentially methylated regions (DMRs), larger DNA methylation differences were observed between cell types compared to the marker CpGs identified in array data (64% vs 56%; Figure S8a-c). Deconvolution was performed on 100 in silico mixtures, comprising 6 immune cell type proportions ranging between: 0.86% and 36.06% for neutrophils, 1.8% and 54.7% for monocytes, 0.1% and 36.3% for CD4 + T cells, 0.04% and 35.9% for CD8 + T cells, 0.1% and 39.4% for natural killer cells and 0.3% and 46.2% for B cells (Figure S8d). CD8 + T-cell and B-cell fractions were predicted less accurately than other cell types (P < 0.001; Fig. 7b). Furthermore, we noted a lower overall deconvolution performance for WGBS, at 50x sequencing depth, than for EPIC array data (RMSE = 0.04 vs 0.02, P < 10− 16; Fig. 7c). The underlying reason is unclear. It may be due to experimental differences, as WGBS protocols are often far less standardized between research groups, but an alternative explanation may be the difference in approach taken to generate in silico mixtures. Indeed, reads were mixed for WGBS, whereas DNA methylation levels were calculated by proportional weight-summing for array-derived data. Similar to array-based observations, the most consistently accurate predictions were produced by the EpiDISH algorithm (RMSE = 0.05; Fig. 7b). Furthermore, normalization did not positively impact the overall deconvolution performance (Fig. 7c-d), with unnormalized data producing top-performing results (RMSE = 0.04).
Finally, deconvolution experiments are often based on targeted BS-seq 29,30. In these, both the number of regions included and the sequencing depth can have a significant impact on the analysis cost. We set out to test the impact of these variables, first by varying the number of marker regions for deconvoluting 6 immune cell types on the same 100 in silico mixtures. This revealed that a local optimum was reached when 100 to 200 regions were included for deconvolution, irrespective of the deconvolution method used (Fig. 8a). We next assessed the impact of sequencing depths. We simulated depth ranges between 60× and 0.5× by downsampling, assessing the performance of the EpiDISH deconvolution algorithm on 20 unnormalized in silico mixtures and varying the number of marker regions between 2 and 500 (Fig. 8b). Interestingly, all simulated sequencing depths exceeding 18-fold appeared to yield a similar performance, suggesting that a depth of ~ 24× suffices for deconvolution, and that higher sequencing depths are unlikely to boost accuracy for a similar deconvolution task.
Comparing robustness to technical variables
To confidently apply deconvolution, a method is expected to be robust to technical variables such as varying wet-lab protocols, data sources and data formats. As we have differences in each of these 3 variables in the datasets included, we compared deconvolution accuracy over all datasets among the deconvolution methods. Both EMeth-Normal and EMeth-Binomial performed significantly worse than the other algorithms (P < 10− 4; Fig. 9a). Though not significant, both EpiDISH and EMeth-Laplace are top-performing (RMSE [column Z-score normalized] = -0.91 and − 0.78). For deconvolution of small fraction EMeth-Laplace slightly outperforms EpiDISH deconvolution (Fig. 9b). Relative to the other algorithms, MethylResolver performed remarkably better on small fractions than on all fractions, indicating that for more fine-grained deconvolution, MethylResolver might be preferable. Finally, variance over the datasets differs greatly between deconvolution methods. For example, in terms of mean normalized RMSE, EMeth-Laplace performs very well, however variance is quite large between datasets. For Minfi, bounded-variable least squares and non-negative least squares deconvolution variance is lower and therefore might be more reliable.