Authorizations, patient cohorts, cell collection and sorting
This study was conducted at the Mayo Clinic in Rochester, Minnesota, after approval from the Mayo Clinic Institutional Review Board (IRB #20-005400, IRB #16-004173). In all cases, diagnosis was according to the 2016 iteration of the WHO classification of myeloid malignancies.45 PB and BM samples were collected in EDTA tubes after informed consent.
Target Capture Chip Assay for Bulk Sequencing
Sample DNA was extracted from peripheral blood mononuclear cells isolated by gradient centrifugation and re-suspended in a concentration of 500 ng in 50µl of low TE buffer. Paired-end indexed libraries were prepared using the Sureselect XT Low Input Library prep protocol on the Agilent Bravo liquid handler following the manufacturer’s protocol (NewEngland Biolabs, Ipswitch, MA, and Agilent Technologies, Ankeny, IA). Briefly, 200ng of target DNA was fragmented using the Covaris LE220 plus sonicator. The settings of duty factor 30%, peak incident power (PIP) 450, cycles per burst 200, time 180 seconds, generated double-stranded DNA fragments with blunt or sticky ends with a fragment size mode of between 150-200bp. The ends were repaired using the Sureselect End-Repair-A-Tailing enzyme mix. Adapter ligated DNA fragments were size-selected to enrich for 200 bp inserts (~ 320 bp total library size) using AMPURE XP bead purification. The size-selected adapter-modified fragments were enriched, and specific indexes were added by 12 cycles of PCR using universal index primers. The concentration and size distribution of the libraries was determined on an Agilent Bioanalyzer DNA 1000 chip.
The Custom Capture hybrid-target enrichment probes were designed using Agilent SureSelect design software (Agilent Technologies, Santa Clara, CA). The targeted gene panel was comprised of 62,962 single probes with size 1.805Mbp, and covered the coding regions, UTRs, and overlapping intron/exon regions for 205 genes described and / or enriched for CH mutations (see Appendix: Target Gene Panel). The custom capture was carried out using the Agilent Bravo liquid handler following Agilent’s SureSelect XT Low. Purified capture products were then amplified using the SureSelect Post-Capture primer mix for 14 cycles. Libraries were validated and quantified on the Agilent Bioanalyzer. Samples were sequenced by 150 paired end reads, 21 samples to a Flow Cell, on an Illumina NovaSeq 6000 SP (Illumina, SanDiego, CA).
Secondary bioinformatics analysis included quality assessment and alignment to the hg19 build reference genome using Novoalign (Novocraft Technologies, Malaysia), followed by GATK based single nucleotide and small insertion / deletion variant calling, structural variation discovery, and annotation. The quality of sequencing chemistry was evaluated using FastQC. After alignment, PCR duplication rates and percent reads mapped on target were used to assess the quality of the sample preparations. Realignment and recalibration steps were implemented in the GATK.46 Somatic single nucleotide variations (SNVs) were then genotyped using SomaticSniper, whereas insertions and deletions were called by GATK Somatic Indel Detector.46, 47 Each variant in coding regions was functionally annotated by SnpEff, SAVANT, ClinVar, dbNSFP, OMIM, and the Human Gene Annotation Database to predict biological effects.48, 49, 50, 51, 52, 53 Each variant was also annotated with allele frequency from the Exoma Aggregation Consortium.54 Variants of significant interest were visually inspected using IGV.55 The total list of all variants was internally compared for internal duplicates indicative of false positives, and variants of concern are additionally hand annotated for identification using Alamut Software. Interpretation for relevant alterations included absence in international normal variant allele databases (GnomAD, ExAC), deleterious effect on protein function by multiple phenotype prediction models, somatic and functional annotation in literature, consequence of variant (nonsense, truncating, etc.) and location proximal to important domains.
Single-Cell Proteogenomics
Single-cell DNAseq + proteogenomics was performed using the Mission Bio Tapestri platform according to the manufacturer’s specifications. Briefly, a cocktail of pre-titered oligo-linked antibodies targeting 42 unique cell-surface markers and 3 antibodies for isotype control, TotalSeq™ D Human Heme Oncology Kit (BioLegend®, San Diego, CA [https://www.biolegend.com/en-us/totalseq/single-cell-dna]) was utilized to capture the cell immunophenotype, along with targeted myeloid scDNAseq genotyping panel targeted against 45 genes with 312 amplicons covering relevant genes dysregulated in myeloid disorders (designed and manufactured by Mission Bio, San Francisco, CA [https://designer.missionbio.com/catalogpanels/Myeloid]). See Appendices: Myeloid Panel Coverage and Myeloid Panel Design for specific genomic coordinates. Cryopreserved PBMC patient samples were gently thawed, washed, and run through a Dead Cell Removal Kit (Miltenyi, Auburn, CA). The resulting viable cell fraction was counted, blocked with Human TruStain FcX™ (BioLegend®) and re-suspended at a concentration of 25,000 cells/µL. This fraction was incubated with BioLegend TotalSeq Kit Cocktail described, washed, and filtered through a Flowmi cell strainer (MilliporeSigma, St. Louis, MO), and quantified using a Countess II cell counter. The cells were then diluted to a concentration of 4,000 cells per µL in Cell Buffer. Next, 35 µL of cell suspension was loaded onto a microfluidics cartridge and cells were encapsulated on the Tapestri instrument followed by cell lysis with protease digestion followed by heat inactivation using a thermal cycler. The cell lysate was reintroduced into the cartridge and processed such that each cell possessed a unique molecular barcode. Amplification of the targeted DNA regions was performed by incubating the barcoded DNA emulsions in a thermocycler as follows: 98°C for 6 min (4°C per s); ten cycles of 95°C for 30s, 72°C for 10s, 61°C for 9 min, 72°C for 20s (1°C per s); ten cycles of 95°C for 30s, 72°C for 10s, 48°C for 9 min, 72°C for 20s (1°C per s); and 72°C for 6 min (4°C per s). Emulsions were broken, DNA digested and purified with 0.72X AMPure XP beads (Beckman Coulter). The beads were pelleted and washed with 80% ethanol and the generated by amplifying DNA libraries with Mission Bio V2 index primers in the thermocycler using the following program: 95°C for 3 min; 10 cycles of 98°C for 20s for DNA libraries and 16–20 cycles for protein libraries, 62°C for 20s, 72°C for 45s; 72°C for 2 min. Final libraries were purified with 0.69X AMPure XP beads. All libraries were sized and quantified using an Agilent Bioanalyzer and pooled for sequencing on an Illumina NovaSeq 6000 SP with 2 x 150bp multiplexed runs. FASTQ files generated by sequencers were processed using the Tapestri Pipeline V2 which handles adapter trimming, sequence alignment (BWA), barcode correction, cell finding and variant calling (using GATK 4.1.7 haplotype caller). Generated loom and H5 files were then processed with Tapestri Insights v2.2 (Mission Bio) and/or the Python-based Mosaic package (GitHub). Tapestri Insights analysis used default filter criteria (for example, genotype quality ≥ 30 and reads per cell per target ≥ 10) or whitelisting of known variants and annotation-based information (including ClinVar and DANN). Only cells with complete genotype information for all variants (previously detected in bulk sample) were included for downstream processing.
Cell type identification for the proteogenomics platform
A rigorous quality control was performed on antibody-derived tag (ADT) count profiles of the Tapestri single-cell proteogenomic data. We first filtered out cells with less than 200 or more than 50,000 total ADT counts, and then removed cells with less than 10 unique measured ADTs. After ADT-based filtering, one sample was excluded from the analysis as there were less than 500 cells remaining for the sample. ADT counts for each surface protein were then scaled using counts per 10 million with a pseudocount of + 1 and normalized using the centered-log ratio (CLR) transformation. Subsequently, following step II of the “dsb” (denoised and scaled by background) protocol, normalized ADT profiles of antibodies for isotype control were used to remove cell-to-cell technical noise.56 In result, we obtained a normalized and denoised ADT count matrix representing 42 different surface protein expression profiles of 36,557 single cells from 17 patients. This full matrix was used to generate UMAP plots following dimensionality reduction using PCA.
To identify cell types for cells profiled by the Tapestri platform, we compared ADT count profiles of the Tapestri data with that of a reference CITE-seq data, i.e., Azimuth reference.57 Among the 42 cell-surface markers targeted by our Tapestri analysis, 36 were also targeted by the reference CITE-seq analysis. In the CITE-seq analysis, expression of 8 out of the 36 shared markers were assayed using two different antibodies targeting the same antigen (CD3, CD4, CD11b, CD38, CD44, CD45, CD56, and CD138). For these antigens, we used geometric means of ADT counts for both antibodies. CD5, CD7, CD10, CD33, CD62L, and FceRIa were the markers that were targeted by the Tapestri platform but not by the reference CITE-seq analysis. ADT count matrices of the 36 shared markers for the Tapestri and CITE-Seq analyses were scaled, normalized, and denoised as described above (for denoising CITE-seq data, 3 IgG antibodies were selected as isotype controls). Subsequently, for each cell from the Tapestri data, 20 nearest neighbor cells from the CITE-seq data were identified using Euclidian distance, and cell type was called based on their identities using majority vote.
Single-cell profiling of PBMCs (single-cell RNA-seq and 10X Genomics Multiome)
Frozen PBMCs were thawed in a 37°C water bath for 3–5 min until no ice was visible. Cells were washed twice with 1 mL PBS + 0.04% BSA and pelleted (300×g for 5 min at 4°C). Dead cells were removed according to the Demonstrated Protocol: Removal of Dead Cells from Single Cell Suspensions for Single Cell RNA Sequencing (10X Genomics). Using the MACS Dead Cell Removal Kit (Miltenyi Biotec), the pellet was resuspended in 100 µL Dead Cell Removal MicroBeads and incubated for 15 minutes at room temperature. After incubation, the cell suspension was diluted with 1X Binding Buffer and applied to an MS column. The dead cells were retained on the column and the live cells passed through the column and were collected. After dead cell removal, the samples were washed twice with 1 mL PBS + 0.04% BSA and the cell concentration was determined using a Cellometer K2 cell counter (Nexcelom Biosciences). The cells were then aliquoted for scRNA-seq and Multiome (see Methods section).
Single-cell RNA-seq
The cells were first counted and measured for viability using either the Vi-Cell XR Cell Viability Analyzer (Beckman-Coulter, Brea, CA) or a basic hemocytometer and light microscope. The barcoded Gel Beads were thawed from − 80°C and the cDNA master mix was prepared according to the manufacturer’s instruction for Chromium Next GEM Single Cell 3’ Library and Gel Bead Kit (10x Genomics, Pleasanton, CA). Based on the desired number of cells to be captured for each sample, a volume of live cells was mixed with the cDNA master mix. A per sample concentration of 400,000 cells per milliliter or better is required for the standard targeted cell recovery of approximately 4000 cells. The stock concentration requirements would not change for higher cell recovery numbers. The cell suspension and master mix, thawed Gel Beads and partitioning oil were added to a Chromium Single Cell G chip. The filled chip was loaded into the Chromium Controller, where each sample was processed and the individual cells within the sample were captured into uniquely labeled GEMs (Gel Beads-In-Emulsion). The GEMs were collected from the chip and taken to the bench for reverse transcription, GEM dissolution, and cDNA clean-up. The resulting cDNA contained a pool of uniquely barcoded molecules. A portion of the cleaned and measured pooled cDNA continued to library construction, where standard Illumina sequencing primers and a 10x Genomics unique i7 Sample index were added to each cDNA pool.
All cDNA pools and resulting libraries were measured using Qubit High Sensitivity assays (Thermo Fisher Scientific, Waltham, MA) and Agilent Bioanalyzer High Sensitivity chips (Agilent, Santa Clara, CA).
Libraries were sequenced at between 40,000 and 50,000 fragment reads per cell following Illumina’s standard protocol using the Illumina cBot and HiSeq 3000/4000 PE Cluster Kit (Illumina, San Diego, CA). The flow cells were sequenced as 100 X 2 paired end reads on an Illumina HiSeq 4000 HD using HiSeq 3000 / 4000 sequencing kit and HCS v3.4.0.38 collection software. Base-calling was performed using Illumina’s RTA version 2.7.7.
Single-cell Multiome
After approximately 4,000 cells were aliquoted for scRNA-seq, the remainder were used for single-cell Multiome ATAC + Gene Expression (10X Genomics). Nuclei were isolated according to the Demonstrated Protocol: Nuclei Isolation for Single Cell Multiome ATAC + Gene Expression (10x Genomics, CG000365 Rev A). Briefly, cells were added to a 2.0 mL low binding tube and centrifuged (300×g for 5 min at 4°C) using a swinging bucket rotor. The supernatant was removed, and the cell pellet was resuspended in 100 µL of chilled 10x Genomics Lysis Buffer (10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% NP-40 Substitute, 0.01% digitonin, 1% BSA, 1 mM DTT, 1 U/µL RNase inhibitor 40 U/mL) by pipette-mixing 10 times. Cells were incubated on ice for 3 min, followed by dilution with 1 mL of chilled Wash Buffer (10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 1% BSA, 1 mM DTT, 1 U/mL RNase inhibitor 40 U/mL). Nuclei were then centrifuged (500×g for 3 min at 4°C), and the supernatant was slowly removed. The nuclei were washed one additional time with 1 mL Wash Buffer. Nuclei were resuspended in chilled diluted nuclei buffer (1X Nuclei Buffer (10x Genomics), 1 mM DTT, 1 U/mL RNase inhibitor 40 U/mL); the concentration was determined using a Cellometer K2 cell counter (Nexcelom Biosciences) and the samples were adjusted to a concentration appropriate for our targeted nuclei recovery. The single-cell ATAC library construction and gene expression library construction were carried out as described in the Chromium Next GEM Single Cell Multiome ATAC + Gene Expression User Guide (CG000338 Rev A). ATAC and GEX libraries were sequenced separately on an HiSeq 4000 (Illumina) before demultiplexing, alignment to the reference genome, and post-alignment quality control.
DNA Methylation
Genomic DNA was isolated and checked for quality by standard protocols. 1 µg genomic DNA then underwent bisulfite treatment using the TrueMethyl oxBS Module (Tecan Genomics, Männedorf, Switzerland) according to the manufacturer’s specifications. Briefly, DNA was purified using magnetic beads, denatured, and underwent bisulfite conversion followed by desulfonation and purification. The TrueMethyl converted DNA samples were then eluted in 10 µL and then processed through the Illumina Infinium MethylationEPIC BeadChip array (Illumina, San Diego, CA) protocol. In brief, 7 µL of converted DNA was denatured with 1 µL of 0.4N sodium hydroxide prior to whole genome amplification on the MSA4 plate. All other steps were followed as per the manufacturer’s guidelines.
Quality control of Infinium MethylationEPIC BeadChips was performed via the Genome Studio Methylation Module (Illumina). Subset-quantile Within Array Normalization (SWAN) was performed on the Infinium MethylationEPIC BeadChip IDAT files via the R package “minfi”.58, 59 The resultant β-values are used to identify changes in DNA methylation (Δβ) between groups. Unless otherwise noted, a change in absolute methylation level of 10% (Δβ > |0.1|) and a p value of < 0.01 were considered significant.
CpG site relation within chromatin states was annotated using Bedtools v2.27.1 to the genome annotations provided for PBMCs by the Roadmap Epigenomics project.60 The functional annotation of differentially methylated CpGs located within gene bodies and promoters was generated through the use of QIAGEN IPA (QIAGEN Inc., https://digitalinsights.qiagen.com/IPA). Gene ontology associated with the differentially methylated CpGs within non-coding regions was performed using GREAT.61
Genotyping of Targeted loci with single-cell Chromatin Accessibility (GoT-ChA)
The assay was performed following the published protocol (Myers et al., 2022) with some modifications. 10,000 nuclei were captured for each sample. The following primers were utilized to specifically amplify genotyping fragment (DNMT3A R882). GoT-ChA R1N-F, 5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGATGACTGGCACGCTCCAT-3’; GoT-ChA-R, 5’-CTAAGCAGGCGTCAGAGGAG-3’; GoT-ChA_nested-R, 5’-BiosG/CCTTGGCACCCGAGAATTCCATCCTGCTGTGTGGTTAGACG-3’. The underlined sequences are locus-specific. After index PCR, DNAs were digested with AatII and MscI to monitor the specificity of amplified DNAs. Final libraries were quantified using a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, #Q32854) and were analyzed by the Fragment Analyzer (Advanced Analytical Technologies; AATI; Ankeny, IA) using the High Sensitivity NGS Fragment Analysis Kit (Cat. #DNF-486). The libraries were sequenced on a NovaSeq 6000 at the Mayo Clinic Genome Analysis Core. ATAC libraries were sequenced to a depth of 25,000 read pairs per nucleus and GoT-ChA libraries were sequenced to 5,000 read pairs per nucleus.
Olink
We used Olink Explore 1536 panel assay (Olink Proteomics [Uppsala, Sweden]), which uses proximity extension assay technology coupled to a readout methodology based on next generation sequencing (Illumina NovaSeq 6000, NextSeq 550, and NextSeq 2000; all manufactured by Illumina; appendix 1 p 2), to quantify protein targets, as previously described.62
Statistical analysis
Distribution of continuous variables was statistically compared using Mann-Whitney or Kruskal-Wallis tests, while nominal or categorical variables were compared using the Chi-Square or Fischer’s exact test. Time to event analyses used the method of Kaplan-Meier for univariate comparisons using the log-rank test. OS was calculated from the date of diagnosis to date of death or last follow-up, while AML-free survival (LFS) was calculated from date of diagnosis to date of AML transformation or death. Poisson regression models were fit to compare the numbers of different cell types across groups. These Poisson models incorporated an offset term to reflect the total number of cells for a given patient (to be able to compare cell types across consistent intervals in the setting of varying total cell counts: \(\text{ln}\left(Y\right)= {\text{ln}\left(N\right)+\beta }_{0}+{\beta }_{1}{X}_{1}+ϵ\)).
Single-cell data analysis
Single-cell RNA-seq: Sequenced reads from the droplet libraries were processed using 10x Genomics Cell Ranger v6.0.1.63 The reads were aligned to the pre-built human reference transcriptome GRCh38 - v2020-A (July 7, 2020) provided by 10X Genomics. Read trimming, alignment, UMI counting, and cell calling were performed by Cell Ranger. Doublet prediction on scRNA-seq data was done using Scrublet v0.2.1 with default parameters.64 Downstream processing was done using Seurat v4.0.4.65 Count matrices from all samples were combined and batch-corrected using Seurat v4 integration method. Count matrix from each sample was log-normalized, scaled to mean 0 and variance 1, and dimensionality reduced using PCA on the top 2000 variable genes across all samples. The reciprocal PCA and reference-based integration options were applied in the anchor finding step due to large data size. Four patient samples, one from each sex and each condition (COVID-19+ / CH− and COVID-19+ / CH+), were chosen as references, and PCs 1–50 were used for the reference-based integration. The integrated data included a total of 85,019 cells from 24 patient samples. Cells with more than 50% of reads mapped to mitochondrial genes, those with less than 200 unique genes detected and those that were predicted as doublets by Scrublet were removed. After QC filtering the number of cells was reduced to 78,083. Genes that were not expressed in at least 3 cells were also removed from downstream analysis. Uniform manifold approximation and projection (UMAP) was made using the top 50 PCs obtained by running PCA on the integrated (batch-corrected) gene expression matrix. Cell type identification was done with SingleR v1.10.0 using immune data from celldex v1.6.0 as reference.66, 67
Single-cell Multiome: Sequenced reads from the gene expression (GEX) and DNA accessibility (ATAC) droplet libraries of the Multiome assay were processed using 10x Genomics Cell Ranger ARC v2.0.0. The reads were aligned to the pre-built human reference genome GRCh38 - v2020-A-2.0.0 (May 3, 2021) provided by 10X Genomics. Read trimming, alignment, duplicate marking (ATAC), UMI counting (GEX), peak calling (ATAC) and joint cell calling were performed by Cell Ranger. Downstream processing was done using Seurat v4.0.4 and Signac v1.4.0.68 GEX and ATAC count matrices were integrated (batch-corrected) independently using Seurat. Sample GEX count matrices were integrated following the same steps as used for the scRNA-seq data. Default options (CCA and pairwise anchor-finding) were used in the integration anchor finding step, since the data size was smaller than our scRNA-seq data. To merge the ATAC data from all samples, a common peak set was created by merging peaks from all samples using the reduce function from the R package GenomicRanges. Peaks that were smaller than 20 base pairs or larger than 10000 base pairs after merging were removed. The count matrix for each sample with the new common peak set was then recalculated using Signac. The count matrices were normalized using Term Frequency - Inverse Document Frequency (TF-IDF) and dimensionality reduced using singular value decomposition (SVD) using only peaks with non-zero counts in at least 20 cells, these two steps together known as latent semantic indexing (LSI) generating LSI components. The samples were then integrated using Seurat V4 integration with the reciprocal LSI (rLSI) method used on LSI components 2 to 50 (since the first LSI component correlates with sequencing depth) in the pairwise anchor finding step. The UMAP was calculated using the integrated LSI components 2 to 50. Seurat’s weighted nearest neighbor (WNN) algorithm was used on principal components 1 to 50 (GEX) and integrated LSI components 2 to 50 (ATAC) together to obtain a combined UMAP projection of both GEX and ATAC counterparts of the complete scMultiome dataset. The integrated data included 36,343 cells from 11 patient samples. Cells with more than 50% of reads mapped to mitochondrial genes, those with less than 200 unique genes detected (GEX), those with less than 200 unique peaks detected (ATAC) and those with transcription start site (TSS) enrichment score (as calculated by Signac) less than 1 were removed. After QC filtering the number of cells was reduced to 25,725. Cell type identification was done by using Azimuth algorithm to map the scMultiome GEX data to the scRNA-seq data since they were generated from the same cohort. The labels were then transferred from the scRNA-seq data to the single-cell Multiome data.
Differential gene expression analysis: Differential gene expression testing was done on the log-normalized counts using Seurat’s FindMarkers function with default parameters unless specified otherwise. The statistical test applied was Wilcoxon Rank Sum test with p-values adjusted using Bonferroni correction based on the total number of genes in the dataset. Statistically significant differentially expressed genes were selected by keeping only genes that fall below the adjusted p-value threshold of 0.05. Differential gene expression testing comparing any two conditions (e.g., COVID-19+ / DNMT3AMT versus COVID-19+ / TET2MT) was always done for each cell type independently (although shown together in the volcano plots for efficient visualization), unless specified otherwise. To make sure that the results were not driven by a single patient sample we applied a leave-one-out approach on all tests where we removed cells from one sample at a time redoing the tests and keeping only the genes that passed the adjusted p-value threshold in all tests. Pathway analysis was done using the Ingenuity Pathway Analysis platform on differentially expressed genes.
Differential DNA accessibility and motif analysis: The differentially accessible peaks were identified by comparing the TF-IDF normalized cut-site counts of any two pair of cell populations using Seurat’s FindMarkers function. Here, the method used was the logistic regression framework along with testing for the number of fragments in peaks as a latent variable. Motif enrichments in peaks were estimated using ChromVAR enrichment scores on the JASPAR2020 motif matrix set.69, 70 Enrichment of cut-sites in monocyte and macrophage specific ChIP-seq peaks of select DNA binding proteins were also estimated using ChromVAR by providing ChIP-seq peaks from ReMap2022 as input.28 Co-accessibility scores between pairs of peaks were calculated using Cicero v1.3.5.27
GoT-ChA29: Sequencing reads from the GoT-Cha experiments on two samples (DNMT3A mutants with the R882H mutation) were examined for sufficient sequencing depth. The GoT-Cha experiments yielded 168 million and 151 million reads for samples 1 and 2, respectively. For each sample, we generated four FASTQ files (*_I1_001.fastq.gz, *_R1_001.fastq.gz, *_R2_001.fastq.gz, and *_R3_001.fastq.gz). To precisely locate the cell barcode and mutation site, we randomly selected 5M reads from each FASTQ file and utilized the "sc_seqLogo.py" script from our RSeQC package to generate sequence logos.71 Our analysis revealed that the "I1" FASTQ file contains an 8-nt sample barcode (Supp. Figure 6B), while the "R1" FASTQ file contains the 51-nt targeted DNA sequences, with the last three nucleotides representing the genotype of the R882 codon (Supp. Figure 6C). Notably, the codon is CGC (Arginine) for the wildtype and CAC (Histidine) for the mutant, given that the DNMT3A gene is located on the reverse strand. We confirmed that reads from the "R1" file could uniquely map to the targeted DNMT3A locus (chr2:25234324–25234374) on the human reference genome GRCh38/hg38. Furthermore, the "R2" FASTQ file contains the 16-nt cell barcode (Supp. Figure 6D), while the sequences from the "R3" FASTQ files are unknown (Supp. Figure 6E). We then combined the "R1" and "R2" FASTQ files into a single FASTA file, using the cell barcodes as the DNA sequence identifiers, thereby explicitly linking the cell barcode to the DNA sequence (Supp. Figure 7). Any DNA sequence with Phred-scaled quality scores < 30 at the mutation site will be discarded. This preprocessing step not only significantly reduces the file size but also enhances computation speed. For read-level genotyping we used the following method. We utilized an IUPAC (International Union of Pure and Applied Chemistry) string to represent both wildtype and mutation genotypes. In this study, "C" and "T" are denoted as "Y" (Supp. Figure 6C). Then, we employed the motility C + + library (https://github.com/ctb/motility) to match this IUPAC string to each read. Utilizing IUPAC instead of a PWM (Position Weight Matrix) could dramatically enhance computation speed since the differences between wildtype and mutant reads are minimal (only 1 nucleotide change). We found that allowing for 1-mismatch could rescue an additional 10 to 12 million reads (~ 7%) in each sample, as compared exact match. Consequently, 77.5% and 75.6% of reads were successfully genotyped for the two samples, respectively. To distinguish real cells from background noise, we employed a methodology similar to CellRanger's approach in single-cell RNA sequencing analysis. After segregating reads based on cell barcode, we conducted nonparametric kernel density estimation using the "gaussian" kernel function (Supp. Figure 8A, B). The cell barcode density plot revealed three distinct modes likely corresponding to "cell", "cell-free DNA", and "empty droplets", respectively. To determine the cutoff point for cell calling, we identified the local minima (i.e., the changing point where the curve bends) from the density plot. For instance, in sample 1, the local minima for the cell mode is 3.0712, corresponding to 10^3.0712 = 1178 reads. This suggests that a cell with fewer than 1178 reads will be categorized as background (Supp. Figure 8A). A comparison with the Knee plot (or barcode rank plot) demonstrates that the detected threshold aligns well with the elbow point (Supp. Figure 8A, B). As a result, we identified 15185 and 5679 valid cells from samples 1 and 2, respectively. For cell-level genotyping we used the following method. We first calculated the mutant allele fraction (MAF) for each cell using formula described in code deposited on GitHub. Cells with MAF ≤ 0.2 were categorized as homozygous wildtype (WT/WT), cells with MAF ≥ 0.8 were classified as homozygous mutants (Mut/Mut), likely due to Loss of Heterozygosity (LOH), and the remaining cells were designated as heterozygotes (WT/Mut). 40% and 38% of cells in Samples 1 and 2, respectively, were classified as WT/WT (Supp. Figure 8C). The scATAC-seq data from the two samples were analyzed using the cellranger-atac workflow (version 2.1.0), using the reference file (refdata-cellranger-arc-GRCh38-2020-A-2.0.0) downloaded from https://support.10xgenomics.com/single-cell-atac/software/downloads/latest. 8858 and 6255 high-quality cells were identified in Samples 1 and 2, respectively.