Patient recruitment and statistical analysis for clinical data
The samples were obtained from patients who visited two tertiary health centers, Seoul National University Hospital and Samsung Medical Center. The study protocol was conducted after approval by the institutional review board of each center (No. 1805-122-948, 2013-09-009). The study was conducted in accordance with Declaration of Helsinki. Patients aged 18 years or older with newly diagnosed MM were included. The exclusion criteria for patients were a history of previous hematologic malignancy or solid cancer excluding localized skin squamous cell carcinoma and uterine cervical cancer, history of cancer treatment, history of hematopoietic stem cell transplantation, and those who did not agree to enroll. After recruiting patients, medical records including demographic data, laboratory data, radiographic data, and treatment outcome data were collected. We used the International Myeloma Working Group consensus criteria for response and minimal residual disease assessment for response evaluation.38 Deep response was defined as either achievement of VGPR or CR, with VGPR defined by 90% or more reduction of the monoclonal component.
To compare categorical variables, Fisher’s exact test was utilized. For the comparison of continuous variables, the Mann-Whitney test was employed. Univariate and multivariate logistic regression analyses were conducted to assess the association of factors with deep response. All statistical analyses for the clinical data analysis were performed using R software version 4.0.0 (https://www.r-project.org).
Deep targeted sequencing of clonal haematopoiesis mutations
The peripheral blood sampling was conducted after obtaining written informed consent. DNA extraction from peripheral white blood cells was performed for targeted sequencing of 24 CH-associated genes (see Supplementary Table 3). The Twist target enrichment panel (Twist Bioscience, USA) and DNBSEQ-G400 Dx (MGI Tech, China) with 2× 100 bp paired-end reads were utilized, archiving an average depth of coverage over 1000×. All reliable non-synonymous variants associated with malignancy were annotated as CH-driven mutations, including truncating mutations or any somatic variants previously reported in the COSMIC database v83. Variants with a VAF of ≥ 1.5% were considered positive. The CH mutation calling and filtering process were identical to that described in previous literature.24
Healthy cohort comparison
For the healthy control cohort, we included 5,487 healthy subjects aged 50–79 years who underwent self-referred health check-ups between February 2014 and March 2016 from the Gene-ENvironmental Interaction and phenotypE (GENIE) cohort at the Seoul National University Hospital Healthcare System Gangnam Center and had no history of cancer.39 The cohort, blood sample collection, genome analysis, and the current study design were approved by the Institutional Review Board (IRB) of Seoul National University Hospital (IRB number: 2112-136-1284, 25-1103-127-357). Informed consent for blood sample storage and use for research purposes was obtained from the participants at the time of blood sample collection.
To evaluate the prevalence of CH mutations in MM and compare it with that of non-MM healthy subjects, we utilized a multivariable logistic regression model adjusted for age and sex to assess the association between CH mutations and MM. In order to mitigate potential selection bias related to age and sex, we conducted 1:4 case-control propensity score matching using the MatchIt package in R.
UK Biobank data analysis
CH variant calling and quality control
We conducted a somatic variant calling procedure for the UK Biobank cohort using Mutect2 following the GATK best practices pipeline (https://www.oreilly.com/library/view/genomics-in-the/9781491975183/). In summary, Mutect2 was executed over 1301 intervals where mutations are known to occur in CH, with the “OrientationBiasReadCounts” annotation and F1R2 outputs enabled. We utilized the GetPileupSummaries and CalculateContamination functions in GATK to estimate cross-sample contamination, and the LearnReadOrientationBias function to learn orientation bias priors. Subsequently, we employed the FilterMutectCalls function to annotate variants that passed or failed quality control filters, incorporating the cross-sample contamination estimate and orientation bias priors as inputs. Variant effects were annotated using Funcotator, and the annotated VCF was exported to a tabular form with bcftools.
Prior to manual review, we applied stringent quality control filters to the raw variant calls. We excluded variants with a total depth < 20, a depth for either allele < 5, or a variant allele fraction < 0.02. Additionally, variants flagged by the FilterMutectCalls function with the following filter flags were removed: "orientation", "haplotype", "base_qual", "contamination", "strand_bias", "map_qual", "position", "germline", and "clustered_events". Variants flagged with the "slippage" filter were excluded if their variant allele fraction was < 0.10. Furthermore, individual mutations occurring more frequently than the R882H mutation were excluded, except for the c.1926_1927insG mutation in ASXL1.
Diagnostic information and cohort selection
We utilized diagnostic information based on the ICD-10 code. Individuals diagnosed with MM at any point between attendance and the last follow-up were selected for analysis. To ensure the specificity in our analysis, individuals with a prior diagnosis of MM, monoclonal gammopathy of undetermined significance (MGUS), or plasmacytoma before enrollment were excluded.
Survival analysis
The Kaplan-Meier plot was employed to visualize the cumulative incidence of MM, with group comparisons made using the log-rank test. Additionally, a Cox proportional hazard regression model was utilized to estimate the HR for risk factors in the multivariate analysis. This survival analysis encompassed both disease incidence and death.
Time to MM incidence was calculated from the “date of attending assessment center” (data-field: f53) to the “date of first inpatient diagnosis icd10” (data-field: 41280). Similarly, time to death after MM diagnosis was computed from the “date of first inpatient diagnosis icd10” (data-field: 41280) until the “date of death” (data-field: 40000). In cases of multiple diagnoses, the first inpatient diagnosis was considered the primary event time.
The covariates incorporated into the Cox regression model included the following factors: genetic principal component (Genetic PC) (1–10), genetic sex (male/female), body mass index (BMI), and age at the time of disease diagnosis, specifically for MM and MGUS. Furthermore, to mitigate any potential influence of genetic background on MM predisposition, the list of known germline predisposition mutations in MM was included as a covariate.40
For the analysis of MGUS to MM progression, only samples diagnosed with both MGUS and MM using the ICD-10 code were included. Event time was defined as the duration between MGUS diagnosis and MM diagnosis. The control group consisted of MGUS patients who did not progress to MM by the last follow-up.
Single cell DNA sequencing
Sampling and generation process of single-cell DNA sequencing data
Bulk DNA sequencing was conducted on PBMC samples extracted from two MM patients to validate CH, and CH-positive variants were detected. Following this, a custom amplicon panel targeting genes linked to clonal hematopoiesis was designed for single-cell DNA sequencing analysis. Subsequently, BM aspirates were obtained from the same two patients, and data were generated using a single-cell multi-omics (DNA + Protein) sequencing platform (Tapestri®, Mission Bio, Inc.).
Preprocessing and quality control of scDNA data
The generated FASTQ files underwent preprocessing through the Tapestri pipeline.41 This pipeline includes adapter sequence trimming, alignment of read content to the human genome (hg19), assignment of sequence reads to cell barcodes, and conducting genotype calling through GATKv3.7. The resulting data is compiled into a VCF file and exported as loom and HDF5 files for further processing. Basic filtering steps for low-quality genotypes or cells were performed in Tapestri Insights with default settings. The minimum variant quality score was set at 30, with a minimum of 10 reads per variant per cell. Cells with an alternate allele frequency below 20 were excluded. Additionally, variants occurring in less than 50% of cells and cells with less than 50% informative genotypes among potential variants were filtered out. Following these procedures, the data underwent subsequent analysis using Tapestri Mosaic (version 3.0.1, https://missionbio.github.io/mosaic) within a Python (3.8.17) environment.
Detection of CH variant in scDNA data
In the context of this multi-omics platform, the Tapestri Pipeline was utilized for DNA analysis as previously described. Confirmation and visualization of variants were performed in the Python environment using guidelines and scripts provided by Tapestri Mosaic. For protein analysis, we employed the same module to quantify the number of reads per antibody per cell. Subsequently, normalization was conducted within the same environment, and variants that were not validated in bulk sequencing were excluded.
Using this tool, we reviewed the scDNA data from the processed HDF5 files. We identified variants consistent with those validated through bulk sequencing. However, due to differences in panel design and the limitations of targeted sequencing using the amplicon method, only one of the two samples confirmed the same TET2 mutation as observed in bulk sequencing (TET2:chr4:106158215:C/A).
Single cell protein analysis and cell type profiling
To investigate the distribution of TET2 mutations within specific cell types identified through DNA analysis, we utilized the Tapestri Mosaic module in Python for the analysis of single-cell protein data. UMAP and clustering were performed using the run_umap and cluster functions of the respective tool. Using the same tool, we examined the normalized counts of marker proteins within each cluster and conducted cell type profiling.
Visualization of single-cell multi-omics analysis results
The expression of these markers was visualized as a heatmap using the same tool as mentioned above. We used the UMAP projection to display cell type profiling results. The distribution of cells harboring the previously identified TET2 mutation from the scDNA analysis was illustrated by overlaying it on the UMAP.
Single cell RNA sequencing
scRNA-seq Library Preparation
To prepare cells for single-cell sequencing, the LUNA-FL™ Automated Fluorescence Cell Counter is used alongside the 10x Genomics protocols and guidelines to ensure optimal sample preparation. Libraries are then prepared using the Chromium controller following the 10x Single Cell 3’ v3 protocol. This involves diluting cell suspensions to a target count (10,000 cells), mixing them with the master mix, and loading them into a chip with Gel Beads and Partitioning Oil. Within droplets, RNA transcripts are barcoded and reverse transcribed. The resulting cDNA undergoes end repair, 'A' base addition, adapter ligation, purification, and PCR enrichment to form the final library. Quantification is done via qPCR (KAPA), qualification via Agilent Technologies 4200 TapeStation, and sequencing on the HiSeq platform (Illumina) as per the user guide specifications.
scRNA-seq Data Processing
Raw FASTQ files for scRNA-seq were processed by using the Cell Ranger software (v3.1.0). Reads were mapped to the human reference genome (GRCh38) using the Ensembl GRCh38.100 GTF file. For each sample, a gene-by-cell unique molecular identifier (UMI) count matrix was generated with default parameters. Empty droplets were removed using the emptyDrops function of the DropletUtils (v1.10.3) R package with FDR < 0.05.42 To filter out low-quality cells, cells with more than 60% of UMIs assigned to mitochondrial genes and less than 2.5 log10-scaled UMI counts were excluded using the calculateQCMetrics function of the scater (v1.18.6) R package.43 To normalize the count matrix for cell-specific biases, cells were clustered using the quickCluster function of the scran (v1.18.7) R package.44 Cell-level size factors were calculated using the computeSumFactors function of the same package. Raw UMI counts were normalized by cell-level size factors and then log2-transformed with a pseudocount of 1. Highly variable genes (HVGs) were identified using the modelGeneVar function of the scran package with FDR < 0.05 for biological variability. For downstream analysis, cells were clustered into 22 clusters using the FindClusters function of the Seurat (v4.3.0) R package on the top 15 principal components (PCs) of HVGs with resolution = 0.5.45 Cells were visualized on a two-dimensional UMAP plot using the RunUMAP function of the same package.
scRNA-seq Data Analysis
To identify differentially abundant cell subpopulations associated with CH mutation, we utilized a single-cell specific method based on mixed-effects modeling of associations of single cells (MASC). We obtained odds ratios and P-values indicating differences in abundance for each cell type across conditions.46 Differential gene expressions between conditions within each cell subpopulation were assessed using the FindMarkers function of the Seurat package, implementing the MAST algorithm (v1.16.0), with patients considered as latent variables. With the log-scaled fold changes for each gene between the two conditions, gene set enrichment analysis (GSEA) was performed using the fgsea function of the fgsea (v1.16.0)47 R package, utilizing the Gene Ontology gene sets used as a database with the msigdbr function of the msigdbr (v7.5.1) R package.48 Trajectory analysis for hematopoiesis was conducted using the run_palantir function of the Palantir (v1.0.1) Python package.49 To visualize cells on two-dimensional t-SNE plots, batch effects across patients were corrected using the run_harmony function of the Harmonypy (v0.0.9) Python package on the first 100 diffusion components (DCs).50 Subsequently, a k-nearest neighbor (k-NN) graph (k = 30) was constructed using the first 10 DCs. Coordinates for the t-SNE plots were computed using the run_tsne function with perplexity = 700. Using the run_palantir function, branch probabilities for each of the three differentiation fates were calculated for all cells. Transcription factor activities were inferred using the DoRothEA (1.2.2) R package based on the Dorothea regulon database with A, B, and C levels out of 5 levels representing the confidence level based on the number of supporting evidence.51,52 Cell-cell interactions between classical monocytes and plasma cells were inferred using the CellPhoneDB (v3.1.0) Python package with the normalized count matrix and default parameters.53
Exosome RNA analysis
Sampling, exosome extraction, and small RNA sequencing
Exosome extraction was performed from BM samples of 30 MM patients, followed by sequencing and FASTQ file generation for small RNAs within the extracted exosomes. (NEXTflex®, Illumina, Inc.)
The FASTQ files have been uploaded to the Genboree Workbench for the purpose of employing the exceRpt small RNA-seq pipeline (http://genboree.org/java-bin/workbench.jsp) to conduct a consistent analysis across the entire dataset.
Preprocessing, mapping, and read counting
The small RNA FASTQ files extracted from the BM of MM patients were initially processed using the exceRpt small RNA-seq Pipeline (version 4.6.2) through a batch submission tool.
The exceRpt pipeline processes samples by aligning reads and filtering to eliminate contaminants, then aligns them to the endogenous sequence database.54 The pipeline primarily utilizes default settings, including adapter trimming set to ‘auto detect’, which identifies and trims adapter sequences for various library types. Additional parameters were chosen to remove random degenerate 4N sequences of NEXTflex libraries. The default random barcode setting was employed, indicating the presence of a random 4 nucleotides sequence immediately preceding the 5′ and 3′ ends of the insert sequence. The sequences and identities of the adapters identified by the exceRpt pipeline were confirmed in output files, with no identified missing or incorrect adapters in the libraries.
Read quality assessment was conducted using FASTQC, with reads filtered out if they had PHRED scores below 30 (FASTX-Toolkit v0.0.13).54 Additionally, reads shorter than 16 nucleotides were excluded from further analysis.
To address potential contamination from laboratory or ribosomal RNA (rRNA), sequences were mapped using Bowtie2 to UniVec (a database of common laboratory contaminant sequences maintained by NCBI) and human rRNA sequences, and subsequently removed.54
After the filtering step, the reads were aligned to both the human genome and pre-miRNA sequence databases. Initially, the reads underwent mapping to miRbase version 21, gtRNAdb, piRNABank, Ensembl transcripts (hg19), and circBase databases. This process aimed to assign reads to microRNAs (miRNAs), transfer RNAs (tRNAs), PIWI-interacting RNAs (piRNAs), gencode transcripts, and circular RNAs, respectively.54
Following the pipeline execution as described, a reevaluation of internal quality metrics of the exceRpt pipeline was conducted because the output did not include any samples with TET2 mutations that passed the tool's inherent quality control criteria. The inherent quality metrics, specifically transcriptome read count and transcriptome-genome ratio, were redefined to maximize the detection of TET2 mutation signals (Transcriptome reads ≥ 2000, Transcriptome-genome ratio ≥ 0.25). Modified criteria-compliant samples were integrated and combined into a raw read count matrix by the exceRpt pipeline on Genboree Workbench.
Differential expressed miRNA analysis
After preprocessing and quality control, the raw read counts of miRNA obtained from the Genboree Workbench exceRpt small RNA-seq pipeline were further analyzed using R (version 4.3.1) for differentially expressed miRNA analysis.
The calculation of differential expression fold-changes and adjusted P-values was performed using the DESeq2 package (version 1.40.2).55
Altered down-regulated miRNAs with a baseMean greater than 20 were selected based on criteria of fold change ≤ -1.0 and adjusted P-value < 0.05. Ten differentially expressed miRNAs (DE miRNAs) were confirmed through this filtering. Subsequently, these DE miRNAs were visualized through a volcano plot using ggplot2 (version 3.4.2).
miRNA targeted gene and pathway prediction
To identify potential target genes for these DE miRNAs, we employed the miRSystem tool (https://mirsystem.cgm.ntu.edu.tw/), which integrates validated data from miRecords, TarBase, and seven miRNA target prediction databases (PITA, miRanda, DIANA-microT, mirBridge, PicTar, rna22, TargetScan), along with five pathway databases (Gene Ontology, KEGG, BioCarta, PID, Reactome).56 The goal is to predict targets and decipher their biological roles and pathway involvements.
Upon entering DE miRNAs into miRSystem, it produced a target gene summary report, functional annotation summary report, pathway ranking summary, and individual reports for target genes and pathways for each miRNA. The DE miRNAs were found to regulate genes linked to the MAPK signaling pathway, integrins in angiogenesis, and various growth factors and inflammatory pathways.
UK Biobank Olink proteomics data preprocessing
The Olink Proteomic profiling in the UK Biobank involved analyzing blood plasma samples from 54,967 participants using the Olink® Explore 3072 platform, which measures 2,923 protein analytes across panels such as Cardiometabolic, Inflammation, Neurology, and Oncology. After whole-exome sequencing-based proteogenomic analyses on 52,217 samples, 50,065 passed Olink NPX quality control (96%). The quality control method was based on the paper by Sun et al.57 A total of 2,923 unique proteins were measured across eight Olink® Explore 3072 panels (Cardiometabolic, Cardiometabolic II, Inflammation, Inflammation II, Neurology, Neurology II, Oncology, and Oncology II). However, we utilized only 1,463 protein assays that were publicly available at the time of our analysis out of 2,923 assay data. These assays belonged to the Cardiometabolic, Inflammation, Neurology, and Oncology panels.
CH variant calling was conducted on UK Biobank data using Mutect2, with the specific methodology similar to the description provided earlier. ICD-10 codes for MM were collected and organized in the same manner as described previously.
The proteomics NPX data, accessible via the UK Biobank research analysis platform (RAP) (https://ukbiobank.dnanexus.com), has been integrated and filtered with information regarding CH and MM diagnostic codes (ICD-10 code). Merged with UK Biobank showcase metadata, including assay details and quality control information, the data has been restructured into the Olink data format. This entire process was carried out in R (version 4.3.1).
Ordinal regression of proteomics data
CH is marked by somatic mutations that increase mutation rates as individuals age,58 displaying a mosaic nature. Gender differences affect CH occurrence and progression.59 To minimize the influence of the confounding factors of age and gender, an ordinal regression analysis was conducted using the OlinkAnalyze package (version 3.6.0, https://CRAN.R-project.org/package=OlinkAnalyze). The analysis incorporated these factors as covariates in the regression model (NPX ~ CH + Age + Gender).
Through this process, proteins exhibiting statistically significant expression differences based on the presence of CH were identified. A post-hoc test calculated estimate values, representing the difference in mean NPX between variables, leading to further validation of these differences.
Pathway analysis of proteomics data
Using the statistically significant protein set from the aforementioned post-hoc test results (adjusted P-value < 0.05), an over-representation analysis was conducted. Statistically significant pathways with adjusted P-value < 0.05 were confirmed. Based on these results, key pathways were selected by considering keywords where significant differences were observed in previous scRNA-seq and miRNA analyses, such as cytokines, MAPKs, and integrins. For a more detailed interpretation, the estimate values from the previous post-hoc test for each protein within the set of relevant proteins in pathways were visualized together in a heatmap. All analysis processes were conducted using the OlinkAnalyze package in R.
Cell culture
The human monocyte cell line, THP-1, human MM cell line, MOLP-8, and 293FT cells were obtained from the Korean Cell Line Bank. THP-1 cells were cultured in RPMI 1640 (Hyclone, Cat.#SH30027.01), MOLP-8 cells were cultured in MEM medium (Hyclone, Cat.#SH30024.01), and 293FT cells were cultured in DMEM medium (Hyclone, Cat.#SH30243.01). All media were supplemented with 10% FBS (Hyclone, Cat.#AH30007592) and 1% penicillin-streptomycin (Gibco, Cat.#15140-122).
Differentiation of THP-1 cells into macrophages was induced by the addition of 150 nM phorbol 12-myristate 13-acetate (PMA; Sigma, Cat.#P8139) for 72 hours, and conditioned media of the differentiated THP-1 cells were collected for 48 hours using fresh media. All cells were incubated at 37°C in a 5% CO2 atmosphere.
Generation of TET2 knockdown THP-1 cells
TET2 knockdown THP-1 cells were generated by lentivirus infection containing Cas9 and a single guide RNA (sgRNA) targeting the TET2 gene. For lentiviral sgRNA production, sgRNA was constructed with the lentiCRISPR v2-GFP vector. Lentiviral vector, VSVg, and psPAX2 were cotransfected into 293FT cells using lipofectamine 2000 for 72 hours. Viral particles were concentrated using Lenti-X™ concentrator (Takara) and used to infect cells with 8 µg/ml polybrene for 48 hours. After infection, cells were sorted by flow cytometry for GFP expression. Knockdown of TET2 was evaluated by western blotting.
Cell proliferation assay
Cell proliferation was estimated using the trypan blue exclusion assay. Cells were incubated with a 0.4% solution of trypan blue dye (Thermo Scientific), and cell numbers were counted using a hemocytometer. The numbers of blue-stained cells compared with the initial cell count were calculated for cell proliferation estimation. To investigate the effect of CCR10 inhibition on cell proliferation, BI-6901 (0.1 ng/ml) was treated with conditioned media for 120 hours.
Cell-line Bulk RNA analysis
Bulk RNA sequencing
Total RNA from differentiated THP-1 cells was extracted using the RNeasy Plus Mini Kit (Qiagen). Whole-transcriptome expression profiles were generated by RNA sequencing.
Bulk RNA-seq Data Processing and Analysis
Raw paired-end RNA-seq reads were aligned to the human genome (GRCh38) using STAR (v2.7.10b). The aligned reads were mapped with the Ensembl GRCh38.103 GTF file. The raw count matrix was normalized by scaling and size factors using the DESeq2 (v1.30.1) R package, with genes having a count of at least 10 for every sample.60 Differential gene expression was tested using the DESeq2 R package and genes with P-values < 0.05 and log2-scaled fold change > 0.1 or < -0.1 were identified as DEGs. Functional enrichment analysis of DEGs was performed using the topGO (v2.42.0)61 R package with Gene Ontology Biological Process (GOBP) terms based on the org.Hs.eg.db (v3.12.0) annotation data package.62
MMRF dataset analysis
To investigate the impact of a single gene on MM survival, we utilized the Survival Genie platform with the MMRF CoMMpass dataset.18 This dataset contains bulk RNA sequencing data from the CD138+ fraction of MM patients. Survival analysis was performed using the single gene option on primary MM samples for overall survival. Samples were divided into high and low gene expression groups using an optional cut point (cutp).