We began with an unsupervised clustering analysis using two types of data: somatic copy number alteration (CNA), and gene expression levels from RNA-seq measurements. We then identified differences among the three resulting clusters in their risk stratification, and in overall survival, using datasets with information on mutations, putative copy number alterations from GISTIC (Genomic Identification of Significant Targets in Cancer), with matched clinical data. Next, we performed pathway analyses to find differences in which molecular pathways were enriched in the three molecular subgroups we found. Further analyses focused on molecular characterization of the one cluster with the worst outcomes under the current treatment regimen.
Dataset preparation
We downloaded the TCGA adult AML datasets directly from https://www.cbioportal.org/study/clinicalData?id=laml_tcga_pub. We used the total of 166 samples with transcriptomic, copy number alteration, mutation, and clinical datasets. These samples were obtained from peripheral blood and represented the major morphologic and cytogenetic subgroups of AML [3]. We used two different CNA datasets: CNA segmentation and discrete CNA values datasets.
Calculation of CNA segment
We estimated gene level CNA as the segment mean of copy numbers of the genomic region of a gene by using TCGA-Assembler 2 [4] downloaded from https://github.com/compgenome365/TCGA-Assembler-2 (version 2.0.6). Degree of CNA was calculated as log2 (tumor values/normal values). Across samples, CNA of all genes had a standard deviation greater than the median. Therefore, in order to exclude near normal (very low) CNA values, only genes with a sum of CNA values across samples greater than zero were used for analysis, resulting in 13,019 genes total. Hg19 annotation was used to obtain gene position.
Discrete CNA values
Putative copy-number calls determined using GISTIC 2.0 were used to obtain the information for six genes known to be important for AML: FLT3, NPM1, DNMT3A, CEBPA, RUNX1, and RUNX1T1. Patients with CNA values greater than or equal to 1 were classified as copy number amplifications, while patients with values less than or equal to -1 were classified as copy number deletions. Patients with zero values were classified as unchanged.
RNA-seq expression
We used RSEM (RNA-Seq by Expectation Maximization) expected raw count expression dataset. Genes without at least 1 count-per-million reads in at least 50% of the total samples were filtered out. The resulting RNA dataset was log2 transformed and quantile normalized. A total of 12,934 genes were retained for analysis.
Mutation dataset
The AML genes RUNX1, RUNX1T1, CEBPA, FLT3, NPM1, DNMT3A, and TP53 were analyzed across 166 samples.
Clinical dataset
The clinical dataset provided information such as cytogenetic abnormalities and the risk stratification.
Pathway database
We downloaded the gmt file of MSigDB hallmark gene set collection (version 7.1) from https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp for annotation. The 50 hallmarks in this collection represent biological state or process [5].
Multiple omics data integrative clustering and gene set analysis (MOGSA)
MOGSA is an R software package for multivariate single sample gene set analysis [6]. Using this package (version 1.22.1), we integrated transcriptomic data and gene level CNA over the same set of samples. Based on the chosen PCs from multiple factorial analysis (MFA), we performed unsupervised clustering to identify subgroups and calculated Hall Mark gene set pathway scores for individual samples.
Determine the number of principle components
MFA [7] was incorporated into the moa function from MOGSA to perform principal component analysis (PCA) for CNA and RNA-seq Expression.
Figure 1. Distribution of variances explained by top 20 principle components (PCs). The first six PCs were used to identify subgroups by clustering. These contained a total of 43.6% of total variance, with CNA and RNA-seq expression contributing equally.
Identify of subgroups
We used ConsensusClusterPlus (version 1.52.0) [8] to identify subgroups based on the first six PCs (Figure 1). We used these parameter settings: maxK=6, reps=10000, pItem=0.8, clusterAlg=”hc”, finalLinkage= “ward.D2”, distance=”pearson”. We named 3 chosen clusters (Figure 2A & 2B) as follows: C1 or ‘Intermediate Risk Molecular Subgroup’; C2 or ‘Low Risk Molecular Subgroup’; and C3 or ‘High Risk Molecular Subgroup’ (Figure 2C). Descriptive names were based on our survival analysis (Figure 5).
Figure 2. (A) Delta area shows the numbers of clusters (k) (X axis) and their relative change in area under CDF curve ( Y-axis). (B) Silhouette plot of chosen clusters with k=3. (C) The separation of 3 subgroups: Low Risk Molecular Subgroup, Intermediate Risk Molecular Subgroup, and High Risk Molecular Subgroup.
Survival analysis
We used the R modules Survfit and coxph [9] to perform overall survival analysis based on the three subgroups resulting from the total of 166 TCGA adult samples (Figure 2).
Calculate gene-set pathway scores
We used the MOGSA (the Integrative Single Sample Gene-set Analysis of Multiple Omics Data) package, to identify the MSigDB hallmark pathways’ gene set scores (GSS). We used these parameter settings: nf=6, proc.row=”center_ssq1”, w.data= “lambda1”, and statis=FALSE.
Select representative molecular pathways for three subtypes
In order to choose representative molecular pathways, we firstly selected the pathways resulting from the MOGSA function with GSS FDR (false discovery rate) values smaller than 0.01 in 50% of all samples. Secondly, we used the R functions, fitting generalized linear models (GLM), and t test to calculate the difference of GSSs in each subgroup vs. that in the rest and selected the top 5 and bottom 5 representative pathways ranked by GLM T values, resulting in 16 unique representative pathways total with GLM FDR <0.01 and t test P values < 0.01. Lastly, we visualized z-score scaled GSSs in a heatmap (Figure 5).
Cytogenetics information of the total of 166 patients
The patients’ information on cytogenetic risk and their genetic abnormalities is shown in supplemental table 1.