1. CNV data and gene expression data
In this study, the TARGET and TCGA AML cohorts were used to analyze the prognostic roles of CNVs and CNV modulated gene expression. TARGET CNV (WGS data) and mRNA (mRNA and miRNA sequencing) data were downloaded from target-data.nci.nih.gov (https://target-data.nci.nih.gov/Public/AML/). The clinical data, including patient identification, gender, risk group, age, overall survival time, and vital status, were downloaded and pre-processed. The CNV data (Affymetrix Genome-Wide SNP array 6.0 data), mRNA data (mRNA and miRNA sequencing), and clinical data of the TCGA AML cohort were downloaded from gdac.broadinstitute.org, and overall survival time = “patient.days_to_death” + “patient.days_to_last_followup”.
Bone marrow samples were collected from AML patients (n = 121, 53 females,68 males, median age 39 years) from Xinqiao Hospital and Chongqing General Hospital, Chongqing, China since November 2016. The AML patients were diagnosed according to the French–American–British (FAB) and WHO classifications, and had not received bone marrow transplantation. The use of clinical samples was approved by the Ethics Committee of Chongqing General Hospital.
2. Database preparation
To reduce the influence of different synonyms of genes, the gene names were transferred to “symbol” based on the file “gene_info.gz” downloaded from ftp.ncbi.nlm.nih.gov.
Cancer-related pathways in KEGG were obtained using an API with Biopython package “Bio.KEGG”, from https://www.genome.jp/kegg/, and KEGG cancer panel genes(including 483 genes) were extracted from these pathways.
3. Statistical analysis
3.1. The CNV data from TCGA were divided into two groups, (the sample from the primary blood-derived cancer and solid normal tissue) and analyzed separately, using the GISTIC (version 2.0.23) downloaded from ftp.broadinstitute.org, and parameters were set up as TCGA suggested: the threshold for copy number amplifications was 0.1, the threshold for copy number deletions was 0.1, the maximum number of segments was 2000, and the significance threshold for q-values was 0.25 (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/).
The data in “all_thresholded.by_genes.txt” were used for the following analysis, and greater than or equal to 1 were defined as amplification and less than or equal to -1 was defined as deletion.
3.2 CNV survival analysis
The Multivariate Cox proportional hazards regression model (Cox model) was used for identifying survival-related CNVs in AML patients.
The R package “circlize” (function “circos.genomicTrack()”) was used for mapping the CNV genes to chromosomes.
In TARGET cohorts, 193 patients (96 females and 97 males, median age 9 years) who had integrated CNV data (WGS data) and clinical data were included in survival analysis. In TCGA cohorts, 191 patients (87 females and 104 males, median age 58 years) who had integrated CNV data and clinical data were included for validating the survival analysis result from TARGET cohorts.
We removed those CNVs occurring in less than 4 samples. For every gene, patients were divided into two groups, named “normal” or “CNV,” and analyzed with the Cox model adjusted for gender and age. The p-values were adjusted with “HDR”. p < 0.05, and adjusted p < 0.05 were considered significant. And the Kaplan-Meier curve was used for visualizing the survival. We performed all analyses using R software with “survminer” and “survival” packages.
3.3 mRNA survival analysis
The gene expression data from the 294 TARGET AML cohort (137 females and 157 males, median age 10 years, the recurring patients were excluded) were used for exploring the relationship between gene expression and survival. The expression of mRNA (RPKM) was log2-transformed. The patients who had integrated mRNA sequencing data and clinical data were included and analyzed with multivariate Cox proportional hazards regression models adjusted for gender and age, and univariate Cox proportional hazards regression models for gender or age. We used R “survminer” and “survival” packages to perform the analysis of Cox proportional hazards regression models, and “HDR” to adjust the p values afterward. For statistical tests, p < 0.05, and adjusted p < 0.05 were considered significant, and upper and lower 95% confidence intervals were reported. In the TCGA AML cohorts, 179 patients (84 females and 95 males, median age 58 years) who had integrated CNV data and clinical data were included for validating the survival analysis results from the TARGET cohorts, and multivariate Cox proportional hazards regression models adjusted for gender and age was used for analyzing, and p < 0.05 was were considered significant
3.4 Integrative analysis of gene expression and CNV data
Integrative analysis of gene expression and CNV was performed on 156 patients in the TARGET cohort and 171 patients in the TCGA cohort. Correspondence between gene CNV and expression was analyzed as follows: The patients were divided into groups with different CNV statuses (“Normal”, “Duplication” and ”Deletion”), and the Kruskal test was used for comparing more than two groups, Wilcox test for comparing two groups with R. The p values were adjusted with “HDR,” and p < 0.05 and adjusted p < 0.2 (Adjusted with “HDR”) was considered to be statistically significant.
4. Patient samples validation analyses
The use of clinical samples was approved by the Institutional Research Ethics Committee of the Affiliated Chongqing Hospital of the University of Chinese Academy of Sciences, and Chongqing General Hospital, Chongqing, China. DNA and RNA of patient bone marrow samples were extracted using the Tiangen DNA/RNA kit (Beijing, China) according to the manufacture’s instruction.
Extracted DNA was for CNV analyzing using AccuCopyTM method developed by Genesky Biotechnologies Inc. (Shanghai, China) as described previously[20]. The primers were as follows: SEMA4D , 5'-GGATGAAACTTGCCACGTGAA -3', 5'-GGAAATGCCTTGCCCTAAACC-3'; DNMT1, 5'- GATCAGGCAGCTCAATAATTTGTGT-3', 5'-TGACCTCAAATATGGGCAGCA-3'; CBFB, 5'-GTCATTGCAGGCAAGAAGACAAC-3', 5'-GAGAACAGCGACAAACACCTA-3'; CHAF1B, 5'-TAAATGGCTCCTGGCCCCTAT-3', 5'-TCTTCCACGGACGGTTACTGCT-3'. For each gene, two primers were used to increase the accuracy of CNV analyzing.
Rever Tra Acea-First Strand cDNA Synthesis Kit (Toyobo) was used to generate cDNA using the extracted RNA, and real-time quantitative PCR with SYBR Green using PCR System7500 (ABI) was performed to determine the gene expression, and The primers for real-time PCR were as follows: CBFB, forward: 5'-ACTGGATGGTATGGGCTGTC-3' ,reverse: 5'-AAGGCCTGTTGTGCTAATGC-3';CHAF1B: forward: 5'-CTGGGCAACTGATGGGAATT-3' ,reverse: 5'-GCAGCACCCTGTCACAGCT-3';DNMT1: forward: 5'-GTTCTTCCTCCTGGAGAATGTCA-3', reverse: 5'-GGGCCACGCCGTACTG-3'; SAE1: forward: 5'-AGGACTGACCATGCTGGATCAC-3', reverse: 5'-CTCAGTGTCCACCTTCACATCC-3'; SEMA4D: forward:5’-GTCTTCAAAGAAGGGCAACAGG-3’, reverse: 5’GAGCATTTCAGTTCCGCTGTG-3’; β-actin(intern control):forward:5’-AGTTGCGTTACACCCTTTC-3’, reverse: 5’-CCTTCACCGTTCCAGTTT-3’. The PCR reaction was conducted in triplicate for each sample. Allgene expression was normalized to that of GAPDH using the 2-ΔΔCt method.
The patient samples were divided into groups based on their CNV statuses (“Normal”, “Duplication” and ”Deletion”). The Kruskal test was used for comparing more than two groups, Wilcox test for comparing two groups with R. p < 0.05 was considered statistically significant.
The gene expression was normalized, and multivariate Cox proportional hazards regression models adjusted for gender and age were used for survival analysis using the R “survminer” and “survival” packages.
5. VENN graphic presentations
VENN diagrams were plotted with the R package “VennDiagram”.