2.1. Data Collection and Analysis:
2.1.1 TCGA, CPTAC, SCAN-B, METABRIC, and CCLE:
The Cancer Genome Atlas (TCGA) provides data on genomics and transcriptomics for various cancers. We downloaded RNA-sequencing data from Genomics Data Commons (https://portal.gdc.cancer.gov/) related to TCGA Breast cancer (BRCA). As TCGA provided level-3 data, we did not perform data processing. In addition, we downloaded methylation data from TCGA BRCA using the DownloadMethylationData() function from TCGA-assembler (https://ccte.uchicago.edu/TCGA-Assembler/index.php). The unwanted column information in the data was removed by using ProcessMethylation450Data(). When CpG sites corresponded to more than one gene, average methylation values were calculated using CalculateSingleValueMethylationData().
We also obtained processed transcriptomic data from studies such as SCAN-B, ABiM_405, ABiM_100, OSLO2-EMIT0 12-15, Creighton Breast Tumor Compendium16,17, Van de Vijver et al.18, Neo-adjuvant Chemotherapy Response Compendium dataset 19, and METABRIC dataset20 through literature search. These studies included gene expression values along with the patient clinical features.
From the Human Cancer Cell Line Encyclopedia (CCLE) and DepMap portal (https://depmap.org/portal/download/all/), CRISPR knockout screens of BCa cell lines were obtained as gene-effect scores from Achilles and Sanger’s SCORE project. In this study, the scores were normalized so that nonessential genes had a median score of 0, while independently identified common essential genes have a median score of -1. Gene Effect scores were inferred using Chronos21. The integration of the Broad and Sanger datasets followed the methodology outlined by Pacini et al., with the exception that quantile normalization was omitted22.
In addition, we downloaded the BCa proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) from Proteomics Data Commons (https://proteomics.cancer.gov/programs/cptac). The integration and analysis of these data have been previously reported 23,24. In summary, protein expression values downloaded from the CPTAC data portal were log2 normalized for each sample. Z-values for each protein in each sample were then calculated as the number of standard deviations from the median across samples.
2.1.2. RNA-seq Data Analysis:
We procured raw data from NCBI GEO for GSE58135 25, GSE14273126, GSE183947 27, GSE100925 28, GSE47462 29, GSE184196 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE184196), GSE122630 30, GSE163882 31, GSE130660 32, GSE99063 33, GSE68359 34, and GSE13127635.The raw data from NCBI GEO were downloaded using fastq-dump function from SRA Toolkit (https://github.com/ncbi/sra-tools). The adapter sequences in the downloaded fastq files were trimmed and quality checked by Trim Galore (https://github.com/FelixKrueger/TrimGalore). The trimmed files were mapped to hg38 genome by using the HISAT2 (https://daehwankimlab.github.io/hisat2/) alignment tool, followed by bam conversion and sorting by SAMTools 36. The gene counts from the bam files were obtained by using HTseq-counts function 37. The gene counts were converted either to FPKM or to RPKM by using R or the Python package, respectively (https://github.com/AAlhendi1707/countToFPKM). When raw data were not available for studies such as GSE209998 38, GSE173661 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE173661), and GSE96058 15,we procured the processed data and performed downstream analysis. The statistical analysis was conducted with an unpaired welch t-test.
2.1.3. Gene Expression Array Data Analysis
For the “Creighton Breast Tumor Compendium” dataset16,17 of nine separate breast tumor expression profiling datasets for survival analysis, gene transcription profiling datasets (all on Affymetrix U133 array, A set, and all with DMFS as an outcome measure) were obtained from previous studies (Loi, GEO:GSE6532; Wang, GEO:GSE2034; Desmedt, GEO:GSE7390; Miller, GEO:GSE3494; Schmidt, GEO:GSE11121; Zhang, GEO:GSE12093; Minn, GEO:GSE2603 and GEO:GSE5327, Chin, http://cancer.lbl.gov/breastcancer/data.php. Genes within each dataset were first normalized to standard deviations from the median; samples from the Loi dataset that were also represented in Desmedt were excluded from Loi. When multiple gene array probe sets referenced the same gene, the probe set with the highest average variation across samples for the nine datasets was selected to represent the gene.
For the chemotherapy response expression compendium dataset19, we previously assembled a compendium of eight different public breast cancer expression datasets39-45, involving gene expression profiling of pre-treatment breast tumor biopsies from patients treated with neoadjuvant chemotherapy, with patient response recorded at the end of treatment. The compendium, representing 1240 tumor expression profiles, involved all datasets being generated using the same Affymetrix gene array platform. We normalized the expression values within each dataset in the same manner as described above for the Creighton dataset.
2.1.4. Proteomics Data Analysis Pipeline:
The output files from PRIDE were converted to raw format using msConvert 46. We obtained raw files for studies such as PXD01243147 and PXD018830 48 from PRIDE. MaxQuant and Andromeda search engines were used to process the downloaded MS/MS data, with reference to Homo sapiens UniProt proteome (UP000005640) 49. The MaxQuant parameters were set based on the proteolytic enzyme used, fixed and variable modifications, quantification approach, and data acquisition method. To perform downstream statistical analysis, the output files from MaxQuant analysis were used as input files for Perseus 50. NA values were eliminated from the resulting file, considering the condition that the row should have only three or fewer values. Additionally, the values were log-normalized for further analysis. In addition, we also downloaded processed gene level proteomics data from Anurag M et al., research article 51.
2.1.5. ChIP-seq Data Analysis:
The data associated with GSE85158 52, GSE117941 53, and GSE178373 54 studies were downloaded from NCBI GEO using the fastq-dump from SRAToolkit (https://github.com/ncbi/sra-tools). The quality of the raw data was assessed by FastQC (https://github.com/s-andrews/FastQC), followed by removing the adapter sequences using Trim Galore (https://github.com/FelixKrueger/TrimGalore). The human reference (hg38) was used for alignment with trimmed reads, using BWA mem 55. Duplicate reads were identified using Picard (https://github.com/broadinstitute/picard), followed by merging the technical replicates using SAMtools 36. The obtained bam files were converted to bed and bigwig files using BamToBed and bamCoverage tools 56. Peak calling was performed (NarrowPeaks for transcription factors and Broad Peaks for histone modification) with input DNA or IgG as controls, using MACS257.
2.1.6. scRNA-seq Data Analysis:
The processed data for BCa single-cell sequencing were downloaded from the Curated Cancer Cell Atlas (https://www.weizmann.ac.il/sites/3CA/) 58. We procured associated data and meta files for studies byQian et al 59, Gao et al 60, Azizi et al 61, Wu et al 62, and Griffiths et al 63. Using the Seurat R package, we filtered the cells to have at least 1000 genes in each barcode 64. These filtered cell counts were normalized, batch-corrected using Harmony, and annotated based on the available clinical features65.
2.2. Data formatting and visualization:
We integrated genomic, proteomic, and epigenetic studies into a user-friendly web resource built using PERL CGI. The data analysis results were depicted via interactive visualizations using public and in-house Java script libraries, and Python Flask applications.
Using R and PERL scripts, gene expression matrix files from RNA-seq and scRNA-seq studies and protein expression matrix files from proteomic studies were categorized based on tumor grade, tumor stage, patient’s age, patient’s race, nodal metastasis status, molecular subtype, treatment, and other associated categories.
Categorized and formatted data files were utilized to generate various graphical outputs such as heatmaps, box plots, jitter plots, Kaplan-Meier curves, UMAP plots, and violin plots as representations that address heterogeneity by comparing gene/protein expression along with various clinical features in each dataset.
ChIP-seq results highlighting epigenetic modifications near the gene region are displayed as IGV plots.
2.2.1. Visualization of differentially expressed genes: Heatmapvisualization was employed to visualize the most differentially expressed mRNAs, miRNAs, lncRNAs, and proteins in various BCa datasets. To compile a list of the top 250 genes that exhibited either over-expression or under-expression in each sub-type, we initially identified genes with FPKM values that displayed significant differences (p-values < 0.05). From this initial selection, we considered only genes with a median FPKM value of 1 or higher. Finally, the genes were ranked based on the ratio of the mean FPKM values in tumor samples to the mean FPKM values in normal samples. To generate an interactive heatmap illustrating the top over- and under-expressed genes in a dataset, we utilized the Highcharts library from JavaScript (http://www.highcharts.com/).
2.2.2. Visualization of individual gene expression patterns: Box and Jitter plots were employed to depict the expression levels of the genes in normal samples, primary breast tumors, metastatic breast tumors, and various treatment groups, along with the associated clinical characteristics. The Highcharts library from JavaScript was used to generate the visualizations representing the interquartile range (IQR), including minimum, 25th percentile, median, 75th percentile, and maximum values, utilizing the data obtained from data formatting.
2.2.3. Visualization of scRNA-seq based gene expression: The techniques utilized for visualizing single-cell RNA-seq data included UMAP, violin plots, and ridge plots. These visualizations were generated using Python, with pandas (https://pandas.pydata.org/) for data manipulation and Plotly (https://plotly.com/python/) for creating the plots. This approach allowed the display of gene expression patterns across various cell types and the representation of clustering outcomes. The resulting images were stored and presented through HTML embedding, allowing for interactive exploration and analysis of the single-cell RNA sequencing data.
2.2.4. Survival analysis using Kaplan-Meier curves: Patient survival data and gene or protein expression data from each dataset were utilized to create Kaplan-Meier survival plots. A Perl script developed in-house was employed to generate input files for survival analysis, which included details such as patient id, survival time (days/months), patient vital status (alive or deceased), and sample categories such as high-expression and low/medium-expression groups. Patient categorization for survival analysis was performed as previously described in Chandrashekar et al 66. To conduct multivariate analyses, clinical features such as race, sex, subtype, and grade, among others, were considered in relation to the expression and survival information. The "survival" and "survminer" packages in R were utilized for univariate and multivariate survival analyses, and statistical significance was assessed using log-rank tests (https://cran.r-project.org/web/packages/survminer/index.html). Finally, in-house JavaScript Kaplan-Meier plots were created for genes in the dataset for which survival information was available.
2.2.5. Visualization of ChIP-seq data: To facilitate the interactive visualization of data from ChIP-seq analysis, the MammOnc-DB platform incorporated the "igv.js" JavaScript developed by the IGV team (https://github.com/igvteam/igv.js/) for peak calling. Bigwig files and broadpeak/narrowpeak files from ChIP-seq data analysis were loaded to igv.js to generate IGV plots.
2.3. Web server Configuration:
MammOnc-DB operates on a CentOS server that has 72 cores (Intel® Xeon® CPU E2–2699 v3 @ 2.30GHz), 98 GB of RAM, and 22 TB HDD. To provide users with a seamless experience, the user interface of MammOnc-DB was created using PERL-CGI hosted on the Apaches webserver (https://httpd.apache.org/).