MammOnc-DB, an integrative breast cancer data analysis platform for target discovery

doi:10.21203/rs.3.rs-4926362/v1

Download PDF

Article

MammOnc-DB, an integrative breast cancer data analysis platform for target discovery

https://doi.org/10.21203/rs.3.rs-4926362/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Breast cancer (BCa) is one of the most common malignancies among women worldwide. It is a complex disease that is characterized by morphological and molecular heterogeneity. In the early stages of the disease, most BCa cases are treatable, particularly hormone receptor-positive and HER2-positive tumors. Unfortunately, triple-negative BCa and metastases to distant organs are largely untreatable with current medical interventions. Recent advances in sequencing and proteomic technologies have improved our understanding of the molecular changes that occur during breast cancer initiation and progression. In this era of precision medicine, researchers and clinicians aim to identify subclass-specific BCa biomarkers and develop new targets and drugs to guide treatment. Although vast amounts of omics data including single cell sequencing data, can be accessed through public repositories, there is a lack of user-friendly platforms that integrate information from multiple studies. Thus, to meet the need for a simple yet effective and integrative BCa tool for multi-omics data analysis and visualization, we developed a comprehensive BCa data analysis platform called MammOnc-DB (http://resource.path.uab.edu/MammOnc-Home.html), comprising data from more than 20,000 BCa samples. MammOnc-DB was developed to provide a unique resource for hypothesis generation and testing, as well as for the discovery of biomarkers and therapeutic targets. The platform also provides pre- and post-treatment data, which can help users identify treatment resistance markers and patient groups that may benefit from combination therapy.

Biological sciences/Cancer/Breast cancer

Biological sciences/Cancer/Tumour biomarkers

Breast cancer (BCa) is one of the most common cancers in women worldwide. Since the mid-2000s, the incidence of BCa has increased by approximately 0.5% annually. ¹ The etiology of BCa involves factors such as genetic predisposition, lifestyle changes, and aging ². Genetic mutations, familial history, demographic variables, medical background, and modifiable risk factors such as obesity, alcohol intake, and smoking are involved in its development ^3–5. BCa tumors are classified into distinct subtypes (Luminal A, Luminal B, HER2+, and TNBC), characterized by expression levels of estrogen and progesterone receptors, and HER2 expression in tumor cells. The hormone receptor-expressing BCa as well as HER2-positive tumors have viable treatment options ^4,6. Early-stage BCa is considered curable; however, despite significant progress in diagnosis and treatment, advanced/metastatic stage is associated with high mortality. Although BCa initially responds to treatments, may eventually, can recur and develop therapy resistance ^7,8. However, the heterogeneity of BCa poses a substantial challenge in diagnosis and treatment, requiring precision medicine to address the diverse molecular subtypes involved⁹.

With the availability of high-throughput technologies from advanced molecular profiling, such as next-generation sequencing and mass spectrometry, researchers can evaluate specific biomarkers and molecular signatures associated with tumor subtypes and identify potential therapeutic targets ¹⁰. Although data from next-generation sequencing have shed light on the molecular evolution of BCa, it is necessary to understand and process these molecular data with clinical information to enhance the capability of precision medicine and precision targeting approaches ¹¹. Although large amounts of data are available in public repositories, there are opportunities to develop user-friendly resources that allow cancer researchers to leverage the data effectively.

Large-scale cancer “Omics” data, generated using various techniques such as microarray, bulk RNA-seq, scRNA-seq, ChIP-seq, ATAC-seq, and MS/MS data for genetic, epigenetic, and proteomic data, are archived in numerous public repositories. From the perspective of a researcher with limited bioinformatics support, performing an in-depth analysis of the volume of genomic and proteomic data available for BCa is challenging. A focused and comprehensive web resource that provides integrative analysis, including data for metastatic BCa and response to BCa treatments, will be useful. Recognizing unmet need and opportunities to develop a comprehensive resource facilitating BCa data analysis and visualization, we developed the MammOnc-DB, a user-friendly portal for integrative analysis and visualization of BCa data.

MammOnc-DB incorporates data that were collected, curated, and integrated from the NCBI Gene Expression Omnibus. In addition, we utilized Proteomics Identifications Database (PRIDE) and ProteomeXchange to obtain proteomic data. MammOnc-DB also contains multi-omics data from The Cancer Genome Atlas (TCGA), Clinical Proteomic Tumor Analysis Consortium (CPTAC), the METABRIC, Cancer Cell Line Encyclopedia (CCLE), and Sweden Cancerome Analysis Network – Breast (SCAN-B) Consortium. Our data procurement and processing included multiomics studies that included data for normal breast tissue, primary BCa tissue, and metastatic BCa samples, with associated clinical information. In addition, we included data on BCa patients treated with various therapies.

Using MammOnc-DB, researchers can access multi-omic and multiple publicly available BCa datasets. It provides information and enables users to analyze the expression of genes (mRNAs, miRNAs, and lncRNAs) and proteins in primary and metastatic BCa along with available normal samples and across tumor subgroups based on tumor stage, tumor grade, race, molecular subtype, histological subtype, or other available clinicopathologic features. By utilizing MammOnc-DB to identify differentially expressed genes, one can identify the top differentially expressed genes associated with specific clinical features. Additional options include Kaplan-Meier survival analysis and evaluation of epigenetic changes. Users can download high-resolution graphics depicting expression profiles and patient survival information in various forms.

The MammOnc-DB enables researchers to utilize high-throughput BCa omics data to identify potential biomarkers and therapeutic targets for BCa. Furthermore, in silico validation of selected genes using the independent studies integrated into this platform. With subgroup-specific data analysis, one can identify gene alterations in subsets of BCa, allowing the development of hypotheses and testing the underlying biology for this dysregulation. In the future, our goal is to populate the MammOnc-DB platform with additional data as they become available.

2.1. Data Collection and Analysis:

2.1.1 TCGA, CPTAC, SCAN-B, METABRIC, and CCLE:

The Cancer Genome Atlas (TCGA) provides data on genomics and transcriptomics for various cancers. We downloaded RNA-sequencing data from Genomics Data Commons (https://portal.gdc.cancer.gov/) related to TCGA Breast cancer (BRCA). As TCGA provided level-3 data, we did not perform data processing. In addition, we downloaded methylation data from TCGA BRCA using the DownloadMethylationData() function from TCGA-assembler (https://ccte.uchicago.edu/TCGA-Assembler/index.php). The unwanted column information in the data was removed by using ProcessMethylation450Data(). When CpG sites corresponded to more than one gene, average methylation values were calculated using CalculateSingleValueMethylationData().

We also obtained processed transcriptomic data from studies such as SCAN-B, ABiM_405, ABiM_100, OSLO2-EMIT0 ^12-15, Creighton Breast Tumor Compendium^16,17, Van de Vijver et al.¹⁸, Neo-adjuvant Chemotherapy Response Compendium dataset ¹⁹, and METABRIC dataset²⁰ through literature search. These studies included gene expression values along with the patient clinical features.

From the Human Cancer Cell Line Encyclopedia (CCLE) and DepMap portal (https://depmap.org/portal/download/all/), CRISPR knockout screens of BCa cell lines were obtained as gene-effect scores from Achilles and Sanger’s SCORE project. In this study, the scores were normalized so that nonessential genes had a median score of 0, while independently identified common essential genes have a median score of -1. Gene Effect scores were inferred using Chronos²¹. The integration of the Broad and Sanger datasets followed the methodology outlined by Pacini et al., with the exception that quantile normalization was omitted²².

In addition, we downloaded the BCa proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) from Proteomics Data Commons (https://proteomics.cancer.gov/programs/cptac). The integration and analysis of these data have been previously reported ^23,24. In summary, protein expression values downloaded from the CPTAC data portal were log2 normalized for each sample. Z-values for each protein in each sample were then calculated as the number of standard deviations from the median across samples.

2.1.2. RNA-seq Data Analysis:

We procured raw data from NCBI GEO for GSE58135 ²⁵, GSE142731²⁶, GSE183947 ²⁷, GSE100925 ²⁸, GSE47462 ²⁹, GSE184196 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE184196), GSE122630 ³⁰, GSE163882 ³¹, GSE130660 ³², GSE99063 ³³, GSE68359 ³⁴, and GSE131276³⁵.The raw data from NCBI GEO were downloaded using fastq-dump function from SRA Toolkit (https://github.com/ncbi/sra-tools). The adapter sequences in the downloaded fastq files were trimmed and quality checked by Trim Galore (https://github.com/FelixKrueger/TrimGalore). The trimmed files were mapped to hg38 genome by using the HISAT2 (https://daehwankimlab.github.io/hisat2/) alignment tool, followed by bam conversion and sorting by SAMTools ³⁶. The gene counts from the bam files were obtained by using HTseq-counts function ³⁷. The gene counts were converted either to FPKM or to RPKM by using R or the Python package, respectively (https://github.com/AAlhendi1707/countToFPKM). When raw data were not available for studies such as GSE209998 ³⁸, GSE173661 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE173661), and GSE96058 ¹⁵,we procured the processed data and performed downstream analysis. The statistical analysis was conducted with an unpaired welch t-test.

2.1.3. Gene Expression Array Data Analysis

For the “Creighton Breast Tumor Compendium” dataset¹⁶^,¹⁷ of nine separate breast tumor expression profiling datasets for survival analysis, gene transcription profiling datasets (all on Affymetrix U133 array, A set, and all with DMFS as an outcome measure) were obtained from previous studies (Loi, GEO:GSE6532; Wang, GEO:GSE2034; Desmedt, GEO:GSE7390; Miller, GEO:GSE3494; Schmidt, GEO:GSE11121; Zhang, GEO:GSE12093; Minn, GEO:GSE2603 and GEO:GSE5327, Chin, http://cancer.lbl.gov/breastcancer/data.php. Genes within each dataset were first normalized to standard deviations from the median; samples from the Loi dataset that were also represented in Desmedt were excluded from Loi. When multiple gene array probe sets referenced the same gene, the probe set with the highest average variation across samples for the nine datasets was selected to represent the gene.

For the chemotherapy response expression compendium dataset¹⁹, we previously assembled a compendium of eight different public breast cancer expression datasets^39-45, involving gene expression profiling of pre-treatment breast tumor biopsies from patients treated with neoadjuvant chemotherapy, with patient response recorded at the end of treatment. The compendium, representing 1240 tumor expression profiles, involved all datasets being generated using the same Affymetrix gene array platform. We normalized the expression values within each dataset in the same manner as described above for the Creighton dataset.

2.1.4. Proteomics Data Analysis Pipeline:

The output files from PRIDE were converted to raw format using msConvert ⁴⁶. We obtained raw files for studies such as PXD012431⁴⁷ and PXD018830 ⁴⁸ from PRIDE. MaxQuant and Andromeda search engines were used to process the downloaded MS/MS data, with reference to Homo sapiens UniProt proteome (UP000005640) ⁴⁹. The MaxQuant parameters were set based on the proteolytic enzyme used, fixed and variable modifications, quantification approach, and data acquisition method. To perform downstream statistical analysis, the output files from MaxQuant analysis were used as input files for Perseus ⁵⁰. NA values were eliminated from the resulting file, considering the condition that the row should have only three or fewer values. Additionally, the values were log-normalized for further analysis. In addition, we also downloaded processed gene level proteomics data from Anurag M et al., research article ⁵¹.

2.1.5. ChIP-seq Data Analysis:

The data associated with GSE85158 ⁵², GSE117941 ⁵³, and GSE178373 ⁵⁴ studies were downloaded from NCBI GEO using the fastq-dump from SRAToolkit (https://github.com/ncbi/sra-tools). The quality of the raw data was assessed by FastQC (https://github.com/s-andrews/FastQC), followed by removing the adapter sequences using Trim Galore (https://github.com/FelixKrueger/TrimGalore). The human reference (hg38) was used for alignment with trimmed reads, using BWA mem ⁵⁵. Duplicate reads were identified using Picard (https://github.com/broadinstitute/picard), followed by merging the technical replicates using SAMtools ³⁶. The obtained bam files were converted to bed and bigwig files using BamToBed and bamCoverage tools ⁵⁶. Peak calling was performed (NarrowPeaks for transcription factors and Broad Peaks for histone modification) with input DNA or IgG as controls, using MACS2⁵⁷.

2.1.6. scRNA-seq Data Analysis:

The processed data for BCa single-cell sequencing were downloaded from the Curated Cancer Cell Atlas (https://www.weizmann.ac.il/sites/3CA/) ⁵⁸. We procured associated data and meta files for studies byQian et al ⁵⁹, Gao et al ⁶⁰, Azizi et al ⁶¹, Wu et al ⁶², and Griffiths et al ⁶³. Using the Seurat R package, we filtered the cells to have at least 1000 genes in each barcode ⁶⁴. These filtered cell counts were normalized, batch-corrected using Harmony, and annotated based on the available clinical features⁶⁵.

2.2. Data formatting and visualization:

We integrated genomic, proteomic, and epigenetic studies into a user-friendly web resource built using PERL CGI. The data analysis results were depicted via interactive visualizations using public and in-house Java script libraries, and Python Flask applications.

Using R and PERL scripts, gene expression matrix files from RNA-seq and scRNA-seq studies and protein expression matrix files from proteomic studies were categorized based on tumor grade, tumor stage, patient’s age, patient’s race, nodal metastasis status, molecular subtype, treatment, and other associated categories.

Categorized and formatted data files were utilized to generate various graphical outputs such as heatmaps, box plots, jitter plots, Kaplan-Meier curves, UMAP plots, and violin plots as representations that address heterogeneity by comparing gene/protein expression along with various clinical features in each dataset.

ChIP-seq results highlighting epigenetic modifications near the gene region are displayed as IGV plots.

2.2.1. Visualization of differentially expressed genes: Heatmapvisualization was employed to visualize the most differentially expressed mRNAs, miRNAs, lncRNAs, and proteins in various BCa datasets. To compile a list of the top 250 genes that exhibited either over-expression or under-expression in each sub-type, we initially identified genes with FPKM values that displayed significant differences (p-values < 0.05). From this initial selection, we considered only genes with a median FPKM value of 1 or higher. Finally, the genes were ranked based on the ratio of the mean FPKM values in tumor samples to the mean FPKM values in normal samples. To generate an interactive heatmap illustrating the top over- and under-expressed genes in a dataset, we utilized the Highcharts library from JavaScript (http://www.highcharts.com/).

2.2.2. Visualization of individual gene expression patterns: Box and Jitter plots were employed to depict the expression levels of the genes in normal samples, primary breast tumors, metastatic breast tumors, and various treatment groups, along with the associated clinical characteristics. The Highcharts library from JavaScript was used to generate the visualizations representing the interquartile range (IQR), including minimum, 25th percentile, median, 75th percentile, and maximum values, utilizing the data obtained from data formatting.

2.2.3. Visualization of scRNA-seq based gene expression: The techniques utilized for visualizing single-cell RNA-seq data included UMAP, violin plots, and ridge plots. These visualizations were generated using Python, with pandas (https://pandas.pydata.org/) for data manipulation and Plotly (https://plotly.com/python/) for creating the plots. This approach allowed the display of gene expression patterns across various cell types and the representation of clustering outcomes. The resulting images were stored and presented through HTML embedding, allowing for interactive exploration and analysis of the single-cell RNA sequencing data.

2.2.4. Survival analysis using Kaplan-Meier curves: Patient survival data and gene or protein expression data from each dataset were utilized to create Kaplan-Meier survival plots. A Perl script developed in-house was employed to generate input files for survival analysis, which included details such as patient id, survival time (days/months), patient vital status (alive or deceased), and sample categories such as high-expression and low/medium-expression groups. Patient categorization for survival analysis was performed as previously described in Chandrashekar et al ⁶⁶. To conduct multivariate analyses, clinical features such as race, sex, subtype, and grade, among others, were considered in relation to the expression and survival information. The "survival" and "survminer" packages in R were utilized for univariate and multivariate survival analyses, and statistical significance was assessed using log-rank tests (https://cran.r-project.org/web/packages/survminer/index.html). Finally, in-house JavaScript Kaplan-Meier plots were created for genes in the dataset for which survival information was available.

2.2.5. Visualization of ChIP-seq data: To facilitate the interactive visualization of data from ChIP-seq analysis, the MammOnc-DB platform incorporated the "igv.js" JavaScript developed by the IGV team (https://github.com/igvteam/igv.js/) for peak calling. Bigwig files and broadpeak/narrowpeak files from ChIP-seq data analysis were loaded to igv.js to generate IGV plots.

2.3. Web server Configuration:

MammOnc-DB operates on a CentOS server that has 72 cores (Intel® Xeon® CPU E2–2699 v3 @ 2.30GHz), 98 GB of RAM, and 22 TB HDD. To provide users with a seamless experience, the user interface of MammOnc-DB was created using PERL-CGI hosted on the Apaches webserver (https://httpd.apache.org/).

3.1. Overview:

Figure 1 provides an overview of MammOnc-DB, and Supplementary Table 1 lists the currently available studies within the MammOnc-DB.

The MammOnc-DB homepage allows users to select the type of omics they are interested in, such as gene expression, protein expression, and gene regulation, through the menu bar. Additionally, the platform also contains a tutorial page to assist users in using the portal effectively

The functionality of MammOnc-DB extends to various types of analysis, which are described in the following sections.

3.2. Heatmap facilitating identification of top differentially expressed genes.

The gene expression page of MammOnc-DB features a left panel that allows users to identify genes that are either over or under-expressed in a dataset (Fig. 2A). For instance, if a user selects “TNBC” under “SCAN-B” in Panel 1, they will be directed to a dedicated page that displays the over-expressed and under-expressed genes in the form of a heatmap. Figure 2B shows a heatmap representing the top 25 genes that are over- or under-expressed, comparing non-TNBC tumors (n = 8332) and TNBC (n = 874) tumors in the SCAN-B dataset. This page allows users to identify up to the top 250 over-or under-expressed genes in the dataset. Moreover, by clicking on the gene name in the chosen study, users can access expression information about each gene in that study. Additionally, our portal offers the option of identifying over and under-expressed lncRNAs and miRNAs using heatmap (Supplementary Fig. 1).

3.3. Identifying the expression pattern of a queried gene across different datasets with subgroup classifications:

3.3.1. Overview of gene expression and survival analysis using bulk RNA-seq and microarray datasets:

Using Panel 2 on the gene expression page, users can search for their specific gene of interest and determine whether it is related to protein-coding, miRNA, or lncRNA across a range of datasets and analyze their expression patterns in relation to various clinicopathologic features (Fig. 2A). In the gene expression page, users have the option to select between "bulk RNA-sequencing" or "scRNA-seq" data, enabling them to input their gene of interest and choose a study from the available choices (Fig. 3A). MammOnc-DB currently offers 20 studies for bulk RNA-seq (TCGA-BRCA, SCAN-B, ABiM_405, ABiM_100, OSLO2EMIT0, GSE58135, GSE142731, GSE183947, GSE100925, GSE47462, GSE184196, GSE122630, GSE163882, GSE130660, GSE99630, GSE68359, GSE131276, GSE209998, GSE173661,and GSE96058), two microarray (METABRIC and Van de Vijver et al.), two microarray compendium datasets (Creighton breast tumor compendium and Neo-adjuvant chemotherapy compendium), and five scRNA-seq studies (Qian et al., Gao et al., Wu et al., Azizi et al., and Griffiths et al.,), which are categorized into primary, metastatic, and treatment-related studies of BCa.

For example, the PSAT1 gene was typed in the text box, “protein-coding” was the gene type and the “METABRIC” study was selected. Clicking the “Submit” button leads them to an intermediate page displaying the gene name, analysis types, and external links to additional resources (Fig. 3A). Clicking on the "Expression" button directs users to the expression page, where box and jitter plots with corresponding p-values for various categories are presented, with the statistical analysis being an unpaired Welch t-test. Figure 3B shows a boxplot that illustrates the expression pattern of PSAT1 in the METABRIC study. It compares ER Negative (n = 429) and ER (n = 1445), positive patients, showing a statistically significance with a p-value less than 0.001. Users can also visualize the results in terms of jitter plots by clicking the button. Examples of PSAT1 expression in METABRIC, based on PR Status, and PAM50 and Claudin subtype are shown as jitter plots in Fig. 3B. Additional studies and classifications for different genes are represented in Supplementary Fig. 2.

The DepMap button at the bottom allows users to access a comprehensive dataset consisting of 40 BCa cell lines and their corresponding gene effect scores. These scores are derived from CRISPR knockout screens conducted by Dempster et al.²¹ This feature allows users to assess the impact of gene knockout in each cell line. An example of PSAT1 gene knockout and the associated gene effect score in various breast cancer cell lines are depicted as a bar plot in Fig. 3C.

In addition to analyzing gene expression, users can utilize the “Survival” button to perform Kaplan-Meier analysis for their genes of interest. The survival profile of PSAT1 in the METABRIC dataset shows that higher expression of PSAT1 was significantly associated with poor survival (p < 0.001), as illustrated in Fig. 3D. Supplementary Figs. 3A and 3B present additional multivariate Kaplan-Meier plots of lncRNA (PCAT1) and miRNA (hsa-mir-7706) from TCGA dataset.

3.3.2. Single-cell RNA-seq data analysis:

Furthermore, users can retrieve scRNA-seq data through the gene expression section, allowing them to discern expression patterns within various clusters visualized as UMAP, violin plots, and ridge plots (see Fig. 4). An illustration of the expression pattern of ARID5B in Azizi et al., is provided as an example, displaying UMAP, violin plots, and ridge plots, comparing its expression in different subclasses of T cells. Additional studies and classifications are presented in Supplementary Fig. 4.

3.4. Analyzing the expression patterns of target proteins across various datasets and patient subgroups

Users can determine the expression pattern of a specific protein by utilizing the protein expression page in MammOnc-DB. This page was designed similarly to the gene expression page. Users can input the name of the gene of interest for the available studies (CPTAC, Tommaso De Marchi et al., (PXD01431), Goming et al., (PXD018830), and Anurag M et al.,) and the protein expression results were observed through a box and jitter plot format (Fig. 5A). An illustrative example of TK1 expression is shown in Fig. 5B, which displays the total and phosphoprotein expression of TK1 in relation to various clinical features. Additional studies and classifications are presented in Supplementary Fig. 5.

3.5. Transcription Factor Binding Site Analysis: ChIP-seq Data Exploration:

Processed ChIP-seq datasets are incorporated into MammOnc-DB to evaluate histone modifications and ER ligand treatment in different breast cancer BCa cell lines (GSE85158, GSE117941, and GSE178373). To facilitate interpretation, ChIP-seq results are presented in an interactive genome visualization format. Users can enter a specific gene and observe the binding of markers in either the promoter or gene body regions (Fig. 6A). Figure 6B shows a graphical representation of ChIP-seq results in MammOnc-DB. The figure displays the binding patterns of ER bound to different ligands (Tamoxifen, E2, GD 0927, and GNE 274) at STK11 genomic locations in the MCF7 cell line, providing a visual depiction in the IGV.

Case studies have also been included and are available in Supplementary Document 1.

Large-scale cancer omics data have been generated due to advancements in high-throughput technologies, including sequencing techniques and a reduction in the cost of sequencing. Omics data are critical for understanding the molecular changes and mechanisms underlying breast cancer development and progression, which can help to identify biomarkers and therapeutic targets. To maximize the utility of publicly available multi-omics data, there is a need to develop an easy-to-use web portal that enables researchers and clinicians to perform comprehensive analyses of these data and visualize them. Data collection, processing, and analysis require dedicated effort from experts in various fields, including pathology, computational biology, and statisticians.

We created MammOnc-DB platform to explicitly focuses on BCa-related omic data analysis and visualization. While our previous effort UALCAN provides pan-cancer data analysis ^66,67, MammOnc-DB incorporates transcriptomics and proteomics data from various consortia and public repositories. This platform utilizes bulk RNA-seq, single-cell RNA-seq (scRNA-seq), ChIP-seq, and mass spectrometry (MS) data. Bulk RNA-seq provides a comprehensive view of gene expression patterns across tumor tissues, offering a broad understanding of the transcriptional landscape. Conversely, scRNA-seq explores the heterogeneity of cells, uncovering distinctive cell populations within tumors. This level of analysis is essential for identifying rare cell types, elucidating tumor progression, and mapping cellular lineage connections. Additionally, scRNA-seq data can unveil specific transcriptional profiles of individual cell types, which may be obscured in bulk RNA-seq data, facilitating a more accurate identification of potential therapeutic targets and biomarkers. ChIP-seq enables the discovery of DNA-protein interactions and epigenetic changes, shedding light on the regulatory processes governing gene expression. This approach is necessary for understanding the impact of transcription factors and other regulatory proteins on the advancement of BCa. In addition, MS investigations unveil the proteomic profile, outlining protein levels, modifications after translation, and interactions between proteins. By combining these sets of data, a holistic understanding of the molecular changes in BCa can be achieved.

Integrating multi-omics data in MammOnc-DB allows users to conduct in-silico analysis and validation of target genes that are specific to various tumor subgroups. This functionality facilitates hypothesis generation based on available data. Moreover, the platform serves as a tool for discovering new biomarkers crucial for early detection, prognosis, and prediction of responses to treatment. By analyzing pre- and post-treatment data, researchers and clinicians can identify markers that indicate therapy response, which could guide clinical decision-making. Incorporating gene expression, gene regulation, and protein data enhances the reliability of the identified biomarkers. Despite the advancements and the potential of MammOnc-DB, limitations should be acknowledged. Due to the lack of access to raw data, different normalization methods were present in the processed data, which could introduce variability and affect the comparability and interpretation of the results. Since MammOnc-DB relies on publicly available datasets, there is a potential for bias introduced by the selection and representation of these datasets.

We will maintain platform dynamics by integrating into MammOnc-DB additional molecular datasets, such as DNA copy number alterations, DNA methylation data from Illumina arrays, and information on transcription factor binding using ChIP-Seq data. Further, we will include additional datasets as they become available. Furthermore, we intend to analyze and include spatial transcriptomics data from public repositories. We expect to be responsive to user needs and suggestions when possible and will upgrade MammOnc-DB as appropriate. In summary, MammOnc-DB will serve as a valuable resource for BCa researchers and clinicians, enabling them to explore the diverse multi-omics data related to BCa and facilitating discoveries of BCa biomarkers and targets.

5. Data and Code Availability:

The pre-processed data in this portal are available in the designated references. The underlying code for this portal is not publicly available but may be made available by the corresponding author to researchers on reasonable request.

6. Authorship contribution statement

Santhosh Kumar Karthikeyan: Conceptualization, Methodology, Formal analysis, Software, Validation, Project administration, Writing – original draft. Darshan S. Chandrashekar: Conceptualization, Methodology, Formal analysis, Software, Validation, Project administration, Writing – original draft. Snigdha Sahai: Validation. Sadeep Shresta: Writing – review & editing.Ritu Aneja: Writing – review & editing.Rajesh Singh: Writing – review & editing. Celina Kleer: Writing, discussions– review & editing. Harikrishna Nakshatri: Writing, discussions– review & editing. Sidharth Kumar: Writing, discussions– review & editing. Steve Qin: Writing, discussions– review & editing. Upender Manne: Writing, discussions– review & editing. Chad J. Creighton: Resources, Formal analysis, Writing – review & editing. Sooryanarayana Varambally: Conceptualization, Methodology, Validation, Validation, Writing – original draft, Writing – review & editing, Supervision.

7. Competing interests: All authors declare no financial or non-financial competing interests.

Acknowledgments

This study was supported by the UAB Department of Pathology, the UAB O'Neal Comprehensive Cancer Center, and the UAB Heersink School of Medicine. S.V. and U.M were supported by funding from U54 CA118948. C.J.C. was supported by grant CA125123. Dr. SV received support from Breast Cancer Research Foundation of Alabama (BCRFA) and U54 CA118948. We thank Harshith Kadaiah for the coding support. We thank Dr. Donald Hill from UAB O'Neal Comprehensive Cancer Center for the help in editing this manuscript and Israel Ponce-Rodriguez for support in maintenance of MammOnc-DB server. We thank High-Charts (https://www.highcharts.com/) for the graphic scripts.

Siegel, R. L., Giaquinto, A. N. & Jemal, A. Cancer statistics, 2024. CA Cancer J Clin 74, 12-49 (2024). https://doi.org:10.3322/caac.21820
Łukasiewicz, S. et al. Breast Cancer-Epidemiology, Risk Factors, Classification, Prognostic Markers, and Current Treatment Strategies-An Updated Review. Cancers (Basel) 13 (2021). https://doi.org:10.3390/cancers13174287
Harbeck, N. et al. Breast cancer. Nat Rev Dis Primers 5, 66 (2019). https://doi.org:10.1038/s41572-019-0111-2
Loibl, S., Poortmans, P., Morrow, M., Denkert, C. & Curigliano, G. Breast cancer. Lancet 397, 1750-1769 (2021). https://doi.org:10.1016/s0140-6736(20)32381-3
Sun, Y. S. et al. Risk Factors and Preventions of Breast Cancer. Int J Biol Sci 13, 1387-1397 (2017). https://doi.org:10.7150/ijbs.21635
Provenzano, E., Ulaner, G. A. & Chin, S. F. Molecular Classification of Breast Cancer. PET Clin 13, 325-338 (2018). https://doi.org:10.1016/j.cpet.2018.02.004
Agostinetto, E., Gligorov, J. & Piccart, M. Systemic therapy for early-stage breast cancer: learning from the past to build the future. Nat Rev Clin Oncol 19, 763-774 (2022). https://doi.org:10.1038/s41571-022-00687-1
Burguin, A., Diorio, C. & Durocher, F. Breast Cancer Treatments: Updates and New Challenges. J Pers Med 11 (2021). https://doi.org:10.3390/jpm11080808
Tsimberidou, A. M., Fountzilas, E., Nikanjam, M. & Kurzrock, R. Review of precision cancer medicine: Evolution of the treatment paradigm. Cancer Treat Rev 86, 102019 (2020). https://doi.org:10.1016/j.ctrv.2020.102019
Ahmed, Z. Practicing precision medicine with intelligently integrative clinical and multi-omics data analysis. Hum Genomics 14, 35 (2020). https://doi.org:10.1186/s40246-020-00287-z
Naithani, N., Sinha, S., Misra, P., Vasudevan, B. & Sahu, R. Precision medicine: Concept and tools. Med J Armed Forces India 77, 249-257 (2021). https://doi.org:10.1016/j.mjafi.2021.06.021
Staaf, J. et al. RNA sequencing-based single sample predictors of molecular subtype and risk of recurrence for clinical assessment of early-stage breast cancer. NPJ Breast Cancer 8, 94 (2022). https://doi.org:10.1038/s41523-022-00465-3
Saal, L. H. et al. The Sweden Cancerome Analysis Network - Breast (SCAN-B) Initiative: a large-scale multicenter infrastructure towards implementation of breast cancer genomic analyses in the clinical routine. Genome Med 7, 20 (2015). https://doi.org:10.1186/s13073-015-0131-9
Aure, M. R. et al. Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome. Breast Cancer Res 19, 44 (2017). https://doi.org:10.1186/s13058-017-0812-y
Brueffer, C. et al. Clinical Value of RNA Sequencing-Based Classifiers for Prediction of the Five Conventional Breast Cancer Biomarkers: A Report From the Population-Based Multicenter Sweden Cancerome Analysis Network-Breast Initiative. JCO Precis Oncol 2 (2018). https://doi.org:10.1200/po.17.00135
Creighton, C. J. The molecular profile of luminal B breast cancer. Biologics 6, 289-297 (2012). https://doi.org:10.2147/btt.S29923
Kessler, J. D. et al. A SUMOylation-dependent transcriptional subprogram is required for Myc-driven tumorigenesis. Science 335, 348-353 (2012). https://doi.org:10.1126/science.1212728
van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347, 1999-2009 (2002). https://doi.org:10.1056/NEJMoa021967
Creighton, C. J. Gene Expression Profiles in Cancers and Their Therapeutic Implications. Cancer J 29, 9-14 (2023). https://doi.org:10.1097/ppo.0000000000000638
Pereira, B. et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat Commun 7, 11479 (2016). https://doi.org:10.1038/ncomms11479
Dempster, J. M. et al. Chronos: a cell population dynamics model of CRISPR experiments that improves inference of gene fitness effects. Genome Biol 22, 343 (2021). https://doi.org:10.1186/s13059-021-02540-7
Pacini, C. et al. Integrated cross-study datasets of genetic dependencies in cancer. Nat Commun 12, 1661 (2021). https://doi.org:10.1038/s41467-021-21898-7
Chen, F., Chandrashekar, D. S., Varambally, S. & Creighton, C. J. Pan-cancer molecular subtypes revealed by mass-spectrometry-based proteomic characterization of more than 500 human cancers. Nat Commun 10, 5679 (2019). https://doi.org:10.1038/s41467-019-13528-0
Monsivais, D. et al. Mass-spectrometry-based proteomic correlates of grade and stage reveal pathways and kinases associated with aggressive human cancers. Oncogene 40, 2081-2095 (2021). https://doi.org:10.1038/s41388-021-01681-0
Varley, K. E. et al. Recurrent read-through fusion transcripts in breast cancer. Breast Cancer Res Treat 146, 287-297 (2014). https://doi.org:10.1007/s10549-014-3019-2
Saleh, M. et al. Comparative analysis of triple-negative breast cancer transcriptomics of Kenyan, African American and Caucasian Women. Transl Oncol 14, 101086 (2021). https://doi.org:10.1016/j.tranon.2021.101086
Zhang, Y. et al. Identification of Five Cytotoxicity-Related Genes Involved in the Progression of Triple-Negative Breast Cancer. Front Genet 12, 723477 (2021). https://doi.org:10.3389/fgene.2021.723477
Cassetta, L. et al. Human Tumor-Associated Macrophage and Monocyte Transcriptional Landscapes Reveal Cancer-Specific Reprogramming, Biomarkers, and Therapeutic Targets. Cancer Cell 35, 588-602.e510 (2019). https://doi.org:10.1016/j.ccell.2019.02.009
Brunner, A. L. et al. A shared transcriptional program in early breast neoplasias despite genetic and clinical distinctions. Genome Biol 15, R71 (2014). https://doi.org:10.1186/gb-2014-15-5-r71
Bownes, R. J. et al. On-treatment biomarkers can improve prediction of response to neoadjuvant chemotherapy in breast cancer. Breast Cancer Res 21, 73 (2019). https://doi.org:10.1186/s13058-019-1159-3
Chen, J. et al. Machine learning models based on immunological genes to predict the response to neoadjuvant therapy in breast cancer patients. Front Immunol 13, 948601 (2022). https://doi.org:10.3389/fimmu.2022.948601
Turnbull, A. K. et al. Unlocking the transcriptomic potential of formalin-fixed paraffin embedded clinical tissues: comparison of gene expression profiling approaches. BMC Bioinformatics 21, 30 (2020). https://doi.org:10.1186/s12859-020-3365-5
Barakat, T. S. et al. Functional Dissection of the Enhancer Repertoire in Human Embryonic Stem Cells. Cell Stem Cell 23, 276-288.e278 (2018). https://doi.org:10.1016/j.stem.2018.06.014
Mohammed, H. et al. Progesterone receptor modulates ERα action in breast cancer. Nature 523, 313-317 (2015). https://doi.org:10.1038/nature14583
Wahdan-Alaswad, R. S. et al. Thyroid hormone enhances estrogen-mediated proliferation and cell cycle regulatory pathways in steroid receptor-positive breast Cancer. Cell Cycle, 1-20 (2023). https://doi.org:10.1080/15384101.2023.2249702
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009). https://doi.org:10.1093/bioinformatics/btp352
Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166-169 (2015). https://doi.org:10.1093/bioinformatics/btu638
Garcia-Recio, S. et al. Multiomics in primary and metastatic breast tumors from the AURORA US network finds microenvironment and epigenetic drivers of metastasis. Nat Cancer 4, 128-147 (2023). https://doi.org:10.1038/s43018-022-00491-x
Horak, C. E. et al. Biomarker analysis of neoadjuvant doxorubicin/cyclophosphamide followed by ixabepilone or Paclitaxel in early-stage breast cancer. Clin Cancer Res 19, 1587-1595 (2013). https://doi.org:10.1158/1078-0432.Ccr-12-1359
Iwamoto, T. et al. Gene pathways associated with prognosis and chemotherapy sensitivity in molecular subtypes of breast cancer. J Natl Cancer Inst 103, 264-272 (2011). https://doi.org:10.1093/jnci/djq524
Hatzis, C. et al. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. Jama 305, 1873-1881 (2011). https://doi.org:10.1001/jama.2011.593
Shen, K. et al. Cell line derived multi-gene predictor of pathologic response to neoadjuvant chemotherapy in breast cancer: a validation study on US Oncology 02-103 clinical trial. BMC Med Genomics 5, 51 (2012). https://doi.org:10.1186/1755-8794-5-51
Korde, L. A. et al. Gene expression pathway analysis to predict response to neoadjuvant docetaxel and capecitabine for breast cancer. Breast Cancer Res Treat 119, 685-699 (2010). https://doi.org:10.1007/s10549-009-0651-3
Prat, A. et al. Research-based PAM50 subtype predictor identifies higher responses and improved survival outcomes in HER2-positive breast cancer in the NOAH study. Clin Cancer Res 20, 511-521 (2014). https://doi.org:10.1158/1078-0432.Ccr-13-0239
Miyake, T. et al. GSTP1 expression predicts poor pathological complete response to neoadjuvant chemotherapy in ER-negative breast cancer. Cancer Sci 103, 913-920 (2012). https://doi.org:10.1111/j.1349-7006.2012.02231.x
Adusumilli, R. & Mallick, P. Data Conversion with ProteoWizard msConvert. Methods Mol Biol 1550, 339-368 (2017). https://doi.org:10.1007/978-1-4939-6747-6_23
Gomig, T. H. B. et al. High-throughput mass spectrometry and bioinformatics analysis of breast cancer proteomic data. Data Brief 25, 104125 (2019). https://doi.org:10.1016/j.dib.2019.104125
De Marchi, T. et al. Proteogenomic Workflow Reveals Molecular Phenotypes Related to Breast Cancer Mammographic Appearance. J Proteome Res 20, 2983-3001 (2021). https://doi.org:10.1021/acs.jproteome.1c00243
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc 11, 2301-2319 (2016). https://doi.org:10.1038/nprot.2016.136
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Methods 13, 731-740 (2016). https://doi.org:10.1038/nmeth.3901
Anurag, M. et al. Proteogenomic Markers of Chemotherapy Resistance and Response in Triple-Negative Breast Cancer. Cancer Discov 12, 2586-2605 (2022). https://doi.org:10.1158/2159-8290.Cd-22-0200
Franco, H. L. et al. Enhancer transcription reveals subtype-specific gene expression programs controlling breast cancer pathogenesis. Genome Res 28, 159-170 (2018). https://doi.org:10.1101/gr.226019.117
Guan, J. et al. Therapeutic Ligands Antagonize Estrogen Receptor Function by Impairing Its Mobility. Cell 178, 949-963.e918 (2019). https://doi.org:10.1016/j.cell.2019.06.026
Furman, C. et al. Covalent ERα Antagonist H3B-6545 Demonstrates Encouraging Preclinical Activity in Therapy-Resistant Breast Cancer. Mol Cancer Ther 21, 890-902 (2022). https://doi.org:10.1158/1535-7163.Mct-21-0378
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589-595 (2010). https://doi.org:10.1093/bioinformatics/btp698
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842 (2010). https://doi.org:10.1093/bioinformatics/btq033
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008). https://doi.org:10.1186/gb-2008-9-9-r137
Gavish, A. et al. Hallmarks of transcriptional intratumour heterogeneity across a thousand tumours. Nature 618, 598-606 (2023). https://doi.org:10.1038/s41586-023-06130-4
Qian, J. et al. A pan-cancer blueprint of the heterogeneous tumor microenvironment revealed by single-cell profiling. Cell Res 30, 745-762 (2020). https://doi.org:10.1038/s41422-020-0355-0
Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat Biotechnol 39, 599-608 (2021). https://doi.org:10.1038/s41587-020-00795-2
Azizi, E. et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell 174, 1293-1308.e1236 (2018). https://doi.org:10.1016/j.cell.2018.05.060
Wu, S. Z. et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 53, 1334-1347 (2021). https://doi.org:10.1038/s41588-021-00911-1
Griffiths, J. I. et al. Serial single-cell genomics reveals convergent subclonal evolution of resistance as early-stage breast cancer patients progress on endocrine plus CDK4/6 therapy. Nat Cancer 2, 658-671 (2021). https://doi.org:10.1038/s43018-021-00215-7
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33, 495-502 (2015). https://doi.org:10.1038/nbt.3192
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289-1296 (2019). https://doi.org:10.1038/s41592-019-0619-0
Chandrashekar, D. S. et al. UALCAN: A Portal for Facilitating Tumor Subgroup Gene Expression and Survival Analyses. Neoplasia 19, 649-658 (2017). https://doi.org:10.1016/j.neo.2017.05.002
Chandrashekar, D. S. et al. UALCAN: An update to the integrated cancer data analysis platform. Neoplasia 25, 18-27 (2022). https://doi.org:10.1016/j.neo.2022.01.001

(Not answered)

SupplTable1MammoncDatasets.pdf
SupFig1.pdf
Supplementary Figure 1: Heatmap showing top over-expressed and under-expressed genes in TCGA dataset. (A) Top 25 over- and under-expressed lncRNAs comparing TNBC breast tumors and normal tissue. (B) Top 25 over- and under-expressed miRNAs comparing luminal breast tumors and normal tissue.
SupFig2.pdf
Supplementary Figure 2: Box-whisker plots showing expression of genes in various studies. Boxplot showing expression of IDO1 in (A) PAM50 and (B) endocrine treatment sub-classification from Brueffer C et al., GSE96058. (C) Expression of CCDC74B based on treatment response in Chen J et al., GSE163882. (D) CHEK2 expression based on metastatic site in Garcia-Recio S et al., GSE209998. GATA3expression in (E) ER-positive and (F) TNBC tumors compared to normal adjacent tissues in Varley K E et al., GSE58135. (G) HIPK2expression pattern in MDA-MB-361 cells based on control and abemaciclib treatment in Goel et al., GSE99063. (H) Expression of NEK2 in PDX model comparing E2, E2+TH, E2+Tam, and E2+TH+Tam from Wahdan-Alaswad R et al., GSE131276.
SupFig3.pdf
Supplementary Figure 3: Kaplan-Meier plots showing the association of genes and other clinical features with patient survival in TCGA. (A) Kaplan-Meier plot illustrating the association of PCAT1expression and race with patient survival. (B) Kaplan-Meier plot illustrating the association of hsa-mir-7706 expression and race with patient survival.
SupFig4.pdf
Supplementary Figure 4: Visualization of scRNA-seq data. (A) Expression of the MUC1 gene based on cell type in Qian et al., (2020), depicted as UMAP plot, feature plot, and violin plot. MUC1 is expressed at higher levels in malignant cells than in normal cells (B) Expression of MLPH gene based on sample type in Gao et al.,(2020) is depicted as UMAP plot, feature plot and violin plot.
SupFig5.pdf
Supplementary Figure 5: Box-whisker plots showing expression of proteins in various studies. (A) Expression of GSTK1 (C9JNT3) based on Sample type in Song et al., PXD012431 is shown. (B) Box plot showing expression of DECR1 based on Chemotherapy treatment response in Anurag M et al., PMID: 36001024.

Download PDF

Editorial decision: revise
29 Oct, 2024
Review #1 received at journal
20 Oct, 2024
Review #2 received at journal
15 Oct, 2024
Reviewer #2 agreed at journal
07 Oct, 2024
Reviewer #1 agreed at journal
25 Sep, 2024
Reviewers invited by journal
11 Sep, 2024
Editor assigned by journal
19 Aug, 2024
Submission checks completed at journal
19 Aug, 2024
First submitted to journal
16 Aug, 2024

You are reading this latest preprint version

MammOnc-DB, an integrative breast cancer data analysis platform for target discovery

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methods

3. Results

3.1. Overview:

3.2. Heatmap facilitating identification of top differentially expressed genes.

3.3. Identifying the expression pattern of a queried gene across different datasets with subgroup classifications:

3.3.1. Overview of gene expression and survival analysis using bulk RNA-seq and microarray datasets:

3.3.2. Single-cell RNA-seq data analysis:

3.4. Analyzing the expression patterns of target proteins across various datasets and patient subgroups

3.5. Transcription Factor Binding Site Analysis: ChIP-seq Data Exploration:

4. Discussion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1