Prostate tissues cohort, ethics approval and consent to participate
Our cohort includes chemo-naïve patients followed in the Urology Division of Fondazione IRCCS Ca' Granda - Ospedale Maggiore Policlinico (Milan) who underwent the transrectal ultrasound-guided systematic sampling of prostate tissue (TRUS biopsy). This diagnostic procedure was performed as part of routine clinical management following the detection of abnormal digital rectal examination or an elevated PSA blood level. According to the current standard for the detection of PCa, 14 ultrasound-guided biopsy cores (diagnostic biopsies), seven from each side of the prostate gland, were collected from each patient for the clinical diagnosis. During the same procedure, one additional biopsy core to be used for our research project (research biopsy), was taken from a site directly adjacent to one of the diagnostic biopsies. The institutional ethics committee board of IRCCS Foundation Ca’ Granda Ospedale Maggiore Policlinico authorized this study (authorization n 1063). All specimens were obtained after the patients had provided written informed consent following the ethical principles of biomedical research on biospecimens. To ensure optimal recovery of living cells, fresh surgical prostate biopsy specimens were placed in ice-cold saline buffer and directly transported to the laboratory within 1 hour from sample collection. Tissue samples were then immediately processed to preserve the intranuclear genomic and protein architectures of cell nuclei. Samples selected for epigenome studies consist of 17 fresh biopsies divided on the bases of the histology and the spatial distribution of the positive cores in two different groups: 10 PCa biopsies from patients with histologically confirmed prostate cancer and 7 from patients who had no cancer in any biopsy core. All the clinical data were provided by the Urology division preserving the confidentiality of patient personal data.
Tissues processing
The biopsy specimen was stored at 4-8°C for maximum 3 hours before dissociation, avoiding its freezing. Given the little size of needle biopsy cores (typically 20-30 mg) we empirically determined the digestion condition to achieve the optimal recovery of living cells. Briefly, the biopsy tissue was transferred in a 2 ml microcentrifuge tube and rinse twice with 1 ml of ice-cold sterile PBS 1X. Then, the tissue was cut into small pieces (~1 mm) with autoclaved surgical scissors directly in the microcentrifuge tube. The resulting minced tissue was enzymatically digested by adding 1 ml of prewarm HBSS (Gibco, 14025) containing 200 units of collagenase type I (Life Technologies, 17018-029) per ~10 mg of tissue plus 67 µg DNase I (Sigma-Aldrich, 10104159001). Tissue dissociation was carried out in a water bath at 37 °C shaking vigorously for 10 seconds every 5 minutes. To prevent tissue over-digestion, the best incubation time for each experiment was established by monitoring cell viability, cell debris and aggregates every 10 minutes (usually ranges from 1 to 1.5 hour). After completed digestion cells were washed once by topping up to 2 ml with RPMI + 10% FBS and spun down at 300 g. The cell pellet was resuspended in RPMI + 10% FBS and dispersed by passing through a 75µm cells strainer, followed by an additional wash of the filter with RPMI + 10% FBS. Finally, the cells were spun down at 300 g and resuspended in 1 ml of ice-cold PBS for counting with a hemocytometer.
Histological and immunofluorescence evaluation
One third portion of each research biopsy tissue specimen was embedded in Killik (Bio-Optica, 05-9801), immediately frozen in precooled isopentane (MilliporeSigma, 277258) and stored at -80°. OCT- embedded biopsy cores (one per patient) were serially sectioned with 10 μm thickness in a cryostat at -20°C. Ten slides per patient, containing multiple sections representing distinct regions of the same tissue were prepared in parallel and stored at -80°C. Hematoxylin and Eosin staining (H&E) was performed using H&E Staining Kit (Abcam, ab245880). The H&E-stained slides were reviewed by an expert genitourinary pathologist to assign the Gleason Score according to the International Society of Urinary Pathology grading system. The same pathologist evaluated all specimens (diagnostic and research biopsies) presented in this work.
For immunofluorescence, frozen tissue sections were briefly thawed at room temperature (RT), placed in precooled acetone for 20 minutes at -20°C and washed 3 times in PBS 1X for 2 minutes at RT. To permeabilize tissues the sections were incubated with 0.5% Triton X-100 in PBS 1X for 10 minutes at RT with mild agitation. After three washes in PBS 1X for 2 minutes at RT, coverslips were blocked by incubating with 5% Donkey serum, 3% BSA in PBS 1X for 1 hour at RT. After three washes in PBS 1X for 2 minutes at RT, coverslips were incubated with 1:500 Lamin A/C primary antibody (Santa Cruz sc-6215) 1:500 in blocking solution overnight at 4°C in a humidified chamber, isolating the tissue section by a hydrophobic pen. The following day, sections were washed 3 times in PBS 1X for 2 minutes at RT and incubated with fluorescence-conjugated secondary antibody (Alexa FluorTM 488-conjugated Donkey Anti-goat IgG Invitrogen, A-31571), 1:500 3% BSA in PBS 1X for 1 hour at RT in a dark humidified chamber. After three washes in PBS 1X for 2 minutes at RT, the coverslips were incubated with Hoechst solution (Thermo Fisher Scientific H3570, 1:500) for 10 minutes at RT in the dark, rinsed in PBS 1X and mounted on microscope slides with ProLong antifade mounting media (Thermo Fisher Scientific, P36930).
Images were acquired using a Leica TCS SP5 confocal microscope with a HCX Plan Apo ×63/1.40 objective, with a 5-zooming factor. The NIS-Elements v.5.30 software (Nikon-Lim) was employed to analyze nuclei physical parameters including area and circularity. Hoechst staining facilitated nucleus identification and delineation of the region of interest. Over 5 field of views (FOVs) were acquired and analyzed per sample, totaling the examination of more than 70 nuclei within each sample of the cohort (seven non-neoplastic controls, CTR and ten primary prostate cancer patients, PCa). The mean nuclear area or circularity was calculated for each sample and plotted alongside other samples within the same group on the charts. The analysis of Hoechst and Lamin A/C distribution across nuclei was conducted using Fiji software. Leveraging the plot profile plugin, a 10 µm fixed line was implemented to trace and capture the fluorescence intensity distribution across each nucleus. To standardize comparisons between diverse samples, a normalization step was performed. This involved calibrating the fluorescence intensity of individual pixels along the line to the mean intensity derived from the entire set of pixel lines. For each sample the mean distribution values for both Hoechst and Lamin A/C were calculated and graphically represented alongside other samples within the same group.
Flow cytometry Analysis
To quantify the relative amounts of cell populations in each biopsy, 10,000 cells from the digestion step were stained, acquired on a BD FACSCantoII Flow Cytometer and analyzed with FlowJo software in the INGM FACS facility. To avoid unspecific binding, antibodies were incubated with PBS-BSA 1% for 30 minutes at 4°C. TO-PRO®-3 stain (Thermofisher, T3605) was used to assess cell viability. Tissue resident leukocytes were identified as CD326-/CD45+ (EpCAM-CD326 FITC, B347197; CD45 Pacific Blu Biolegend, 982306); epithelia cells as CD326+/CD45-, and stromal cells as CD326-/CD45-.
4f-SAMMY-seq isolation of chromatin fractions
Chromatin fractionation on tissue-derived single cell suspension was performed with minor adaptations to the protocol described in 16. Cells were counted, washed in cold PBS and resuspended in 600ul cold cytoskeleton triton buffer, CSK-Tr: 10 mM PIPES pH 6,8; 100 mM NaCl; 1 mM EGTA; 300 mM Sucrose; 3 mM MgCl2; 1X Protease Inhibitor Cocktail (Roche, 04693116001); 1 mM PMSF (Sigma-Aldrich, 93482); 1 mM DTT; 0,5% Triton X-100. After 10 minutes on a rotator at 4°C, samples were centrifugated for 3 minutes at 900g at 4°C and supernatant was collected as S1 fraction. Pellets were washed for 10 minutes on the rotator at 4°C with an additional volume of the same buffer followed by centrifugation for 3 minutes at 900g at 4°C. Chromatin was then digested by using 25 units of DNase I (Invitrogen, AM2222) in 100ul of cytoskeleton CSK buffer: 10 mM PIPES pH 6.8; 100 mM NaCl; 1 mM EGTA; 300 mM Sucrose; 3 mM MgCl2; 1mM PMSF; 1X Protease Inhibitor Cocktail for 60 minutes at 37°C. To stop digestion, ammonium sulphate was added to samples to a final concentration of 250 mM and, after 5 minutes on ice, samples were pelleted at 900g for 3 minutes at 4°C and the supernatant was collected as S2 fraction. After a wash of 10 minutes on a rotator at 4°C with 200ul of CSK buffer followed by centrifugation for 3 minutes at 3000g at 4°C, the pellet was further extracted 10 minutes on a rotator at 4°C in 100ul of CSK-NaCl buffer: 10 mM PIPES pH 6.8; 2 M NaCl; 1 mM EGTA; 300 mM Sucrose; 3 mM MgCl2; 1mM PMSF; 1X Protease Inhibitor Cocktail, centrifuged at 2300 g 3 minutes at 4°C and the supernatant was collected as S3 fraction. Pellets were washed twice for 10 minutes on the rotator at 4°C followed by centrifugation at 3000 g 3 minutes at 4°C with 200ul of CSK-NaCl buffer. Finally, pellets were solubilized in 100ul of 8M urea for 10 minutes at room temperature and labelled as S4 fraction. Fractions were stored at –80°C until DNA extraction.
4f-SAMMY-seq DNA extraction, library preparation and sequencing
Chromatin fractions (S2, S3 and S4) were diluted 1:2 in 1X TE buffer (10mM TrisHCl pH 8.0, 1 mM EDTA) and incubated with 3 U of RNAse cocktail (Ambion, AM2286) at 37° for 90 minutes, followed by 40μg of Proteinase K (Invitrogen, AM2548) at 55° for 150 minutes. Genomic DNA was then isolated using phenol/chloroform (Sigma-Aldrich, 77617) extraction, followed by a back extraction of phenol/chloroform with additional volume of TE1X. DNA was precipitated in 2 volumes of cold ethanol, 0.3M sodium acetate and 20ug glycogen (Ambion AM9510) for 1 hour on dry ice or overnight at -20°C. Dry pellets were resuspended in 50 ul (S2) or 15 ul (S3 and S4) of nuclease-free water and incubated at 4°C overnight. On the next day, S2 was further purified using PCR DNA Purification Kit (Qiagen, 28106) and separated using AMPure XP paramagnetic beads (Beckman Coulter, A63880) with the ratio 0.90/0.95 to obtain smaller fragments conserved as S2S (< 300 bp) and larger fragments labelled as S2L (> 300bp) fractions. Both were resuspended in 20 ul of nuclease-free water and then reduced to 15ul using a centrifugal vacuum concentrator. S2L, S3 and S4 fractions were sonicated in a Covaris M220 focused-ultrasonicator using screw cap microTUBEs (Covaris, 004078) to obtain a smear of DNA fragments peaking at 150-200 bp (water bath 20°C, peak power 30.0, duty factor 20.0, cycles/burst 50, 150 seconds for S2L and 175 seconds for S3 and S4). Fractions were quantified using Qubit 4 fluorometer with Qubit dsDNA HS Assay Kit (Invitrogen, Q32854). Libraries were generated from each sample using NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB, E7645L) and Unique Dual Index NEBNextMultiplex Oligos for Illumina (NEB, E6440S). Libraries were then qualitatively and quantitatively checked on and run on an Agilent 2100 Bioanalyzer using High Sensitivity DNA Kit (Agilent, 5067-4626). Libraries with distinct adapter indexes were then multiplexed and, after cluster generation on FlowCell, sequenced for 50 bases in single or paired-end reads mode on an IlluminaNovaSeq 6000 instrument at the IEO Genomic Unit in Milan.
RNA extraction, library preparation and sequencing
Ten thousand cells retrieved from the digestion step were stabilized in 200µl of 1Thioglycerol/Homogenization Solution of the Maxwell® RSC miRNA Tissue Kit (Promega, AS1460) and stored frozen at -80°C for total RNA automated purification using Maxwell® RSC 48 Instrument (Promega, AS8500). Total RNA was quantified by Qubit 4 fluorometer with Qubit RNA HS Assay Kit (Invitrogen, Q32852) and assessed by Agilent 2100 Bioanalyzer using Agilent RNA 6000 Pico Kit (Agilent, 5067-1513) to inspect RNA integrity. For each sample, 1 ng of total RNA was used to construct strand specific RNA-seq libraries with SMARTer Stranded Total RNA-Seq Kit - Pico Input (Takara, 634487). The yield and quality of the libraries were evaluated on Agilent 2100 Bioanalyzer using High Sensitivity DNA Kit (Agilent, 5067-4626). RNA-seq libraries were sequenced on the Illumina NextSeq™ 550 system at the sequencing facilities of Humanitas or Fondazione IRCCS Ca' Granda-Ospedale Maggiore Policlinico in Milan in paired-ends mode.
ChIP-seq public datasets
The ChIP-seq data of histone post translational modifications (histone marks) for non-neoplastic prostate gland tissue were downloaded from the following datasets available on ENCODE data portal 20: ENCSR763IDK (H3K27ac), ENCSR768PFZ (H3K36me3) and ENCSR133QBG (H3K9me3). These data were downloaded as raw high-throughput sequencing reads (FASTQ) and analyzed as described in the next sections.
ChIP-seq and 4f-SAMMY-seq high-throughput sequencing reads data analyses
High-throughput sequencing reads were trimmed using Trimmomatic (v0.39) 51 with the following parameters for 4f-SAMMY-seq and ChIP-seq: 2 for seed_mismatch, 30 for palindrome_threshold, 10 for simple_threshold, 3 for leading, 3 for trailing and 4:15 for sliding window. The sequence minimum length threshold of 35 was applied to all datasets. We used the Trimmomatic “TruSeq3-SE.fa” (for single-end reads) and “TruSeq3-PE-2.fa” (for paired-end reads) as clip files. After trimming, reads were aligned using BWA (v0.7.17-r1188) 52 setting –k parameter as 2 and using as reference genome the UCSC hg38 genome (only canonical chromosomes were used) and the output saved in BAM file format. We uniformly used and aligned only a single read per DNA fragment for both single-end and paired-end sequencing samples. The PCR duplicates were marked with Picard (v2.22; https://github.com/broadinstitute/picard) MarkDuplicates option, then filtered using Samtools (v1.9) 53. In addition, we filtered all the reads with mapping quality lower than 1. Each sequencing lane was analyzed separately and then merged at the end of the process.
Reads distribution profiles analysis
To compute reads distribution profiles (genomic tracks) for individual fractions of 4f-SAMMY-seq, we used Deeptools (v3.4.3) 54 bamCoverage function. For these analyses the genome was binned at 50bp, the reads extended up to 250 bp and Deeptools RPKM normalization method was used. We considered a genome size of 2,701,495,761 bp (value suggested in the Deeptools manual https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html) and we excluded regions known to be problematic in terms of sequencing reads coverage using the blacklist from the ENCODE portal (https://www.encodeproject.org/files/ENCFF356LFX/).
To compute the genomic tracks for ChIP-seq IP over INPUT enrichment profiles (log2 normalized ratio) we used the SPP R package (v1.16.0) 55 and R statistical environment (v3.5.2). The reads were imported from the BAM files using the “read.bam.tags” function, then filtered using “remove.local.tag.anomalies” and finally the comparisons were performed using the function “get.smoothed.enrichment.mle” setting “tag.shift = 0” and “background.density.scaling = TRUE”. The resulting enrichment signal corresponds to a log2 normalized ratio between the pair of sequencing samples.
For computing the relative comparisons between two 4f-SAMMY-seq fractions (relative enrichment, i.e. log2 normalized ratio) we used the same procedure described above for ChIP-seq IP over INPUT enrichment profiles. We defined the "solubility profile" as the relative enrichment ratio of 4f-SAMMY-seq sequencing reads distribution along the genome for S2L vs S3 fractions. We used S3 as baseline in this ratio as it is the second most consistent fraction among controls (Extended Data Fig. 3d).
To compute correlations between genomic tracks, we used R (v3.5.2) base function "cor" with “method = Spearman”. The genomic tracks were imported in R using the rtracklayer (v1.42.2) 56 library. Then the original genomic track files with 50bp resolution were re-binned by averaging data at 150kb resolution using the function “tileGenome” and the correlation was computed per chromosome. The correlation values obtained for each chromosome were then summarized in one value describing the genome-wide sample correlations through a weighted mean, where the weight of each chromosome corresponds to its length.
Open histone marks (H3K4me1, H3K36me3) peaks were called with MACS (v2.2.9.1) using a broad-cutoff of 0.1 57.
H3K9me3 peaks were defined using the EDD (v1.1.19) software 58 with parameters (binsize = 150 Kb and gap penalty = 25) processing the filtered bam files obtained as described above. The “required_fraction_of_informative_bins” parameter was set to 0.98. The blacklisted regions were defined as for the reads distribution profile analyses described above.
The enrichment boxplots shown in Fig. 2e was produced retrieving for each histone mark peak the genomic bins that fall in that genomic region. For each group of patients (CTR, LDD and HDD groups) and for each genomic bin we plot the mean 4f-SAMMY-seq solubility profile across the group. The values used were the quantile normalized tracks (across all samples) S2LvsS3 produced using SPP (see “High-throughput sequencing reads analyses”). Genomic tracks obtained by SPP or Deeptools were saved in bigWig file format.
Genomic tracks visualization
The visualization of genomic tracks was performed with Gviz R library (v1.26.5) 59. The track profile was calculated using the function “DataTrack” (the input bigWig file was imported using the function “import” of the rtracklayer library) and plot using the function “plotTracks”. Extended Data Fig. 3e has been created plotting each track as type "polygon" and all the tracks have been set using a window of 500 except for ChIP-seq of H3K27ac where the window has been set to 5000. It is worth remarking that the "window" parameter in the "DataTrack" function is referring to the number of intervals in which the displayed region is divided to plot the profile. As such, a larger number indicate a finer grain definition of the profile. As H3K27ac is known to have sharp peaks, we aimed to plot a finer grain profile. For Fig. 1c the window was set to 1000 and for all the tracks except for ChIP-seq of H3K27ac where we used 5000, i.e. the same value used for the Extended Data Fig. 3e. CTR and PCa samples track overlayed with confidence interval were drawn using at the same the time the types “a” and “confint”. The same settings of histone mark tracks of Fig. 1c were used for Fig. 2b.
Chromatin compartment analysis
Chromatin compartments were calculated using a revised version of the CALDER algorithm (version 1.0 60), as implemented in the original 4f-SAMMY-seq pipeline for the sub-compartment identification 16. Namely, for each chromosome: i) we calculated the four 4f-SAMMY-seq fractions (S2S, S2L, S3, S4) reads distribution profiles, binned at 150kb and normalized with RPKM (see "Reads distribution profiles analysis" section); ii) for each genomic bin, defined as a vector containing the four RPKM values (one for each fraction), we calculated the Euclidean distance (dist, R stats package, method="euclidean") with all the other bins of the same chromosome. These steps produced an NxN matrix (hereinafter referred to as biochemical similarity matrix), where N is the number of bins for the considered chromosome. Starting from the biochemical similarity matrix, we performed the CALDER procedure as described in 16, to derive the eigenvector and to reconstruct chromatin compartmentalization. Specifically, we called sub-compartments but limiting their segmentation to the highest level, thus obtaining only 2 compartments corresponding to "A" and "B" compartments (Fig. 2a, b).
The consensus of compartment calls across healthy controls has been produced labeling each genomic bin (150kb) according to its more frequent compartment across samples: e.g. if a bin is labeled as A in at least 4 out of 7 CTR samples, that bin will be defined as A in the consensus of controls, and vice versa for bins labelled as B in at least 4 controls (Fig. 2c). The definition of compartment shifts was based on the concordant or discordant compartment classification for each 150kb genomic bin (Extended Data Fig. 4a). The compartment domains shared across patients as shown in Extended Data Fig. 4a were reported using the library upsetR.
The clustering of patients in Fig. 2d were based on Jaccard Indexes (JI) of compartments concordance for each pair of patients. The resulting matrix of pairwise JI was grouped by hierarchical clustering based on the Euclidean distance (among rows or columns) and complete linkage using the pheatmap function in the pheatmap R library.
Cell type deconvolution
For cell type deconvolution analysis, we fist normalized for each sample raw count to TPM (Transcript per Million) by using R 3.6.1 environment and applying the following steps. First, read counts for each gene were normalized (divided) by the length of the gene in kilobases (RPK, Reads per kilobase). Then RPK values in a sample are summed and divided by 1.000.000 to obtain the scaling factor. Finally, RPK values for each gene are divided by the scaling factor.
For the deconvolution of the 3 macro populations (immune, epithelia, stroma cells) we defined a custom signature matrix based on single cell expression data of prostate tissue from the Human Proten Atlas (https://www.proteinatlas.org/; V. 21.1) 61–63. Following the cluster classification of Human Protein Atlas we considered as epithelia the subpopulations annotated as prostatic glandular cells, basal prostatic cells or urothelial cells. We considered as stroma the populations annotated as muscle cells, fibroblast or endothelial cells. We considered as immune cells infiltrate the populations annotated as t-cells or macrophages. Finally, we selected as representative genes for each population only those with a TPM > 10 for the specific population and a TPM < 2 in all the two remaining populations. We finally got a macropopulation signature matrix of 702 genes. We run the deconvolution analysis exploiting this signature matrix on our bulk data by using the relative mode of CIBERSORTX (V.1.0) 64 (Extended Data Fig. 4f).
To compute the immunity score on our biopsies we used the software CIBERSORTX (V.1.0) 64 applying the previously computed TPM matrix of our bulk RNA (mixture file) on the LM22.txt provided in the CIBERSORTX repository. The LM22 leukocyte signature is based on a matrix of 547 genes discriminating 22 human hematopoietic cell types isolated from PBMC. We enabled batch correction by using the B-mode correction option. We next run the analysis in absolute mode by applying 500 permutations. Each sample in the mixture file gets a score that reflects the absolute proportion of each cell types in LM22 mixture file. We finally computed the absolute leukocyte score: i.e. the sum for each sample of all the 22 scores of leukocytes populations. We compared the absolute leukocyte score for each biopsy across the different groups (CTR, LDD, HDD) (Extended Data Fig. 4g).
Copy Number Analysis (CNA)
For copy number analysis we used paired-end 4f-SAMMY-seq reads by treating fractions from the same sample as independent replicates. Data was processed with the pipeline nf-core/sarek v3.4.0 65,66 of the nf-core collection of workflows 67, using whole genome sequencing settings, and segmentation was called with CNVkit v0.9.10 68.
Transcriptomic analyses
For RNA-seq data, the overall quality of the sequenced reads was assessed using FastQC tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) (V. 0.11.8), then reads were trimmed with Trimmomatic software 51 (version 0.39) removing the adapters (ILLUMINACLIP:Picov2smart-PE.fa), primer dimers, and low quality bases at the beginning and at the end of the reads (trimmomatic PE phred33 LEADING:3 TRAILING:3 SLIDINGWINDOWD:4:15 MINLEN:36). STAR 69 (V. 2.7.0f_0328) was used to index (STAR --runMode genomeGenerate) the Human Genome (GENCODE Release 39, GRCh38 primary assembly genome 70) and to align sequenced reads in paired-end mode (--readFiles R1.FASTQ R2.FASTQ) on the indexed reference genome. Multimapping reads and PCR duplicate were marked in the final output (--bamRemoveDuplicatesType Unique) and unaligned reads stored in a different file (--outReadsUnmapped Fastx).
The read counts on genes were calculated using as a reference a GTF file with RefSeq annotation downloaded from UCSC (http://genome.ucsc.edu) stored in the following directory:(http://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/ncbiRefSeq/109.20190905/hg38.109.20190905.ncbiRefSeq.gtf.gz). This file was further processed to remove non canonical and mitochondrial chromosomes, selected only curated genes (NM, NR) and finally split in protein coding (NM) and noncoding (NR) files. Reads count was performed with HTSeq-count (V. 0.13.5) on bam files (previously generated by STAR) using as a feature the union of all exons in a gene. The type of library was specified with “-s reverse” parameter. The reads that align to more than one position in the reference genome were discarded (htseq-count --non-unique none). The full matrix with raw read counts for each sample were loaded in R 3.6.1 and normalized using DESeq2 71 median of ratios. Differential expression analysis was performed with DESeq2V. 1.26 72) using Wald test and the Benjamini and Hochberg correction for multiple tests, to compute p-values and adjusted p-values, respectively.
Enrichment analyses
Up-regulated genes (with adjusted p-value < 0.05 and Log2 Fold Change > 1) and Down-regulated genes (with adjusted p-value < 0.05, and Log2 Fold Change < -1) were used separately to query the gene-list enrichment tool Enrichr (https://maayanlab.cloud/Enrichr/) 73–75 over the main Gene Ontology classes (Biological Process - BP, Molecular Function - MF, Cellular Component - CC) updated to 2023 taking as final enriched terms only those with Benjamini-Hochberg adjusted p-value < 0.05. Genes within the recurrent compartment switch “A to B” and “B to A” were also queried over the main Gene Ontology classes taking also in this case the enriched terms with Benjamini-Hochberg adjusted p-value < 0.05. GSEA analysis 31 was performed in Preranked mode. Starting from raw p-values of all genes we generated a ranked matrix sorted by the score resulting from the following formula: -Log10(p-value) multiplied by the sign of the Log2 Fold Change. For the gene set references we downloaded H (Hallmark) and C5 (Ontology) GMT files from MSigDB. Finally, we performed the analysis using the parameters "Number of permutations": 1000; "Collapse": No; "Enrichment Statistic": classic; "Max size": 500; "Min size":15. Resulting geneset with a family-wise error rate (FWER) < 0.05 are finally sorted using the Normalized Enrichment Score (NES).
We then inferred pathway activity for the comparison of our tumor subtypes by using the R package decoupleR 76 (V. 2.4). We used the get_progeny (organism=human,top=100) function to get pathways from PROGENy (V.1.20) 32 a curated collection of signaling pathways derived from perturbation experiments, where each gene has a specific weight describing its positive or negative response to a given pathway stimulation. The top 100 responsive genes in the compendium ranked by p-value were used for each pathway. Starting from Wald statistic values previously computed for each gene with DESeq2 (LDDvsCTR, HDDvsCTR and HDDvsLDD comparisons) we run the function decouple with parameters statistics=c(“mlm”,”ulm”,”wsum”) and consensus_score=TRUE, to estimate a consensus score across the top performer methods (according to benchmark done by decoupleR developers) in pathway activity prediction. Multivariate linear model (mlm), Univariate linear model (ulm), Weighted Sum (wsum) are recommended by developers of decoupleR since these methods better estimate pathway activity considering the weights associated with pathway related genes.
Patient specific DEGs and Compartment switches
Significant Down-regulated genes have been identified using DESeq2 with the same workflow described above (adjusted p-value < 0.05 and absolute Fold Change value > 1), comparing each individual LDD and HDD patients against the group of 7 CTRs. Patients-specific compartment switches ("A to B" and "B to A") were defined with respect to the consensus of compartment calls across CTRs.
To define number of DEGs included in switch regions for each patient, patient-specific down-regulated DEGs were intersected to patient-specific A to B regions while up-regulated DEGs were intersected to patient-specific B to A regions. Then the gene lists resulting from all the patient-specific intersections of DEGs and compartment switches were merged and the reultin union gene list was used as input for functional classes enrichment analysis with Enrichr (https://maayanlab.cloud/Enrichr/).
TCGA Data
TCGA-PRAD RNA-Seq expression data were downloaded by using gdc-client tool on gdc_manifest.2022-08-08.txt metafile input downloaded from the GDC portal repository (https://portal.gdc.cancer.gov/; V.34.0; Release July, 27, 2022) by selecting from Web interface the TCGA project, prostate and RNA-seq count. The main clinical parameters (Gleason Score, T stage, N stage and age) and info about the samples (primary tumor, normal biopsy and metastatic) were selected from the same Web section of GDC portal repository by downloading clinical and sample-sheet files. ABSOLUTE tumor purity score 77 was downloaded from https://gdc.cancer.gov/about-data/publications/pancanatlas. The file is TCGA_mastercalls.abs_tables_JSedit.fixed.txt (Downloaded October ,13, 2022) 78. Excluding primary normal biopsies, we obtained from the TCGA repository described above the 500 FPKMUpperQuartile normalized expression counts from prostate primary tumors. The Biochemical Recurrence Free (BCR) status for each patient was obtained from PCaDB (http://bioinfo.jialab-ucr.org/PCaDB/; Release May, 10 2021; eSet V.1.3), downloading the TCGA-PRAD_eSet.RDS object and applying the pData function from Biobase package. We then excluded the only sample with a Gleason Score pattern of 2+4 never described in literature. Despite we performed expression analyses assigning a PCI score on the entire cohort of 499 samples, the BCR data available are 466.
PCI score (Prostate Compartmentalization Index)
Starting from our 101 DEGs of the HDD vs LDD comparison we defined two gene lists with 24 HDD Up-regulated and 77 HDD Down-regulated genes. We subset the TCGA matrix extracting only those genes matching with our full list of 101 genes. We performed a log transformation of TCGA gene expression matrix: Log2(matrix +1). The log transformed expression of each gene across the 499 samples was standardized into a z-score. For each sample we computed 2 median values on the z-score transformed matrix: the median of the 24 HDD Up-regulated genes (HDD_UP) and the median of the 77 HDD Down-regulated (HDD_DOWN) genes. Finally, we computed the PCI score defined as the difference of median values: PCI=median(HDD_UP) - median(HDD_DOWN). The resulting PCI score with a positive value will indicate that a sample has HDD-like transcriptomic features, whereas a PCI negative value will classify a sample as LDD-like.
Survival Analyses
Univariate and multivariate regression analysis were performed in R using survival and survminer packages. First, we performed a log-rank test to compare survival Kaplan-Meier curves (in terms of BCR STATUS AND BCR TIME) between HDD-like and LDD-like samples. Then we performed a multivariate Cox proportional-hazard model including the following covariates: HDD-LDD status (previously assigned according to the PCI score) of the samples, pathological stage, age at diagnosis, Gleason Score and Tumor purity score. For the multivariate Cox regression analysis, the Gleason Score was included as numeric feature by summing the primary and secondary numeric grades, as reported for TCGA samples.
Signature Refinement
To refine the 101 gene signature, we applied univariate cox regression analysis on each single gene expression values across the TCGA samples, including in the model the BCR STATUS and BCR TIME. The p-values obtained for each gene are then adjusted with Benjamini-Hochberg. Then we selected only those genes whose expression is associated with Hazar Ratio (HR) < 1 or HR > 1 with a adj.p-value < 0.05. We obtained 21 candidate prognostic genes. Unsupervised clustering of these genes can cluster 2 main groups of samples with HDD-like and LDD-like features with high significance in the prognostic status of the two groups (Extended Data Fig. 6a). We further refined the signature by removing 3 genes (NPIPB3, CCNE2, FMO5) that were noisier in the unsupervised clustering heatmap: these are the only genes that cluster in a discordant manner compared to their original association to HDD or LDD phenotype in our dataset (Extended Data Fig. S6a). Finally, we used the resulting refined panel of 18 genes to recompute and validate our PCI score on other independent cohorts of transcriptomic profiles of prostate cancer patients.
Validation
Through the PCaDB (http://bioinfo.jialab-ucr.org/PCaDB/ Release May, 10 2021; eSet V.1.3) we downloaded normalized expression matrices and BCR data of three independent datasets of prostate cancer: Emory (GSE54460) 79, Stockholm (GSE70769) 80, Belfast (GSE116918) 81. For each of them we classified samples computing the PCI score based on the refined 18 genes signature.