Clinical trials search strategy
To identify the most comprehensive target atlas from clinical trials, ClinicalTrials.gov and PubMed were searched on 31 December 2022 with the following search terms ‘antibody-drug conjugate’, ‘cancer’, ‘tumour’, and ‘oncology’ in various combinations. Just interventional studies were included, and the early phase I trials were grouped with phase I trials. In addition, abstracts and posters from ASCO 2022, ECCO 2022 and ESMO 2022 congresses were included for ADC searching with the terms ‘antibody-drug conjugate’ or ‘ADC’.
Data acquisition of the normal tissue transcriptome and proteome
Expression profiles for human tissue proteins based on IHC were retrieved from the Human Protein Atlas (HPA) under the entry of ‘Normal tissue data’ (https://www.proteinatlas.org/about/download, normal_tissue.tsv.zip). Proteomic sequencing data based on LC-MS/MS was retrieved from the Human Proteome Map (HPM) (http://www.humanproteomemap.org/, HPM_normal protein_level_expression_matrix_Kim_et_al_052914 - LC-MSMS.xlsx). Consensus transcript expression profiling integrated from the HPA, GTEx and FANTOM5 was retrieved from the HPA under the entry of ‘RNA consensus tissue gene data’ (https://www.proteinatlas.org/about/download, rna_tissue_consensus.tsv.zip).
Retrieval and compilation of protein subcellular location data
Subcellular localization of proteins was retrieved from the HPA under the entry of ‘Subcellular location data’ (https://www.proteinatlas.org/about/download, subcellular_location.tsv.zip), and the knowledge channel of COMPARTMENTS (https://compartments.jensenlab.org/Downloads, human_compartment_knowledge_full.tsv). The membrane protein annotation dataset (includes 6176 entries) was compiled by extracting all cell surface membrane protein information first through the R language (version 4.0.3), and then merging the above information.
Rearrangement and mapping of the human tissues
Since the human tissue nomenclature differs among source repositories, each data set was mapped to a set of consensus tissue labels. In cases when mapping multiple tissues in one repository to a single tissue label in another source, the maximum expression value was selected. For example, the caudate, cerebellum, cerebral cortex, choroid plexus, dorsal raphe, hippocampus, hypothalamus, pituitary gland, and substantia nigra were collapsed into a single tissue category, “brain”. The cervix, uterine, endometrium, ovary, fallopian tube, vagina, epididymis, seminal vesicle, testis, and prostate were collapsed into internal genitalia. In addition, the adult adrenal, adult colon, adult esophagus, adult frontal cortex, adult gallbladder, adult heart, adult kidney, adult liver, adult lung, adult ovary, adult pancreas, adult prostate, adult rectum, adult retina, adult testis, and adult urinary bladder in the HPM were mapped to the adrenal gland, colon, esophagus, brain, gallbladder, heart muscle, kidney, liver, lung, internal genitalia, pancreas, internal genitalia, rectum, eye, internal genitalia, and urinary bladder, respectively. To maintain consistency, fetal tissues were also discarded, resulting in 32 unique tissue categories.
Normal tissue expression and binning
In the interest of facilitating target screening, the expression values were classified into five categories, including ‘High’, ‘Medium’, ‘Low’, ‘Not Detected’, and ‘Not Available’. To accomplish the binning, we first perform log10 conversion on HPM dataset, and then temporarily correct it for the purpose of abundance distribution estimation. In order to best fit the normal curves to the observed distributions, we applied the Broyden-Fletcher-Goldfarb-Shanno algorithm [32], and subsequently obtained the peak maximum and standard deviation measure. Expression values in the range of one standard deviation above the peak maximum were recorded as ‘Medium’, and expression values above this threshold were recorded as ‘High’. Similarly, expression values in the range of one standard deviation below the peak maximum were recorded as ‘Low’, and for those falling below one standard deviation were recorded as ‘Not Detected’. Proteins without expression values were recorded to be of ‘Not Available’ abundance. The natural format of the expression profile of human tissue proteins obtained from IHC is the five aforementioned categories, so there is no adjustment. While for the RNA consensus expression profiling integrated from the HPA, GTEx and FANTOM5, the consensus normalized expression (NX) values between 20 and 40 were recorded as ‘Medium’, and the NX values above this threshold were recorded as ‘High’. Similarly, the NX values in the range of 1-20 were recorded as ‘Low’, and for those falling below 1 were recorded as ‘Not Detected’. The NX values in other cases were recorded as ‘Not Available’.
Differential expression analysis of genes between tumour and its adjacent normal tissue
We gathered the uniformly processed TCGA and GTEx RNA-sequencing data from the RNAseqDB (https://github.com/mskcc/RNAseqDB). All together there were 9,109 high-quality samples covering 14 normal tissues and 19 types of solid cancer. The DESeq2 package [33] was used to identify HUGO genes that are differentially expressed between tumours and their adjacent normal tissues. By setting a threshold of Benjamini-Hochberg adjusted p-values of 0.01 and log2FoldChange of 1.0, those HUGO genes that were significantly upregulated in the tumours were retained.
Differential expression analysis of genes between tumour and other normal tissues
The read count data of RNA-sequencing gathered from the RNAseqDB was first transformed to TPM (transcripts per million) format that can be directly used to compare gene expression. To transform read counts into TPM format, we need to normalize for gene length, and then normalize for gene depth, in that order. For the gene length normalizing step, we fist calculated gene length from GTF files (GDC.h38 GENCODE v22 GTF (used in RNA-Seq alignment and by HTSeq)) by counting the longest transcript of each gene (sum of exons) or the sum of all exons, then divided each count by the length of its respective gene. For the gene depth normalizing step, we performed in the order as follows: 1) sum all counts within each sample column; 2) divide each column sum by the desired depth (1,000,000) to yield scaling factors; 3) divide each sample count within a column by its respective scaling factor.
The TPM values of each gene in its paired indication and normal tissues were used as input data for differential expression analysis. The non-parametric Mann–Whitney U test analysis was applied to calculate the differential expression ratios. The differential expression patterns of target genes between their paired indication and normal tissues were visualized via the OmicCircos package [34].
Tumour tissue transcriptome, genome, and phenotype compilation
To integrate transcriptome and genome and phenotype information of the gene set of interest. We downloaded the phenotype data of TCGA patients with various solid cancers from UCSC Xena (https://xenabrowser.net/datapages/), and then extracted and organized the information about gender, neoplasm histologic grade, pathologic stage, and TNM staging. The non-silent mutations (SNP and INDEL) for each gene in individual cancer type were determined through mining the MC3 (“Multi-Center Mutation Calling in Multiple Cancers”) TCGA MAF (mutation annotation format) file. The gene-level transcription estimates (in log2(x+1) transformed RSEM normalized count format) were transformed to TPM format that can be directly used to compare gene expression. Thereafter, we compilated a comprehensive dataset via integrating the aforementioned tumour tissue transcriptome and genome information and the organized phenotype data. The heterogeneous expression pattern analysis was performed using the ggstatsplot package in batch mode (https://github.com/IndrajeetPatil/ggstatsplot).
Annotation of functionally relevant mutation
OncoKB (https://www.oncokb.org/) contains annotation information about the impact and therapeutic significance of 5616 specific alterations in 682 cancer genes. It combines multiple resources, including FDA, NCCN (National Comprehensive Cancer Network) and other guidelines, ClinicalTrials.gov and scientific literatures. We utilized the annotation information about oncogenic and clinically actionable alterations from the OncoKB and discarded somatic mutations that were labeled as likely oncogenic or predicted oncogenic, resulting in the set of driver mutations, which were not contaminated by passenger mutations. The collected information was applied to analyze the therapeutic significance of potential target genes that altered in a large proportion of patients.
Predicting overexpression rates of target antigens
We applied the method called functional genomic mRNA profiling to predict overexpression rates of target antigens [35]. Typically, principal component analysis (PCA) was used to analyze the mRNA transcriptome and n eigenvalues and n corresponding eigenvalues (transcriptional components) were subsequently obtained. We identified these subsets of transcriptional components that describe non-genetic regulatory factors (physiological and experimental factors) and used them as covariates to in multiple linear regression to correct the original gene expression data (the so-called functional genomic mRNA profiles that capture the effects of genomic alterations on gene expression levels). The overexpression percentages of each target antigens in samples per cancer type were determined based on the threshold that was defined in the set of functional genomic mRNA profiles of normal tissues by calculating the 97.5th percentile for the functional genomic mRNA signal.
Gene set enrichment score analysis
We first downloaded the pan-cancer phenotype data and expression matrix from XENA (https://xenabrowser.net/datapages/?dataset=GDC-PANCAN.basic_phenotype.tsv&host=https%3A%2F%2Fgdc.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443, GDC-PANCAN.basic_phenotype.tsv) and the TCGA PanCanAtlas (https://gdc.cancer.gov/about-data/publications/pancanatlas, EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv), respectively. We then extracted the paired samples and their corresponding expression matrix according to the rules of TCGA sample barcode (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/). After which the enrichment score of gene set of interest was calculated based on the ssGSEA [36], and the z-score was transformed to evaluate the expression similarities and differences of gene set in TCGA pan-cancer.
Statistical analysis
All statistical analyses described above within the context of individual analyses in the Methods section were carried out using R statistical environment. The non-parametric Mann–Whitney U test and non-parametric Kruskal–Wallis one-way ANOVA were carried out for the analyses of two groups and more than two groups, respectively.