Study cohorts
The CPI1000+ cohort utilises raw whole exome and RNA sequencing data from the following studies:
- Snyder et al. (29), an advanced melanoma anti-CTLA-4 treated cohort.
- Van Allen et al. (30), an advanced melanoma anti-CTLA-4 treated cohort.
- Hugo et al. (31), an advanced melanoma anti-PD-1 treated cohort.
- Riaz et al. (32), an advanced melanoma anti-PD-1 treated cohort.
- Cristescu et al. (2) an advanced melanoma anti-PD-1 treated cohort.
- Cristescu et al. (2) an advanced head and neck cancer anti-PD-1 treated cohort.
- Cristescu et al. (2) “all other tumor types” cohort (from KEYNOTE-028 and KEYNOTE-012 studies), treated with anti-PD-1.
- Snyder et al. (33), a metastatic urothelial cancer anti-PD-L1 treated cohort.
- Mariathasan et al. (1), a metastatic urothelial cancer anti-PD-L1 treated cohort.
- Mcdermot et al. (15), a metastatic renal cell carcinoma anti-PD-L1 treated cohort.
- Rizvi et al. (34), a non-small cell lung cancer anti-PD-1 treated cohort.
- Hellman et al., an unpublished cohort of non-small cell lung cancer samples treated with anti-PD-1.
- Le et al. (35), a colorectal cancer cohort treated with anti-PD-1 therapy.
In order to allow studies to be grouped by histology, patients from the “all other tumor types” KEYNOTE-028 and KEYNOTE-12 cohort from Cristescu et al. were broken out to create two additional cohorts, cohort 14: Cristescu et al - urothelial cancer and cohort 15: Cristescu et al – breast cancer (N.B. an additional n=67 breast cancer samples from Voorwerk et al. (36) are in the process of being analysed for the final submission). Thus in total, the CPI1000+ cohort is comprised of data from 15 distinct sub-studies. A breakdown of sample numbers for each study/histology is contained in Table S1. Validation data for copy number analysis was taken from Samstein et al. (3), a cohort of 1662 patients treated with CPI and profiled using the MSK-IMPACT gene panel (referred to as the MSK1600 cohort). Segment copy number data for these samples was downloaded from the GENIE Synapse portal (syn7222066), https://www.synapse.org/, and clinical data was utilised from the Samstein et al. paper. In addition, a cohort of MSK-IMPACT sequenced, but non-CPI treated patients from were utilised for negative control analyses, to distinguish CPI predictive from generally prognostic biomarkers. Copy number segment data for this non-CPI cohort was similarly obtained from the GENIE Synapse portal (syn7222066), https://www.synapse.org/, and clinical response data was taken from Bielski et al. (37), and patients overlapping with the Samstein et al. were removed. We note the clinical survival data from the supplementary files of Bielski et al. and Samstein et al MSK-IMPACT publications did not match for overlapping patients (presumably due to differing duration of follow-up) and hence these two datasets could only be analysed separately, and not combined together for interaction test analysis. Lastly, single cell RNA sequencing was conducted on CD8 TILs from patient L011, a patient diagnosed with non-small cell lung cancer who underwent definitive surgical resection prior to receiving any adjuvant therapy. Informed consent was obtained under study UCLHRTB 10/H1306/42.
Clinical end points
In the CPI1000+ cohort, a uniform clinical end-point of response was defined across all the 15 studies based on radiological response as per the RECIST criteria, with “CR/PR” being classified as a responder and “SD/PD” being a non-responder. We note this is a conservative definition of response, and patients with SD and extended survival have in some previous studies been considered as experiencing clinical benefit from treatment, however the “CR/PR” vs “SD/PD” definition used here allows for uniform consistency across cohorts, clearest interpretation and is consistent with the most recent literature (1, 2). For RECIST response evaluations we utilized the clinical data provided by the original authors, which in >90% of cases was best response time point. In a minority of cases the time point of RECIST evaluation was not directly specified. For the (2) cohort response labels were not available as a supplementary file, however they could be inferred from cross-reference of Table S2 and Fig. S3 of that paper, and validated by re-computing p-values from the paper to ensure exact match (e.g. Fig. 2 multivariate model p-values stated in the paper, we were able to match to the 4 decimal places accuracy provided in the paper). In addition, the inferred labels were further validated when we checked the numbers of responders per detailed histology in Table S3 of (2) and found the inferred data matched exactly the reported results. RECIST response data was not available for the MSK1600 cohort, so instead overall survival was used as the clinical end-point, combined with negative control analysis in MSK-IMPACT profiled samples not treated with CPI, to distinguish predictive from prognostic biomarkers.
Multimer sorting of neoantigen reactive T cells
We have previously identified CD8+ neoantigen reactive T cells (NARTs) targeted against a clonal neoantigen (arising from the mutated MTFR2 gene) in NSCLC tumor regions derived from patient L011 (24). Briefly, neoantigen-specific CD8 T cells were identified using high throughput MHC multimer screening of candidate mutant peptides generated from patient-specific neoantigens of predicted <500nM affinity for cognate HLA as previously described (24). 288 candidate mutant peptides (with predicted HLA binding affinity <500nM, including multiple potential peptide variations from the same missense mutation) were synthesized and used to screen expanded L011 TILs. In patient L011, TILs were found to recognize the HLA-B*3501 restricted, MTFR2D326Y-derived mutated sequence FAFQEYDSF (netMHC binding score: 22nM), but not the wild type sequence FAFQEDDSF (netMHC binding score: 10nM). No responses were found against overlapping peptides AFQEYDSFEK and KFAFQEYDSF. Neoantigen-specific CD8+ T cells were tracked with peptide-MHC multimers conjugated with either streptavidin PE (Biolegend, cat#405203), APC (Biolegend, cat#405207) BV650 (Biolegend, cat#405231) or PE-Cy-7 (Biolegend, cat#405206) and gated as double positive cells among live, single CD8+ cells. Phenotypic characterization of neoantigen-specific CD8 T cells in L011 was performed as previously described (24).
Single-Cell RNA sequencing of Neoantigen Reactive T cells
Multimer-positive and negative single CD8+ T cells from NSCLC specimens were sorted directly into the C1 Integrated Fluidic Circuit (IFC; Fluidigm). Cell lysing, reverse transcription, and cDNA amplification were performed as specified by the manufacturer. Briefly, 1000 single, multimer positive or negative CD8 T cells were flow sorted directly into a 10- to 17-μm-diameter C1 Integrated Fluidic Circuit (IFC; Fluidigm). Ahead of sorting, the cell inlet well was preloaded with 3.5ul of PBS 0.5% BSA. Post-sorting the total well volume was measured and brought to 5ul with PBS 0.5% BSA. 1ul of C1 Cell Suspension Reagent (Fluidigm) was added and the final solution was mixed by pipetting. Each C1 IFC capture site was carefully examined under an EVOS FL Auto Imaging System (Thermo Fisher Scientific) in bright field, for empty wells and cell doublets. An automated scan of all capture sites was also obtained for reference. Cell lysing, reverse transcription, and cDNA amplification were performed on the C1 Single-Cell Auto Prep IFC, as specified by the manufacturer. The SMARTer v4 Ultra Low RNA Kit (Takara Clontech) was used for cDNA synthesis from the single cells. cDNA was quantified with Qubit dsDNA HS (Molecular Probes) and checked on an Agilent Bioanalyser high sensitivity DNA chip. Illumina NGS libraries were constructed with Nextera XT DNA Sample Preparation kit (Illumina), according to the Fluidigm Single-Cell cDNA Libraries for mRNA sequencing protocol. Sequencing was performed on Illumina® NextSeq 500 using 150bp paired end kits.
Sample quality control
Several steps were taken to prevent contamination and sample quality issues. First, samples were clustered using a panel of common germline SNPs, to ensure no duplicate participants were included (Fig. S4). Secondly, the DeTiN tumor-in-normal contamination tool (7) was run, and excluded any samples with TiN>0.02. Thirdly, we assessed for any technical correlations between mutation counts and purity or sequencing coverage, and did not observe any significant relationships (Fig. S2). Fourthly, to remove any potential batch issues in the RNA sequencing data we used the removeBatchEffect() function from the limma R package. Finally, we assessed for any evidence of different exome capture kits across the cohorts impacting results, and found no significant difference in TMB scores based on exome capture kits utilised (Fig. S3). We note however that Agilent SureSelect kits were used in nearly all studies, except for one cohort, Snyder et al. (33), which used IDT xGen WES capture, and in addition we found no specification of the capture kit used in the Hugo et al. manuscript (31).
Whole exome sequencing (DNA) pipeline – variant calling
For all studies we obtained germline/tumor whole exome sequencing data in either BAM, SRA or fastq format, from the relevant sequencing repository or directly original authors and reverted these files back to FASTQ format using Picard tools (version 1.107) SamToFastq. Raw paired-end reads in FASTQ format were aligned to hg19 obtained from the GATK bundle (v2.8) using bwa mem (bwa v0.7.15) (McKenna et al., 2010, Li and Durbin, 2009). Picard tools (picard v1.107) was used to remove duplicates (http://broadinstitute.github.io/picard), and GATK was additionally used for local indel realignment. Quality control metrics were produced with picard tools (v1.107), FastQC (v0.11.5 - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and GATK(v3.9) (McKenna et al., 2010). Platypus v0.8.1 was used to call homozygous and heterozygous germline SNPs (Rimmer et al., 2014). The default parameters were used, but the genIndels flag was set to FALSE. Only SNPs with a minimum depth of coverage of 20x are taken forward. Somatic variants were detected using two tools (VarScan2 v2.4.1 & MuTect v1.1.7) (Koboldt et al., 2012, Cibulskis et al., 2013), using the following method: SAMtools mpileup (version 0.1.19) was used to locate non-reference positions in tumor and germline samples. Bases with a Phred score of less than 20 or reads with a mapping quality less than 20 were omitted. The Base alignment quality (BAQ) calculation option was deactivated and a threshold of 50 was set for the coefficient of downgrading mapping quality. VarScan2 somatic (version 2.3.6) used output from SAMtools mpileup to identify somatic variants between tumour and matched germline samples. VarScan2 processSomatic was used to extract the somatic variants. Single nucleotide variant (SNV) calls were filtered for false positives with the associated fpfilter.pl script in Varscan2, initially with default settings then repeated with min-var-frac=0·02, having first run the data through bam-readcount (version 0.5.1). MuTect (version 1.1.4) was also used to detect SNVs, and results were filtered according to the filter parameter PASS. Default parameters were used in both tools with the exception of: i) minimum coverage for the germline sample was set to 10, ii) minimum coverage for the tumor sample was set to 30, iii) minimum somatic variant allele frequency (VAF) was set to 0·01 and minimum alternative read coverage set to 5, iv) alternative reads in the germline had to be <=5 and germline VAF <=1%, v) variant had to be not present in EXAC03 database at 5% or higher frequency. In final QC filtering, an SNV was considered a true positive if the variant allele frequency (VAF) was greater than 1% and the mutation was called by both VarScan2, with a somatic p-value <=0.01, and MuTect. Alternatively, a frequency of 5% was required if only called in VarScan2, again with a somatic p-value <=0.01. For small scale insertion/deletions (INDELs), only calls classed as high confidence by VarScan2 processSomatic were kept for further analysis, with somatic_p_value scores less than 5 × 10−4. Variant annotation was performed using ANNOVAR (version 2016Feb01).
Whole exome sequencing (DNA) pipeline – copy number calling
VarScan2(v2.4.1) was used to generate logR depth ratios from paired tumour region/germline samples. These values were subsequently GC corrected (Cheng et al., 2011). Default parameters were used to generate this data with the exception of: min-coverage=8 and min-segment-size=50. B-Allele Frequencies (BAFs) – the proportion of reads with a SNP variant relative to the total read depth – were calculated using the SNPs called in the germline by platypus. The GC-corrected logR values and BAF values are then used by ASCAT (v2.3) (Van Loo et al., 2010) to generate segmented allele-specific copy number data, including estimates of tumour ploidy and cellularity. Sequenza (Favero et al., 2015) was additionally run on all samples in parallel. To ensure accuracy, default ASCAT copy number solutions were quality control checked, and where a sample failed any of the following quality flags it then underwent manual review: i) unexpectedly high purity, defined as tumor cellularity > 80%, ii) unexpectedly low levels of loss of heterozygosity, defined as fraction of the genome LOH of < 0.1, iii) unexpectedly high level of the genome with both alleles at even copy number, defined as the fraction of the genome with alleles A and B both even as > 0.7, iv) unexpectedly high level of the genome with copy number = 0, defined as >=4Mb with copy number = 0. In addition, an orthogonal measure of tumor purity was derived based on mutation variant allele fraction, as previously described [NEJM], and samples with a mismatch in purity between ASCAT and orthogonal measurements of greater than 1 standard deviation were additionally flagged for manual review. Samples that had been flagged for manual review underwent dual analyst inspection, which involved review of the default and alternative copy number solutions from ASCAT and Sequenza tools. Where a better fitting solution was available (based on the rules above, as well as obtaining consistency in solutions between ASCAT and Sequenza) this was utilised rather than the ASCAT default.
RNA sequencing pipeline
RNAseq data was obtained in BAM/SRA/FASTQ format for all studies, and reverted back to FASTQ format using bam2fastq (v1.1.0). FASTQ data underwent quality control and were aligned to the hg19 genome using STAR (38). Transcript quantification was performed using RSEM with default parameters (39).
Mutation clonality analysis
PyClone (Roth et al., 2014) was used to determine the clonal status of mutations. For each sample variant calls were integrated with local allele specific copy number (obtained from ASCAT), tumor purity (also obtained from ASCAT), and variant allele frequency data. All mutations were then clustered using the PyClone Dirichlet process clustering. We ran PyClone with 10,000 iterations and a burn-in of 1000, and using parameters as previously described (40).
HLA and neoantigen analysis
Neoantigen predictions were derived by first determining the 4-digit HLA type for each patient, along with mutations in class I HLA genes, using POLYSOLVER (41). Next, all possible 9, 10 and 11-mer mutant peptides were computed, based on the detected somatic non-synonymous SNV and INDEL mutations in each sample. Binding affinities of mutant and corresponding wildtype peptides, relevant to the corresponding POLYSOLVER-inferred HLA alleles, were predicted using NetMHCpan (v3.0) and NetMHC (v4.0) (42). Neoantigen binders were defined as IC500<500 nM or rank < 2.0. Germline HLA-I evolutionary divergence scores were derived by calculating the Grantham distances between HLA gene allele pairs using the same procedure as described in Pierini et al. (43), which utilises the Grantham distance metric originally designed for investigating protein evolution from physiochemical differences in amino acid sequences (44). Aligned protein sequences for HLA alleles were obtained from the IMGT database (45) for the different HLA alleles as called by Polysolver from the raw data files for the HLA-A, B and C genes. A custom R script was created to calculate the Grantham distance at each position on exons 2 and 3 of two aligned HLA alleles (exon 2 and 3 being the the peptide binding region of the HLA protein). The final Grantham distance score between two HLA alleles was calculated as the sum of the scores at each position divided by length of the amino acid sequence. The average Grantham score for an individual patient was then calculated by taking the mean of the separate Grantham scores for HLA-A, B and C. HLA loss of heterozygosity analysis was performed using the LOHHLA too as previously described (10).
Derivation of published biomarkers
The following previously published biomarkers were tested for association with response to CPI therapy: tumor mutation burden (TMB) (29, 30, 34) (also split out into Clonal (24) and Subclonal TMB), frameshift insertion/deletion (indel) mutation burden (46), burden of indels escaping nonsense mediated decay (47), shannon diversity index for intratumor heterogeneity (SDI-ITH) (13), burden of somatic copy number alterations (48), HLA-I evolutionary divergence (9), loss of heterozygosity at the HLA locus (10), Gender (12), B2M mutations (49), JAK1/JAK2 mutations (50), DNA damage response pathway mutations (51), Receptor tyrosine kinases pathway mutations (52), SERPINB3/SERPINB4 mutations (53), PTEN mutations (54), CD8A (55), CD274 (PD-L1) (56) and CXCL9 expression (57), as well as the CD8 T cell effector (15) and T cell inflamed gene expression signatures (14). Histology specific CPI biomarkers, e.g. STK11 and PBRM1, were excluded from the analysis on account of lack of power to compressively assess biomarkers in single histology types. TMB was defined as the count of missense variants, SCNA load was defined using the weighted genome instability index (wGII) (8), expression of individual genes was measured using either TPM (for datasets with RNAseq) or normalised nanostring expression values for the Cristescu et al. cohort. For inactivating pathway mutations (i.e. B2M, PTEN, JAK1/JAK2, DNA damage response) loss of function mutations (i.e. those causing a premature stop codon) and homozygyous deletions were included. DNA damage response pathway genes were defined as: BRCA1, BRCA2, ATM, POLE, ERCC2, FANCA, MSH2, MLH1, POLD1 and MSH6 based on (51). Receptor tyrosine kinases pathway mutations were defined as per (52). All other biomarkers were defined as per the method outlined in the original underlying publication as referenced above. Associations with response were tested using logistic regression. To allow biomarkers with varying measurement scales (e.g. mutation counts vs gene expression values) to be compared equivalently based on effect size rather than p-value (5), all biomarker values were converted to standard z-scores (i.e. mean normalised to equal zero, and standard deviation normalised to one). To avoid data pooling, each biomarker was tested individually in each sub-study, and then the effect sizes and standard errors were combined through meta-analysis to derive a final p-value per biomaker. Meta-analysis was conducted using R package ‘meta’. Proportion of variance explained analysis. The total proportion of variance explained by all biomarkers was calculated by logistic regression pseudo-R2, using R function ‘PseudoR2’.
Fitting a multivariate model of CPI response
All biomarkers attaining significance in the Fig. 1A meta-analysis were utilised, comprising 12 measures in total: TMB, Clonal TMB, Indel TMB, NMD-escape TMB, T cell inflamed and CD8 Effector expression signatures, Gender, SERPINB3 mutation status, LOHHLA, and gene expression values for CD274 (PD-L1), CD8A and CXCL9. All 12 biomarkers were inputted into the gradient boosted tree algorithm XGBoost, a widely used machine learning algorithm effective for classification tasks. R package ‘xgboost’ was utilized, with maximum tree depth set at 2, nrounds set as 20 and all other parameters set to default values. In the four training/validation cohorts, individual models were trained on a randomly selected 75% subset of each cohort, with AUC values then calculated in 25% of held-back (unseen) samples. To understand the distribution of AUC values the random subsampling procedure was repeated via 1000 rounds of Monte Carlo simulation. For each Monte Carlo iteration the AUC value was measured in the 25% set of unseen samples. An identical process was performed for TMB as a single biomarker. R package ‘ROCR’ was used for the ROC curve analysis.
Pan-cancer analysis of copy number losses and gains
Copy number segment data from ASCAT for all responders (n=261) and non-responders (n=746) were inputted to the R package ‘copynumber’ to derive the gain and loss frequency across the genome for each group (i.e. for responders and non-responders separately). Region level cytoband coordinates were obtained from the UCSC Table Browser, with 303 autosomal chromosomes cytobands defined. For gains and losses (separately) the frequency per cytoband was converted back to absolute patient counts and the difference between responders and non-responders was compared using a 2x2 Fisher’s exact test. Results were corrected for multiple testing using the p.adjust function in R, with the FDR method. The frequency of whole chromosal losses was analysed using genome-wide SNP6 segmented data per sample from the TCGA GDAC Firehose repository (http://firebrowse.org/), for histology types overlapping with the CPI1000+ cohort, i.e. TCGA cohorts: BLCA, BRCA, COADREAD, HNSC, KIRC, LUAD, LUSC and SKCM. In MSK-IMPACT dataset, the overall survival analysis to validate 9q34.3 loss was conducted only in patients with 9p loss, given the strong co-correlation between 9p and 9q loss, and the known generally prognostic impact of loss of 9p in driving reduced overall survival (e.g. CDK2NA). The immune evasion alteration analysis was conducted as per previously published method by Rosenthal et al. 2019 (11), which defines antigen-presentation-pathway genes as components of the HLA enhanceosome, peptide generation, chaperones or the MHC complex itself. In the analysis we included disruptive events (non-synonymous mutations or copy-number loss defined relative to ploidy) of the following genes: CIITA, IRF1, PSME1, PSME2, PSME3, ERAP1, ERAP2, HSPA, HSPC, TAP1, TAP2, TAPBP, CALR, CNX, PDIA3 and B2M. The analysis was also repeated for non-synonymous mutations only (i.e. no copy number loss events). Significance was determined using a one-sided 2x2 Fisher’s exact test. In addition, a multivariate logistic regression test was also performed, adjusting for wGII and cancer type, which also confirmed a significant association between 9q34 loss and a higher rate of immune evasion.
Pan-cancer analysis of focal amplifications and deep deletions
Copy number segment data from ASCAT for all responders (n=261) and non-responders (n=746) were utilised to identify tumors with either focal amplification (copy number =>5 and segment length < 3Mb) or homozygous deletions (copy number = 0 and segment length < 3Mb), in known oncogenes (for amplifications) or tumor suppressor genes (for deep deletions). Oncogenes and tumor suppressor genes were defined according to the Cancer Gene Census (https://cancer.sanger.ac.uk/census), accessed 23rd October 2019, and events with greater than 5% frequency in the CPI1000+ cohort were analysed. The difference in Oncogene/TSG amplification/deletion frequency was compared between responders and non-responders using a one-sided 2x2 Fisher’s exact test (events were hypothesised to associate with resistance only, as they are not collateral passenger events that may cause sensitization).
Analysis of single cell RNA sequencing data
All sequencing data was assessed to detect sequencing failures using FASTQC and lower quality reads were filtered or trimmed using TrimGalore. Outlier samples containing low sequencing coverage or high duplication rates were discarded. Analyses using the RNAseq data were performed in the R statistical computing framework, version 3.5 using packages from BioConductor version 3.7. The single cell RNAseq samples were mapped to the GRCh38 reference human genome, as included in Ensembl version 84, using the STAR algorithm and transcript and gene abundance were estimated using the RSEM algorithm. After quantification, the scater package was used to set filtering thresholds, based on using spike ins and mitochondrial genes to filter out bad quality cells, filtering by total number of genes and filtering by total number of sequenced reads. The remaining cells were used after normalizing using size-factors estimated by the SCRAN package. Downstream analyses used log2 transformed normalized count data. All count data, metadata and intermediate results were kept within a SummarisedExperiment/SingleCellExperiment R object. The data was processed using the edgeR BioConductor package that was used for outlier detection and differential gene expression analyses. Differentially expressed genes were assessed based on their protein coding status.
Statistical methods
Unless otherwise stated (e.g. the section above “Derivation of published biomarkers”), odds ratios were calculated using Fisher's Exact Test for count data, Kruskal-Wallis test was used to test for a difference in distribution between three or more independent groups, and Mann Whitney U test was used to assess for a difference in distributions between two population groups. Logistic regression was used to assess multiple variables jointly for independent association with binary outcomes. Overall survival analysis was conducted using a Cox proportional hazards model. Statistical analysis were carried out using R3.4.4 (http://www.r-project.org/) or greater. We considered a P value of 0.05 as being statistically significant. Any analysis with 25 or more comparisons was subject to multiple testing correction using the R p.adjust function, with FDR method.