Literature Search and List of Cancer Types
Based on the report from the 2013–2017 United States Cancer Statistics (USCS) database, we identified the top ten incident cancer types for females and males, after excluding non-melanoma skin cancer25. First, we surveyed the NHGRI-EBI Catalog of Published Genome-Wide Association Studies (GWAS Catalog)33 and the Polygenic Risk Score (PGS) Catalog34 to select the largest European ancestry-based GWAS as of May 2020 for each cancer type. We additionally browsed PubMed35 for large cancer-specific GWASs that were not included in the GWAS Catalog or PGS Catalog. For breast and colorectal cancer, we searched for prior European sample-based large-scale polygenic risk score (PRS) studies as of July 2020 and selected studies reporting the best-performing PRS (Supplementary Table S5). We did not consider pleiotropic GWAS. We filtered to cancer types with at least ten independent genome-wide significant SNPs after LD clumping at a genome-wide significant (GWS) p-value, 5E-8, threshold. Ultimately, eleven cancer types (bladder, breast, colorectum, endometrium, kidney, lung, melanoma, Non-Hodgkin’s lymphoma, ovary, pancreas, and prostate) were included in our analysis. For the full list of source literature and GWAS summary statistics included in our analysis, see Supplementary Table S5.
UK Biobank Study Population
UK Biobank (UKBB) is a prospective epidemiological cohort study with over 500,000 participants36–38. Individuals aged 40–69 at baseline were recruited across the United Kingdom (UK) from 2006-201036–38. A wide range of genotypic and phenotypic information, including personal medical and family history and lifestyle data, were collected at enrollment36–38. UKBB data is regularly updated by completing follow-up questionnaires, linkage to national cancer and mortality registries, and hospital inpatient electronic medical records systems36–38. With linkage to the national cancer registry data, cancer diagnosis date and type (coded based on International Classification of Disease 9 (ICD-9) or International Classification of Disease 10 (ICD-10)) were available for participants diagnosed with cancer36–38. For our analysis, we used ICD-9 and ICD-10 codes for cancer classification (see Supplementary Table S4).
We then filtered to unrelated UKBB participants of White British ancestry with imputed genotype data. We excluded individuals who were lost to follow-up, with genetic sex and self-reported sex mismatch, or those with any cancer diagnosis prior to baseline assessment (prevalent cancers). These quality control procedures resulted in a study population involving 160,586 females and 144,817 males.
SNP Selection Criteria
After determining the source literature (Supplementary Table S5) for each cancer type, we reviewed the manuscript and any relevant additional resources. We extracted all autosomal SNPs from each cancer GWAS along with their summary statistics such as RSIDs, observed effect size estimates (OR or beta), effective (or risk) allele, risk allele frequency (RAF), and p-value. We excluded variants with minor allele frequency (MAF) < 0.01 and ambiguous SNPs (A/T or G/C allele) with MAF > 0.40. We filtered to variants with a MAF difference of less than 0.10 relative to the UK Biobank data. We removed variants with allele mismatches that could not be resolved by strand or dosage flips and/or SNPs with complete information mismatch, based on RSID, chromosome number, and position, to the European 1000 Genome reference panel39 or the UK Biobank data. We filtered to variants with an information score \(\ge\)0.90 based on the UK Biobank imputed genotype data. Finally, using Plink40, we performed LD clumping at a p-value threshold of 5E-8 and r2 of 0.1 with the European 1000 Genome reference panel39 as the reference panel to remove SNPs in linkage disequilibrium within each cancer type.
Polygenic Risk Score
Then, PRS for UK Biobank participants was computed using PRSice241.
The formula used for PRS calculation in PRSice2:
\({PRS}_{j}={\sum }_{i}^{}{\beta }_{i}{SNP}_{ij}\) where \({PRS}_{j}\) is the PRS for the jth individual, \({\beta }_{i}\) is the observed effect size estimate for the ith SNP, and \({SNP}_{ij}\)is the dosage information for the effective allele of the ith SNP for the jth individual. We standardized each PRS to have unit variance and zero mean.
Statistical Analysis
We developed a sex-specific pan-cancer risk prediction model to estimate the risk of developing at least one cancer over the course of follow-up. The multicancer model included eleven cancer types (bladder, breast [Female-only], colorectum, endometrium [Female-only], kidney, lung, melanoma, Non-Hodgkin’s lymphoma, ovary [Female-only], pancreas, and prostate [Male-only]). Data were split into ⅔ training set and ⅓ of test set—independent validation datasets used for model performance evaluation and subsequent analysis.
Cox proportional hazard regression (Cox) model42 was fitted to the training set with the outcome as an incidence of any first cancer included in the analysis. The models specified a baseline hazard as a function of age and assumed multiplicative effects of the risk factors42:
$$\lambda \left(t|{z}\right)={\lambda }_{0}\left(t\right)\text{exp}\left({\beta }_{1}{z}_{1}{+\beta }_{2}{z}_{2}+{\dots +\beta }_{n}{z}_{n}\right)$$
t: time-to-event; time to any first cancer incidence, censoring age, or death age
\({\lambda }_{0}\left(t\right)\) : baseline hazard function
\({z}=\left({z}_{1}{, z}_{2},\dots ,{z}_{n}\right):\) set of covariates (risk factors) included in the Cox model
\({\beta }=({\beta }_{1},{\beta }_{2},\dots , {\beta }_{n})\) : set of coefficients (log hazard ratios) for the predictors
Polygenic risk scores for each cancer, family history of cancer (breast, colorectum, lung, and prostate) in any first-degree relatives (nonadopted), body mass index, and pack-years of smoking were included as predictors in the model (see Supplementary Figure S1, S2). We also adjusted for the first ten principal components. Also, as UKBB is a left-truncated and right-censored cohort, we used age as the timescale for the Cox model—that is, participants enter the model at recruitment age and exit at cancer incidence age, censoring age, or death age–whichever occurs first. We used the censoring date for the cancer registry data provided by UKBB43. As complete data is required for all the predictors included in the analysis, missing risk factor values were imputed using non-parametric random forest-based multiple imputation44.
We computed pan-cancer risk scores (PCRS) or cancer-specific risk scores for all UKBB participants as the weighted sum of the predictors, with weights for each predictor as the estimated log hazard ratio (HRs) from the fitted Cox model. Then, in the test set, we assessed the discriminatory accuracy of the pan-cancer risk score (PCRS) or the cancer-specific risk score (for individual cancer models) using Harrel's concordance index (C-statistic) and area under the curve (AUC) at five years follow-up.
Absolute Risk Estimation, Positive Predictive Value, and Negative Predictive Value Calculations
Absolute Risk Estimation using iCARE (Individualized Coherent Absolute Risk Estimation) 45 :
We used iCARE to build our absolute risk model. Detailed methodology for absolute risk model building is described in Choudhury et al. 202045. Briefly, risk estimates for each individual in the test set were obtained by feeding age-specific cancer incidence rates by 1-year strata, log HR parameters from the Cox model, and the reference dataset into the model. We used 2016 cancer incidence rates in white individuals of the SEER*Stat database46. Cancer incidence rates for a given age and sex were determined by the following year's cancer incidence rates. For instance, in our study, cancer incidence rates for females aged 50–51 will correspond to SEER*Stat's cancer incidence rates for females aged 51–52. This is to account for the fact that the DETECT-A test was performed at study enrollment, and the female participants were followed up over the course of 12 months. DETECT-A and Galleri will both be used to detect cancers early, prior to conventional diagnosis. The reference dataset was obtained by simulating 10,000 samples representative of the underlying UKBB population using the normal distribution with PCRS or cancer-specific risk score mean and standard deviation.
Time window Selection:
DETECT-A study reported an empirical PPV value of 19.4% (95% CI: 13.1%-27.1%)20. We wanted to select a time window for absolute risk estimation so that the PPV for females aged 65–75 is equal to the point estimate of 19.4% reported in the DETECT-A study20. We varied the time window by one month around one year and calculated the weighted average PPV for females aged 65–75 based on the UKBB PCRS distribution and age distribution as reported by the US Census Bureau47. We found a time window of 11 months provides the best match for the overall PPV for the 65–75 group to the value of 19.4%. Thus, subsequently, we calculated PPV and NPV for different age and PCRS risk groups based on underlying 11-month absolute risk. For Galleri, we used a time window of 1-year.17,22 For DETECT-A, we omit the calculation of projected PPVs and NPVs for males as it does not include prostate cancer (highest incident cancer for males) as one of detectable cancer types46.
Given the absolute risk estimate, \(x\), the positive predictive value and negative predictive value of the multicancer liquid biopsy test can be calculated using the formula below:
$$Se=sensitivity;Sp=specificity$$
$$PPV\left(x\right)=\frac{Se\times p\left(x\right)}{Se\times p\left(x\right)+\left(1-Sp\right)\times \left(1-p\left(x\right)\right)}$$
$$NPV\left(x\right)=\frac{Sp\times (1-p(x\left)\right)}{\left(1-Se\right)\times p\left(x\right)+Sp\times \left(1-p\left(x\right)\right)}$$
The absolute risk estimate can be written as a function of age and risk factors. We assumed that the sensitivity and specificity of the multicancer liquid biopsy test do not depend on the underlying risk factors, and we used the value of these as reported from the DETECT-A and Galleri study (Supplementary Table S1)20,22.