The pan-cancer study cohort
Plasma samples from cancer patients were obtained from the U-CAN biobank which collects samples from consenting patients diagnosed at the Akademiska hospital in Uppsala as part of the clinical routine and with a high degree of standardization (1). Plasma samples were obtained from treatment-naïve patients taken around the time of their diagnosis. Plasma was prepared from whole blood by centrifugation at 2.400 g for seven minutes at room temperature, after which the plasma was aliquoted into several 220 µl vials and immediately frozen for long-term storage at -80°C. Exclusion criteria included any concurrent or previous cancer within the last five years, and arm-to-freezer time exceeding 360 minutes. Diagnosis, stage, age, gender and other variables were obtained from the U-CAN database and the patient’s clinical records. The study was approved by the Swedish Ethical Review Authority (EPM dnr 2019-00222). The research was in line with donor consents in U-CAN (28631533, EPN Uppsala 2010-198 with amendments).
The Wellness healthy cohort
Plasma samples from healthy individuals (39 males and 35 females) were selected from a Swedish SciLifeLab SCAPIS Wellness Profiling (S3WP) study as described previously (2, 3). The S3WP program includes longitudinal samples from 101 healthy individuals aged 50-64, recruited from the prospective observational Swedish CArdioPulmonary bioImage Study (SCAPIS). The study was approved by the Ethical Review Board of Goteborg, Sweden (registration number 407-15), and all participants provided written informed consent. The study protocol conforms to the ethical guidelines of the 1975 Declaration of Helsinki.
Measurement of protein levels
The protein levels were measured in plasma using the Olink Explore PEA technology (4), which uses antibody-binding capabilities to detect the levels of 1,463 targets in plasma coupled with next-generation sequencing (NGS) readout. A total of 1,472 proteins were targeted using specific antibodies, including 1,463 unique proteins related on inflammation, oncology, cardiometabolic and neurology, as well as controls. Each antibody was conjugated separately with two complementary probes, and distributed in four separate 384-plex panels. Each panel contained three control assays (interleukin-6 (IL6), interleukin-8 (CXCL8), and tumor necrosis factor (TNF) used for quality control (QC). In brief, the PEA workflow started with an overnight incubation to allow the conjugated antibodies to bind to the corresponding proteins in the samples. The incubation was followed with an extension and pre-amplification step when the hybridization and extension of complementary probes takes pace. The extended DNA was then amplified by PCR and further indexed to allow the preparation of libraries, which were then sequenced using Illumina’s NovaSeq platform. The counts obtained from the sequencing run were subjected to a quality control and normalization procedure. Here, internal controls introduced at different steps were used to reduce intra-assay variability. These include an incubation control consisting of a non-human antigen measured with the same technology, an extension control consisting of an antibody conjugated to a unique pair of probes which are in proximity and is expected to produce a positive signal, and a control in the amplification step consisting of a double-stranded DNA sequence which is expected to produce a positive signal independent of the amplification step. Additionally, external controls such as negative control (buffer sample) and plate controls (pool of plasma) were used to establish a limit of detection (LOD) and adjust levels between plates, respectively. Finally, two known samples acted as sample controls to calculate the precision of the measurements. After quality control and normalization, the data was provided in an arbitrary Normalized Protein eXpression (NPX) unit, which is on a log2 scale and where a high NPX value can be interpreted as a high protein level.
Disease prediction
The caret R package (v 6.0.90) (5) was used to build multivariate classification models for each of the cancer types. First, the cancer data was split in 70% for training purposes and 30% for testing purposes using the createDataPartition() function in caret. The data was imputed with the preProcess() function in caret using the “knnImpute” method.
Multivariate prediction models were built for each the different cancers in two settings: 1) based on all measured proteins (n= 1,463) and having a control group composed of a subset of patients from all other cancers; and 2) based on a selected set of proteins and having healthy patients as a control. In both cases, the cancer prediction model was built on the training set using a 5-fold cross-validation and built-in parameter tuning. The contribution of each protein to the model was retrieved using the varImp() function in the caret package. A multiclass classification model was built using the caret train() function to achieve a simultaneous classification of the 12-cancer types based on all cancer samples in the training set and selected panel of proteins, with 5-fold cross validation strategy and parameter tunning.
ROC analyses
The performance of the prediction models was evaluated using the samples in the testing set, which were not part of the training of any of the models. ROC analyses were performed to assess the sensitivity and specificity of the classification, summarized as AUC scores. The pROC R package (v 1.18.0) was used for binary classifications and multiROC (v 1.1.1) was used for multiclass classification. Statistical significance for differences in AUC were calculated using the DeLong test (6) implementation in the pROC package, using paired tests for correlated ROC curves and unpaired test when for independent ROC curves, using p-value of 0.05 as threshold for significance.
Differential expression analysis
The differential protein expression was assessed using a two-sided t-test coupled with Benjamini-Hochberg multiple hypothesis correction (7), with a significance threshold of 0.05 for adjusted p-values. The adjusted p-values and difference in average expression per group were summarized in volcano plots for each of the analyzed cancers. Enrichment analysis of up-regulated protein sets were performed using the clusterProfiler package (version 3.18.1) (8). The enricher() function in clusterProfiler was used to perform overrepresentation analysis against the biological annotations from Gene Ontology (GO) biological processes (BP) (9), with subsequent p-value adjustment using the Benjamini-Hochberg method (7) and using adjusted p-value < 0.05 as threshold for significance.
Tissue transcriptomic analysis
Tissue transcriptomes were downloaded from the Human Protein Atlas v21 (10) and each protein marker was mapped to its corresponding gene. The normalized transcript per million (nTPM) was used as a quantitative measure of expression level, and an nTPM ≥ 1 was used as a cutoff to call a gene expressed in a given tissue.
Data visualization
Data visualization was performed in R (version 4.0.3) (11), using the ggplot2 (version 3.3.5) (12), ggbeeswarm (version 0.6.0) (13), ggraph (version 2.0.5) (14), ggrepel (version 0.9.1) (15), ggridges (version 0.5.3) (16), ggplotify (version 0.1.0) (17), igraph (version 1.2.6) (18), pheatmap (version 1.0.12) (19), patchwork (version 1.1.1) (20), pcaMethods (version 1.82.0) (21), tidygraph (version 1.2.0) (22), UpSetR (version 1.4.0) (23) and uwot (version 0.1.10) (24) packages. For the heatmap visualization, data was rescaled to a 0-1 scale and hierarchical clustering was performed using the “ward.D2” method. The limma R package (version 3.46.0) (25) was used to correct for batch differences for the comparison between the UCAN and Wellness cohorts, and to correct for sex effects for UMAP visualization of cancer samples. The first 30 components resulting from principal component analysis (PCA) were used as input data for UMAP visualization. The figures were assembled in Affinity designer (v 1.10.0.1127).
Method reference list
1. B. Glimelius et al., U-CAN: a prospective longitudinal collection of biomaterials and clinical information from adult cancer patients in Sweden. Acta Oncol 57, 187-194 (2018).
2. G. Bergstrom et al., The Swedish CArdioPulmonary BioImage Study: objectives and design. J Intern Med 278, 645-659 (2015).
3. A. Tebani et al., Integration of molecular profiles in a longitudinal wellness profiling cohort. Nat Commun 11, 4487 (2020).
4. L. Wik et al., Proximity Extension Assay in Combination with Next-Generation Sequencing for High-throughput Proteome-wide Analysis. Mol Cell Proteomics 20, 100168 (2021).
5. M. Kuhn, Building Predictive Models in R Using the caret Package. Journal of Statistical Software 28, 1-26 (2008).
6. E. R. DeLong, D. M. DeLong, D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837-845 (1988).
7. Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 57, 289-300 (1995).
8. T. Wu et al., clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb) 2, 100141 (2021).
9. M. Ashburner et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29 (2000).
10. M. Uhlen et al., Tissue-based map of the human proteome. Science 347, 1260419 (2015).
11. R Core Team, R: A Language and Environment for Statistical Computing. MSOR connections 1, (2014).
12. H. Wickham, ggplot2: Elegant Graphics for Data Analysis. (Springer-Verlag New York, 2016).
13. E. Clarke , S. Sherrill-Mix, ggbeeswarm: Categorical Scatter (Violin Point) Plots. (2017).
14. T. L. Pedersen, ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. (2021).
15. K. Slowikowski, ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'. (2021).
16. C. O. Wilke, ggridges: Ridgeline Plots in 'ggplot2'. (2021).
17. G. Yu, ggplotify: Convert Plot to 'grob' or 'ggplot' Object. (2021).
18. G. Csardi, T. Nepusz, The igraph software package for complex network research. (2006).
19. R. Kolde, pheatmap: Pretty Heatmaps. (2019).
20. T. L. Pedersen, patchwork: The Composer of Plots. (2020).
21. W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, pcaMethods--a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 1164-1167 (2007).
22. T. L. Pedersen, tidygraph: A Tidy API for Graph Manipulation. (2020).
23. N. Gehlenborg, UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. (2019).
24. J. Melville, uwot: The Uniform Manifold Approximation and Projection (UMAP) Method for Dimensionality Reduction. (2020).
25. M. E. Ritchie et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47 (2015).