Sample enrollment of the PREDICT project
Based on the design of our multicenter cooperative PRospective Early Detection In a population at high risk for Common malignant Tumor (PREDICT) study (registered at ClinicalTrials.gov, clinical trial number NCT04405557), a total of 1158 plasma samples were enrolled from 31 medical institutions in China between December 2017 to March 2021. Fifty-five samples were later discarded because of the lack of substantial clinical information, the failure in sequencing library construction, and the decline in sequencing quality. More specifically, the remaining 1103 samples (1084 participants) constituted the retrospective and prospective cohorts, forming the training and prospective validation datasets for the PREDICT model.
As for the enrollment criteria, our PREDICT project mainly focused on four categories of samples: HCC, LC, healthy, and individuals with abnormalities in physical examinations, i.e., possessing high risk of developing types of cancer. The retrospective cohort was considered as the liver cancer early detection model training dataset so we included HCC, LC, high-risk, and healthy samples. The HCC and LC samples were mainly collected from the Eastern Hepatobiliary Surgery Hospital, Xiangya Hospital of Central South University, and Zhuhai People’s Hospital in China. The diagnosis of HCC or LC was based on the evidence of magnetic resonance imaging (MRI), ultrasound (US), or computed tomography (CT) scan. HCC patients with any treatment history (including ablation, chemotherapy, surgery, etc.) were primarily excluded. An additional 6 to 12-month follow-up visit was performed for the LC patients to exclude samples possessing possible pre-carcinogenic genomic characteristics. Clinicopathological characteristics and key metrics in liver function blood tests were recorded for the enrolled retrospective HCC or LC patients and 114 LC as well as 164 HCC samples corresponding to 271 patients were finally enrolled in the retrospective cohort. As for high-risk individuals, we mainly included samples with primary diseases as well as samples with direct abnormalities in cancer-related tests, i.e., possessing high risk of developing liver, breast, gastric, colorectal, ovarian, pancreas, or lung cancer. Only individuals above the age of 45 were initially selected. In regards to samples with primary diseases, those possessing diseases related to the liver, breast, esophagus, lung, stomach, throat, thyroid, colorectum, and other organs were included. Addedly, cancer-related tests recommended by consensus guidelines of cancer screening were selected to prioritize other high-risk individuals. As for liver cancer high-risk individuals, those meeting any of the criteria below were enrolled: 1) Alpha-fetoprotein (AFP) > 20ng/mL in two successive tests within a month; 2) positive hepatitis B surface antigen (HBsAg) or hepatitis C virus core antigen (HCVAg) testing results with damaged liver function; 3) existing US-detected liver nodules and excluding the possibility of angioma; 4) presence of compensated liver cirrhosis. Criteria for breast cancer high-risk individuals include 1) the breast imaging-reporting and data system (BI-RADS) score > 3 in mammography or US examinations; 2) Cancer antigen 125 (CA125) > 35U/mL and BI-RADS score > 2; 3) Cancer antigen 15 − 3 (CA-153) > 25U/mL and BI-RADS score > 2; 4) Family history of breast and ovarian cancer, with BI-RADS score > 2. Patients with more than two abnormalities in the following serum gastric function tests: 1) human pepsinogens I (PGI) ≤ 70ug/L; 2) progesterone receptors (PgR) ≤ 7.0; 3) gastrin-17 (G17) ≤ 1pmol/L or G17 ≥ 15pmol/L were selected as gastric cancer high-risk individuals. As for the colorectum high-risk samples, we defined the following enrollment criteria: 1) Carcinoembryonic antigen (CEA) > 7ng/mL in two successive tests within a month; 2) Positive in fecal occult blood test (FOBT) and excluding the possibility of hemorrhoid; 3) CEA > 7ng/mL and positive in FOBT. Additionally, the criteria of ovarian cancer high-risk individuals include 1) CA125 > 70U/mL in two successive tests within a month; 2) CA125 > 35U/mL and abnormal human epididymis protein 4 (HE4) in two successive tests within a month; 3) existing US-detected ovarian masses (> 5cm for patients before menopause, > 3.5cm for patients after menopause). Moreover, individuals with carbohydrate antigen 19 − 9 (CA199) > 25U/mL in two successive tests within a month or existing US-detected pancreatic lesions were categorized into the pancreas cancer high-risk group. As a final point, individuals possessing pulmonary nodules in low-dose CT (LDCT) tests were collected as lung cancer high-risk patients. A total of 8 liver cancer and 32 other cancer type high-risk samples were collected with at least 12-month follow-up to exclude patients with diagnosed cancer. Healthy controls were collected as those who did not meet the enrollment criteria of HCC/LC/high-risk and lacked the history of cancer. A total of 53 healthy individuals were integrated into the retrospective model training cohort. All participants with HCC provided 10mL peripheral blood as well as matched tumor samples, including fresh frozen or formalin-fixed, paraffin-embedded (FFPE) tumor tissue specimens. Participants from the other three categories only provided 10mL peripheral blood samples.
We further designed a prospective cohort of 732 samples (720 participants) for comprehensive model performance validation. Using identical enrollment criteria described above, 310 liver high-risk, 340 other cancer type high-risk and 82 healthy samples were recruited from physical examination participants in 24 medical centers in China. A minimum of 12-month follow-up through clinical examinations or phone calls was performed to confirm the diagnostic status of liver cancer. Similarly, 10mL peripheral blood samples were collected for each participant at physical examination.
Sample pre-processing and library preparation
Peripheral blood samples were collected in 10mL Streck tubes and separated by centrifugation at 1600×g for 10 minutes within three days from collection. The supernatant was further transferred to microcentrifuge tubes, centrifuged again at 16000×g for 10 minutes to remove cell debris, and stored at − 80°C. Circulating cfDNA was extracted from 2.4-8mL (median 7.7mL) plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen, Hilden, Germany). Germline genomic DNA was isolated from peripheral blood lymphocytes (PBLs) using the QIAamp DNA Blood Mini Kit (Qiagen). Matched tumor DNA was extracted from fresh frozen or FFPE tumor tissue specimens using the QIAamp DNA Mini Kit (Qiagen) and ReliaPrep™ FFPE gDNA Miniprep System (Promega, Madison, WI), respectively. The concentration and fragment length of extracted cfDNA was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., Santa Clara, CA).
Later, the germline genomic DNA and tumor DNA (median amount 800 ng) were sheared into fragments at a 200–250bp peak with a Covaris S2 Ultrasonicator (Covaris, Inc., Woburn, MA). The indexed Next-Generation Sequencing (NGS) libraries were further constructed using NEBNext® Ultra™ DNA Library Prep Kit for Illumina® (NEB, Ipswich, MA). A median amount of 50ng cfDNA was used for NGS library construction and unique identifiers (UIDs) were tagged on each double-stranded DNA to distinguish authentic somatic mutations from artifacts, improving the ability to precisely track individual plasma molecules.
Target region design and next-generation sequencing
We designed a 293-gene panel covering a 196Kbp genome especially for HCC early detection. Genes harboring the most common driver mutations, actionable sensitive and resistant mutations in liver cancer were integrated. Frequently mutated regions were addedly included based on datasets from in-house cancer sequencing database and public databases including COSMIC (http://cancer.sanger.ac.uk/cosmic) and TCGA (https://cancergenome.nih.gov/). The final panel covered whole coding regions of 13 genes and specific regions of 280 genes. Additionally, considering the HBV integration event demonstrated an inseparability on HCC carcinogenesis, eight genotypes of the HBV genome were simultaneously included in the sequencing panel.
For the constructed DNA libraries, we first used the above custom-designed panel (Integrated DNA Technologies, Inc., Coralville, IA) for hybridization enrichment. The indexed libraries were further sequenced using a 100bp paired-end configuration on a DNBSEQ-T7RS sequencer (MGI Tech, Shenzhen, China) or Gene+Seq-2000 sequencing system (Geneplus-Suzhou, Suzhou, China), respectively producing 2Gb, 10Gb, and 3Gb sequenced data for PBLs, plasma, and fresh specimen/FFPE libraries. The average coverage at target regions for plasma samples was > 30000x.
Sequencing data processing procedure
The sequenced reads from three types of libraries were mapped to the reference human genome (GRCh37) using the default parameters in BWA software (v0.6.2) after removing adaptors and low-quality reads. Duplicate reads were marked and removed using MarkDuplicates tool in Picard (v4.0.4.0, Broad Institute) for tumor and germline genomic DNA. Duplicate reads in cfDNA were identified by the concatenated UIDs. The position of template fragments was utilized by realSeq software (v3.1.0, in-house) for the elimination of errors introduced by the PCR or NGS process. Additional local realignment around single nucleotide variants (SNVs) and small insertions and deletions (InDels) as well as alignment quality assessment were conducted by GATK software (v3.4.46, Broad Institute). The tumor fraction estimation using off-target reads was conducted by ichorCNA_offtarget software [32] (https://github.com/GavinHaLab/ichorCNA_offtarget).
Somatic variant detection and primary filtration
Tumor somatic SNVs and InDels were primarily identified by realDcaller software (v1.7.1, in-house) and TNscope (Sentieon Inc., San Jose, CA) software. For cfDNA, SNV calling was performed using realDcaller specifically optimized for ultra-low frequency mutation calling and TNscope was used as an auxiliary tool to improve the detection of longer InDels.
Upon annotation completion, variants met the following criteria were initially filtered out: 1) the variants present in matched germline genomic DNA; 2) the single-nucleotide polymorphisms (SNPs) with > 1% population allele frequency in Exome Aggregation Consortium (ExAc) or 1000 Genomes Project; 3) the sequencing depth at variant position < 300x. The filtered SNVs were further used in our mutational HCC malignancy model construction. Gene-level and fragment-level copy number variations (CNVs) were called by CONTRA (v2.0.8) [33] and ichorCNA.
The construction of the SNVScore model
After the initial mutation filtering above, we next aim to prioritize key mutations that demonstrated higher significance along the HCC evolutional trajectory. To begin with, we defined a set of hotspot (recurrent) HCC mutations by collecting tissue and plasma samples with various pathological stages in our in-house database. More specifically, 99 early-stage (Barcelona clinic liver cancer (BCLC) Stage 0/A/B) plasma, PBL and tissue samples, 272 late-stage tissue and 453 late-stage PBL and plasma samples were selected. The top 5 mutated genes in early-stage samples were selected as candidates harboring hotspot mutations. Later, we inspected mutations from these five genes and only those met one of the following criteria were included in the hotspot mutation set: 1) mutation frequency > 2 in early-stage plasma samples; 2) mutation frequency > 2 in early-stage tissue samples and detected in early-stage plasma sample; 3) mutation found simultaneously in early and late-stage tissue and plasma samples. Based on the proposed criteria, the hotspot mutations tend to occur early in the carcinogenesis and demonstrate potential governmental role in tumor development.
For each mutation obtained in the retrospective training cohort, we applied different filtering criteria regarding their overlapping situation with the hotspot set. For mutations falling into hotspot mutation set, those meet the following criteria were kept: 1) the supporting duplex reads ≥ 2 or the supporting high-quality reads ≥ 4; 2) not a background mutation in our in-house background mutational database. As for other non-hotspot mutations detected in retrospective samples, they were filtered following the below criteria: 1) not a background mutation in the in-house database; 2) not a clonal hematopoietic (CH) mutation in realDcaller results; 3) ≥ 0.1% variant allele frequency; 4) the supporting duplex reads ≥ 2. These stepwise mutational filtering guaranteed the further construction of the mutational HCC malignancy prediction model.
According to the machine learning-based Lung Cancer Likelihood in Plasma (Lung-CLiP) method [14], we constructed a machine learning-based (ML) SNV model to quantify the tumor relevance of individual SNVs. Regarding our experimental design and data processing procedure, 14 mutational features from Lung-CLiP paper were selected in our SNVScore model generation. For each filtered mutation used for model training and validation, these 14 SNV features were initially computed. Later, the retrospective HCC group-derived mutations simultaneously detected in plasma and matched tissue samples were used as the positive set while non-HCC-derived mutations were selected as the negative set in the training process. Multiple models and multiple ML performance metrics were compared for model prioritization through a 5-fold cross-validation ML training process. Finally, the unknown set defined as the HCC-derived mutations only detected in plasma was subjected to SNVScore model prediction, producing the quantitative contribution of a mutation in hepatocarcinogenesis. The SNVScore values were used as part of the input features in the subsequent PREDICT model.
The HBV integration event identification procedure
As previously mentioned, we integrated 8 HBV genotypes (A-H) in our sequencing panel design. The HBV integration event and breakpoint determination were conducted by in-house software NCsv (v1.0.0) on HCC and LC samples in the retrospective cohort. More detailedly, HBV genomes including AF090842, AB602818, AB014381, M32138, AB032431, AB036910, AB064310 and AY090454 were integrated in our sequencing panel. After adaptor trimming, quality control and duplication removal, we conducted read alignment relatively on human and HBV genomes using BWA software and the mean (µ) and standard deviation (σ) of insert size were calculated. Later three types of read pairs including soft clipped-read pair (read containing simultaneous human and HBV alignments), single-unmapped read pair (one of the paired read failed to align to all genomes), and discordant pair (fragments with insert size larger than µ + 3.96*σ) were collected from the alignment results in NCsv tool. All three types of reads were later re-assembled and re-aligned for precise integration point determination and the breakpoint supporting reads were counted. An integration event was present when the breakpoint supporting read number ≥ 2 and the breakpoint on the human genome was located on the integration hotspot genes, including TERT and MLL4. The presence of events was used in the subsequent PREDICT model construction. Additionally, to gain a more comprehensive understanding of the pathological relevance of integration events, we computed the breakpoint read frequency as the number of three read pair categories defined above dividing the sequencing depth at the breakpoint.
cfDNA fragment size analysis and ML-based model construction
Considering our targeted sequencing strategy, we mainly focused on the reads from off-target regions for the fragmentation feature generation, which could be regarded as reads generated by a low-coverage WGS experiment. More detailedly, off-target reads were defined as reads not falling into ± 250bp of probe regions and with mapping quality ≥ 20. These reads were firstly extracted, sorted, and indexed as a bam file. Later the microscopic and macroscopic characteristics of fragments were generated based on these reads. The microscopic feature focused on the 5Mbp bin-level fragment distribution on the 22 chromosomes. The ratio between short (100-150bp) and long (151-220bp) fragments in each bin was calculated using the sorted and indexed bam file and the ratios were further normalized by the Locally Weighted Scatterplot Smoothing (LOWESS) method. Later Z-score normalization was additionally applied to LOWESS-smoothed ratio values. Finally, the sum of Z-score on each chromosome arm (including 1p, 1q, 2p, 2q, 3p, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 11p, 11q, 12p, 12q, 13q, 14q, 15q, 16p, 16q, 17p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q, 22q) was used as the microscopic fragment features in FRAGScore model training. As for the macroscopic signatures, we focused on the global fragment size distribution. The off-target bam files were primarily conducted down-sampling to guarantee the precision of sample-wise comparisons. Later the fragment distribution between 90 to 600bp was extracted from down-sampled files and the number of fragments at each length with 1bp step was quantified, resulting in a 511-dimension array for each plasma sample. The stochastic gradient descent (SGD) classifier with L1 regularizer was subsequently used for the importance evaluation of each value and the quantifications with importance > 0.07 were extracted while the adjacent fragment lengths were merged for feature complexity reduction. A total of 35 intervals (including (90,92), (94,98), (131,136), (139,147), (151,161), (166,166), (168,187), (189,189), (191,195), (200,249), (251,259), (273,275), (278,311), (323,323), (325,325), (327,329), (331,374), (376,377), (379,379), (393,394), (396,396), (398,420), (422,426), (428,431), (433,433), (435,435), (449,450), (452,462), (464,527), (529,533), (540,540), (543,543), (548,548), (565,565), (584,584)) were obtained and the fragment length quantification was repeatedly conducted on these intervals for the final ML model feature generation. Finally, the microscopic and macroscopic genomic features from HCC and non-HCC (the integration of LC, high-risk, and healthy group) retrospective plasma samples were subjected to 5-fold cross-validation using multiple ML models. The prediction values of the selected FRAGScore model were further used in the construction of the final PREDICT model.
HCC-specific nucleosome footprint identification and feature generation procedures
Distribution of nucleosome footprints (NFs) can reflect cell-type specific biological activities. We identified NFs at transcription start sites (TSSs) by calculating the integrated fragmentation score (IFS) around TSS and comparing the IFS profiles in retrospective HCC and LC samples. More specifically, IFS was calculated using the formula:
$${V}_{i}=n+\sum _{j=1}^{n}{S}_{coverage}\frac{{len}_{j}}{{S}_{len}}$$
where Vi was the IFS for genomic position i, n was the read coverage (fragment number) at i, and lenj was the length of fragment j. Slen was the summed fragment length on the selected chromosome while Scoverage was the number of fragments on selected chromosome. IFS intrinsically possesses positive correlations with read coverage and fragment length. For each sequenced plasma sample, the off-target reads with mapping quality ≥ 30 and fragment length between 30-1000bp were retained. Later, IFS at the middle point of each fragment was calculated. TSS information was also collected from a publication [34] and those with abnormal mapping status as well as sequencing coverage were removed [35], resulting in a total of 207992 TSSs. We further focused on the flanking 2.5Kbp of these remaining TSSs and calculated the IFS flank and IFS Z-score metrics. The IFS flank value was defined as the summed IFS in the TSS flanking window, while the IFS Z-score was calculated on the current window and IFS flank values in 1000 random windows on the selected chromosome. The IFS Z-scores corresponding to 207992 TSSs were subsequently compared between HCC and LC samples by two-sided Wilcoxon rank sum test and 171 TSSs with group-wise P-value < 1e-6 were finally prioritized. The coverage values at selected TSSs were visualized by fitting Gaussian distributions on the smoothed and Z-score normalized coverage at the TSS site and 2.5Kbp flanking regions with 50bp step length. The IFS Z-score distributions from the 171 selected TSSs were further used for IFScore calculation through ML methods, which was utilized in the construction of final PREDICT model. Consequently, we used Reactome, Gene Ontology, and PheGenI databases [36–38] to uncover the biological associations of the selected TSSs.
Feature selection and derivation of the final PREDICT model
Apart from the SNVScore, HBV integration event quantification, FRAGScore, and IFScore values, four additional features were generated for the final ML model. Read number supporting the two hotspot mutations (TP53c.747G > Tp.R249S and TERTc.-58-u66C > T) that possessed the highest mutational frequency and consistency in early-stage HCC tissue and plasma samples were calculated for each retrospective sample. The number of hotspot mutations among all identified SNVs in each plasma sample was also recorded. As for the deleteriousness feature, we constructed reference distributions on three mutational metrics and borrowed the concept of P-value to derive the probability of the observed mutation being deleterious. More detailedly, the PaPI score [39], SNVScore, as well as mutation frequency of all the detected SNVs in retrospective HCC tissue samples were computed and sorted, forming three reference distributions. For each SNV in every retrospective plasma sample, the three metrics mentioned above were compared with reference distributions by counting the proportions of tissue-derived mutations possessing metrics smaller than the plasma SNV. The largest proportion value from SNVs in one sample was recorded as the final deleteriousness feature. Conclusively, nine features covering aspects of hepatocarcinogenesis were used as inputs and the retrospective samples were categorized into HCC and non-HCC groups for final PREDICT model construction. The cutoff on the PREDICT model outputs were determined by 95% specificity. The sensitivity as well as specificity values were used in model performance evaluation.
Statistical and performance evaluation methods for ML model selection
Methods including the Wilcoxon rank sum test, Student's t-test and Fisher’s exact test were used accordingly to compare values in two populations while the Kruskal-Wallis test quantified the difference among multiple distributions. Regarding the differences intrinsic to input features, various ML models including Logistic regression (LR), stochastic gradient descent (SGD) classifier, random forest (RF), support vector classifier (SVC), Gaussian naïve bayes (NB), and Bernoulli NB were used for cross-validation-based multi-model selection. Python package scikit-learn (https://scikit-learn.org/) was used for ML model implementation. Twelve ML performance metrics including specificity, sensitivity, precision, negative predictive value (NPV), accuracy, F1 score, false omission rate (FOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), Matthews correlation coefficient (MCC), Fowlkes-Mallows index (FMI) as well as the area under receiver operating characteristic curve (AUC) were computed using in-house scripts and used for model prioritization.