Machine learning-enabled early detection of hepatocellular carcinoma utilizing cell-free DNA mutation and fragmentation multiplicity: a prospective study

doi:10.21203/rs.3.rs-3848622/v1

Download PDF

Article

Machine learning-enabled early detection of hepatocellular carcinoma utilizing cell-free DNA mutation and fragmentation multiplicity: a prospective study

https://doi.org/10.21203/rs.3.rs-3848622/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Successful development of effective hepatocellular carcinoma (HCC) early diagnosis methods could greatly benefit disease control. Relating to the early detection of liver cancer, multifarious methods exploiting the various genetic aberrations embedded in cell-free DNA have been proposed. Multifaceted feature integration could improve model performance and interpretability. The cohort design and prospective performance validation also significantly affect the model generality. Considering the current demerits, we conducted the PRospective Early Detection In a population at high-risk for Common malignant Tumor (PREDICT) study (clinical trial number NCT04405557), which integrated mainly single nucleotide variants (SNVs) and fragmentation information in model construction on 371 retrospective participants for efficient HCC early detection. The PREDICT model reached 88.41% sensitivity and 95.65% specificity and demonstrated outstanding performance among different clinicopathological populations. Additionally, we integrated the PREDICT model into physical examination packages and prospectively recruited 720 participants from 24 medical institutions. PREDICT model reached 100% sensitivity and 86.7% specificity. Our model reaches a relative equilibrium between cost, performance as well as interpretability and offers an alternative solution for HCC risky individual regular screening and healthy population preventive screening.

Health sciences/Biomarkers/Diagnostic markers

Health sciences/Diseases/Cancer/Cancer screening

As the major histological subtype of liver cancer, hepatocellular carcinoma (HCC) commonly develops among chronic liver disease patients that possess multiple disease-related risk factors including hepatitis B virus (HBV) or hepatitis C virus (HCV) infection, non-alcoholic fatty liver disease (NAFLD), alcoholic liver disease, and liver cirrhosis (LC). These risk factors instigate the accumulation of crucial hepatocarcinogenic genetic or epigenetic changes, resulting in advanced fibrosis or cirrhosis, dysplastic nodules, and eventually HCC. Additionally, due to the diminish of visceral nociceptors, perceptible pathognomonic symptoms often occur at later HCC pathological stages. Successful development of effective methods for HCC early diagnosis could supply marvelous guidance to personalized medical intervention.

Numerous HCC diagnostic guidelines have been established recently, which could be stratified by the invasiveness during sample collection. Percutaneous biopsy possesses the risk of cancer cell dissemination and demonstrates drastic performance reduction on small nodules and early-stage cases. Multiple biomarkers collected from blood biopsy and imageological examinations have been introduced for non-invasive HCC diagnosis, e.g. the combination of ultrasound imaging (US) and alpha-fetoprotein (AFP) measurements [1]. However, such scheme showed limited sensitivity and specificity [2]. Imaging methods like computed tomography (CT) or magnetic resonance imaging (MRI) exhibited better discriminative ability among early-stage cases but demerits including being labor intensive, equipment dependent and expensive still exist. There is an imperative and unmet clinical need for the development of novel non-invasive strategies in HCC screening with higher performance, stronger participant adherence, lower cost, and reduced infrastructure dependence.

The non-invasiveness as well as copiousness in information prompted the extensive application of cell-free DNA (cfDNA) in cancer screening, progression surveillance and therapeutic determination [3]. Relating to liver cancer early detection, multifarious methods have been proposed exploiting the genetic aberrations embedded in cfDNA. Single nucleotide variants (SNVs) have long been regarded as the most common incitation of tumorigenesis, hence possess higher biological interpretability and demonstrated versatility in early-stage cancer detection and localization [4]. However, the concentration of tumor cfDNA in early-stage or small tumors is relatively low, highlighting the necessity of elaborate techniques. Mutations in cfDNA also demonstrate high heterogeneity among individuals. Deriving the authentic deleterious mutation set and quantitative metrics constitute the premise of using SNVs in early detection tasks. Recently, the fragmentomic features have emerged as the research hotspot in liver cancer early diagnosis [5]. In practical use, ultra-low whole genome sequencing (WGS) strategy is often adopted for expenditure control. However, GC content bias exacerbates in lower coverage scenarios, causing additional efforts in the design of bias reduction methods [6]. Additionally, the DNA methylation landscape was recently exploited regarding the pioneering of such information during tumorigenesis and progression [7, 8]. However, confounders including lifestyle, behavior, and age greatly affect the baseline level of DNA methylation [9]. Inflammation and tissue damage also increase the cfDNA release [10], emphasizing the importance of unbiased cancer-specific methylation marker derivation. To summarize, distinct tumor-derived cfDNA aberrations offer complementary information. By subtly designing the multidimensional feature set and even incorporating liver-specific events like HBV integration [11], the high prediction performance, interpretability, and generality of the cancer early diagnosis model could be foreordained.

The cohort design is another linchpin of accurate and robust cancer early detection. Primary high-risk populations like LC and hepatitis B surface antigen-seropositive (HBsAg+) individuals were often integrated in liver cancer detection studies. However, the risky population should cover the most spectrum of etiology and pathological state. Participants with non-liver disorders should also be considered, regarding the existence of multi-organic intertwined regulatory networks during hepatocarcinogenesis. Additionally, the capacity of cancer detection model should be assessed on a prospective cohort, i.e., real-world screening tests. The liver cancer prospective detection study is currently scarce, possibly due to the hardship in participant recruitment and complete follow-up information collection. Physical examinations should also be conducted in prospective studies, providing a relatively detailed depiction of in vivo physical condition.

Considering the possible improvements stated above, we conducted the PRospective Early Detection In a population at high risk for Common malignant Tumor (PREDICT) study (clinical trial number NCT04405557), which integrated SNV, fragmentation, and HBV integration information in model construction on 371 retrospective participants for HCC early detection. Through ingenious panel design using unbiased prior knowledge and ultra-deep targeted sequencing, we quantified the contribution of SNVs during hepatocarcinogenesis with low cost. Further multi-dimensional feature abstraction remarkably enhanced the model interpretability. The integration of HCC or other cancer-type high-risk individuals introduced additional multi-organic interrelationship information. PREDICT model reached 88.41% sensitivity and 95.65% specificity and demonstrated great performance among different clinicopathological populations. Additionally, the model was integrated into physical examination packages and was further validated in a prospective cohort containing 720 participants from 24 medical institutions in China. The PREDICT model reached 100% sensitivity and 86.7% specificity. To our knowledge, PREDICT is the first HCC early detection method utilizing non-HCC high-risk populations in model construction and performance validation. Integrating the model into physical examination packages warranted the abundance of clinicopathological information as well as the complexity in realistic screening applications. Our model reaches a relative equilibrium between cost, performance as well as interpretability.

Multifarious clinicopathological characteristics were collected for sample categories in the retrospective cohort

As stated in Materials and Methods, 164 HCC, 114 LC, 8 liver high-risk, 32 other cancer type high-risk, and 53 healthy control samples were included in retrospective cohort and miscellaneous clinicopathological characteristics were integrated (Suppl. Table 1). Innovatively, 40 samples demonstrating a high risk of developing liver as well as other types of cancer were included, offering genomic characteristics smoldering in early liver carcinogenesis and prevailing among malignant cancer types (details in Materials and Methods). More detailedly, the gender distribution in the HCC and LC groups demonstrated minor difference (P-value = 0.503, Fisher’s exact test, Suppl. Table 2) while the HCC group possessed significantly higher age at diagnosis (Student's t-test, Suppl. Table 2). Earlier Child-Pugh Grade dominated in our LC samples while BCLC stage 0 and A samples constituted more than 57% of our retrospective HCC group (Suppl. Table 2). Such high proportion of early-stage samples surpassed several liver cancer early detection studies, including CR-2021 [12], JHO-2023 [13], and CD-2023 [2] (Suppl. Table 3). Additionally, patients from both groups possessed high HBV infection rate and key liver-related biomarkers including AFP and Des-Gamma-Carboxy Prothrombin (DCP) significantly increased for HCC patients (Suppl. Table 2). We also collected affluent liver nodule information by MRI or US regarding their extensive involvement in hepatocarcinogenesis. All retrospective samples achieved high quality in our targeted sequencing strategy (Extended Data Fig. 2A-E) and HCC as well as LC samples showed significant elevation in cfDNA concentration levels (Extended Data Fig. 2F).

Computation of SNVScore demystifies the position of key SNVs in the HCC carcinogenesis evolutional trajectory

Hepatocarcinogenesis is a dynamic process harboring pathological hallmarks and quantifying the contribution of genetic factors could undoubtedly aid the elucidation of HCC pathogenesis. Borrowing the concept of Lung-CLiP tool, which computes the likelihood of a blood sample containing tumor-derived cfDNA [14], we defined the SNVScore as the probability of SNV being tumor-related. In other words, SNVScore quantified the contribution of cfDNA-derived SNVs in hepatocarcinogenesis. Regarding our experimental design and mutation calling method, we selected 14 features for SNVScore model construction (Suppl. Table 4, details in Materials and Methods).

To begin with, a total of 923 tissue or plasma samples of distinct HCC stages were selected from the in-house database (Fig. 1A) for the derivation of the hotspot mutation set, which was used as a reference in SNVScore model training data preparation. More detailedly, we first identified genes with top mutational frequency in early-stage HCC tissue and plasma (Fig. 1B). Genes including TP53, TERT, CTNNB1, ARID1A, and AXIN1 demonstrated high consistency in mutational frequencies as well as relatively higher blood release (Fig. 1B). Interestingly, these genes not only displayed similar mutational frequency rankings in late-stage HCC tissue and plasma samples (Extended Data Fig. 3A), but also preserved high cross-stage mutational frequency in public cohorts including NG-2015 [15] (Extended Data Fig. 3B), CCR-2018 [16] (Extended Data Fig. 3C) and TCGA-HCC (Extended Data Fig. 3D).

Based on the above observations on domestic and public cohorts, we next scrutinized mutations on the 5 top genes in 923 in-house samples and defined the hotspot mutation set using three criteria. Mutations exhibiting an early-stage specificity, a promising blood release and a cross-stage consistency were retained (Fig. 1C, details in Materials and Methods) and a total of 28 SNVs constituted the final hotspot mutation set (Extended Data Fig. 3E). Later, filtered mutations in the retrospective cohort were classified as hotspot and non-hotspots, on which different additional filtrations were applied (Fig. 1D, details in Materials and Methods). A total of 2215 mutations were maintained and categorized into positive, negative and unknown mutational sets (Fig. 1D, details in Materials and Methods). To be more explicit, the positive training set contained 245 HCC mutations with high detectivity in blood, while the negative set contained 657 non-HCC mutations, forming two endpoints of mutational tumorigenesis pertinence and data for SNVScore model training (Fig. 1D). Considering SNV and CNV mainly constitute tumor-related genetic aberrations, the fragment and gene-level CNV landscape in our discovery cohort were also investigated. Minimal high-level copy number amplifications (HLAMPs) were observed, and the fragment-level CNV showed a heterogeneous group-wise distribution (Extended Data Fig. 4A). Similarly, both the gene-level CNV on all samples (Extended Data Fig. 4B) and samples in positive and negative training sets defined above (Extended Data Fig. 4C) demonstrated high heterogeneity in unsupervised clustering analyses, endorsing the rationality of using the 14 SNV features in our ML model construction.

After SNV feature extraction and multi-collinearity checking (Extended Data Fig. 4D), we conducted 5-fold cross-validation and multi-model selection in the training process using the positive and negative mutation set. Models including stochastic gradient descent (SGD) classifier, random forest (RF), Gaussian and Bernoulli naïve bayes (NB) were selected for performance comparison, while 12 ML metrics were employed for model prioritization (details in Materials and Methods). As shown in Fig. 1E, the SGD classifier demonstrated stable high performance and overperformed other models (Fig. 1F-G). Different features contributed to the model performance (Extended Data Fig. 4E) and the SNVScore of mutations from positive and negative sets expectedly demonstrated a drastic elevation on HCC-derived SNVs (Extended Data Fig. 4F). The level of SNVScore also positively correlated with Mutation Allele Frequency (MAF), cfDNA concentration and tumor fraction (Fig. 1H), denoting a connotative connection between our model and intratumor heterogeneity (ITH). As for the 1313 mutations only detected in HCC plasma (Fig. 1D), they possess the possibility of unknown tumor or organic damage origin. By computing their SNVScore, slight association with MAF was observed (Fig. 1H).

Analyses on HBV integration events in plasma samples unearthed their intricate associations with HCC progression

HBV integration into the human genome is a recurrent event for HBV-related HCC. Analogous to mutational changes, HBV integration demonstrated a distinguished probability of occurrence at hotspot regions, including TERT and KMT2B genes in HCC [17]. Such hotspot alternations were highly conserved between tissue and cfDNA samples [18]. We integrated all 8 HBV genotypes in our targeted sequencing panel and utilized a novel algorithm incorporating multifarious aligning status for the identification of HBV integrations on key oncogenes (Fig. 2A, details in Materials and Methods). To put it concisely, aligned reads possessing fractional alignment to human and HBV genome, read pairs harboring unaligned bisection, and fragments with deviated insertion size were fundamentally selected as candidates for HBV integration event detection. Read pair re-assembly and re-alignment on human as well as HBV reference genomes were subsequently conducted on these candidates for breakpoint identification.

Through the amalgamated procedure described above, a total of 63 HBV integration events were prioritized (Fig. 2B) and located at the TERT promoter and KMT2B intron region in the human genome. A total of 24 HCC samples harbored HBV integration breakpoints while only 1 LC sample occurred the integration event (Fig. 2C). When inspecting the genome-wide distribution of integrations, the TERT promoter region on human genome (Fig. 2D) and the HBV-C genotype (Fig. 2E) demonstrated an elevated integration frequency. As for statistics on genes encoded by the HBV genome, the PreC/C, S, and X genes showed resemblant integration frequency (Fig. 2F).

We next incorporated clinicopathological information with HBV integrations for the discovery of possible clinical relevance. Considering that the integration event mainly occurred in HCC samples, the only LC sample was discarded in the following analyses. For comparisons conducted on continuous clinical factors, cfDNA concentration as well as log2(DCP) levels demonstrated a significant increase in HCC samples harboring integration events (Fig. 2G). Unsurprisingly, the concentration of cfDNA also demonstrated a positive association with integration number per sample (Extended Data Fig. 5A) and breakpoint read frequency (Extended Data Fig. 5B), which was defined as the proportion of supporting reads at HBV breakpoints (details in Materials and Methods). When inspecting the correlations between integration event and several key clinical factors, inner connections among DCP levels, tumor size, tumor fraction, and HBV integration were identified (Extended Data Fig. 5C). Breakpoints residing in samples with positive tumor fraction also demonstrated significantly higher read frequency (Extended Data Fig. 5D), implying the involvement of HBV integration events in carcinogenesis. As for genes on the HBV genome, the read frequency of breakpoints on P and PreC/C genes showed a substantial elevation trend and decreased heterogeneity (Extended Data Fig. 5E). With regards to categorical factors, samples from different BCLC stages harbored a range of HBV integrations (Fig. 2H) and the site-wise read frequency demonstrated a positive correlation with malignancy (Extended Data Fig. 5F). Interestingly, HBV integration event decreased with tumor number (Fig. 2I) and samples with > 3 tumor number exhibited minimum integration site heterogeneity (Extended Data Fig. 5G). Additionally, the occurrence of genomic integration (Fig. 2J) aggrandized with microvascular invasion (MVI) grades. To conclude, we identified critical HBV integration events in HCC patient plasma samples and unraveled the intricate connections between integrations and tumor progression.

Scrutiny of the plasma cfDNA fragmentation patterns variegated our cognition on HCC development

Aberration in fragment size is one of the key characteristics of circulating tumor DNA (ctDNA). As shown in Fig. 3A, the fraction of fragments with short (90-150bp), long (151-220bp), and extra (221-600bp) lengths calculated on the bam files derived from off-target regions (details in Materials and Methods) exhibited specificity related with the sample origin. More specifically, HCC plasma samples demonstrated significant enrichment on short fragments (Fig. 3B, Extended Data Fig. 6A), in congruence with previous reports [19]. Interestingly, fraction of long fragments drastically diminished in LC samples while HCC samples exhibited highest relative concentration of long fragments among four sample categories (Fig. 3C, Extended Data Fig. 6B). Such trend reversed when considering extra length fragments (Fig. 3D, Extended Data Fig. 6C). Disease-specific gene regulatory maps engendered these fluctuations on fragment size (Extended Data Fig. 6D).

We next inspected the correlation between fragment length and key clinicopathological factors. For more intuitive group-wise comparisons, we normalized the fragment size distributions by the averaged distribution from healthy samples. As expected, when focusing on short and long fragments, the HCC samples demonstrated strongest difference to healthy group (Extended Data Fig. 6E). In addition, several HCC samples showed enrichment of “shorter” extra length fragments while “longer” extra length fragments prevailed in cirrhosis samples. Interestingly, we observed resembling fragment size distributions in 40 high-risk samples and 53 healthy controls (Extended Data Fig. 6E), possibly due to the limited cfDNA blood release. We next incorporated key clinicopathological characteristics with the normalized fragment size distributions to explore possible inner associations. Fragmentation patterns showed an insubstantial connection to key categorical clinical factors in both HCC (Extended Data Fig. 6F) and LC (Extended Data Fig. 6G) samples. Surprisingly, strong associations between short fragment fraction and tumor size, tumor fraction as well as DCP levels were particularly observed in HCC samples (Extended Data Fig. 6H-I).

Afterward, we fixed our attention on the derivation of ML features abstracting the aberrant fragment size distributions. Two types of fragmentomic features were computed (Fig. 3E). Briefly, the arm-level feature focused on the normalized and adjusted ratio between short and long fragments in relatively microscopic genomic windows while the bin-level feature considered the relatively-macroscopic fragment size intervals demonstrating higher prominence (details in Materials and Methods). The window-level adjusted ratio harbored more drastic signal fluctuations in HCC subgroup (Extended Data Fig. 7A). Several key chromosome arms carrying HCC-specific signal elevation and degradation were also discovered, including chromosome 4q, 16q, 19p and 20q (Extended Data Fig. 7B). Our arm-level feature abstraction demonstrated strong correlation with known recurrent CNV events from tissue as well as high specificity for HCC [20, 21]. Additionally, our bin-level feature selection step filtered out fragment size intervals possessing lower fragment number (Extended Data Fig. 7C), hence reduced the data complexity. Unsurprisingly, both the generated arm-level (Extended Data Fig. 7D) and bin-level (Extended Data Fig. 7E) features showed high discriminability between HCC and non-HCC samples. After checking multi-collinearity (Extended Data Fig. 7F), cross-validation-based multi-model selection was conducted for final FRAGScore model derivation (Fig. 3E). The linear regression (LR) model exhibited high performance stability (Fig. 3F) and transcended other models on key ML metrics (Fig. 3G and Extended Data Fig. 7G). Interestingly, arms harboring HCC-specific feature amplification (Extended Data Fig. 7B) positively contributed to the FRAGScore composition while arms with tumor-specific feature degradation showed an antithetical trend (Extended Data Fig. 7H). The feature contributions also demonstrated high accordance with discrepancy level between HCC and non-HCC groups (Fig. 3H).

Integration of the IFScore supplemented our liver cancer early detection model with additional interpretability and performance

Specific gene transcriptional regulation only initiates under the premise of nucleosome position reprogramming at TSS regions, i.e., the formation of nucleosome-free regions (NFRs). NFRs share hypersensitivity to nuclease treatment [22], forming region-specific coverage depletion or nucleosome footprints (NFs) in cfDNA profiles. We calculated the integrated fragmentation score (IFS) for NF quantification at TSS and compared the IFS profile between HCC and LC samples for key gene prioritization (Fig. 4A, details in Materials and Methods). As expected, the IFS value at fragment middle points demonstrated strong positive correlation with the number of fragments in HCC, LC, healthy, and high-risk (Extended Data Fig. 8A-D, left) subpopulation. Higher IFS values also showed association with elevated average fragment size (Extended Data Fig. 8A-D, right). Additionally, the normalized IFS profile at promoter regions from HCC and LC subgroup were compared by two-sided Wilcoxon rank sum tests. A total of 171 TSS reached our P-value filtering threshold, 168 (98.25%) of which harbored higher normalized IFS value in HCC subgroup, indicating the expressional attenuation of these genes in HCC (Fig. 4B). Noteworthily, the standard derivations of Gaussian distributions fitting normalized read coverage at the 171 selected TSS exhibited apparent elevation in LC group (Fig. 4C, details in Materials and Methods).

We next deliberate on the biological associations of the genes corresponding to the selected TSS. Results on the Reactome pathway database unveiled enrichment of cell-cell junctions and interleukin-7 (IL-7) signaling-related terms (Extended Data Fig. 8E). Similar terms associated with deregulated interleukins as well as cell junction organizations were found in the Gene Ontology database (Extended Data Fig. 8F). Besides, key genes occurred more than two times in the 171 TSS set (Extended Data Fig. 8G) demonstrated strongest association with LC phenotype in Phenotype-Genotype Integrator (PheGenI) database (Extended Data Fig. 8H).

Normalized IFS values from the 171 selected TSS in HCC as well as non-HCC populations were further gathered for ML model construction. Again, LR model demonstrated extreme high-performance consistency among the metrics (Fig. 4D-E).

Delicate design of the PREDICT model reached a balance between high performance and expenditure in HCC early detection

Regarding the vital importance of mutagenesis in cancer initiation, we calculated additional mutational features. Firstly, the read number supporting the two hotspot mutations with high prevalence as well as consistency in early-stage tissue and plasma samples (Extended Data Fig. 3E) were calculated in each retrospective plasma sample (Fig. 5A, details in Materials and Methods). As for the deleteriousness feature, mutations detected in the retrospective HCC tissue samples were utilized and their PaPI score, SNVScore as well as mutation frequency were quantified and sorted to form three reference distributions. Positions of plasma mutations on the reference distributions were compared, and the maximum ranking was recorded in each sample (Fig. 5A, details in Materials and Methods). In conclusion, nine features were used for the final detection model (Fig. 5A). The curated features expectedly demonstrated substantial discrepancy between HCC and non-HCC samples (Extended Data Fig. 9A) and showed multifarious enrichment in groups of samples (Extended Data Fig. 9B). Similarly, the LR model showed constantly robust high performance (Extended Data Fig. 9C) and was used for the final PREDICT model construction, i.e., the calculation of PREDICTScore (Fig. 5A). All nine features endowed unignorable power in HCC sample stratification (Extended Data Fig. 9D).

We next perlustrated the performance of PREDICTScore in retrospective subpopulations with various clinicopathological statuses. The PREDICTScore in HCC samples was significantly higher than the other three groups (Extended Data Fig. 9E). The cirrhosis group also possessed remarkably higher PREDICTScore than other non-HCC groups while the score showed negligible difference in healthy and high-risk groups (Extended Data Fig. 9E). Overall, our PREDICT model reached a sensitivity of 88.41% and specificity of 95.65% (Fig. 5B). For the clinicopathological characteristics (Fig. 5C), continuous factors including cfDNA concentration levels, tumor fraction and tumor size demonstrated strong positive correlations with PREDICTScore (Fig. 5D, left). When focusing on categorized factors, HCC samples with higher DCP levels ( > = 100mAU/mL) showed significantly higher PREDICTScore (Fig. 5D, right). The predictions also exhibited a concomitant trend with the increasing of tumor number (Extended Data Fig. 10A), the severeness of BCLC stage (Extended Data Fig. 10B) and the invasiveness of MVI grades (Extended Data Fig. 10C). The model also successfully identified 85.3% BCLC stage 0-A, 93.3% stage B and 97.2% stage C patients (Fig. 5E), expectedly overperforming other key measurements in liver function blood tests (Fig. 5E). Similarly, 80.3% HCC patients with clinical stage I, 95.7% and 96.2% with stage II and III were detected by PREDICT model (Extended Data Fig. 10D). Our model also successfully identified 82% HCC patients with tumor size < 50mm and more than 98% HCC patients with larger tumor size (Extended Data Fig. 10E). Up to 86.6% HCC patients with one tumor were prioritized by our model and the performance proceeded closer to perfection among patients with higher tumor number (Extended Data Fig. 10F). When categorizing the HCC patients by AFP levels, the PREDICT model maintained high performance (Fig. 5F). Additionally, the performance of PREDICT model was not strongly affected by the presence of LC in HCC patients (Extended Data Fig. 10G) or the HBV infection status (Extended Data Fig. 10H). Lastly, HCC samples misclassified were subjected to re-classification. The PREDICT model successfully recovered 66.67%, 82.98%, and 78.13% false negative samples from DCP or AFP blood tests (Extended Data Fig. 10I) while the DCP or AFP method exhibited unsatisfactory performance.

We further laid emphasis on the model performance in non-HCC samples, i.e., samples from LC, healthy, and high-risk retrospective subgroups. Among the miscellaneous clinical factors in retrospective LC samples (Extended Data Fig. 11A), cfDNA concentration, patient age, AFP, total bilirubin, direct bilirubin as well as aspartate transaminase (AST) levels positively correlated with the PREDICTScore (Extended Data Fig. 11B). The albumin level conceivably anti-correlated with our prediction results, in obedience to the compromise of albumin among patients with advanced cirrhosis [23]. When categorized into two groups, the positive association between total as well as direct bilirubin levels remained significant (Extended Data Fig. 11C). The PREDICTScore corresponding to different cirrhosis stages also exhibited an elevation with malignance (Extended Data Fig. 11D). Being the key hallmark in hepatocarcinogenesis [24], cirrhotic nodules often cause a disturbance on the discriminability of HCC early detection models. For LC samples stratified by the presence of liver nodules, our model successfully recovered 87.2% nodule-residing samples (Fig. 5G). Significant elevation on the PREDICTScore was also observed in samples harboring nodules (Extended Data Fig. 11E) or regenerative nodules (Extended Data Fig. 11F). Such elevation barely reached the HCC decision threshold in PREDICT model, again manifesting the robustness of our tool. As for samples from healthy and high-risk subgroups, our model achieved 100% prediction specificity and the 9 HCC-related features expectedly demonstrated muted intensity (Extended Data Fig. 11G-H).

PREDICT model demonstrated distinguished performance in the large-scale physical examination-based prospective validation cohort

Innovatively, we integrated our PREDICT model into physical examination packages provided in multiple medical institutions in China and performed follow-up for cancer diagnostic status tracking. We recruited 720 physical examination package participants from 24 medical institutions (Fig. 6A) and collected their plasma samples (n = 732) and detailed clinical information (Suppl. Table 5). Samples were further categorized into healthy, liver cancer high-risk, and other cancer type high-risk populations regarding the examination results and the stratification criteria (details in Materials and Methods). A minimum of 12-month follow-up was performed for each participant to confirm the diagnostic outcome of HCC. The total 732 samples showed superb mapping quality and minimal contaminations (Extended Data Fig. 12A-E). Interestingly, the liver cancer high-risk samples possessed significantly higher cfDNA concentration levels (Extended Data Fig. 12F) than the other two populations, while other cancer high-risk participants showed negligible cfDNA amount difference with the healthy subgroup.

Noteworthily, the presence of liver nodules acted as the dominant liver cancer risk factor (Fig. 6B, Suppl. Table 6). Patients in other cancer high-risk group could be stratified by the appearance of direct abnormalities in cancer-related tests. 45 high-risk individuals possessed diversified types of primary diseases (Fig. 6C) while 265 participants had aberrations in breast, gastric and colorectal cancer-related tests (Fig. 6D, Suppl. Table 6). Expectedly, PREDICT model outputs were elevated in liver cancer high-risk population (Extended Data Fig. 13A) and the difference between high-risk and healthy individuals reached high significance (Extended Data Fig. 13B). Using threshold defined in retrospective cohort (details in Materials and Methods), 101 participants were predicted to harbor high risk of developing HCC and more positive predictions expectedly resided in liver high-risk group (Extended Data Fig. 13C). Combining with the clinical outcomes, our PREDICT model successfully captured all 4 patients that ultimately diagnosed HCC, reaching 100% sensitivity and 86.7% specificity in the whole prospective cohort (Fig. 6E) and 83.7% specificity in liver cancer high-risk subgroup (Extended Data Fig. 13D). Interestingly, all the diagnosed cases were from the liver cancer high-risk group.

Similarly, we next performed integrated analyses on PREDICTScore and clinicopathological characteristics in the prospective cohort. As for categorical factors in the liver cancer high-risk group, only age demonstrated positive correlations with PREDICTScore (Fig. 6F, left). The non-liver high-risk individuals showed no cancer-type specificity (Fig. 6F, right). We did not observe apparent clinical factor distribution differences between the predicted high-risk or risk-free individuals in liver cancer high-risk (Fig. 6G), healthy (Extended Data Fig. 13E), and other cancer high-risk groups (Extended Data Fig. 13F-G). As a final point, PREDICT model predictions on the 24 samples collected at two successive time points exhibited limited pairwise variations (Extended Data Fig. 13H-I), again endorsing the robustness of the model. Through exquisite method design, the PREDICT model demonstrated impressive performance with low cost, undoubtedly propelling the pace of transition from theoretical high performance in cancer early detection methods to large-scale health surveillance applications.

The design of an ML-based non-invasive cancer early detection study often consists of the determination of participants, genomic features, and ML methods. We initially aspired to derive a highly interpretable detection model achieving an equilibrium between cost and performance. To achieve this goal, we meticulously integrated multiple ingenuities in participant enrollment. Firstly, we incorporated a substantial proportion of BCLC stage 0-A patients in model training, surpassing most studies [5]. We also enrolled a comparable amount of LC and HCC patients, circumventing possible overfitting due to the overwhelming number of healthy participants [25]. With the increased vaccination rate in China [26], non-viral etiologies like alcohol or obesity should also be given special attention. By increasing the multiplicity of the clinicopathological characteristics in LC and HCC participants, we efficaciously circumvented possible performance decrease in several HBV-related studies focusing on specific subpopulations, e.g. HBsAg + individuals [27]. Additionally, integrating our model in large-scale physical examinations could simultaneously testify the robustness of our method in asymptomatic participants as well as perceive the model performance in real-world applications. By innovatively introducing HCC or other cancer type high-risk individuals in model training, the multi-organic intertwined regulatory network was partially incorporated.

Some previous studies overemphasized the model performance by using strategies like stacked ensemble learning [25], which omitted the quantitative contributions of input features. Relatively macroscopical features on chromosome arm or megabase level were also widely adopted, resulting in difficulties in the derivation of critical factors governing HCC evolution. We coped with the interpretability problem in two ways. On the one hand, gene-level features including SNV, deleteriousness, TSS signal intensity, and HBV integration events were adopted. To take advantage of the merits of SNVs, we made a lot of efforts in model design to overcome existing shortcomings. Functional mutations showing barely cohort specificity were selected in our sequencing panel and the sequencing strategy assured the detection of low-frequency mutations. We also inventively quantified the importance of SNVs in carcinogenesis and surmounted the heterogeneity issue. On the other hand, various actions were taken during PREDICT model training to guarantee the performance and interpretability. Multiple model selections through 12 ML metrics were conducted and algorithms including SGD Classifier and LR were utilized regarding their relatively high interpretability, i.e. the retention of feature contributions during training. Cross-validation and collinearity checking were supplementarily conducted to avoid model overfitting. Our elaborate model design expectedly resulted in appealing discriminative ability.

Apart from the balance between model performance and interpretability, the pursuit of concinnity between cost and performance is another indubitable goal. Multiple procedures including sample collection, transportation, experimental protocol, and sequencing strategy affect the cancer detection cost, among which the control of sequencing total amount is the paramount aspect. Currently, there are mainly two strategies for cost control in cancer detection applications, including applying low-coverage WGS and targeted sequencing with high coverage. Attributing to the relatively lower difficulty in implementation, the WGS strategy has attracted increasing interest [12, 25, 28]. However, the dependence between GC content and read coverage [29] causes genome-wide fragment distribution disturbance, imposing non-negligible bias in the WGS strategy. As for the targeted sequencing strategy, many aspects should be particularly considered in the panel design. Firstly, the category and amount of features ought to be carefully selected for the equipoise between multi-dimensional information redundancy and exuberance. Secondly, the selected features must not be biased toward certain population. In other words, a multi-center or multi-source strategy should be adopted in marker discovery. Through elaborate panel design and feature abstraction, we maximized the benefits of panel sequencing and achieved gratifying cost-performance consonance. The optimal performance not only manifested the detection rate of HCC patient in retrospective populations stratified by different clinicopathological characteristics, but also reflected by the close correlations between the cancer risk quantified by PREDICT model and key clinical information in HCC and LC subgroups.

The importance of prospective cohorts in cancer early detection model performance validation is undeniable. Unlike simply recruiting participants categorized by various tumor-related examinations, prospective validation focuses more on the screening of tumors in large populations, i.e., real-world model application. Due to the difficulty in participant recruitment and tracking, prospective study remains scarce in the liver cancer detection field. Several existing studies possessed deficiencies including relatively shorter follow-up time [27] and propensity on specific factors like HBV infection [30], both possibly requiring more scrutiny and real-world complexity in study design. One of our innovations is the integration of PREDICT model with physical examination packages, which warranted the number of participants, the abundance of clinicopathological information, and the completeness of follow-ups. Multi-center sample collection also eliminated possible regional bias and introduced heterogeneity. By further integrating other cancer type high-risk individuals in prospective validation, the associations between hepatocarcinogenesis and other disorders could be revealed. Another critical element in prospective studies is the existence of possible group bias among recruited participants, i.e. an extremely high amount of healthy volunteers. Our design’s diversified physical examinations enabled accurate stratification of risky individuals. Using substantial risky participants possessing dramatically increased cancer incidence rates than healthy volunteers, we avoided the lower-than-projected rate of liver cancer detection in some prospective studies [31].

Our study still has some limitations and possible future improvements. To begin with, we only included Chinese patients in the current PREDICT study. By covering participants from omnifarious geographical regions with various races and ethnicities, the generality of our model could further be testified. Besides, the LC samples in retrospective cohort were significantly younger than the HCC subgroup. Recruiting future age-matched participants could aid the elimination of possible confounders. More plasma samples should also be recruited at different time points to certify our model’s robustness and explore possible applications including disease progression surveillance, relapse monitoring, and therapeutic guidance. Lastly, the cfDNA concentration levels for high-risk individuals showed imperceptible differences with healthy individuals. We speculate the enrichment of other multi-omics information like DNA methylation which exhibited close connections to organ damage [10] in the risky population. Future integration of these damage-related features in model construction could provide an accurate estimation of the tissue and cell damaging landscape during hepatocarcinogenesis.

In summation, our PREDICT model achieved the relative harmony between cost, performance, and interpretability using retrospective and prospective cohorts, providing an alternative robust solution for efficient HCC early detection.

Sample enrollment of the PREDICT project

Based on the design of our multicenter cooperative PRospective Early Detection In a population at high risk for Common malignant Tumor (PREDICT) study (registered at ClinicalTrials.gov, clinical trial number NCT04405557), a total of 1158 plasma samples were enrolled from 31 medical institutions in China between December 2017 to March 2021. Fifty-five samples were later discarded because of the lack of substantial clinical information, the failure in sequencing library construction, and the decline in sequencing quality. More specifically, the remaining 1103 samples (1084 participants) constituted the retrospective and prospective cohorts, forming the training and prospective validation datasets for the PREDICT model.

As for the enrollment criteria, our PREDICT project mainly focused on four categories of samples: HCC, LC, healthy, and individuals with abnormalities in physical examinations, i.e., possessing high risk of developing types of cancer. The retrospective cohort was considered as the liver cancer early detection model training dataset so we included HCC, LC, high-risk, and healthy samples. The HCC and LC samples were mainly collected from the Eastern Hepatobiliary Surgery Hospital, Xiangya Hospital of Central South University, and Zhuhai People’s Hospital in China. The diagnosis of HCC or LC was based on the evidence of magnetic resonance imaging (MRI), ultrasound (US), or computed tomography (CT) scan. HCC patients with any treatment history (including ablation, chemotherapy, surgery, etc.) were primarily excluded. An additional 6 to 12-month follow-up visit was performed for the LC patients to exclude samples possessing possible pre-carcinogenic genomic characteristics. Clinicopathological characteristics and key metrics in liver function blood tests were recorded for the enrolled retrospective HCC or LC patients and 114 LC as well as 164 HCC samples corresponding to 271 patients were finally enrolled in the retrospective cohort. As for high-risk individuals, we mainly included samples with primary diseases as well as samples with direct abnormalities in cancer-related tests, i.e., possessing high risk of developing liver, breast, gastric, colorectal, ovarian, pancreas, or lung cancer. Only individuals above the age of 45 were initially selected. In regards to samples with primary diseases, those possessing diseases related to the liver, breast, esophagus, lung, stomach, throat, thyroid, colorectum, and other organs were included. Addedly, cancer-related tests recommended by consensus guidelines of cancer screening were selected to prioritize other high-risk individuals. As for liver cancer high-risk individuals, those meeting any of the criteria below were enrolled: 1) Alpha-fetoprotein (AFP) > 20ng/mL in two successive tests within a month; 2) positive hepatitis B surface antigen (HBsAg) or hepatitis C virus core antigen (HCVAg) testing results with damaged liver function; 3) existing US-detected liver nodules and excluding the possibility of angioma; 4) presence of compensated liver cirrhosis. Criteria for breast cancer high-risk individuals include 1) the breast imaging-reporting and data system (BI-RADS) score > 3 in mammography or US examinations; 2) Cancer antigen 125 (CA125) > 35U/mL and BI-RADS score > 2; 3) Cancer antigen 15 − 3 (CA-153) > 25U/mL and BI-RADS score > 2; 4) Family history of breast and ovarian cancer, with BI-RADS score > 2. Patients with more than two abnormalities in the following serum gastric function tests: 1) human pepsinogens I (PGI) ≤ 70ug/L; 2) progesterone receptors (PgR) ≤ 7.0; 3) gastrin-17 (G17) ≤ 1pmol/L or G17 ≥ 15pmol/L were selected as gastric cancer high-risk individuals. As for the colorectum high-risk samples, we defined the following enrollment criteria: 1) Carcinoembryonic antigen (CEA) > 7ng/mL in two successive tests within a month; 2) Positive in fecal occult blood test (FOBT) and excluding the possibility of hemorrhoid; 3) CEA > 7ng/mL and positive in FOBT. Additionally, the criteria of ovarian cancer high-risk individuals include 1) CA125 > 70U/mL in two successive tests within a month; 2) CA125 > 35U/mL and abnormal human epididymis protein 4 (HE4) in two successive tests within a month; 3) existing US-detected ovarian masses (> 5cm for patients before menopause, > 3.5cm for patients after menopause). Moreover, individuals with carbohydrate antigen 19 − 9 (CA199) > 25U/mL in two successive tests within a month or existing US-detected pancreatic lesions were categorized into the pancreas cancer high-risk group. As a final point, individuals possessing pulmonary nodules in low-dose CT (LDCT) tests were collected as lung cancer high-risk patients. A total of 8 liver cancer and 32 other cancer type high-risk samples were collected with at least 12-month follow-up to exclude patients with diagnosed cancer. Healthy controls were collected as those who did not meet the enrollment criteria of HCC/LC/high-risk and lacked the history of cancer. A total of 53 healthy individuals were integrated into the retrospective model training cohort. All participants with HCC provided 10mL peripheral blood as well as matched tumor samples, including fresh frozen or formalin-fixed, paraffin-embedded (FFPE) tumor tissue specimens. Participants from the other three categories only provided 10mL peripheral blood samples.

We further designed a prospective cohort of 732 samples (720 participants) for comprehensive model performance validation. Using identical enrollment criteria described above, 310 liver high-risk, 340 other cancer type high-risk and 82 healthy samples were recruited from physical examination participants in 24 medical centers in China. A minimum of 12-month follow-up through clinical examinations or phone calls was performed to confirm the diagnostic status of liver cancer. Similarly, 10mL peripheral blood samples were collected for each participant at physical examination.

Sample pre-processing and library preparation

Peripheral blood samples were collected in 10mL Streck tubes and separated by centrifugation at 1600×g for 10 minutes within three days from collection. The supernatant was further transferred to microcentrifuge tubes, centrifuged again at 16000×g for 10 minutes to remove cell debris, and stored at − 80°C. Circulating cfDNA was extracted from 2.4-8mL (median 7.7mL) plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen, Hilden, Germany). Germline genomic DNA was isolated from peripheral blood lymphocytes (PBLs) using the QIAamp DNA Blood Mini Kit (Qiagen). Matched tumor DNA was extracted from fresh frozen or FFPE tumor tissue specimens using the QIAamp DNA Mini Kit (Qiagen) and ReliaPrep™ FFPE gDNA Miniprep System (Promega, Madison, WI), respectively. The concentration and fragment length of extracted cfDNA was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., Santa Clara, CA).

Later, the germline genomic DNA and tumor DNA (median amount 800 ng) were sheared into fragments at a 200–250bp peak with a Covaris S2 Ultrasonicator (Covaris, Inc., Woburn, MA). The indexed Next-Generation Sequencing (NGS) libraries were further constructed using NEBNext® Ultra™ DNA Library Prep Kit for Illumina® (NEB, Ipswich, MA). A median amount of 50ng cfDNA was used for NGS library construction and unique identifiers (UIDs) were tagged on each double-stranded DNA to distinguish authentic somatic mutations from artifacts, improving the ability to precisely track individual plasma molecules.

Target region design and next-generation sequencing

We designed a 293-gene panel covering a 196Kbp genome especially for HCC early detection. Genes harboring the most common driver mutations, actionable sensitive and resistant mutations in liver cancer were integrated. Frequently mutated regions were addedly included based on datasets from in-house cancer sequencing database and public databases including COSMIC (http://cancer.sanger.ac.uk/cosmic) and TCGA (https://cancergenome.nih.gov/). The final panel covered whole coding regions of 13 genes and specific regions of 280 genes. Additionally, considering the HBV integration event demonstrated an inseparability on HCC carcinogenesis, eight genotypes of the HBV genome were simultaneously included in the sequencing panel.

For the constructed DNA libraries, we first used the above custom-designed panel (Integrated DNA Technologies, Inc., Coralville, IA) for hybridization enrichment. The indexed libraries were further sequenced using a 100bp paired-end configuration on a DNBSEQ-T7RS sequencer (MGI Tech, Shenzhen, China) or Gene⁺Seq-2000 sequencing system (Geneplus-Suzhou, Suzhou, China), respectively producing 2Gb, 10Gb, and 3Gb sequenced data for PBLs, plasma, and fresh specimen/FFPE libraries. The average coverage at target regions for plasma samples was > 30000x.

Sequencing data processing procedure

The sequenced reads from three types of libraries were mapped to the reference human genome (GRCh37) using the default parameters in BWA software (v0.6.2) after removing adaptors and low-quality reads. Duplicate reads were marked and removed using MarkDuplicates tool in Picard (v4.0.4.0, Broad Institute) for tumor and germline genomic DNA. Duplicate reads in cfDNA were identified by the concatenated UIDs. The position of template fragments was utilized by realSeq software (v3.1.0, in-house) for the elimination of errors introduced by the PCR or NGS process. Additional local realignment around single nucleotide variants (SNVs) and small insertions and deletions (InDels) as well as alignment quality assessment were conducted by GATK software (v3.4.46, Broad Institute). The tumor fraction estimation using off-target reads was conducted by ichorCNA_offtarget software [32] (https://github.com/GavinHaLab/ichorCNA_offtarget).

Somatic variant detection and primary filtration

Tumor somatic SNVs and InDels were primarily identified by realDcaller software (v1.7.1, in-house) and TNscope (Sentieon Inc., San Jose, CA) software. For cfDNA, SNV calling was performed using realDcaller specifically optimized for ultra-low frequency mutation calling and TNscope was used as an auxiliary tool to improve the detection of longer InDels.

Upon annotation completion, variants met the following criteria were initially filtered out: 1) the variants present in matched germline genomic DNA; 2) the single-nucleotide polymorphisms (SNPs) with > 1% population allele frequency in Exome Aggregation Consortium (ExAc) or 1000 Genomes Project; 3) the sequencing depth at variant position < 300x. The filtered SNVs were further used in our mutational HCC malignancy model construction. Gene-level and fragment-level copy number variations (CNVs) were called by CONTRA (v2.0.8) [33] and ichorCNA.

The construction of the SNVScore model

After the initial mutation filtering above, we next aim to prioritize key mutations that demonstrated higher significance along the HCC evolutional trajectory. To begin with, we defined a set of hotspot (recurrent) HCC mutations by collecting tissue and plasma samples with various pathological stages in our in-house database. More specifically, 99 early-stage (Barcelona clinic liver cancer (BCLC) Stage 0/A/B) plasma, PBL and tissue samples, 272 late-stage tissue and 453 late-stage PBL and plasma samples were selected. The top 5 mutated genes in early-stage samples were selected as candidates harboring hotspot mutations. Later, we inspected mutations from these five genes and only those met one of the following criteria were included in the hotspot mutation set: 1) mutation frequency > 2 in early-stage plasma samples; 2) mutation frequency > 2 in early-stage tissue samples and detected in early-stage plasma sample; 3) mutation found simultaneously in early and late-stage tissue and plasma samples. Based on the proposed criteria, the hotspot mutations tend to occur early in the carcinogenesis and demonstrate potential governmental role in tumor development.

For each mutation obtained in the retrospective training cohort, we applied different filtering criteria regarding their overlapping situation with the hotspot set. For mutations falling into hotspot mutation set, those meet the following criteria were kept: 1) the supporting duplex reads ≥ 2 or the supporting high-quality reads ≥ 4; 2) not a background mutation in our in-house background mutational database. As for other non-hotspot mutations detected in retrospective samples, they were filtered following the below criteria: 1) not a background mutation in the in-house database; 2) not a clonal hematopoietic (CH) mutation in realDcaller results; 3) ≥ 0.1% variant allele frequency; 4) the supporting duplex reads ≥ 2. These stepwise mutational filtering guaranteed the further construction of the mutational HCC malignancy prediction model.

According to the machine learning-based Lung Cancer Likelihood in Plasma (Lung-CLiP) method [14], we constructed a machine learning-based (ML) SNV model to quantify the tumor relevance of individual SNVs. Regarding our experimental design and data processing procedure, 14 mutational features from Lung-CLiP paper were selected in our SNVScore model generation. For each filtered mutation used for model training and validation, these 14 SNV features were initially computed. Later, the retrospective HCC group-derived mutations simultaneously detected in plasma and matched tissue samples were used as the positive set while non-HCC-derived mutations were selected as the negative set in the training process. Multiple models and multiple ML performance metrics were compared for model prioritization through a 5-fold cross-validation ML training process. Finally, the unknown set defined as the HCC-derived mutations only detected in plasma was subjected to SNVScore model prediction, producing the quantitative contribution of a mutation in hepatocarcinogenesis. The SNVScore values were used as part of the input features in the subsequent PREDICT model.

The HBV integration event identification procedure

As previously mentioned, we integrated 8 HBV genotypes (A-H) in our sequencing panel design. The HBV integration event and breakpoint determination were conducted by in-house software NCsv (v1.0.0) on HCC and LC samples in the retrospective cohort. More detailedly, HBV genomes including AF090842, AB602818, AB014381, M32138, AB032431, AB036910, AB064310 and AY090454 were integrated in our sequencing panel. After adaptor trimming, quality control and duplication removal, we conducted read alignment relatively on human and HBV genomes using BWA software and the mean (µ) and standard deviation (σ) of insert size were calculated. Later three types of read pairs including soft clipped-read pair (read containing simultaneous human and HBV alignments), single-unmapped read pair (one of the paired read failed to align to all genomes), and discordant pair (fragments with insert size larger than µ + 3.96*σ) were collected from the alignment results in NCsv tool. All three types of reads were later re-assembled and re-aligned for precise integration point determination and the breakpoint supporting reads were counted. An integration event was present when the breakpoint supporting read number ≥ 2 and the breakpoint on the human genome was located on the integration hotspot genes, including TERT and MLL4. The presence of events was used in the subsequent PREDICT model construction. Additionally, to gain a more comprehensive understanding of the pathological relevance of integration events, we computed the breakpoint read frequency as the number of three read pair categories defined above dividing the sequencing depth at the breakpoint.

cfDNA fragment size analysis and ML-based model construction

Considering our targeted sequencing strategy, we mainly focused on the reads from off-target regions for the fragmentation feature generation, which could be regarded as reads generated by a low-coverage WGS experiment. More detailedly, off-target reads were defined as reads not falling into ± 250bp of probe regions and with mapping quality ≥ 20. These reads were firstly extracted, sorted, and indexed as a bam file. Later the microscopic and macroscopic characteristics of fragments were generated based on these reads. The microscopic feature focused on the 5Mbp bin-level fragment distribution on the 22 chromosomes. The ratio between short (100-150bp) and long (151-220bp) fragments in each bin was calculated using the sorted and indexed bam file and the ratios were further normalized by the Locally Weighted Scatterplot Smoothing (LOWESS) method. Later Z-score normalization was additionally applied to LOWESS-smoothed ratio values. Finally, the sum of Z-score on each chromosome arm (including 1p, 1q, 2p, 2q, 3p, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 11p, 11q, 12p, 12q, 13q, 14q, 15q, 16p, 16q, 17p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q, 22q) was used as the microscopic fragment features in FRAGScore model training. As for the macroscopic signatures, we focused on the global fragment size distribution. The off-target bam files were primarily conducted down-sampling to guarantee the precision of sample-wise comparisons. Later the fragment distribution between 90 to 600bp was extracted from down-sampled files and the number of fragments at each length with 1bp step was quantified, resulting in a 511-dimension array for each plasma sample. The stochastic gradient descent (SGD) classifier with L1 regularizer was subsequently used for the importance evaluation of each value and the quantifications with importance > 0.07 were extracted while the adjacent fragment lengths were merged for feature complexity reduction. A total of 35 intervals (including (90,92), (94,98), (131,136), (139,147), (151,161), (166,166), (168,187), (189,189), (191,195), (200,249), (251,259), (273,275), (278,311), (323,323), (325,325), (327,329), (331,374), (376,377), (379,379), (393,394), (396,396), (398,420), (422,426), (428,431), (433,433), (435,435), (449,450), (452,462), (464,527), (529,533), (540,540), (543,543), (548,548), (565,565), (584,584)) were obtained and the fragment length quantification was repeatedly conducted on these intervals for the final ML model feature generation. Finally, the microscopic and macroscopic genomic features from HCC and non-HCC (the integration of LC, high-risk, and healthy group) retrospective plasma samples were subjected to 5-fold cross-validation using multiple ML models. The prediction values of the selected FRAGScore model were further used in the construction of the final PREDICT model.

HCC-specific nucleosome footprint identification and feature generation procedures

Distribution of nucleosome footprints (NFs) can reflect cell-type specific biological activities. We identified NFs at transcription start sites (TSSs) by calculating the integrated fragmentation score (IFS) around TSS and comparing the IFS profiles in retrospective HCC and LC samples. More specifically, IFS was calculated using the formula:

$${V}_{i}=n+\sum _{j=1}^{n}{S}_{coverage}\frac{{len}_{j}}{{S}_{len}}$$

where V_i was the IFS for genomic position i, n was the read coverage (fragment number) at i, and len_j was the length of fragment j. S_len was the summed fragment length on the selected chromosome while S_coverage was the number of fragments on selected chromosome. IFS intrinsically possesses positive correlations with read coverage and fragment length. For each sequenced plasma sample, the off-target reads with mapping quality ≥ 30 and fragment length between 30-1000bp were retained. Later, IFS at the middle point of each fragment was calculated. TSS information was also collected from a publication [34] and those with abnormal mapping status as well as sequencing coverage were removed [35], resulting in a total of 207992 TSSs. We further focused on the flanking 2.5Kbp of these remaining TSSs and calculated the IFS flank and IFS Z-score metrics. The IFS flank value was defined as the summed IFS in the TSS flanking window, while the IFS Z-score was calculated on the current window and IFS flank values in 1000 random windows on the selected chromosome. The IFS Z-scores corresponding to 207992 TSSs were subsequently compared between HCC and LC samples by two-sided Wilcoxon rank sum test and 171 TSSs with group-wise P-value < 1e-6 were finally prioritized. The coverage values at selected TSSs were visualized by fitting Gaussian distributions on the smoothed and Z-score normalized coverage at the TSS site and 2.5Kbp flanking regions with 50bp step length. The IFS Z-score distributions from the 171 selected TSSs were further used for IFScore calculation through ML methods, which was utilized in the construction of final PREDICT model. Consequently, we used Reactome, Gene Ontology, and PheGenI databases [36–38] to uncover the biological associations of the selected TSSs.

Feature selection and derivation of the final PREDICT model

Apart from the SNVScore, HBV integration event quantification, FRAGScore, and IFScore values, four additional features were generated for the final ML model. Read number supporting the two hotspot mutations (TP53c.747G > Tp.R249S and TERTc.-58-u66C > T) that possessed the highest mutational frequency and consistency in early-stage HCC tissue and plasma samples were calculated for each retrospective sample. The number of hotspot mutations among all identified SNVs in each plasma sample was also recorded. As for the deleteriousness feature, we constructed reference distributions on three mutational metrics and borrowed the concept of P-value to derive the probability of the observed mutation being deleterious. More detailedly, the PaPI score [39], SNVScore, as well as mutation frequency of all the detected SNVs in retrospective HCC tissue samples were computed and sorted, forming three reference distributions. For each SNV in every retrospective plasma sample, the three metrics mentioned above were compared with reference distributions by counting the proportions of tissue-derived mutations possessing metrics smaller than the plasma SNV. The largest proportion value from SNVs in one sample was recorded as the final deleteriousness feature. Conclusively, nine features covering aspects of hepatocarcinogenesis were used as inputs and the retrospective samples were categorized into HCC and non-HCC groups for final PREDICT model construction. The cutoff on the PREDICT model outputs were determined by 95% specificity. The sensitivity as well as specificity values were used in model performance evaluation.

Statistical and performance evaluation methods for ML model selection

Methods including the Wilcoxon rank sum test, Student's t-test and Fisher’s exact test were used accordingly to compare values in two populations while the Kruskal-Wallis test quantified the difference among multiple distributions. Regarding the differences intrinsic to input features, various ML models including Logistic regression (LR), stochastic gradient descent (SGD) classifier, random forest (RF), support vector classifier (SVC), Gaussian naïve bayes (NB), and Bernoulli NB were used for cross-validation-based multi-model selection. Python package scikit-learn (https://scikit-learn.org/) was used for ML model implementation. Twelve ML performance metrics including specificity, sensitivity, precision, negative predictive value (NPV), accuracy, F1 score, false omission rate (FOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), Matthews correlation coefficient (MCC), Fowlkes-Mallows index (FMI) as well as the area under receiver operating characteristic curve (AUC) were computed using in-house scripts and used for model prioritization.

Ethical approval and consent to participate

With informed consent, all samples used in this study were collected from subjects enrolled following protocols approved by the ethics committee of the Chinese PLA General Hospital (Number S2019-137-01), in accordance with the Helsinki’s Declaration.

Acknowledgements

Not applicable.

Author contributions

WZ, QZ, and XY contributed to the study design. Ledu Z, WW, Yong L, FL, ZW, QZ, Yongli L, WL, LL, Liwei Z, YG, JX, SC, CS, SY, JS, SS and BS contributed to sample collection. JY, MM, XQ, and FH analyzed the sequencing data and constructed the early detection model. JY drafted the manuscript. CZ, LJ, and XX provided advice in manuscript preparation. All authors read and approved the final manuscript.

Funding information

This study was supported by the National Natural Science Foundation of China (Number 82271770), Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment (Number 2021B1212040004) and Shenzhen Science and Technology Program (Number GZL202309043419000025).

Availability of data and materials

Data generated in the current study are available from the corresponding author upon reasonable request.

Competing interests

JY, XQ, CZ, LJ, JS, and SS are the employees of Geneplus-Shenzhen while MM, HF, BS, XX, and XY are the employees of Geneplus-Beijing. All authors declared no potential conflict of competing interests.

Heimbach JK, Kulik LM, Finn RS, Sirlin CB, Abecassis MM, Roberts LR, et al. AASLD guidelines for the treatment of hepatocellular carcinoma. Hepatology. 2018;67:358–80.
Foda ZH, Annapragada AV, Boyapati K, Bruhm DC, Vulpescu NA, Medina JE, et al. Detecting Liver Cancer Using Cell-Free DNA Fragmentomes. Cancer Discov. 2023;13:616–31.
Cisneros-Villanueva M, Hidalgo-Pérez L, Rios-Romero M, Cedro-Tanda A, Ruiz-Villavicencio CA, Page K, et al. Cell-free DNA analysis in current cancer clinical trials: a review. Br J Cancer. 2022;126:391–400.
Abbosh C, Birkbak NJ, Swanton C. Early stage NSCLC — challenges to implementing ctDNA-based screening and MRD detection. Nat Rev Clin Oncol. 2018;15:577–86.
Manea I, Iacob R, Iacob S, Cerban R, Dima S, Oniscu G, et al. Liquid biopsy for early detection of hepatocellular carcinoma. Front Med. 2023;10:1218705.
Cristiano S, Leal A, Phallen J, Fiksel J, Adleff V, Bruhm DC, et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature. 2019;570:385–9.
Cai J, Chen L, Zhang Z, Zhang X, Lu X, Liu W, et al. Genome-wide mapping of 5-hydroxymethylcytosines in circulating cell-free DNA as a non-invasive approach for early detection of hepatocellular carcinoma. Gut. 2019;68:2195–205.
Kisiel JB, Dukek BA, V S R Kanipakam R, Ghoz HM, Yab TC, Berger CK, et al. Hepatocellular Carcinoma Detection by Plasma Methylated DNA: Discovery, Phase I Pilot, and Phase II Clinical Validation. Hepatol Baltim Md. 2019;69:1180–92.
Faul JD, Kim JK, Levine ME, Thyagarajan B, Weir DR, Crimmins EM. Epigenetic-based age acceleration in a representative sample of older Americans: Associations with aging-related morbidity and mortality. Proc Natl Acad Sci. 2023;120:e2215840120.
Zhang L, Li J. Unlocking the secrets: the power of methylation-based cfDNA detection of tissue damage in organ systems. Clin Epigenetics. 2023;15:168.
Li C-L, Ho M-C, Lin Y-Y, Tzeng S-T, Chen Y-J, Pai H-Y, et al. Cell-Free Virus-Host Chimera DNA From Hepatitis B Virus Integration Sites as a Circulating Biomarker of Hepatocellular Cancer. Hepatol Baltim Md. 2020;72:2063–76.
Chen L, Abou-Alfa GK, Zheng B, Liu J-F, Bai J, Du L-T, et al. Genome-scale profiling of circulating cell-free DNA signatures for early detection of hepatocellular carcinoma in cirrhotic patients. Cell Res. 2021;31:589–92.
Wu T, Fan R, Bai J, Yang Z, Qian Y-S, Du L-T, et al. The development of a cSMART-based integrated model for hepatocellular carcinoma diagnosis. J Hematol OncolJ Hematol Oncol. 2023;16:1.
Chabon JJ, Hamilton EG, Kurtz DM, Esfahani MS, Moding EJ, Stehr H, et al. Integrating genomic features for non-invasive early lung cancer detection. Nature. 2020;580:245–51.
Schulze K, Imbeaud S, Letouzé E, Alexandrov LB, Calderaro J, Rebouissou S, et al. Exome sequencing of hepatocellular carcinomas identifies new mutational signatures and potential therapeutic targets. Nat Genet. 2015;47:505–11.
Harding JJ, Nandakumar S, Armenia J, Khalil DN, Albano M, Ly M, et al. Prospective Genotyping of Hepatocellular Carcinoma: Clinical Implications of Next-Generation Sequencing for Matching Patients to Targeted and Immune Therapies. Clin Cancer Res Off J Am Assoc Cancer Res. 2019;25:2116–26.
Zhao L-H, Liu X, Yan H-X, Li W-Y, Zeng X, Yang Y, et al. Genomic and oncogenic preference of HBV integration in hepatocellular carcinoma. Nat Commun. 2016;7:12992.
Zheng B, Liu X-L, Fan R, Bai J, Wen H, Du L-T, et al. The Landscape of Cell-Free HBV Integrations and Mutations in Cirrhosis and Hepatocellular Carcinoma Patients. Clin Cancer Res. 2021;27:3772–83.
Lo YMD, Han DSC, Jiang P, Chiu RWK. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021;372:eaaw3616.
Lin Y-W, Sheu J-C, Huang G-T, Lee H-S, Chen C-H, Wang J-T, et al. Chromosomal abnormality in hepatocellular carcinoma by comparative genomic hybridisation in Taiwan. Eur J Cancer. 1999;35:652–8.
Meier T, Timm M, Montani M, Wilkens L. Gene networks and transcriptional regulators associated with liver cancer development and progression. BMC Med Genomics. 2021;14:41.
Tsompana M, Buck MJ. Chromatin accessibility: a window into the genome. Epigenetics Chromatin. 2014;7:33.
Carvalho JR, Machado MV. New Insights About Albumin and Liver Disease. Ann Hepatol. 2018;17:547–60.
Choi J-Y, Lee J-M, Sirlin CB. CT and MR Imaging Diagnosis and Staging of Hepatocellular Carcinoma: Part I. Development, Growth, and Spread: Key Pathologic and Imaging Aspects. Radiology. 2014;272:635–54.
Zhang X, Wang Z, Tang W, Wang X, Liu R, Bao H, et al. Ultrasensitive and affordable assay for early detection of primary liver cancer using plasma cell-free DNA fragmentomics. Hepatol Baltim Md. 2022;76:317–29.
Liu Z, Li M, Hutton DW, Wagner AL, Yao Y, Zhu W, et al. Impact of the national hepatitis B immunization program in China: a modeling study. Infect Dis Poverty. 2022;11:106.
Qu C, Wang Y, Wang P, Chen K, Wang M, Zeng H, et al. Detection of early-stage hepatocellular carcinoma in asymptomatic HBsAg-seropositive individuals by liquid biopsy. Proc Natl Acad Sci U S A. 2019;116:6308–12.
Bao H, Wang Z, Ma X, Guo W, Zhang X, Tang W, et al. Letter to the Editor: An ultra-sensitive assay using cell-free DNA fragmentomics for multi-cancer early detection. Mol Cancer. 2022;21:129.
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72.
Wang P, Song Q, Ren J, Zhang W, Wang Y, Zhou L, et al. Simultaneous analysis of mutations and methylations in circulating cell-free DNA for hepatocellular carcinoma detection. Sci Transl Med. 2022;14:eabp8704.
Schrag D, Beer TM, McDonnell CH, Nadauld L, Dilaveri CA, Reid R, et al. Blood-based tests for multicancer early detection (PATHFINDER): a prospective cohort study. Lancet Lond Engl. 2023;402:1251–60.
Adalsteinsson VA, Ha G, Freeman SS, Choudhury AD, Stover DG, Parsons HA, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat Commun. 2017;8:1324.
Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, et al. CONTRA: copy number analysis for targeted resequencing. Bioinforma Oxf Engl. 2012;28:1307–13.
Abugessaisa I, Noguchi S, Hasegawa A, Kondo A, Kawaji H, Carninci P, et al. refTSS: A Reference Data Set for Human and Mouse Transcription Start Sites. J Mol Biol. 2019;431:2407–22.
Zhu G, Guo YA, Ho D, Poon P, Poh ZW, Wong PM, et al. Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden. Nat Commun. 2021;12:2229.
Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018;46:D649–55.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.
Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, Sherry ST, et al. Phenotype–Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet. 2014;22:144–7.
Limongelli I, Marini S, Bellazzi R. PaPI: pseudo amino acid composition to score human protein-coding variants. BMC Bioinformatics. 2015;16:123.

Yes there is potential Competing Interest. JY, XQ, CZ, LJ, JS, and SS are the employees of Geneplus-Shenzhen while MM, HF, BS, XX, and XY are the employees of Geneplus-Beijing. All authors declared no potential conflict of competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

Machine learning-enabled early detection of hepatocellular carcinoma utilizing cell-free DNA mutation and fragmentation multiplicity: a prospective study

Status:

Version 1

Abstract

Figures

Introduction

Results

Multifarious clinicopathological characteristics were collected for sample categories in the retrospective cohort

Computation of SNVScore demystifies the position of key SNVs in the HCC carcinogenesis evolutional trajectory

Analyses on HBV integration events in plasma samples unearthed their intricate associations with HCC progression

Scrutiny of the plasma cfDNA fragmentation patterns variegated our cognition on HCC development

PREDICT model demonstrated distinguished performance in the large-scale physical examination-based prospective validation cohort

Discussion

Materials and Methods

Sample enrollment of the PREDICT project

Sample pre-processing and library preparation

Target region design and next-generation sequencing

Sequencing data processing procedure

Somatic variant detection and primary filtration

The construction of the SNVScore model

The HBV integration event identification procedure

cfDNA fragment size analysis and ML-based model construction

HCC-specific nucleosome footprint identification and feature generation procedures

Feature selection and derivation of the final PREDICT model

Statistical and performance evaluation methods for ML model selection

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1