Patient selection
This multicentre retrospective cohort study was approved by the Ethics Management Committee of the First Affiliated Hospital of the University of Science and Technology of China (2021-RE-043). We enrolled a total of 288 patients with early-stage HCC who received curative ablation at three centres: the First Affiliated Hospital of the University of Science and Technology of China (Center 1), the Second Hospital of Anhui Medical University (Center 2), and the Taizhou Hospital of Zhejiang Province (Center 3) from April 2008 to March 2022. The inclusion and exclusion criteria for patient selection are shown in Fig. 1A. At Center 1, patients were randomly assigned to a training set and an internal validation set at a 7:3 ratio. Patients from Center 2 and Center 3 were combined to form the external validation set.
Ablation was performed by the same experienced team of the respective hospitals under ultrasound guidance as described in eAppendix 1. If there were multiple lesions, ablation was performed one by one. All patients underwent CECT, and the imaging data were collected via the picture archiving and communication system (PACS) of the respective hospitals. Specific CT equipment information is shown in eAppendix 2. The inclusion criteria were as follows: (1) clinical diagnosis of HCC according to the noninvasive criteria established by the American Association for the Study of Liver Disease based on distinct imaging features[22]; (2) single tumour diameter ≤ 5 cm, multiple tumours ≤ 3, each diameter ≤ 3 cm; (3) refusal to undergo hepatectomy or liver transplantation; and (4) patient management involving curative ablation only. The exclusion criteria were as follows: (1) tumours invading blood vessels, bile ducts, adjacent organs, distant metastasis, or other malignancies; (2) a prior history of HCC treatment, such as hepatic resection, transarterial chemoembolization (TACE), targeted therapy, or radiotherapy; (3) the absence of CECT imaging data or CECT conducted more than 1 month prior to ablation; and (4) a follow-up duration of less than 2 years. The detailed workflow is illustrated in Fig. 1B.
Clinical data collection
Before initiating the data collection process, all relevant staff members at the participating centres underwent a training session on data extraction. The medical records of eligible patients were meticulously reviewed. Clinical and laboratory data were systematically collected via standardized forms. To ensure relevance and consistency, only the data obtained within one week before ablation were considered. Another reviewer randomly assessed and validated 30% of the collected data. To account for potential variations across participating centres, all laboratory measurements were standardized. Extreme outliers—values significantly higher or lower than the norm—were flagged for review. These outliers were re-evaluated by the hospital's lead researcher or the designated chief physician to confirm their validity and exclude input errors.
Image information collection
CECT images were obtained from the hospital PACS in DICOM format. Two physicians (reader 1 and reader 2) from each centre independently evaluated the CECT images and focused on the following eight semantic features: (1) tumour margin; (2) tumour capsule; (3) intratumoral vessels; (4) tumour growth; (5) intratumoral necrosis; and (6) peritumoral enhancement. When multiple lesions were present in the patient's liver, we evaluated the largest tumour. Some example images and specific imaging semantic features are explained in eAppendix 3.
Image segmentation
Physician A, who had 10 years of experience interpreting abdominal CT scans, used ITK-SNAP (version 3.6.0; www.itksnap.org) software to segment the images. The volume of interest (VOI)—comprising either the entire tumour or the residual liver excluding vessels or bile ducts—was delineated layer by layer on arterial, portal, and delayed-phase images. In cases with multiple lesions, the largest lesion was chosen for segmentation. Throughout the delineation process, Physician A was blinded to the clinical data of the patients.
To extract the histological features of the peritumoral images, the peritumoral VOIs were processed via the Python morphological erosion and expansion algorithm, which automatically constricted the boundary of each lesion inwards by 5 mm and expanded it outwards by 3, 5, and 10 mm, respectively. To ensure the reproducibility of the radiomic features, physician A and physician B, with 10 years of experience in reading abdominal CT images, performed the above procedures again after two weeks. The results of the repeat extraction were used to calculate the interclass correlation coefficient (ICC).
Feature extraction via radiomics and deep learning
Feature Extraction
Python version 3.8.4 (https://pypi.org/project/pyradiomics/) was used to extract radiomic features (eAppendix 4) for three image phases (arterial phase, A; portal phase, P; delayed phase, D) and six regions (tumour, residual liver, 5 mm-eroded and 3 mm-, 5 mm-, and 10 mm-extended peritumoral regions). To quantify the differences between different phases, we calculated the differences in the radiomic features between the A and P phases and between the D and P phases in the six regions (delta-radiomics). For each patient, 1539 imaging features were extracted. After removing features with a variance close to 0, the total number of features was reduced to 1,512 in the tumour and residual liver regions and 1,507 features in the other regions, resulting in a total of 45,290 features per patient. For more details, please refer to Table S1.
The settings for 3D ResNet-18 prior to feature extraction are described in eAppendix 5. In the deep learning feature extraction process, the extraction area was the same as that used in the radiomic process. The number of features extracted from each region was 512, and a total of 9216 deep learning features were extracted for each patient from three phases and six regions (Table S3), ensuring the consistency and comparability of the data.
Radiomic features and deep learning feature screening
To prevent overfitting, we employed a three-step feature selection process for feature screening. First, we selected features with an ICC greater than 0.8 and standardized these features via Z score normalization. Second, we used LASSO regression to select features with nonzero coefficients. Third, if the number of features exceeded one after LASSO selection, we further utilized recursive feature elimination (RFE) with a decision tree (DT) kernel to determine the optimal number of features at each stage. To obtain more representative features, each group of features underwent the above feature selection process. The specific feature selection process and results are shown in Tables S2. Finally, we calculated the Rad-score on the basis of the weighted regression coefficients of the radiomic features derived from LASSO.
In addition, we also conducted the above three-step feature selection processes for features extracted through deep learning. The specific feature selection results are shown in Table S4. We calculated the DL-score on the basis of the weighted regression coefficients of the deep learning features derived from LASSO.
Model building and comparison
We built seven models utilizing different features. The clinico-radiological model (Cli) was constructed from the clinical and imaging features resulting from univariate and multivariate analyses, the Rad-Score calculated from the radiomic features was used to construct a radiomic model (Rad), and the DL-Score calculated from the deep learning features was used to construct a deep learning radiomic model (DLR). To identify potentially better prediction models, we constructed four integrated models based on the clinico-radiological features, Rad-Score and DL-Score, including the clinico-radiological and Rad-Score integrated model (CR), the clinico-radiological and DL-Score integrated model (CDL), the Rad-Score and DL-Score integrated model (DLRR), and a comprehensive model combining all features (Combined).
We constructed seven models via 6 machine learning algorithms, including support vector machine (SVM), logistic regression (LR), random forest (RF), K-nearest neighbour (KNN), light gradient boosting machine (LightGBM) and Xtreme gradient boosting (XGBoost) algorithms. The most appropriate algorithm was selected on the basis of the characteristic data of the different groups to ensure the objectivity of the results. To improve the generalization ability of the models and better evaluate their performance with small sample sizes, we used 5-fold cross-validation for the model hyperparameter selection and model training process. Moreover, the process was performed on the training set only to avoid data leakage. In addition, we excluded models with an AUC greater than 0.1 between the training set and the internal validation set to avoid overfitting and underfitting. The best model with the highest AUC in the internal validation set was selected.
Model validation and clinical application
To demonstrate the calibration ability and clinical applicability of the best model, we generated calibration curves and performed DCA on the training, internal validation and external validation sets.
To evaluate whether the best model can effectively predict progression-free survival (PFS) and overall survival (OS) for risk stratification, we utilized the maximum Youden index from the internal validation set as the optimal cut-off value for prediction outcomes in both the training and validation cohorts. Patients were categorized into low-risk and high-risk groups, and the 2-year PFS and 5-year OS rates were analysed via Kaplan-Meier(K-M) survival curves.
Follow-up
All patients were followed up regularly after discharge. The first follow-up was one month after the ablation procedure, during which the local therapeutic effect was evaluated. The other follow-ups were every three months or six months after ablation. During these follow-ups, the serum AFP level, liver function, and abdominal ultrasound examination were conducted, and CECT, MRI, or ultrasound angiography might be used to monitor for recurrence if necessary. The starting point of this study was defined as the time at which the ablation procedure was performed, and the primary endpoint was ER. ER was defined as the emergence of a new intrahepatic lesion or metastasis within two years postablation, with the lesion displaying imaging features typical of HCC or being confirmed through histopathological analysis. Curative ablation was defined as the absence of tumour necrosis enhancement on dynamic CECT, MRI, or CEUS. The last follow-up date for this study was March 31, 2024.
Statistical analyses
Statistical analysis was performed via R software (version 4.3.0). In the training set, variables with more than 20% missing data were excluded; otherwise, multiple imputation algorithms were employed to handle the missing data. To improve model interpretability, continuous variables were converted into binary variables via threshold values from receiver operating characteristic (ROC) curves. Categorical variables are presented as frequencies and percentages and were analysed via chi-square tests or Fisher's exact tests. Variables showing a p value of less than 0.1 in the univariate analysis were included in the multivariate analysis for further variable selection. The models' predictive performance was assessed through the area under the curve (AUC), accuracy (ACC), positive predictive value (PPV), and negative predictive value (NPV). Model comparisons were made via the net reclassification improvement (NRI) and integrated discrimination improvement (IDI) indices. Calibration curves and decision curve analysis (DCA) were used to evaluate model calibration and clinical utility. Kaplan-Meier (K-M) survival analysis was conducted, and log-rank tests were used for curve comparisons. A two-tailed p value of less than 0.05 was considered statistically significant.