Study selection
After removal of duplicates, a total of 13612 articles were recruited. Of these, 13467 records were deleted after titles and abstract screening. The remaining 145 potential studies were further checked for full-text and only 20 studies were included finally. 125 publications were excluded for the following reasons: seven articles were excluded as conference abstract or non-original study; 81 articles were excluded based on the questions related to the models, including not for prediction models, only consisting of single predictor, model application or not for liver cancer prediction. 40 articles were removed due to the study population (see Fig. 1 for a flowchart).
Basic characteristics of prediction models
Table 1 and Supplementary Table 1 presents the main characteristics of included studies. All reports were conducted between 2009 and 2020. Most of studies were conducted in China (N=10, 50%) [8,11,20,21-27], USA (N=3, 15%) [28-30], Korea (N=2, 10%) [31,32], and Japan (N=2, 10%) [33,34]. Among all the observational studies, 16 were prospective cohort study and four were retrospective cohort study. And only one study was case-control study. Most were conducted at high risk population, including patients with seropositive hepatitis B surface antigen, chronic hepatitis C virus infection and cirrhosis. Only two studies were initiated based on data from general population [34,35].
Follow up, sample size and predictors
The longest median or mean duration of follow-up time was 18.8 years [11]. The sample size for the model derivation varied from 442 to 407,206. Smaller sample size in the validation set was observed, compared to derivation dataset, and the largest sample size in the validation dataset was 91,357 [32]. EPVs ranged from 3.3 [26] to 89.73 [28]. Predictors differed largely among the eligible articles. 17 unique predictor variables were identified among 20 prediction models (Fig. 2). The most commonly used predictors for the prediction models were age (n=17; 85%), sex (n=13; 65%), alanine aminotransferase (ALT) (n=7; 35%), alpha-fetoprotein (AFP) (n=5; 25%), cirrhosis (n=5; 25%), platelet count (n=5; 25%), diabetes (n=5; 25%), and HBV DNA (n=5; 25%). Additional information for models could be found at Supplementary Table2.
Methodological assessment and missing data
Cox proportional hazards model were used in most studies. One study applied the machine learning and logistic regress model for predicting risk simultaneously [22]. For missing data, most studies have not reported. Complete case analysis [27,29,33,34,36] or multiple imputation [31] were used to handle missing data.
Summary of model performance
The model performance measures are presented in Table 2. The most commonly described measure of discriminatory value was AUC, ranging from 0.72 [37] to 0.93 [25] for model development, and 0.65 [20] to 0.92 [32] for model validation. C statistic was reported in nine articles [11, 23, 25, 28-32, 34, 36, 37] and five articles both described the C statistic and AUC, [23, 25, 29, 32, 37] these C statistic ranged between 0.64 [30] and 0.96 [31]. The discriminatory ability in most studies were considered good (C statistics over 0.7). 11 articles did not report calibration, and the correlation coefficient was the most frequently used index of calibration in the remaining nine articles. Internal validation was carried out in six models [25, 26, 28, 29, 34, 37]. Only ten studies conducted external validation.
Risk of bias for included studies
A summary of the risk of bias assessment of models by domains is presented in table3, and details on each item across domains are presented in Supplementary Table 3. Of the 20 models, 12 studies were defined at low risk in the participants, predictors and outcome domains. However, all models were classified as overall high risk of bias, due to a low number of events per variable, lack of internal validation, less reporting information on missing data and performance measures. Only five studies were assessed at high risk of bias for the participants [22, 23, 27, 30, 31], suggesting that the target population of models has good representatives. Only two of these studies were rated at high risk in terms of predictors [11, 22]. Liver cancer was diagnosed based on published guidelines or sophisticated criteria in 16 studies and thus evaluated at low risk. For applicability, 13 studies were defined at low risk, and five studies were assessed at high risk, [11, 22, 26, 30, 33] which were less potentially applicable to the real setting.
Meta-analysis of AUC
Results of the meta-analysis for AUC of prediction models are shown in Fig. 3. A total of 10 articles were included in the meta-analysis finally, of which seven articles provided AUC/ C statistic in model development process. The pooled estimates also showed that the models varied in discriminating ability, ranging from 0.67 (0.60-0.73) to 0.94 (0.94-0.95). The discrimination ability was accepted in most of studies.