In this study, we developed a machine learning model to predict the risk of developing HCC within the following 12 years in an individual health screening examinee, based on the information available from the examination results and the history of medical service use. The model showed good calibration and discrimination in the test. We believe that one of the greatest strengths of this model is that it extracts information hidden in the big data that otherwise would have been discarded, and creates a new value, that is, providing an individual examinee with the estimated risk of developing a certain disease in the future. We hope that, after further development and validation, prediction models of this kind will be integrated in the national healthcare system and provide people with additional helpful information.
The main goal of machine learning lies on making the most accurate prediction possible, while traditional statistical analysis is mainly focused on generalization from sample statistics to population parameters. Thus, machine learning is often referred to as a ‘black box’; data goes in, predictions come out, but the processes between the input and the output are unclear, which is okay as long as its prediction is accurate [17]. However, we wanted to first identify a set of valid and stable input features, or predictors, and examine their associations with the outcome, instead of putting all data into an algorithm and asking it to try to make a good prediction, for the following reasons. First, even when the main goal is to make accurate predictions, it is still important to understand the relationship of predictors with an outcome, so that we can take appropriate action about the causes of the outcome. Second, complex algorithms can be so flexible that they pick up meaningless or noisy signals from input data to make good predictions only in a certain dataset but fail to generalize to other datasets with different noises. Therefore, by the rigorous feature selection process, we aimed to remove noisy signals, that is, non-significant, unstable input features; in our results many seemingly irrelevant underlying diseases such as hemorrhoid or chronic rhinitis were frequently selected as independent risk factors for HCC in resampled datasets (Supplementary Table 3).
Older age, male sex, chronic liver disease, heavy alcohol consumption, diabetes, and HIV infection are well-known risk factors for HCC.[18, 19] All of these risk factors were independent predictors in our cohort as well. Although drinking habit by questionnaire was not selected as a predictor in our model, we believe that the use of ALT and GGT, which were strong predictors in our model, is a more object approach for assessing the effect of alcohol consumption than the 5-point scale questionnaire used in our health screening examination, as a previous study showed [20].
In contrast to underlying diabetes, underlying dyslipidemia and higher total cholesterol were associated with the lower risk in our cohort. This opposite association between diabetes, dyslipidemia, and HCC is in line with the results of an epidemiologic study of HCC and metabolic risk factors in a nationwide Taiwan cohort [21]. This may be partly explained by that in this study dyslipidemia was diagnosed when both the diagnosis and the use of lipid-lowering drugs were confirmed (Supplementary Table 2), and current evidence suggests that statin use could contribute to a decline in HCC incidence [18, 22]. However, hypercholesterolemia without taking lipid-lowering drugs was also an independent risk factor [21]. More research is warranted on the effect and mechanism of dyslipidemia on the risk of HCC development and prognosis.
Family history of liver cancer is also a known risk factor for HCC [23, 24]. In our study, family history of chronic liver disease, not cancer, was a strong predictor. This is not a surprising result considering that chronic liver disease is one of the strongest risk factors for liver cancer. The presence or absence of family history of cancer was also asked in our health screening questionnaire, but it includes all types of cancer, which is probably the reason that it was not included as a significant risk factor.
Interpretation of the lower risk of HCC in patients with mental disorders due to psychoactive substance use or schizophrenic and delusional disorders is hampered by the fact that those diagnoses were considered sensitive personal information and grouped together under the unidentified code in our dataset. However, as mental disorders due to use of alcohol, which is most commonly used psychoactive substance, probably affected the outcome towards an increased risk, schizophrenic and delusional disorders were likely attributed to the decreased risk of HCC. Especially, schizophrenia has been reported by a meta-analysis study to be protective against HCC development [25]. Some investigators suggested the correlation between tumor suppressor genes and schizophrenia as possible explanation of its potential protective effect against cancer [26].
Our prediction model has limitations. Our model was developed and validated using a single ethnic (i.e., Asian) population from a single country. Thus, the generalizability of the model to other countries or ethnic groups is not guaranteed. However, we believe that our approach (i.e., machine learning predictor based on the claim and health screening data) can be applied to various cohorts similarly and used to produce their own, even multi-national, prediction models. In addition, as mentioned above, some diagnoses were masked and grouped together for the protection of sensitive personal information. We expect that more detailed information from the national health insurance database will be made available for research purposes in the future.