This study received approval from the Medical Research and Ethics Committee under the Ministry of Health Malaysia (NMRR-20–748-54587). The requirement for informed consent was waived because the data were retrospectively accessed. The study’s methods and findings were in line with the guidelines on the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [18].
Selection of prediction models
A literature search was performed in February 2020 to identify the prediction models that were developed for NAFLD from the PubMed. The search strategy used the following search string: (fatty liver [Title/Abstract]) OR (NAFLD [Title/Abstract]) OR (steatosis [Title/Abstract]) AND (predict [Title/Abstract]) OR (index [Title/Abstract]) OR (risk [Title/Abstract]) OR (score [Title/Abstract]) OR (model [Title/Abstract]) OR (algorithm [Title/Abstract]) OR (test [Title/Abstract]) OR (biomarker [Title/Abstract]) OR (machine learning [Title/Abstract]). The search was limited to full-text articles regarding research on adult subjects that were written in English and published between 2000 and 2019. An article was only selected if it did the following: (i) presented the development of a prediction model or an update of a previously developed model for NAFLD, (ii) used the risk of developing NAFLD in the general population as the study’s endpoint, (iii) applied multiple parameters or predictors to the model, (iv) developed the model based on weighted risk predictors, (v) provided the full model’s linear predictor or prediction algorithm, and (vi) only applied parameters that were routinely measured and tested in public healthcare centers in Malaysia. When a full-text article was not made publicly available, we made up to two attempts to approach the corresponding authors by email. The reference lists of the selected articles were also used to identify additional relevant articles.
Validation cohort
The validation cohort comprised patients seeking care from Hospital Sultanah Bahiyah, a public tertiary care center, which also served as the gastroenterology referral center in northern Malaysia. They were all above 18 years of age and underwent liver elastography using the Fibroscan® device (EchoSens, Paris) between January 2017 and December 2019. Those who had a history of active alcohol consumption (more than 14 drinks per week for men or more than seven for women), viral hepatitis, autoimmune hepatitis, or other forms of chronic liver disease were excluded.
The information on the risk factors or predictors that were used in each prediction model was obtained from the patient’s electronic medical records. The predictors included individual socio-demographic and clinical information, ranging from age, ethnicity, gender, education level, marital status, occupation, and body mass index (BMI) to the presence of cardiovascular diseases (diabetes mellitus, hypertension, dyslipidaemia, and coronary artery disease) and the laboratory findings, including alanine aminotransferase (ALT), aspartate aminotransferase (AST), fasting blood glucose level, triglycerides (TG), and serum cholesterol.
The diagnosis of NAFLD was confirmed by physicians based on the findings of the liver elastography. The controlled attenuation parameter (CAP) was used to measure the level of hepatic steatosis—a reading above 248 decibels/meter (dB/m) indicated NAFLD [19]. Ten measurements were performed for each patient, and the diagnosis was only confirmed if at least six readings were valid. For the purpose of this study, only the CAP results from the first liver elastography, as well as the information and laboratory findings from the patient’s clinic visits prior to the first liver elastography, were gathered.
Statistical analysis
Generally, studies on the external validation of prediction models require at least 100 (or ideally more than 200) events to generate an adequate study sample size [20,21]. To make up for the incomplete information in the validation cohort, the predictive mean matching method was applied to generate five imputed datasets, which were then pooled using Rubin’s rules [22]. The demographic and clinical characteristics of the patients were summarized as either percentages (categorical data) or means and standard deviations (numerical data).
For each patient in the validation cohort, their risk of NAFLD was calculated using the algorithms provided by the selected prediction models. The predictive performance of each model was estimated using discrimination (the ability of a model to differentiate between individuals with and without NAFLD) and calibration (the agreement between the predictions and observed outcomes). The model’s calibration was assessed graphically using a calibration plot. A perfect model prediction was expected to be represented by a 45⁰ line with an intercept (α) of zero and a slope (β) of one in the calibration plot [23]. The calibration intercept quantified the degree of agreement between the proportion of observed NAFLD cases and the mean predicted probability, which would indicate whether the predictions were systematically too low or too high [23]. On the other hand, the calibration slope referred to the degree of agreement between the predicted probability of developing NAFLD in the present study and the actual probability of having NAFLD [24]. The graph for each model was plotted based on the results of ten groups of a similar number of patients from the validation cohort who had similar predicted probabilities [24].
Direct application of the published models on the current validation cohort might have caused miscalibration, which is characterized by deviations from the ideal line (i.e. calibration-in-the-large was not equal to zero and the calibration slope was less or more than one). In the case of model miscalibration, the prediction model was updated by calculating a correction factor using the following equation [25]:
The correction factor was then added to the original model’s intercept, and the new intercept was used when the updated model was applied to the validation cohort.25 This method improved the model’s calibration without affecting its ability to discriminate between individuals with and without NAFLD [26]. Furthermore, the model’s discrimination was assessed based on the concordance (‘c’) statistic, which was equal to the area under the receiver operating characteristic (ROC) curve, along with its corresponding 95% confidence interval. Areas under the ROC that were greater than 0.5 suggested that the model could be used to predict NAFLD [27].
Subsequently, the diagnostic accuracy for each updated prediction model was examined using the sensitivity, specificity, positive- and negative-likelihood ratios and the positive and negative predictive values. These diagnostic parameters were calculated using a cut-off value that meant that ten percent of the population had values above the model’s cut-off points. The procedure was then repeated using cut-offs where 20%, 80%, and 90% had values above the cut-off. All the data in this study was analyzed using the R statistical software version 3.5.2 (rms, Hmisc, pROC and rmda packages) [28].