HEXA (Health Examines)
The HEXA was initiated in 2004 and 173,357 participants, aged over 40 years, were recruited from 38 health examination centers and training hospitals located in eight regions of South Korea22. Of these, 58,700 individuals with genotype data and passing sample quality control criteria were extracted. The sample quality control criteria for exclusion are as follows: a history of cancer, gender inconsistencies, cryptic relatedness, low genotype call rate (< 95%), and sample contamination, as previously described22. All participants were genotyped with the Korean Chip (K-CHIP), which was designed by the Center for Genome Science, Korea National Institute of Health (KNIH), based on the UK Biobank Axiom® Array, and manufactured by Affymetrix. The SNP imputation was carried out using IMPUTE v235 with 1000 Genomes Phase 3 data as a reference panel.
KARE (Korea Association Resource)
Participants of KARE cohort (n = 8,840) were recruited from two regions in South Korea (Ansan and Ansung) from 2009 to 2012 for the Korean Genome and Epidemiology Study23. All study participants aged ≥ 40 years provided written informed consent, and approval was obtained from the institutional review board. The exclusion criteria were as follows: history of cancer, gender inconsistencies, cryptic relatedness, low genotype call rate (< 95%), and sample contamination 22, 23. The KARE study utilized the Afymetrix Genome-Wide Human SNP Array GeneChip 5.0. SNP imputation was performed using IMPUTE v2 with the 1000 Genomes Project (haplotype phase 1)35.
Ethics approval and consent to participate
This study was conducted with bioresources from the National Biobank of Korea, the Korea Disease Control and Prevention Agency, Republic of Korea (KBN‐2021‐051).
Disease selections
For hypertension, we selected cases meeting any of the following criteria: systolic blood pressure ≥ 140 mmHg, diastolic blood pressure ≥ 90 mmHg, use of antihypertensive medicines, diagnosis of hypertension, or undergoing treatment for hypertension. Controls were those with systolic blood pressure < 120 mmHg and diastolic blood pressure < 80 mmHg36.
For Type 2 diabetes, cases were selected if they satisfied any of the following criteria: fasting glucose level ≥ 126 mg/dl, 2-hour oral glucose tolerance test (2-hour OGTT) ≥ 200 mg/dl, receiving treatment for type 2 diabetes, or taking medication for condition. Controls were identified as those with fasting glucose level < 100 mg/dl, 2-hour OGTT < 140 mg/dl, and no history of type 2 diabetes treatment and diagnosis37.
For asthma, cataract, cholelithiasis, colon polyp, and stroke, cases were chosen if they met any of these criteria: a diagnosis of each respective disease, taking medication for the same, or undergoing treatment for it. Conversely, controls were selected from those without a diagnosis of any of these diseases.
For coronary artery disease, cases were selected based on the following criteria: a diagnosis of myocardial infarction or angina pectoris, medication for either condition or undergoing treatment for them. Controls were those not having a diagnosis of both myocardial infarction and angina pectoris.
For obesity, cases meeting the criterion of a body mass index ≥ 25 were selected. Controls were identified as those with a body mass index < 2538, 39.
For osteoporosis in HEXA, cases were selected based on these criteria: diagnosis of osteoporosis, taking medication for osteoporosis, or receiving treatment for osteoporosis. Controls were selected based on the criterion of not having a diagnosis of osteoporosis. For osteoporosis in KARE, we selected cases that met the following criteria: for females, a diagnosis of osteoporosis, taking medication for osteoporosis, undergoing treatment for osteoporosis, or having a distal radius T score < -2.6 or midshaft tibia T score < -3.040; for males, a diagnosis of osteoporosis, taking medication for osteoporosis, undergoing treatment for osteoporosis, or having a distal radius T score < -2.5 or midshaft tibia T score < -2.541. In contrast, controls for females were defined as having a distal radius T score greater than -1.4 and a midshaft tibia T score of -1.640, and controls for males were defined as having a distal radius and midshaft tibia T score greater than -1.041.
PRS-CS
PRS-CS is a Bayesian regression framework that enables “Shared continuous shrinkage priors” on SNP effects to infer their posterior mean effects, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns42. PRS-CS will learn the phi parameter from the discovery GWAS without requiring post-hoc tuning as an auto model. We used the default settings for other parameters. Also, we used the 1000 Genomes reference panel provided by PRS-CS (https://github.com/getian107/PRScs).
PRS-CSx
We used PRS-CSx, a recently developed Bayesian polygenic modeling method, to construct the transferability PRS21. PRS-CSx jointly models the two GWAS summary statistics and couples genetic effects across populations using a shared continuous shrinkage prior, which enables more accurate effect size estimation by sharing information between summary statistics and leveraging LD diversity across discovery samples. The shared prior allows for correlated but varying effect size estimates across populations, retaining the flexibility of the modeling framework. In addition, PRS-CSx accounts for population-specific allele frequencies and LD patterns and inherits efficient and robust posterior inference algorithms from PRS-CS. We used pre-computed 1000 Genomes Project reference panels that matched the ancestry of each discovery GWAS, and a fully Bayesian algorithm for model fitting, which automatically learned all model parameters from the summary statistics without the need for hyper-parameter tuning. Also, the PRS-CSx used the 1,259,754 HapMap3 variants information to estimate the PRS. So, we used only HapMap 3 variants in the HEXA (~ 1,150,090 SNPs) and KARE cohort (~ 919,166 SNPs).
Statistical analysis
To investigate the LRT and per SD OR, we used a logistic regression model using R statistical package version 4.1.0, as follows:
Disease (coded as 1 or 0) ~ β1 PRS + β2 age + β3 sex
, where logit(Disease) is the log odds of binary outcome variable disease (coded as 1 for control or 2 for case), range of age is from 40 to 69 and sex is coded as 0 or 1 for female or male.
We assessed its prediction performance metric using the continuous NRI, employing the ‘PredictABEL’ package in R. The formula for calculating the censored NRI when comparing the null model against new model 1 and 2 is as follows:
NRIi = P (upnew model i > null model | Case) – P (down new model i < null model | Case) + P (down new model i < null model | Control) – P (up new model i > null model | Control), where i = 1or 2.
We generated NRI indices for both ‘null model vs. new model 1’ and ‘null model vs. new model 2’ and compared these indices to assess the relative predictive performances. For this analysis, we randomly divided the samples into two equal halves. In one half, we generated the model, while in the other half, we estimated the NRI values.
To statistically investigate incidence data, which involves events occurring over time, we conducted Cox regression analysis using the ‘survival’ package in R.
To investigate mean differences of quantitative variables between cases and controls, we used the student's t-test using R statistical package version 4.1.0.
We depicted the bar plot using ‘ggplot2’ version 3.3.6 in R.