Study population and data collection
This study was conducted using bioresources from the National Biobank of Korea and Center for Disease Control and Prevention, Republic of Korea (NBK-2022-004). In this study, indpendent subjects were recruited from three population-based KoGES cohorts 22; HEXA, CAVAS, and the Ansan/Ansung study. Among the 58,694 HEXA, 8,105 CAVAS, and 8,840 Ansan/Ansung participants, those without genetic data (NHEXA=1; NAnsan/Ansung study=3,347) were excluded. Those without hsCRP data were excluded (NHEXA=5, NCAVAS=860, NAnsan/Ansung study=15), as well as those with hsCRP level above 10 mg/L (NHEXA=254, NCAVAS=137, NAnsan/Ansung study=1) because high CRP levels could be caused by acute inflammation 48. In the case of an outlier or missing value for hsCRP level, hsCRP data from the closest follow-up study were used if available (i.e., the 1st follow-up study for HEXA and the 1st to 8th follow-up study for the Ansan/Ansung study) to minimize exclusions. Finally, 71,019 study participants (NHEXA=58,434; NCAVAS=7,108; NAnsan/Ansung study=5,477) were selected for the analysis (Fig S8).
Demographic and epidemiological data were collected using self-reported questionnaires, and laboratory experiments for clinical data were conducted in hospitals across Korea. CRP levels were measured using a high-sensitivity hsCRP method. We collected data on age, BMI, sex, alcohol consumption, smoking, regular physical activity, family income, educational attainment, and chronic disease history simultaneously with hsCRP level collection. Phenotypic data for PheWAS were extracted from KoGES HEXA, CAVAS, and Ansan Ansung databases. Among the available epidemiological and clinical variables, those with valid values exceeding 10,000 or with inflammation-related phenotypes were selected. Among these, binary variables for which the observed frequency was less than 20 in each category were excluded. In total, 114 phenotypes (e.g., platelets, LDL, and total cholesterol) were detected for PheWAS (Table S7-8). The study protocol was approved by the Korea National Institutes of Health and the Institutional Review Board (IRB: CBNU-202306-HR-0153) of Chungbuk National University. This study was performed in accordance with relevant guidelines and regulations and all participants provided informed consent for their biospecimens 49.
Genotyping and quality control
Using the Korea Biobank Array (KCHIP) platform, approximately 800,000 (800 K) SNPs were genotyped from each sample 50. Initially, quality control (QC) procedures were performed to exclude individuals from further analyses whose samples met at least one of the following criteria: 1) call rate < 97%, 2) excessive singletons, 3) sex discrepancy, 4) cryptic second-degree relatives, or 5) withdrawal and blind replicates [20]. SNP QC was also conducted, excluding SNPs with 1) Hardy-Weinberg Equilibrium (HWE) P < 1×10− 6, 2) call rate < 0.95, or 3) low quality SNP (variants classified into another category by R package SNPolisher or off-target variants), resulting in approximately 460 K SNPs remaining 50. Imputation was performed using IMPUTE 4 software based on the Korean reference genome 50. Briefly, 8,000K SNPs were tesed after the following exclusion criteria were met: 1) information content metric < 0.8, and 2) minor allele frequency (MAF) < 0.01 50. For the current GWAS, we double-checked and excluded genotyped or imputed SNPs or samples according to the following criteria: 1) MAF < 0.01, 2) HWE P < 1 \(\:\times\:\) 10− 6, 3) missingness per individual < 0.05, and 4) missing data per marker < 0.05. Consequently, 7,981,597 genotyped or imputed SNPs for HEXA; 7,971,416 for CAVAS; and 7,972,085 for the Ansan/Ansung study were used in the current analysis.
Statistical analysis of epidemiological variables
The hsCRP levels were considered continuous; however, owing to the positively skewed distribution, the natural log-transformed hsCRP levels were used for the current analyses. Among the demographic and epidemiological variables, age and BMI were analyzed as continuous variables; sex, alcohol drinking, smoking, regular physical activity, and history of chronic disease were analyzed as binary variables; and family income and educational attainment were categorized into three or four groups, respectively. To investigate the associations between hsCRP levels and potential confounding variables, several statistical analyses were conducted according to the different types of variables. Student’s t-test and one-way analysis of variance were used to compare hsCRP levels between two groups and among three or more groups of categorical variables, respectively. Pearson’s correlation test was used for continuous variables. We initially constructed a full model using the multivariable linear regression for hsCRP (as dependent variable) with all potential confounding variables statistically significant in the three KoGES cohorts in common (P < 0.05, as independent variables), and then backward selection was performed to extract the covariates. Accordingly, age, sex, BMI, regular physical activity, smoking, and history of hypertension were controlled for in subsequent analyses. In addition, we examined the relationship between LDL and hsCRP levels, both of which are commonly used as biomarkers of CVD. Participants were categorized based on the Korean clinical guidelines for LDL levels:<70 mg/dL, 70–99 mg/dL, 100–129 mg/dL, 130–159 mg/dL, 160–189 mg/dL, and > 190 mg/dL 51. Furthermore, hsCRP levels, proportions of subjects with elevated hsCRP (≥ 3 mg/L), and hsCRP-PRS were compared according to the LDL level categories. All statistical analyses were performed using R Software (version 4.3).
GWAS on hsCRP levels
For the GWAS of hsCRP levels, a multivariable linear model with adjustments, as explained above, was used. A GWAS of hsCRP levels was performed in each of the three KoGES cohorts (HEXA, CAVAS, and Ansan/Ansung study) using PLINK (version1.9) 52. To combine the results of all three KoGES cohort studies, we conducted a fixed-effects inverse variance-weighted meta-analysis using METAL 53, as the overall variation due to heterogeneity was relatively low (I2 = 14.65%) 54. For a genomic control, the statistics of each GWAS result were corrected by the METAL based on the genomic inflation factors (λHEXA = 1.09, λCAVAS = 1.01, and, λAnsan/Ansung study = 1.01). With these corrections, the inflation factor for the meta-analysis was 1.02, indicating little systemic inflation 55.
Human leukocyte antigen (HLA) variants are relatively few, but their complicated LD structure could make GWAS results difficult to interpret 56, we excluded genetic variants in the HLA regions (chr6:25Mb-33Mb, hg19) from further analyses. Furthermore, SNPs on the sex chromosome and indels were excluded because they differed according to sex or individual. We considered hsCRP susceptibility SNPs that satisfied the nominal genome-wide significance threshold (P < 5×10− 8) in at least one of the three KoGES cohorts, with successful validation (P < 0.05) in the remaining KoGES cohorts, and with nominal genome-wide significance in the combined results generated by the meta-analysis (Pmeta<5×10− 8). Using the extracted hsCRP susceptibility SNPs, we performed LD clumping using FUMA platform 23. The GWAS-based hsCRP susceptibility SNPs were grouped into the independent loci based on the distance (± 500kb) and LD structure (r2 < 0.1). For each independent genetic locus, the SNP with the smallest Pmeta value was indexed as the lead SNP.
To calculate the proportion of hsCRP variance explained by the lead SNPs, the following formula was used 14;
$$\:\frac{2\times\:\sum\:_{i}\left[{MAF}_{i}\right(1-{MAF}_{i})\times\:{\beta\:}_{i}^{2}]}{var\left(\text{ln}hsCRP\right)}$$
In this case, ∑ is the sum, \(\:{MAF}_{i}\) is the MAF of the corresponding lead SNP\(\:i\), \(\:{\beta\:}_{i}\) is the effect estimate of the SNP\(\:i\) on natural log hsCRP level, and \(\:var\) is the variance of natural log hsCRP levels.
Validation of hsCRP susceptibility SNPs identified by previously published GWAS
Previous hsCRP GWAS summaries were obtained from the GWAS catalog on 08/31/2023 57. To consider possible ethnic differences in genetic susceptibility loci, EA and with/without EA validation analyses were conducted (number of evaluated SNPs: Nancestry (EA): 253, Nmulti−ancestry with EA: 2,485, Nmulti−ancestry without EA: 2,232). SNPs on the sex chromosome or with P > 5×10− 8, as well as indels, were excluded from validation. Variants that were not used in our GWAS were also excluded. For hsCRP susceptibility SNPs that have been reported several times, those with the smallest P-value were selected. The remaining SNPs were clustered based on distance and LD structure (ancestry: ±500kb; r2 < 0.5, multi-ancestry (with/without EA): ±10,000kb; r2 < 0.001). As a result, 123, 260, and 239 SNPs were evaluated in the ancestry and multi-ancestry (with/without EA) validation analyses, respectively. SNPs were considered significantly validated when Pmeta<0.05, and the results of the previous and current studies were concordant. The detailed process is shown in Fig S3.
Functional analysis
Based on the significant hsCRP SNPs from the meta-GWAS results (Pmeta<5 × 10− 8), we performed the ANNOVAR enrichment test using FUMA software 23, where SNPs were functionally annotated as either exonic, intronic, intergenic, splicing site, 5′/3′-UTR, or upstream/downstream of genes, presented in Fig S4 58. Subsequently, functionally annotated SNPs were mapped to genes using two strategies: positional mapping and eQTL mapping 23. Positional mapping was based on the physical distance between the SNPs and genes within a maximum distance of 10Kb. The eQTL mapping was conducted based on cis-eQTL association between SNPs and genes, limiting the distance to 1Mb. eQTL data were obtained from Genotype-Tissue Expression (GTEx) version 8 59, and the Bonferroni-corrected significance level was used based on the number of genes tested for each tissue.
Using MAGMA 24, we performed gene-based, gene-set, and gene-property analyses based on the overall summary results of the meta-GWAS on hsCRP. Through gene-based analyses, individual SNP statistics within protein-coding genes were combined, resulting in gene-based statistics. The genome-wide significance level of the gene-based analysis was considered using the Bonferroni correction (P < 2.68 × 10− 6=0.05/18,681 genes). Using gene-set analysis, we explored which genes in a specific gene set were more strongly associated with hsCRP levels than other genes. The curated gene sets and gene ontology data obtained from the molecular signature database (MsigDB version 7.0) 60 were used as input data. Gene-based p-values from the gene-based analysis were used to compute gene-set statistics. A total of 15,480 gene sets were tested, and the significance level was set according to the false discovery rate (FDR) control (FDR < 0.05). Gene-property analysis was performed to identify the tissue-specific expression of hsCRP-associated genes, and RNA-sequence data for 30 general tissue types were obtained from GTEx version 8 59.
STRING was used to predict protein-protein interactions within the input genes or proteins 25. In this study, STRING analyses were performed based on MAGMA gene-based analysis results. We investigated how prioritized genes (P < 2.68 × 10− 6, N = 58) interact with each other to influence hsCRP levels. We set the minimum confidence score at 0.4 (medium confidence) corresponding to the estimated likelihood that the association between proteins was true. The STRING gene set enrichment test was performed on all genes (N = 18,681) used in the MAGMA gene-based analysis. Various resources were used to collect gene-set input data, including gene ontology 61, Reactome 62, disease ontology 63, COMPARTMENTS 64, Monarch 65, Uniprot 66, Pfam 67, Interpro 68, KEGG 69, TISSUES 70, and simple modular architecture research tool 71.
The ClueGO plug-in in Cytoscape was used to facilitate the biological interpretation of hsCRP-associated genes 26. The functional enrichment analysis was performed using MAGMA gene-based analysis results, as well as gene ontology and KEGG databases 61,69. Pathways containing four or more hsCRP-associated genes were selected and their significances were calculated using a two-sided hypergeometric test with Bonferroni correction (PBonferroni<0.05). Based on the kappa statics, the degree of linkage between pathways in network was estimated and functional groups were generated (kappa score ≥ 0.4).
PRS and PheWAS
Using the hsCRP-PRS constructed using lead SNPs, PheWAS was conducted to investigate the relationship between cumulative genetic factors of hsCRP and a variety of phenotypes. To compute the hsCRP-PRS, 69 lead SNP statistics were extracted, and the following formula was used:
$$\:{PRS}_{i}=\frac{\sum\:_{i}({G}_{ij}\times\:{\beta\:}_{j})-Mean(PRS)}{SD\left(PRS\right)}$$
where \(\:i\) and \(\:j\) represent the individual subjects and SNP, respectively. \(\:{G}_{ij}\) is the number of hsCRP-increasing alleles of SNP\(\:j\), and \(\:{\beta\:}_{j}\) means the effect estimate of SNP\(\:j\). The \(\:Mean\left(PRS\right)\) and \(\:SD\left(PRS\right)\) refer to the mean and standard deviation of the PRS calculated from the 71,019 participants. PLINK (version1.9) was used for hsCRP-PRS calculation 52. PheWAS was conducted using the same covariates used in the hsCRP GWAS (age, sex, BMI, regular physical activity, smoking, and history of hypertension). Multivariable logistic regression models were adapted for binary variables (intestinal polyps, breast cancer, and family history of cancer), while multivariable linear regression models were adapted for continuous variables (height, LDL, and total cholesterol). PheWAS analysis was performed using the R package PheWAS 72.