Data compilation
The PRSKB integrates with the National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) GWAS Catalog55 to provide the most up-to-date and comprehensive list of GWA studies. The GWAS Catalog is a publicly available database of GWA study summary statistics, which includes over 4,000 publications, 130,000 variant-trait associations, and 6,000 full summary statistic files. Study and association data from the GWAS Catalog are automatically downloaded, pruned, and reformatted using the gwasrapidd R library56. The data are filtered to include only associations that contain both an odds ratio and the respective risk allele. Additionally, only non-haplotype associations that reside on an autosomal chromosome are preserved. Finally, any allele that has been reported on the reverse strand is automatically detected and flipped to the forward strand. The strand-flipping procedure entails comparing each reported risk allele to the list of possible alleles for the specified variant from dbSNP57. If the reported risk allele does not exist in the list of possible alleles, the complement of the risk allele is checked against the list. If the complement is present, then it is used as the reported risk allele for polygenic risk score calculations, as recommended by Choi, et al. 22.
PRSKB tool structure
The PRSKB is divided into three key parts: the database, the server, and the client as shown in Figure 1. The GWA study data, linkage disequilibrium clumping data, and association data are housed in a MySQL database on the PRSKB server. Supplemental Tables S1-S3 expound on the information found in each database table. The variant associations from each study/trait combination are contained within a single associations table, which includes detailed summary statistics for each variant (see Supplemental Table S1). The study table (see Supplemental Table S2) contains detailed descriptions of each GWA study. Finally, there are four clumps tables, hg38 clumps, hg19 clumps, hg18 clumps, and hg17 clumps, that include linkage disequilibrium region identification numbers for variants in each of the five super populations from the 1000 Genomes project (see Supplemental Table S3). The associations and study tables are automatically updated monthly with new associations added to the GWAS Catalog. The scripts for loading tables into the database are publicly available at https://github.com/kauwelab/PolyRiskScore/tree/master/update_database_scripts.
The server houses the application programming interface (API) endpoints for the PRSKB, running NodeJS using PM2 (https://pm2.keymetrics.io/) and NGINX (https://www.nginx.com/). While the user does not interact directly with the API endpoints, the client calls endpoints to download requested data needed to calculate polygenic risk scores. All calculations occur client-side to reduce strain on the server.
Users have two platforms from which they can calculate polygenic risk scores. The first platform is a web interface accessible at https://prs.byu.edu via a web browser. The second is a command-line interface (CLI) tool that can be run from the Linux or Mac command-line or from a bash shell on Windows. The CLI includes a bash script and four Python scripts. We recommend using the CLI to calculate polygenic risk scores for multi-sample VCF files.
Linkage disequilibrium clumping
Linkage disequilibrium is the nonrandom association of alleles at two or more loci58. Linkage disequilibrium generally affects loci that reside in close physical proximity, resulting in the joint inheritance of alleles at different loci within families and populations. Genetic variants that are in linkage disequilibrium will be similarly associated with traits in GWA studies. If they are not adequately assessed, they can confound a polygenic risk score analysis by overrepresenting the relative risk for a disease. For example, if three loci are in high linkage disequilibrium, only one locus should be included in calculating a polygenic risk score because the same risk signal is present in any of those three loci.
To reduce inflated polygenic risk scores, the genetic variants used to calculate polygenic risk scores need to be largely independent from each other. Linkage disequilibrium was calculated by first separating the 1000 Genomes data into the five previously annotated super populations: African, American, East Asian, European, and South Asian. We then used PLINK Linkage Disequilibrium (LD) Clumping59 to calculate linkage disequilibrium regions for the variants in each population (see Supplemental File S1). We ran this analysis for the data available in both reference genomes hg38 and hg19. Although linkage disequilibrium regions are nearly identical between reference genomes60, we also converted the variant coordinates in each clump to reference genomes hg18 and hg17 so that user-supplied genotype information could be easily mapped to the correct LD clump regardless of reference genome.
The LD Clumping analysis results were subsequently used to assign each genetic variant to an LD clump identifier (clump ID) for each population. LD regions were determined using an r-squared cutoff of 0.25 and a distance threshold of 500 kb, which correspond to parameters used in previous studies61,62. From this information, we created a table of population-specific linkage disequilibrium clusters for each reference genome in our database (see Supplemental Table S3). The clump ID for each population facilitates the dynamic retrieval of LD clumps from the database so that no more than one variant per LD region is included in a polygenic risk score calculation.
Calculating Polygenic Risk Scores
Polygenic risk scores are calculated using the same protocols outlined by Choi, et al. 22. Figure 2 shows that polygenic risk score calculations require two essential datasets: 1. summary data comprised of GWA study summary statistics (e.g., odds ratios and p-values), and 2. user-supplied query data comprised of individual genotypes. Although a single GWA study is used to calculate each polygenic risk score, users can optionally select multiple studies or traits, which will each be analyzed independently. Users can also upload their own GWA summary statistics for personalized analyses. The PRSKB first ensures that the summary data and the query data are in the same format (e.g., strand flipping and same reference genome). Next, linkage disequilibrium is calculated by comparing each locus to the population-specific clumping regions housed on the server. When an individual in the target data has two or more variants within the same clumping region, the PRSKB chooses the variant with the most significant p-value from that region to represent the clump in the polygenic risk score. After adjusting for linkage disequilibrium, the remaining set of variants that overlap between the sample and the GWA study are retained and used in the polygenic risk score calculation. The equation we use in the PRSKB to calculate genetic risk scores sums the weighted effects of the overlapping alleles present in an individual, as seen in Equation 1. The value for ai is the number of risk alleles at the ith locus, ri is the odds ratio at that locus, and b and c are the coefficient and exponent for the p-value threshold, respectively. The p-value threshold is supplied by the user at runtime.
Equation 1:
We chose to implement this equation because scores calculated in this simple manner are generally highly accurate9,22,26,29,63,64. Although the additive polygenic risk score model does not account for gene-gene or gene-environment interactions, the largest meta-analysis of heritability from twin studies validates a simple additive model for a majority of the traits examined65.
UK Biobank and 1000 Genomes Polygenic Risk Score Visualization
In order to adequately interpret polygenic risk scores, individual results must be contextualized against a large dataset of similar ethnicity29. We used the 1000 Genomes Project53, which contains sequencing data for 2,504 samples spanning five different continental regions, and the UK Biobank54, a biomedical database comprised of genetic data for 500,000 individuals from the United Kingdom, to generate risk score distributions and summary statistics for each study in the PRSKB database. First, we divided the samples by population region. Next, we used the PRSKB to compute polygenic risk scores from all GWA studies in our database for each individual in each dataset. We then calculated the percentile rank of each person against all other people in the dataset with the same ethnicity. The polygenic risk score and percentile ranks were passed to Plotly JavaScript66 to create interactive graphics that allow users to visualize population-specific distributions of polygenic risk scores for any study in the PRSKB database. Dynamic plots with a table of summary statistics for each study are available for users to query online at https://prs.byu.edu/visualize.html.
Alzheimer's Disease Neuroimaging Initiative (ADNI) Case Study
We also computed Alzheimer’s disease polygenic risk scores for individuals in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) to verify the efficacy of the PRSKB calculations. ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease. MCI is the preclinical stage of Alzheimer’s disease, and is characterized by a slight but measurable decline in cognitive abilities. Individuals with MCI are at an increased risk of developing Alzheimer’s disease or another dementia.
The ADNI case-control cohort includes clinical and genetic data for over 5,000 individuals. Results of a multiple linear regression indicated that there is not a collective significant relationship between age, sex, and polygenic risk score (F=0.395, p=0.674) in the ADNI dataset. Further examination revealed that neither age (t=0.0256, p=0.666) nor sex (t=0.614, p=0.469) has significant individual association with polygenic risk score outcomes, so we did not adjust for age or sex in our analyses using the ADNI dataset. We used the PRSKB calculator to compute the polygenic risk scores for each individual, and we chose to use the most comprehensive GWA study on Alzheimer’s disease in the PRSKB database4 as the summary data for calculations.
Following the risk score calculations, we separated the individuals by clinical dementia rating (CDR)—a summary measure developed to denote the overall severity of dementia in an individual, where CDR=0.0 is cognitively normal, CDR=0.5 is MCI, and CDR≥1.0 is Alzheimer’s disease67. A Shapiro-Wilk’s test of normality68 revealed that the risk scores in each group were not normally distributed (Alzheimer’s disease p=2.2x10-16, MCI p=1.4x10-7, cognitively normal p=4.5x10-10), so we opted to use a Mann-Whitney U test69 to compare the distributions of polygenic risk score ranks in individuals with and without Alzheimer’s disease. We first compared genetic risk scores in individuals with a CDR≥1 (Alzheimer’s disease) and individuals with a CDR≤0.5 (MCI + cognitively normal). Next, we compared individuals with a CDR=0 (cognitively normal) and individuals with a CDR≥0.5 (Alzheimer’s disease + MCI). Because the shape of the risk score distributions in each group followed a similar pattern, the Mann-Whitney U analyses were able to determine the extent to which the medians in each group significantly differed.
We performed a similar analysis using every study in the PRSKB database to identify additional traits that have distinct polygenic risk score distributions for each CDR cohort in the ADNI dataset. Using the PRSKB, we generated polygenic risk scores from 2,388 studies for each individual in the ADNI cohort. We then separated the individuals by CDR diagnosis, where individuals with a CDR≥1 (Alzheimer’s disease) were placed in one group and individuals with a CDR≤0.5 (MCI + cognitively normal) were allocated to another group. Next, we used a Shapiro-Wilks68 test to determine the extent to which scores for each study/CDR cohort were normally distributed. For normally distributed data, we performed an independent Welch’s two-sample t-test to compare the risk score means of individuals with and without Alzheimer’s disease. Finally, in studies with non-normal distributions of polygenic risk scores, we performed the non-parametric Mann-Whitney U test69, which compared the distribution of ranks in individuals with and without Alzheimer’s disease. Cohorts with a sample size less than two were excluded from the analysis. We repeated this analysis using a different clustering: CDR≥0.5 (Alzheimer’s disease + MCI) in one group and individuals with a CDR=0 (cognitively normal) in another group. We did not analyze MCI as a separate group to maintain statistical power and identify differences between 1) People with Alzheimer's disease and all other individuals and 2) People with normal cognition and all other individuals. Similar to the computations performed with the UK Biobank and 1000 Genomes datasets, we also report the percentile score distributions and summary statistics for CDR≥1, CDR=0.5, and CDR=0 online using Plotly Javascript66.