Overview of all Analyses
Across the three (i.e., VT, CMC and ACAT) multiethnic meta-analyses, the three Indo-European meta-analyses and the three EACC analyses, we identified a total of 129 unique genes that were significantly associated with the refractive error phenotype (Supplementary Tables 3–5). We found no statistically significant difference in p-value or the number of unique genome-wide significant genes when adding the PRS as covariates.
Multiethnic Meta-analyses
Forty-three genome-wide significant genes were found using EMMAX-VT (Fig. 1A), 11 genome-wide significant genes using the EMMAX-CMC (Fig. 1B), and 28 genome-wide significant genes using ACAT (Fig. 1C).
Sixty-eight unique genes were identified across the three tests (Fig. 2). Four genes were significant across all three tests - GDF15 (19p13.11), PDCD6IP (3p22.3), RRM2 (2p25.1), and ST6GALNAC5 (1p31.1). GDF15 (19p13.11) was one of the top two significant genes in all three approaches (EMMAX-VT P = 5.12x10− 9, EMMAX-CMC P = 1.12x10− 9, ACAT P = 1.95 x 10− 9). GDF15, PDCD6IP, and RRM2 all replicated in at least one cohort; ST6GALNAC5 only appeared in IECC and thus could not be replicated.
Overall, 25 genes were replicated using the EMMAX-VT approach, 11 in the ACAT approach, and 4 in the EMMAX-CMC approach. Three genes — HCAR1, CCDC9, and NINJ2 — were replicated in more than one replication cohort, all in the EMMAX-VT approach. MRPS27 in EMMAX-VT (REHS and EPIC-Norfolk) and GDF15 in ACAT (IECC and REHS) had genome-wide significant p-values in two cohorts. The list of all genome-wide significant genes for each test can be found in Supplementary Tables 6–8, while the full results of all p-values can be found in Supplementary Tables 9–11.
Indo-European Meta-analyses
As it is possible that East Asians differ in genetic risk factor profile from Indo-Europeans, we performed meta-analyses on the four Indo-European ancestry cohorts. Forty-nine genes were genome-wide significant in the EMMAX-VT approach (Fig. 3A), 13 genes in the EMMAX-CMC approach (Fig. 3B), and 29 genes in the ACAT approach (Fig. 3C). Four genes overlapped between all three tests — GDF15, PIK3CA, RRM2, and ST6GALNAC5 (Fig. 4). The signal at PIK3CA was unique to the Indo-European meta-analysis. GDF15 and RRM2 were both replicated in one cohort, while PIK3CA and ST6GALNAC5 only appeared in IECC.
Overall, 24 genes were replicated in EMMAX-VT, 8 genes in ACAT, and 4 genes in EMMAX-CMC. NINJ2 in the EMMAX-VT and STON1 and SND1 in EMMAX-CMC were replicated in multiple cohorts. The list of all genome-wide significant genes for each test can be found in Supplementary Tables 12–14, while the full results of all p-values can be found in Supplementary Tables 15–17.
EACC Analysis
We also report the standalone results of EACC analysis. Thirty-one genome-wide significant genes were found in EACC using EMMAX-VT (Fig. 5A), 5 genome-wide significant genes using EMMAX-CMC (Fig. 5B), and 22 genome-wide significant genes using ACAT (Fig. 5C). GSTM5 (1p13.3) and WEE1 (11p15.4) overlapped in all three tests (Fig. 6). SERTAD3 (chromosome 19) and ZNF25 (chromosome 10) were genome-wide significant and only appeared in EACC, i.e., rare variants in these two genes did not exist in the other cohorts. 51 unique genome-wide significant genes were identified, 39 novel to the EACC analyses. The list of all genome-wide significant genes for each test can be found in Supplementary Tables 18–20.
Cohort Unique Genes
In addition to the two genes in the EACC EMMAX-VT analysis, there were 6 significantly associated genes that only had rare variants within a single cohort; no other rare variants existed in the other cohorts for these genes. EDN3 and CHMP1B in the IECC EMMAX-VT analysis and PRLH in the IECC ACAT analysis. KLF1 appeared only in the EPIC-Norfolk cohort, in both the EMMAX-VT and EMMAX-CMC analyses. The list of cohort unique genes appears in Supplementary Table 21.
Independent Replication in UK Biobank
We extracted the variants from the 129 significant unique genes and performed replication analyses in the UK Biobank. There were 7 genes with a P < 0.05 in EMMAX-CMC and 9 genes with a P < 0.05 in EMMAX-VT (Supplementary Table 22). P4HTM, CCDC170, and CPB1 were found in both analyses. STON1 was also replicated in the UK Biobank analyses; this gene had a significant meta-analysis p-value in the EMMAX-CMC analysis. Interestingly, the p-value in all cohorts was < 0.053.
Pathway and Expression Analysis on all Significant Genes
We performed IPA pathway analysis on the 129 unique genes. While this did not result in any genome-wide significant canonical pathways, the upstream regulators analysis identified over 172 associated transcription factors. The two highest were the cytokine CSF2, which is known to regulate neuroglia after retinal injuries60, and the Transcription factor (TF) MEF2C, which is known to be expressed in the retina and controls photoreceptor gene expression61 (Supplementary Table 23). The fourth ranked p-value was the Raf kinases, which are known to be involved in retinal development62 and cell survival63; the fifth ranked p-value was TBX5, which is expressed in the retina and involved in eye morphogenesis64,65. Causal network analysis identified 288 associated pathways (Supplementary Table 24), including the TRPC5 pathway, which regulates axonal outgrowth in developing ganglion cells66.
The top overall associated physiological system functions were organ morphology, organismal development and embryonic development, while the top molecular/cellular functions were cell cycle and cellular assembly/organization. Cancer and organismal injuries/abnormalities were the top overall associated phenotypes (Supplementary Table 25). Six genes were associated with ophthalmic phenotypes: CHST6, GCNT2, P4HTHM, USH2A, GRHL2, and MAPT.
FUMA analysis found that the top enriched tissues were heart, brain, muscle, and adipose tissue (Supplementary Fig. 1A). The top functional categories were cytoskeleton organization, cell cycle processes, mitotic nuclear division, and organelle organization (Supplementary Fig. 1B.
Biological Plausibility and Prioritization of Genes
Of the 129 genome-wide significant genes from the six meta-analyses, 27.9% (36/129) have a known expression in human ocular tissue. 51.2% (66/129) of these genes showed evidence for a human ocular phenotype.
Seven genes had a biological plausibility score higher than 3 — PER3 (internally replicated, expressed in ocular tissue and associated ocular phenotype, i.e., score of 5) and PDCD6IP, MAPT, CHST6, GRHL2, USH2A, and P4HTM (all with a score of 4). An additional 11 genes had a score of 3 — GDF15, RRM2, HSPH1, TPR, KRT81, SPHK1, GSTM5, THSD7A, WEE1, and BUB1B (Fig. 7). Detailed background for the prioritization of the genes can be found in Supplementary Tables 26A-F.
The highest overall biological plausibility score belonged to the circadian rhythm gene PER3 (1p36). It was genome-wide significant in both the all cohorts ACAT and Indo-European only meta-analyses (P = 1.08 x 10− 6 and 1.15 x 10− 6, respectively); it was genome-wide significant in REHS and replicated in IECC. Circadian rhythm genes have been shown to be associated with refractive error22 and PER3 is located near the site of a known myopia locus (MYP14) at which the causal gene has not been identified67–69. PER3 was expressed in ON and OFF bipolar cells. Defects in this gene are associated with familial advanced sleep phase syndrome (OMIM 616882) and may contribute to other circadian phenotypes by altering the sensitivity to light70. In defocus experiments in chicks using − 15D lenses, PER3 expression decreased by -1.26 fold in the retina71. Further chick defocusing experiments, showed that PER3 expression in the retina varies under altered visual conditions72.
Five genes had a score just below PER3, including the apoptosis gene PDCD6IP (3p22.3). This gene was found to be genome-wide significant in all-cohorts meta-analyses using all three tests (P = 1.07 x 10− 7, 1.45 x 10− 7, and 4.88 x 10− 6, respectively). Further PDCD6IP had a P of < 0.006 in both the EACC and IECC cohorts and did not appear in the other cohorts. It is particularly interesting because it has two low single variant p-values in both IECC and EACC (0.00556 and 0.00548, respectively) and there are no rare variants in this gene in any of the other cohorts. PDCD6IP is expressed in ganglion cells of peripheral retina and plays a role in programmed cell death in uveal melanoma73 and may play a role in cornea lymphangiogenesis and vascular responses.74
MAPT (17q21.32) encodes tau proteins responsible for stabilizing microtubules; it was found to be genome-wide significant in the all cohorts EMMAX-VT analysis (P = 8.57 x 10− 7). It was genome-wide significant in REHS and replicated in EPIC-Norfolk. Abnormal MAPT was present in human glaucoma patients with uncontrolled intraocular pressure75 Cowan et al. showed that MAPT was expressed in several cell types in both the peripheral and foveal human retina: horizontal cells, rod bipolar cells, ON and OFF bipolar cells GLY and GABA amacrine cells and ganglion cells42. A knock-out mouse model showed decreased total retina thickness.
CHST6 (16q23.1) was genome-wide significant in both the all cohorts and Indo-European only EMMAX-VT meta-analyses (P = 8.99 x 10− 7 and 2.42 x 10− 7, respectively). The gene was genome-wide significant in IECC and replicated in BDES; it was also nearly replicated in EPIC-Norfolk. CHST6 plays a role in maintaining corneal transparency. Mutations in this gene may result in macular corneal dystrophy (OMIM 217800), which is characterized by bilateral, progressive corneal opacification and a reduction of corneal sensitivity.76 The mouse phenotype of a knock-out model corresponded to that of human, i.e. abnormal cornea morphology and decreased corneal (stroma) thickness. Since our reference expression database did not contain any corneal tissue, we couldn’t score this category.
The transcription factor GRHL2 (8q22.3) was genome-wide significant in the all cohorts EMMAX-VT meta-analysis (P = 1.42 x 10− 6). It was genome-wide significant in REHS and replicated in IECC. Mutations in GRHL2 may lead to posterior polymorphous corneal dystrophy77 (OMIM 618031), characterized by a variable phenotype ranging from an irregular posterior corneal surface with occasional opacities, corneal edema, reduced visual acuity, secondary glaucoma, and corectopia.
The transmembrane prolyl hydroxylase P4HTM (3p21.31) was only genome-wide significant in EACC using EMMAX-VT (P = 1.00 x 10− 7). However, this gene was replicated independently in the UKBB analysis. P4HTM has been shown to be expressed in different ocular cells (including horizontal cells and bipolar cells). It is associated with HIDEA, a severe autosomal recessive disorder that is characterized by multiple symptoms, including eye abnormalities78 (OMIM 618493) and knock-out mice models have shown abnormal eye morphology79.
The membrane gene USH2A (1q41) was genome-wide significant in the EACC ACAT analysis (P = 7.55 x 10− 9). It is well known to cause both Usher syndrome, which includes retinitis pigmentosa (RP) and mild to moderate hearing loss, as well as RP without hearing loss80. It is known to be expressed in the retina81.
Pathway and Expression Analysis on Top Prioritized Genes
We ran the IPA and FUMA analyses on the seven top prioritized genes. IPA did not identify any canonical pathways as significant; the only pathway shared across the genes was the 14-3-3-mediated signaling pathway (MAPT and PDCD6IP). The 14-3-3 proteins are a diverse group of signaling proteins.
Upstream regulator analysis found several transcription regulators of at least two genes include NKX2-1 (GRHL2 and MAPT), PSEN1 (MAPT and PER3), and SIRT1 (MAPT and PDCD6IP) (Supplementary Table 27). In the causal network analysis, the master regulator with the highest p-value covering multiple genes was the cytokine macrophage migration inhibitory factor (MIF) (Supplementary Table 28), which covered five genes. Interestingly, MIF is an essential factor in the development of zebrafish eyes82 and has been found to be a potential regulator of diabetic retinopathy83. MIF inhibitors may also be protective to photoreceptors84. The top functional analysis for disease result was hereditary eye disease (Supplementary Table 29). FUMA showed the top tissue expression occurred in the small intestinal terminal ileum, skeletal muscle, and the brain cortex; the latter being probably the best proxy for eye tissue (Supplementary Fig. 2A). A heat map of the expression of the seven genes across all GTEx tissues is given in Supplementary Fig. 2B).
Potential Causal Variants in the Prioritized Genes
We used annotation from wANNOVAR to identify potential causal variants within the top genes identified by the prioritization method (Table 1). For the two prioritized genes that were significant in the ACAT analyses, we were able to look at single variant p-values in addition to annotation to determine potential causal variants. There were three good candidate variants in PDCD6IP, which was genome-wide significant in IECC and replicated in EACC. rs199990824 appeared in the EACC only, was predicted to be damaging by SIFT and MutationTaster, and had a CADD score of 26. The minor allele of rs199990824 appeared in 37 carriers (all heterozygotes) with an average SER of -2.04 D (SD = 3.29) compared to the non-carrier average of -0.44 D (SD = 2.27) and the overall cohort average of -0.45 D (SD = 2.28); the single variant P was 0.000183. In the IECC, the best potential causal variant was rs62620697, which was predicted damaging by MutationTaster, had a CADD score of 23.8, and had a low single variant p-value of 0.002632. rs145293758 also had a low p-value (0.000311) but was not predicted damaging. Carriers (N = 9) of rs62620697 had an average SER of -2.17D (SD = 6.87) compared to that of non-carriers with an average SER of 0.20 (SD = 2.27).
Table 1
Potential Missense Causal Variants in Prioritized Genes
CHR | BP | rs ID | Gene | AA Change | SIFT | PolyPhen2 | MT | FATHMM | CADD |
1 | 7879401 | rs147327372 | PER3 | Thr519Ala | T | B | N | T | 0.01 |
1 | 7890153 | rs144178755 | PER3 | Thr1040Asn | D | B | N | T | 0.962 |
1 | 216138793 | rs554957414 | USH2A | Pro2329Leu | D | D | D | D | 29.1 |
1 | 216373416 | rs148135241 | USH2A | Ser1122Pro | D | D | D | D | 22.8 |
1 | 216419934 | rs201527662 | USH2A | Cys934Trp | D | D | D | D | 36 |
3 | 33840234 | rs200697599 | PDCD6IP | Ile5Ser | D | D | D | T | 32 |
3 | 33879764 | rs199990824 | PDCD6IP | Asp376Asn | D | B | D | T | 26 |
3 | 33905532 | rs62620697 | PDCD6IP | Ala719Thr | T | B | D | T | 23.8 |
3 | 33905587 | rs145293758 | PDCD6IP | Pro737Arg | T | B | N | T | 20.2 |
3 | 49039984 | rs140290144 | P4HTM | Ile227Val | T | B | D | T | 22.1 |
3 | 49043292 | rs144279528 | P4HTM | Asp386Asn | T | B | D | T | 27.3 |
8 | 102570910 | rs142411476 | GRHL2 | Arg183Gln | T | D | D | T | 22 |
16 | 75512734 | rs140699573 | CHST6 | Gln331His | D | D | D | D | 27.4 |
17 | 44055786 | rs139796158 | MAPT | Ala118Gly | D | D | D | T | 26.4 |
17 | 44060807 | rs76375268 | MAPT | Gly213Arg | D | D | N | T | 11.71 |
17 | 44060859 | rs63750072 | MAPT | Gln230Arg | D | D | D | T | 4.652 |
17 | 44067341 | rs143956882 | MAPT | Ser427Phe | D | D | D | T | 28.5 |
17 | 44101481 | rs63750191 | MAPT | Gln741Lys | D | D | D | T | 27.5 |
Legend: The best potential missense causal variants in our top prioritized genes. The headers represent: CHR = chromosome, BP = physical position in basepairs (hg19), Gene = gene location, AA change = amino acid change caused by mutation, SIFT = pathogenicity prediction from SIFT (where T = tolerated and D = damaging), PolyPhen2 = pathogenicity prediction from PolyPhen2 (where B = benign and D = damaging), MT = pathogenicity prediction from MutationTaster (where N = neutral and D = damaging), FATHMM = pathogenicity prediction from FATHMM (where T = tolerated and D = damaging), CADD = CADD phred score |
Potential candidate variants were also identified in PER3, which was genome-wide significant in REHS and replicated in IECC. The REHS signal was primarily driven by two variants - rs147327372 and rs144178755, which had single variant p-values of 1.72 x 10− 8 and 0.004953, respectively. However, neither variant was predicted to be damaging by the prediction algorithms nor appeared in the other European cohorts and were not significant, although rs147327372 did have a p-value of 0.046 in EPIC-Norfolk.
The signals in the other four genes, identified primarily by the two burden-style tests, were driven by a cumulative effect of several variants. In this case, we relied primarily on annotation and reported variants that were generally agreed upon by multiple prediction programs. Five good candidate variants were located in MAPT: rs139796158, rs76375268, rs63750072, rs143956882, and rs63750191. All these variants were nonsynonymous variants and predicted damaging by three of the four databases (except for rs76375268, which was predicted damaging by two). rs139796158, rs143956882, and rs63750191 all had CADD scores > 26. In CHST6, the best candidate variant was the missense variant rs140699573. It was predicted damaging by SIFT, PolyPhen2, MutationTaster, and FATHMM and has a CADD score of 27.4. In GRHL2, the best candidate variant was rs142411476. It was predicted damaging by two databases and had a CADD score of 22. In P4HTM, two variants of interest were identified: rs140290144 and rs144279528. These variants were predicted damaging by MutationTaster and had CADD scores of 22.1 and 27.3, respectively. Finally in USH2A, three variants (rs554957414, rs148135241, and rs201527662) were all predicted damaging by the five prediction algorithms and had CADD scores above 22.
Structural Analysis of Prioritized Candidate Proteins
In addition to the annotation, we also performed protein structural modeling of all coding variants within the prioritized genes (98 variants across 6 genes/proteins) and calculated free energy difference (ΔΔG) between wildtype and mutant proteins (Supplementary Table 30); positive ΔΔG indicates a shift from a more stable to a less stable isoform. More detailed information on the structural analysis can be found in the Supplemental Methods.
In PDCD6IP, both rs145293758 and rs200697599 were predicted to be highly destabilizing to protein structure (Supplementary Fig. 3A). The variant rs145293758 leads to replacement of a proline (Pro737) for an asparagine near phosphorylation sites in the protein’s self-associating domain, which could disrupt phosphorylation. rs200697599 (Ile5) and rs199990824 (Asp376) result in changes to the protein’s BRO1 domain, which is involved in endosomal targeting. The isoleucine to serine mutation at rs200697599 could introduce a phosphorylation site at the N-terminus while the asparagine to aspartic acid mutation at rs199990824 could disrupt hydrogen bonds. Recall that both rs145293758 and rs199990824 were identified as potential causal variants in IECC and EACC, respectively, based on their annotation and single variant p-values.
For PER3, several variants may affect structure, including rs140974114, which results a serine (Ser751) to aspartic acid substitution at the protein’s nuclear localization signal and could disrupt hydrogen bonds and rs200140283, which results in an alanine (Ala681) to glycine substitution in the CSNK1E binding domain. Further potential disruptions occur at rs139315125 (His416), which takes place in the nuclear export signal 3 and rs77418803 (Ser919), which occurs near the nuclear export signal 2. The model is provided in Supplementary Fig. 3B).
Of the variants in MAPT, two were predicted to be destabilizing (rs76375268 at Gly213 and rs63750191 at Gln741) (Supplementary Fig. 3C). Further, rs73314997 (Ser318) and rs143956882 (Ser427) are located near known pathogenic mutations for frontotemporal dementia and Pick disease of the brain, respectively.
Three variants on the luminal domain of CHST6 were found to have a mild effect on protein stability. Two of these variants (rs201349198 at Ala326 and rs140699573 at Gln331) are positioned near variants known to cause macular corneal dystrophy (MCD) near the C-terminus. This suggests the C-terminus is sensitive to mutations enabling interference with keratan sulfation, which could cause a loss of function that can lead to a milder disease phenotype such as refractive error. The model can be found in Supplementary Fig. 4A.
In GRHL2, variants were only predicted to have a mild effect on protein structure and were not located near known pathogenic variants (Supplementary Fig. 4B).
For P4HTM, rs140290144 is predicted to be moderately destabilizing (Supplementary Fig. 4C). It substitutes a valine for a buried isoleucine (Ile227) between two calcium binding sites; potential disruption of these calcium binding sites can result in loss of function. Similarly, rs144279528 occurs in the Fe-dependent 2-OG dioxygenase domain close to an iron binding residue. Substitution of asparagine from the wildtype aspartic acid (Asp386) could have an impact on iron binding by introducing a glycosylation (due to location on protein surface) or disruption of hydrogen bonding.
Of particular interest in the protein modeling was that of usherin (USH2A), the known retinitis pigmentosa gene. Five variants were predicted to be highly destabilizing, particularly rs554957414 with a ΔΔG value of 99.19 kcal/mol). Three of these variants, including rs554957414 (Pro2329), result in the loss of proline and the loss of that ring structure could cause an increase in conformational flexibility and account for such high destabilization predictions (Supplementary Fig. 5). Further, a mutation at rs201527662 (Cys934) results in the replacement of cysteine with tryptophan and will disrupt a disulfide bond between two cysteines.
We also compared the ΔΔG of these five candidate variants with the ΔΔG of all USH2A ClinVar (n = 63) and gnomAD (n = 1870) variants using the Wilcoxon rank sum test. A significant difference between the ClinVar variants and gnomAD variants was found (P = 0.0008) and the ΔΔG values of our candidate variants was much more similar to the known pathogenic variants than the putatively benign GnomAD variants (Supplementary Fig. 6).
Potential Causal Variants in Other Genome-wide Significant Genes
We also identified variants within the other 122 genome-wide significant genes that had a high potential to be damaging. This included 25 variants across the five cohorts; the results are found in Supplementary Table 31. Like our prioritized genes, we also performed protein modeling on these variants (Supplementary Table 32).
Notable findings from the structural analysis include a valine to phenylalanine substitution (Val105) that would disrupt a helix in ALG3, which has been implicated in congenital disorders of glycosylation that have ocular phenotypes85 (Supplementary Fig. 7A). We also identified multiple glycine substitutions in TNFRSF13B in areas associated with heparan sulfate – glycosaminoglycan biosynthesis; heparan sulfate has been shown to play a role in eye pathologies86 (Supplementary Fig. 7B).