Discovery GWAS
This study’s discovery dataset is a previously-published GWAS of ASD (iPSYCH) [29] based on a Danish nationwide population-based cohort [30] including individuals born in Denmark between 1981 and 2005 and diagnosed with ASD prior to 2014. Cases comprised subjects with ASD as the only ascertained diagnosis (classified in accordance with the International Statistical Classification of Diseases and Related Health Problems, 10th revision (ICD-10)) (N=8,605 of 12,371 possible ASD cases; ICD codes F84.0, F84.1, F84.5, F84.8 and/or F84.9), and controls (N=19,526) were a randomly-selected subset of the cohort without diagnosis (ICD F00-F99). iPSYCH was chosen as the source GWAS over the combined iPSYCH/Psychiatric Genomics Consortium (PGC) meta-analysis GWAS to avoid overlap with the AGRE target data which is included in the PGC dataset.
AGRE Cohort
This study’s target dataset was drawn from Autism Speaks’ Autism Genetic Resource Exchange (AGRE), the largest private repository of genetic and phenotype data of families with ASD. [31] The sampling bias of the collection is inclusion of multiplex families. The original AGRE genomic dataset comprised 2303 individuals with both GWAS and SRS data. As outlined below, we excluded any non-European subjects to ensure genetic homogeneity and reduce risk for population stratification. Individuals with known chromosomal or neurogenetic abnormalities (e.g., DiGeorge, Fragile X), as well as non-verbal subjects were also excluded. This final AGRE dataset comprised 1491 individuals (206 of whom were parents) in 663 families (ages ranged from 1.2 to 79 years); 1459 subjects contributed to the analysis of ASD diagnosis (we did not have data regarding diagnosis for 32 subjects).
Phenotypes
Social Responsiveness Scale (SRS)
The quantitative trait measure utilized is the Social Responsiveness Scale (SRS) [28]. The SRS is a 65-item measure of reciprocal social behavior, deficits in which are characterized as quantitative autistic traits (QAT). The SRS capitalizes on observations of individuals in naturalistic social contexts, by parent and/or teacher report. Its internal consistency is very high (alpha = ~.95), and it distinguishes ASD-affected individuals from controls with a Cohen’s d effect size of ~2.7 and from individuals with other psychiatric conditions with an effect size of d= ~1. [28,32] It characterizes variation in the two Diagnostic and Statistical Manual of Mental disorders, 5th edition (DSM-5) core domains of social communication and interaction and restrictive interests and repetitive behaviors. Reciprocal social behavior, as measured by the SRS, is continuously distributed in the general population and highly heritable throughout the range observed from unaffected to sub-clinically affected to fully ASD‐affected individuals
Individuals’ scores on the SRS were obtained from the Internet System for Assessing Autistic Children (ISAAC) database (https://www.autismtools.org/). When longitudinal scores were available, the earliest assessment was employed. Total raw teacher report scores were prioritized, with parental report scores used when teacher scores were unavailable, as has been done previously. [21] Typical norming and adjustment by sex was not done in order to stratify by sex and test interactions with sex in downstream correlation analysis and regression models.
ASD Diagnosis
Diagnoses of subjects with ASD were established with both the Autism Diagnostic Interview-Revised (ADI-R) [33]and Autism Diagnostic Observation Schedule (ADOS) [34], which are the current gold standard instruments for diagnosing individuals with ASD. Parents not in the AGRE pedigree dataset were counted as unaffected.
Genotyping, Quality Control, and Imputation
Genotyping, quality control, and imputation for the original AGRE dataset have been described elsewhere. [21] This AGRE dataset consisted of 2303 individuals for whom imputed genotypes comprising 5.8M autosomal variants were available. [35]
Genetic Ancestry Principal Components Analysis
In order to create an ancestrally homogenous sample and reduce false positives generated by population stratification, we performed an ancestral principal components analysis (PCA) using Eigensoft software. [36] We employed 1000 Genomes Phase 3 (1KG) reference panels (European (EUR), ad-mixed American (AMR), and African (AFR)) and projected the resulting eigenvectors onto AGRE subjects with both SRS and GWAS data (N=2303), resulting in factor scores for the first three principal components (PC1 – PC3 ). Using standard Eigensoft protocol, this analysis yielded a sample of 1491 subjects of European-descent ancestry. PC1 – PC3 were included in all regression models to control for bias due to subpopulations within EUR ancestry.
Quantification and Statistical Analysis
Calculation of PRS
To calculate the ASD-PRS, we used the SNP effect sizes estimated for common variants from the previously-published iPSYCH GWAS of ASD, [30] to ensure the validity of the PRS by estimating effect sizes in an independent cohort. To reduce the multiple testing burden associated with PRS at multiple thresholds, we used the p-value of 0.1 as our a priori threshold as it was the most predictive in our independent source dataset, as well as filtering by imputation information < 0.70, resulting in 745,190 SNPs. We then took the intersection of variants from the iPSYCH ASD source GWAS that overlapped with the imputed variants from our AGRE target dataset (465177 SNPs). Next, we reduced the list of intersecting variants to an independent set by performing LD-clumping (r2 < 0.15 within each 250kb window) using PRSice 2.3.1.e ( [39] and a 1000 Genomes Phase 3 EUR reference panel, [38] resulting in N=39,902 SNPs. The PRS was calculated for each individual as the number of effect alleles at each SNP in the AGRE dataset weighted by the betas from the discovery iPSYCH GWAS at the same SNP. The PRS was standardized to mean=0, standard deviation (SD)=1 in order to interpret the regression coefficient as the change in SRS score for one SD increase in PRS.
Computation of Odds Ratios
We grouped subjects into quartiles of PRS scores and modeled ASD diagnosis in a logistic regression model. We estimated odds ratios (OR) for ASD in Quartiles 2 through 4, conducting three separate tests of each quartile, as compared to the first (lowest) quartile (<25% PRS). Using generalized estimating equations (GEE) (as implemented in SAS Proc Genmod), the model was adjusted for PC1 – PC3, sex, and family variance. We also computed a Cochran-Armitage trend test, a chi-squared test for a linear trend across the four quartiles to test for dose response.
Correlation Analyses
As a preliminary step and for purposes of data visualization, unadjusted Pearson correlations between SRS and PRS were computed in the total sample and stratified by sex and affected status. Graphs were produced with R version 3.4.1 Copyright (C) 2017. [41]
Linear Mixed Model Regression
We employed a multiple linear mixed model (LMM) regression analysis because our sample comprised multi-level data in which subjects existed in groups (i.e., families), where the SRS phenotype is expected to correlate within families as well as between families. First, to account for relatedness in the AGRE family sample, we calculated a genetic relatedness matrix (GRM) using Plink 1.90 [37]. The GRM was then entered into all models as a random effect, implemented in SAS Proc Mixed [42], to adjust the residual variance for relatedness. Age was not included in our modeling as a covariate because SRS scores are known to be invariant to age effects [40]. For our primary model estimating the variance of QAT explained by PRS (Model 1), the fixed effects comprised PC1 – PC3, sex, and PRS. We computed three other models in which we tested the interactions of PRS*sex (Model 2), PRS*ASD (Model 3), and PRS*sex*ASD (Model 4). To properly control for confounders in models including 2-way interactions, we also included covariate x environment and covariate x gene interaction terms into the models [43] (see Table 4 for a breakdown of all interactions included in the modeling).
To estimate the variance explained, we fit a series of full and reduced models, as outlined in Table 4, and from these multi-level models, we then calculated a conditional pseudo-R2 using SAS mixed fit, as traditional R2 statistics are not available for LMM due to the random effect for GRM .[44] We reduced the full model by eliminating the fixed effect or interaction term of interest and measured the difference in pseudo-R2 between the full model and this reduced model. This change in R2 (ΔR2) serves as our estimate of the variance explained by the fixed effect (PRS) or interaction term.
Datasets
iPSYCH discovery GWAS dataset: https://github.com/mgandal/Shared-molecular-neuropathology-across-major-psychiatric-disorders-parallels-polygenic-overlap/tree/master/raw_data/GWAS
AGRE target dataset: obtained from author JL
1000 Genomes Phase 3 LD reference panel: https://ctg.cncr.nl/software/magma
Publicly-Available Software
IMPUTE2: https://mathgen.stats.ox.ac.uk/impute/impute_v2.html
PLINK: https://www.cog-genomics.org/plink2
SHAPEIT2: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
EIGENSOFT/smartPCA: https://github.com/chrchang/eigensoft/blob/master/POPGEN/smartpca.info
SAS %mixed_fit macro: http://support.sas.com/resources/papers/proceedings13/255-2013.pdf