GWAS and Meta-Analysis
We considered genotypes and phenotypes of 522 ewes from five sheep breeds of high (Wadi, n = 160; Hu, n = 117; Icelandic, n = 54; Finnsheep, n = 54; and Romanov, n = 78) and one low (Texel, n = 59) prolificacy. The GWA analysis of these breeds for prolificacy has already been published by (1) and details about sample collection, phenotyping, and genotyping can be found in that paper. All the sheep breeds were genotyped for almost 600k positions using the Ovine Infinium HD BeadChip (Illumina, San Diego, CA, United States). First, conducting six separate GWAS, we have reproduced all GWAS results for each breed. Quality control of genotyping was carried out using the GenABEL (18) R package. SNPs with call rates < 95%, minor allele frequency (MAF) < 0.05, and p-value of Fisher’s exact test for Hardy–Weinberg equilibrium < 10-5 excluded from the analysis. Also, samples with call rates ≤ 95% were removed from datasets.
To remove the parity effect, the mean litter size of each individual (total litter size/parity) was used as the response variable. Based on the distribution of phenotypes, ewes in each breed, coded as high- and low- prolific (for details see (1)). To account for the within-population substructure, we implemented a principal component analysis (PCA) using a set of LD-pruned SNPs and two first PC were incorporated in the model. Plink1.9 (19) and GenABEL (18) were used, respectively. We ran the following logistic model to evaluate the association between SNPs and litter size in each breed: see formula 1 in the supplementary files section.
where, y is the vector of case/control value (1 and 2) for each ewe; XSNP is the incidence matrix relating values in y vector to SNP genotypes, and βSNP is the regression coefficient. The analysis was performed using the GenABEL (18) package. The p-values were almost non-inflated (0.87 ≤ λ ≤ 1.14) for all traits and partial inflation was corrected using the genomic control (GC) method.
A single-trait meta-analyses were undertaken across GWAS results from six sheep breeds using the fixed effects inverse variance weighted method implemented in the METAL (20) software. This meta-analytical approach works based on the GWAS SNPs summary statistics - beta and its standard errors - and is considered the most powerful method to identify novel loci (21). For multiple testing corrections, based on the p-value cutoff of 0.05 and the total number of pruned SNPs (independent tests) we determined the genome-wide threshold (31,752/0.05=1.57e-06). Also, based on the average number of SNPs on each chromosome and the same P-value cutoff, we determined a chromosome-wise (suggestive) threshold (0.05/1176=4.25e-05). The CMplot (https://github.com/YinLiLin/R-CMplot) R package was used for drawing the Manhattan plot.
Functional Annotation
We explored the functional significance of the top SNPs from the meta-analysis using the GTF file from Oar_v.4.0 ovine reference genome assembly. The search was performed using a 15 k base pair window around the top SNPs. GeneCards (https://www.genecards.org/, last access: 20 April 2021) and National Center for Biotechnology Information Gene (http://www.ncbi.nlm.nih.gov/gene/, last access: 20 April 2021) were used to reveal gene functions. The sheep QTL database (http://www.animalgenome.org/cgi-bin/QTLdb/OA/index, last access: 20 April 2021) was used to find whether the identified SNPs located in known prolificacy-related QTLs.
Gene-set enrichment and functional clustering analysis
For enrichment analysis, we need to construct a vector contains significant gene names. As the trait has polygenic nature, several poorly associated SNPs fall below the significant line and the large majority of the genetic variants contributing to the trait remain hidden. So, to capture the cumulative effects of poorly associated variants, we used an arbitrary threshold (0.001) to filter SNPs for constructing gene vector. Using an in-home R script and Oar_v.4.0 ovine reference genome assembly, we wrapped SNPs around the genes and assigned SNPs to genes if they were located within the genes or 15 kb upstream or downstream of a gene (list of significant genes in gene vector are provided in Supplementary File 1, sheet 5).
To extract biological views from the resulted gene vector, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were used to define the functional gene-sets. KEGG enrichment analysis was performed using the R package clusterProfiler (22), but due to the low gene id conversion rate (to ensemble id), we used g:Profiler (23), a web-based enrichment tool, for GO enrichment which led to better conversion rate and results. Terms and pathways with FDR and q-value less than 0.05 (GO and KEGG, respectively) were reported as significant sets. The ggplot2 (24) R package was used to visualize enrichment results as bar plots. For enrichment using protein databases, Cytoscape 3.8.2 (25) along with StringApp 1.6.0 plug-in (26) were used.
Significant genes (official gene symbol) were loaded in DAVID (27,28) for functional clustering analysis against the ovine genes as background. This tool classifies genes in functionally related groups based on gene ontology, protein data, and pathway data, and calculates enrichment scores for groups of genes with a related biological function. Analysis was performed using the DAVID default setting. The functional cluster with the greatest enrichment score was reported.
Protein-protein interactive (PPI) network construction and module identification
To investigate the interactions of proteins encoded by the identified significant genes, the StringApp 1.6.0 (26) plugin in Cytoscape 3.8.2 (25) was used to build the PPI network using the STRING 11.0 database (29). The MCODE 2.0.0 (30) plugin, another Cytoscape App, was used to identify the very closely associated regions within the network which are known as the core modules of the PPI. MCODE identifies the highly interacting nodes (clusters) present in the network which has high quality based on topology, function and, localization similarity of proteins within predicted complexes. Degree cut-off 4 along with the 2 k‐core were used as thresholds. Each node gets a score and clusters will be ranked using scoring by the density and size of the cluster. Clusters with a higher score have more interaction between genes. To identify key genes in the constructed network, another Cytoscape App, cytoHubba 0.1 (31) was used to rank the individual nodes and finding hub genes. The hub genes were identified based on degree parameters, which represent the total number of nodes connected to its adjacent nodes. It also provides the count of the number of interactions of a given node. The top ten hup genes have been reported.