GWAS Datasets
Primary SNP-level summary data for the three cancers were obtained from the Breast Cancer Association Consortium (BCAC), the Ovarian Cancer Association Consortium (OCAC) and the GWAS Catalog. The study included 89,677 breast cancer samples (46,785 cases and 42,892 controls), 66,450 ovarian cancer samples (25,509 cases and 40,941 controls), and 9,628 cervical cancer samples (2,866 cases and 6,762 controls) where all individuals were of European ancestry [10–12]. The summary statistics, including P-values, regression coefficients and standard errors calculated in the meta-analysis are available. Finally, overlapping SNPs of breast, ovarian and cervical cancers were selected for further analysis.
Data Preparation
A total of 294,879 common SNPs were acquired after combining summary statistics for all the three cancers. To select SNPs with small pairwise associations, linkage disequilibrium (LD) based SNP pruning method was used to select SNPs with imputation r2 ≤ 0.2 in European population. Following this primary selection of SNPs with lower LD, the process was repeated using a sliding window of five SNPs to confirm complete removal of pairs of SNPs with high LD. The whole process of SNP selection used the HapMap version 3 CEU genotypes panel (Utah residents with Northern and Western European ancestry from the CEPH collection) as a reference. After deleting the SNPs with large LD, 40,967 SNPs was remained. Then, according to the 1,000 Genome datasets, we accomplished the gene annotation for the pruned SNPs using PLINK1.9 which based on the hg19 human genome build (https://www.cog-genomics.org/static/bin/plink/glist-hg19). Finally, 40,859 SNPs were mapped 11,597 gene regions were included in the metaCCA analysis. It is worth noting that the regression coefficient beta of each annotated SNP has to be normalized prior to metaCCA analysis [8].
Multivariate MetaCCA Analysis
The purpose of metaCCA analysis was to identify the potential pleiotropic genes for multiple diseases, which required a full covariance matrix (∑), consisting of a cross-covariance matrix between all genotypic and phenotypic variables (∑XY), a genotypic correlation structure between SNPs(∑XX), and a phenotypic correlation structure between traits(∑YY). ∑XY was imputed with the normalized regression coefficient βgp(g and p were the number of genotypic and phenotypic variables, respectively).∑XX was estimated using the reference SNP dataset of the HapMap 3 CEU. Each ∑YY entry corresponded to a Pearson correlation coefficient between vectors of β estimates from p phenotypic variables across g genetic variants [13]. It has been proved that the more the number of genotypic variables there are, the less the error of the estimate there is [14]. Therefore, 294,879 overlapping SNPs were used as the estimation of ∑YY, even if only some of the SNPs were included in further analysis. The canonical correlation coefficient r was used to describe the association between genotypes and phenotypes.
In this study, we performed association analysis to identify canonical correlations in the SNP and gene levels, respectively. In the SNP level, we selected susceptibility loci significantly associated with the three cancers using univariate SNP-multivariate phenotype association analysis. In addition, the GWAS summary statistics were examined to confirm random selection of the sample of 40,859 SNPs. The results revealed that the standardized mean of β approached zero and the corresponding median P-value approached 0.5 for all the three datasets, thus confirming that the sample of SNPs was from a random sample of the whole genome. The significant level α (two-sided) was corrected with Bonferroni method (adjusted α = 0.05/40,859). Canonical correlation r of any SNP was significant when the P-value was smaller than 1.22 × 10− 6 (= 0.05/40,859). In the gene level, we conducted multivariate SNP-multivariate phenotype association analysis to obtain a canonical correlation coefficient between a gene and the three cancers. Similarly, Canonical correlation r of any gene was significant when the P-value was smaller than 4.31 × 10− 6 (= 0.05/11,597).
As a complement to the metaCCA analysis, we performed Versatile Gene-based Association Study-2 (VEGAS-2) analysis to refine the identified genes (https://vegas2.qimrberghofer.edu.au/). This method was a gene-based approach considering associations between a trait and all SNPs within a gene rather than each SNP individually [15]. All SNPs in each gene were incorporated into the VEGAS-2 analysis and then P-value was calculated for each trait in the gene level. Ultimately, we selected the pleiotropic genes associating with at least one phenotype using the adjusted threshold (= 1× 10− 6).
Functional Annotation
For identifying biological functions of the significant pleiotropic genes in our study, we explored the potential biological pathways using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Gene enrichment analysis was performed using KOBAS3.0 website developed by Center for Bioinformatics, Peking University (http://kobas.cbi.pku.edu.cn/index.php). In the enrichment analysis, P < 0.01 indicated statistically significant differences. To evaluate functional connectivity of genes, PPI network analysis were performed through Search Tool for the Retrieval Interacting Genes (STRING11.0) database (https://string-db.org/) [16]. We considered the total score above 0.400 that correspond to the combination of the following 7 different scores: text mining, experiments, databases, co-expression, gene neighborhood, gene fusion and gene co‑occurrence.
Data availability
This data is publicly available from the BCAC, OCAC and GWAS Catalog.
BCAC: http://bcac.ccge.medschl.cam.ac.uk/
OCAC: http://ocac.ccge.medschl.cam.ac.uk/
GWAS Catalog: https://www.ebi.ac.uk/gwas/