4.1 Study design
The overall study design is depicted in Figure 5. We conducted a large-scale, genome-wide cross-trait analysis to investigate the shared genetic architecture between age at first sexual intercourse (AFS) and COVID-19 susceptibility, hospitalization, and severity. The study was divided into three parts. Firstly, we performed a genetic association analysis to evaluate the genetic correlation between AFS and COVID-19 phenotypes and to identify significant trait pairs. Next, we used cross-trait meta-analysis to identify shared loci for these trait pairs. Finally, we conducted a two-sample bidirectional Mendelian randomization (MR) analysis to assess whether a causal relationship exists between AFS and COVID-19 susceptibility, hospitalization, and severity.
4.2 GWAS summary statistics
For AFS, we obtained summary statistics from the most recent also the largest genome-wide association study (GWAS) conducted by Mills et al. in 202126. This GWAS meta-analyzed data from the UK Biobank including 397,338 individuals of European ancestry (N = 182,791 males; N = 214,547 females). AFS was treated as a continuous measure and assessed using questions such as “What was your age when you first had sexual intercourse?” This was often defined more specifically as including vaginal, oral, or anal intercourse. Individuals were considered eligible if they provided a valid answer and ages below 12 were excluded. AFS was transformed using within-sex inverse rank normalization due to its markedly non-normal distribution and was adjusted for the birth year of the respondent, its square, its cubic, and top principal components26.
Summary GWAS statistics for susceptibility, hospitalization, and severity of COVID-19 were obtained from the COVID-19 Host Genetics Initiative (HGI) (https://www.covid19hg.org, Release 5)27. The COVID-19 Host Genetics Initiative (HGI) represents the most comprehensive GWAS conducted to date on COVID-19, comprising data from over 3 million individuals across 82 large cohorts from 35 countries, predominantly in Europe. The majority of the included data was collected before widespread COVID-19 vaccination and the Omicron variant outbreak. Consequently, the study primarily includes infections caused by the Alpha and Delta variants. The COVID-19 GWAS dataset was adjusted for age, sex, age × age, age × sex, genetic principal components, and study-specific covariates. Only individuals of European ancestry were included in all MR analyses, and data from the UK Biobank were excluded to prevent potential bias from population overlap. COVID-19 infection status was confirmed through various methods, including laboratory tests (e.g., RT-qPCR or serological testing), self-reported infections, electronic health records, or physician diagnosis. For susceptibility analyses, we compared patients with COVID-19 to those without. COVID-19-related hospitalization was defined as laboratory-confirmed SARS-CoV-2 infection or hospitalization with COVID-19-related symptoms, and hospitalization cases were compared to controls without COVID-19. We also compared severe respiratory conditions caused by COVID-19 between healthy individuals and hospitalized patients who experienced severe respiratory complications or required respiratory support (e.g., intubation, continuous positive airway pressure, continuous external negative pressure, bilevel positive airway pressure, or high-flow nasal cannula)28. Detailed information regarding the summary data from all GWAS analyses is presented in Table 4.
4.3 Statistical analysis
4.3.1 Global genetic correlation analysis
To investigate the potential shared genetic basis between AFS and COVID-19 phenotypes, we conducted a global genetic correlation analysis by utilizing linkage disequilibrium (LD) score regression (LDSC) using all SNPs after merging with HapMap3 SNPs, excluding the human leukocyte antigen (HLA) region29. LDSC estimates the genetic correlation between two traits (ranging from -1 to 1, with -1 indicating a perfect negative genetic correlation and 1 indicating a perfect positive genetic correlation) based on the fact that the GWAS effect size estimate for a given SNP represents the effects of all SNPs in LD with that SNP. The SNPs were subjected to several exclusion criteria, such as being missing in Hapmap3, having missing values, an INFO score ≤ 0.3, minor allele frequency (MAF) ≤ 0.01, p-value < 0 or > 1, not being SNPs (e.g., indels), or having duplicate rs numbers. We used LD scores from 1000 Genomes European ancestry as reference. A p-value < 0.05 was used as a significant threshold.
4.3.2 Local genetic correlation analysis
To investigate whether there is a local genetic correlation between AFS and COVID-19 traits, we utilized ρ-HESS, a method to quantify correlations between pairs of traits due to genetic variation at a defined region in the genome30. A total of 1,703 default independent LD blocks with an average length of 1.5 Mb were examined to calculate local genetic heritability and genetic covariance based on 1000 Genome imputation of hg19 genome build. We corrected multiple tests by the Bonferroni method, with a significance threshold of 0.05/1,703.
4.3.3 Cross-trait meta-analysis
After examining the genetic correlations for all traits, we conducted a genome-wide cross-trait meta-analysis using Cross-Phenotype Association (CPASSOC) to further identify shared pleiotropic loci between AFS and COVID-19. CPASSOC integrates summary statistics from multiple GWAS and controls for the effects of trait heterogeneity, population structure, and recessive correlations31. This method provides two statistics: SHom and SHet. SHom uses a fixed-effects meta-analysis that works best when genetic effect sizes are homogeneous, but less effective when analyzing multiple traits. SHet represents the maximum of the weighted sum of trait-specific test statistics and maintains statistical efficacy even in the presence of heterogeneity31. Given the heterogeneity of different traits in our study, SHet was chosen as the main statistic. Variants that satisfied PSHet < 5 × 10-8 were defined as the significant pleiotropic SNPs. We then obtained independent SNPs using the PLINK clumping function with the following parameters: --clump-p1 5e-8 --clump-p2 1e-5 --clump-r2 0.2 --clump-kb 50032. The variant with the lowest p-value in each locus was defined as the index SNP, and after excluding those associated with any single trait reported in the original GWAS study (p < 5 × 10-8), the remaining index SNPs were defined as novel pleiotropic SNPs. For detailed annotation and pathway analysis of the identified index SNPs, we used a web-based tool SNPnexus based on the Ensembl/GENCODE basic transcript database (human genome version hg19)33.
4.3.4 Fine mapping credible set analysis
After identifying significant shared loci between the two traits, we performed fine-mapping credible set analysis using fast PAINTOR, which integrates functional annotation of genetic variants and LD structures, to calculate a posterior probability for each SNP to be causal at a given locus34. SNP with posterior probability >0.9 is considered the most likely causal variant in the region.
4.3.5 Colocalization analysis
We next performed a colocalization analysis to assess whether association signals of AFS/COVID-19 are colocalized at the same locus through the "coloc.abf" function in "coloc" R package35. We extracted variants within 500 kb of the index SNP for each shared loci and calculated the posterior inclusion probability (PIP) of AFS and COVID-19 sharing the same causal variants (p(H4)). SNPs with a p(H4) > 0.5 were labeled as co-located genetic variants.
4.3.6 GO and GTEx tissue enrichment analysis
To gain biological insights into pleiotropic index SNPs identified from the cross-trait meta-analysis, we conducted multiple post-GWAS functional analyses. We assessed the enrichment of shared gene sets in the Gene Ontology (GO) biological process categories and performed a tissue-specific expression analysis (TSEA) using the software functional mapping and annotation (FUMA) GENE2FUNC process to evaluate whether the shared gene set was highly enriched or specifically expressed in tissues36, based on RNA-Seq data with 54 human tissue types form GTEx (version 8)37. Benjamini-Hochberg procedure was used to account for multiple testing (false discovery rate < 0.05)38.
4.3.7 Cell-type-specific enrichment of SNP heritability
The LD score regression applied to specifically expressed genes (LD-SEG) method was applied to determine potential enrichment of cell-type-specific categories for AFS and COVID-1939. Data from GTEx together with corresponding precomputed gene sets and respective LD scores were used for analyses, detailed analysis was performed using the protocol described in the online documentation of LD-SEG (https://github.com/bulik/ldsc/wiki/Cell-type-specific-analyses).
4.3.8 Mendelian randomization
The genetic variants utilized as instrumental variables (IVs) satisfied three strict criteria: (1) strong association with the exposure, (2) independence from any modifiable confounders, and (3) independence from any pathway associated with the outcome, except for the exposure pathway40. For our MR analysis, we initially extracted IVs based on the genome-wide significance threshold (p < 5 × 10-8), followed by calculating the F statistics of each SNP to exclude weak IVs with F statistics < 10. We then ensured independence among IVs for each exposure by conducting an LD test on each SNP identified as an IV based on individuals with European ancestry from the 1000 Genomes Project with R2 < 0.001 and a window size of 10,000 kb. If the target SNPs were missing in the outcome GWAS, we used proxy SNPs with high LD (R2 > 0.8); otherwise, the SNPs were eliminated. Additionally, we searched selected SNPs using the PhenoScanner database (http://www.phenoscanner.medschl.cam.ac.uk) to identify whether they were related to any confounders and outcomes causally associated with COVID-19. Details on the characteristics of AFS- and COVID-19-associated IVs are shown in Table S13.
To analyze the causal association between AFS and each COVID-19 phenotype, we utilized multiple complementary methods, including inverse variance weighted (IVW), MR-Egger, weighted median, simple mode, weighted mode, and Wald ratio methods through TwoSampleMR package. The IVW model was used as the major primary statistical method, and the Wald ratio method was used when a genetic variant contained only one genetically related SNP. We assessed the presence of horizontal pleiotropy by using Cochran´s Q statistic to test for heterogeneity among IVs, with p < 0.05 indicating heterogeneity. If heterogeneity was present, we used the random-effects model for subsequent analyses; otherwise, we used the fixed-effects model. Leave-one-out cross-validation was performed to evaluate the stability of the MR results through sequential exclusion of each IV. In addition, we conducted MR-Egger analysis to identify directional and horizontal pleiotropy41. Funnel plot asymmetry indicated the presence of horizontal pleiotropy42.