SV Catalogues of Three Major Ancestry Groups
We analysed Illumina short-read WGS data of 9,770 samples from the SG10K_Health study23, comprising participants of Chinese (58%), Indians (24%) and Malays (18%) ethnicities. After CRAM-level quality control (QC) and removing samples failing at least 1 of 9 QC metrics (Methods), 8,392 samples were retained. This data set is subsequently referred to as SG10K Structural Variant release 1.3 (“SG10K-SV-r1.3”). Besides Chinese and Indians which other groups have studied19,20, SG10K-SV-r1.3 contains 1,620 individuals of Malay ancestry (Supplementary Table 1), a population which have to date not been included in previous large population-based SV studies10,12.
The SG10K-SV-r1.3 dataset comprises multiple sub-cohorts sequenced at heterogeneous depths and using different library construction methods, which can influence SV detection accuracy (Supplementary Table 1). To ensure robust SV analysis and to reduce technical confounding factors, we first focused on a discovery cohort of 5,487 individuals’ derived WGS libraries both constructed (PCR plus) and sequenced at an average depth of 15x in a consistent fashion. Other datasets, which included 1,523 individuals sequenced at a depth of 15x using a PCR-less WGS library construction method (referred to as “15x_validation”) and 1,922 individuals sequenced at a depth of 30x using a PCR-employing WGS library construction method (referred to as “30x_validation”) were used as validation datasets to ensure that results observed in the discovery dataset are reproducible. Even when confined to the discovery cohort alone, this study represents one of the largest Asian SV studies to date, covering over 4.21 times as many individuals of Asian ancestries as previous studies (Figure 1A).
We focused on the three most common SV types: deletions, insertions, and duplications (Figure 1B, Supplementary Figure 1, Methods). Due to their distinct genomic properties, it is challenging to accurately identify SVs using a single analytic tool25, and most previous SV cataloguing efforts have employed a combined suite of SV class-specialized algorithms10,12. In this study, we employed Manta26 to identify deletions and insertions separately in single samples, followed by SVimmer27 to obtain a putative cohort-wide consensus set. Individual-level genotype calls within this uniformly-defined discovery SV set were then refined using Graphtyper228. In addition, to address previously-reported limitations of these tools29, we also used MELT30, an algorithm specially designed for identifying mobile element insertion (MEIs) events.
As the Manta-SVimmer-Graphtyper SV pipeline relies solely on discordant read pairs and split-read alignments, it has inherent limitations to accurately detect duplication events created by the presence of tandem repeat sequences (e.g., microsatellites and minisatellites)31,32. We thus complemented the above algorithms with SurVIndel233, an in-house developed algorithm that can detect duplication events at high sensitivity (Figure 1C). To demonstrate the robustness of SurVindel2, we assessed false discovery rate (FDR) and true positive (TP) statistics for duplications relative to Manta-SVimmer-Graphtyper, against a truth set of high quality SVs obtained by haplotype-resolved long-read sequencing of a selected subset of 1000 Genomes Project analyzed samples34. We measured an average per-sample duplication identification FDR of 12% and 36% for SurVindel2 and Manta-SVimmer-Graphtyper, respectively. SurVIndel2 yielded a better sensitivity than Manta-SVimmer-Graphtyper (Figure 1D, Supplementary Table 2). Furthermore, the gains in sensitivity were more pronounced for tandem repeats (Supplementary Figure 2).
Using this pipeline, we identified 152,655 SVs comprising 35,584 insertions (including MEIs), 84,607 deletions, and 32,464 duplications. Approximately 75% of these events were novel (Figure 1E) with respect to gnomAD-SV10, reflecting the potential for new discoveries by analysing underrepresented Asian genomes (a more detailed comparison of SG10K-SV-r1.3 to gnomAD-SV is reported in later sections). We also detected 84,336 and 103,183 SG10K-SV-r1.3 SVs in the 15x PCR minus and 30x PCR plus validation dataset, respectively.
SG10K_Health SV Landscape
On average, each SG10K_Health individual harboured SVs covering 0.41% of the genome, with 1,905 insertions (0.017%), 2,486 deletions (0.367%), and 1,103 duplications (0.030%). These figures were consistent across all three ancestries (Figure 2A). Compared to gnomAD-SV, we detected fewer insertions and deletions per individual (insertions: 1,905 in SG10K_Health vs 2,612 in gnomAD-SV; Deletions: 2,486 vs 3,505), likely due to the higher sequencing depth of gnomAD samples (32x 10). Confirming this hypothesis, we detected comparable insertion/deletion counts per individual in our 30x_validation dataset (2,751/5,692; Supplementary Figure 3). However, despite lower sequencing depth in our discovery cohort, we detected comparable numbers of duplications compared to gnomAD-SV (1,103 vs 1,346), likely reflecting the improved sensitivity of the SurVIndel2 duplication-detection pipeline. Similar to previous studies10, the majority (>70%; 107,548) of deletions, insertions and duplications were rare events with allele frequencies (AF) less than or equal to 1% (Figure 2B, Supplementary Figure 4). Nevertheless, we identified 700 SVs with allele frequency greater than 0.95 in our discovery cohort; in these cases, the reference genome bears the minor allele.
While most detected SVs were small (10kbp; Figure 2C), we identified 6,444 deletions and 2,065 duplications longer than 10kbp. There was a striking abundance of SVs at 300bp, 2kb and 6kb (Figure 2C). The 300bp and 6kb insertions corresponded to Alu and LINE1 elements respectively, the two most abundant classes of transposable elements in the human genome (~11%35 and ~17%36 of the genome). The 2 kb SVs represent composite SVA (SINE, Variable Number Tandem Repeat, and Alu) transposons. These results highlight the pervasive contribution of repeat elements (Alu, LINE1, SVAs) in sculpting human genomic variation, and high-level similarities between our SV catalogue and other studies12.
SVs have been reported to cluster at specific genomic regions (“hotspots”). Several factors have been proposed to influence the location of SV hotspots, such as segmental duplications and the local presence of transposable elements37. These factors may contribute to SV formation due to their higher propensity for DNA breakage and repair, with local transposable elements increasing the likelihood of non-allelic homologous recombination (NAHR)38. To identify SV hotspots, we employed hotspotter39 (bandwidth:200,000, num.trial=10,000, pval=5 X 10-3) and identified 331 regions containing higher-than-expected SV densities (Supplementary Table 3). Together, these 331 regions affected ~303Mbp, in line with previous findings34. Notably, 28.1% (93 out of 331) of the hotspot regions are located within 5Mbp of the ends of the chromosomes as well as near the centromeric regions. Excluding these sub-telomeric and centromeric hotspots, 122 hotspots were unique to SG10K-SV compared to gnomAD-SV. For example, we identified a 725,560bp (chr12:124034930-124760490) hotspot region containing 89 SVs. This hotspot overlaps the NCOR2 (Nuclear receptor corepressor 2) gene, a corepressor that is frequently altered in prostate cancer40.
Impact of SVs on Regulatory Elements and Gene Bodies
To assess the impact of SG10K-SV-r1.3 on different categories of functional genomic regions, we overlapped the SVs with gene regulatory elements identified by ENCODE and the Epigenomics Roadmap project41. Regulatory elements surveyed included 926,535 putative regulatory elements annotated as distal enhancers (667,599), proximal enhancers (141,830), insulators (CTCF sites, 56,766), promoters (34,803), poised elements (exhibiting DNase I hypersensitivity but are likely functionally gated by additional trans-acting signals), and non-promoter K4me3 regions (25,537)41.
Common deletions (AF≥1%) were significantly depleted at putative enhancers and insulators, consistent with a model of negative selection acting on alterations affecting gene expression (Figure 3A). In contrast, rare (4592; 1%> AF >=0.1%) and ultra-rare (12,705; AF <0.1%) deletions did not exhibit similar depletion signals - it is possible that these latter SVs may have arisen later in human evolution with insufficient time for purifying selection. Promoter regions exhibited a trend (albeit not significant) for enrichment in common deletions, duplications and insertions, perhaps reflecting the higher GC content of promoters and susceptibility to errors in DNA replication42. Common duplications were also significantly depleted at distal and proximal enhancers (Figure 3A) again suggesting the action of purifying selection, and ultrarare duplications were also depleted at enhancers, though not as strongly as common duplications. Following a similar qualitative trend, common insertions (MEIs) were more strongly depleted at enhancers, insulators and K4me3 regions than ultrarare insertions.
Unexpectedly, we observed common duplications being enriched at annotated non-promoter H3K4me3 regions. To deepen this observation, we examined the intersect of 81 non-promoter H3K4me3 regions overlapping common duplications, and found that they were highly and significantly enriched for tandem repeats relative to all 25,537 H3K4me3 regions (fold enrichment: 4.6 : hypergeometric p-value: 2.45 x 10-23). We speculate that since read mapping artifacts are common at tandem repeats, it is possible that these genome duplications may have contributed, at least in part, a degree of artifactual H3K4me3 ChIP-seq peaks. These results highlight how more refined genomic annotations taking SVs into account can improve the accuracy of other orthogonal data sets such as regular ChIP-seq maps.
We then analyzed gene bodies (UTRs, CDS, exons or introns). SVs of all three categories were strongly depleted at gene bodies, including 3'UTRs, 5'UTRs, CDS, exons, and introns (Figure 3B). For example, common insertions were depleted 11-fold at coding exons, against reflecting high selection pressure on coding sequences. Similar to enhancers, rare and ultrarare SVs showed weaker depletion patterns in exons of all types. Interestingly, intronic regions showed no deviations from background, except for a modest elevation in rare and ultrarare insertions. This may reflect the increased propensity of certain MEIs families to insert into the gene bodies of actively transcribed genes or GC-rich regions43,44.
SVs deleting gene regions may cause complete or partial loss of function (LOF) effects. Conversely, duplications may lead to gene copy gain, augmenting gene dosage. Employing SVTK45 to assess the potential impact of the SG10K_Health SVs on protein-coding regions, we identified 5,438 SVs (3.5% of 152,655) with direct predicted impact on protein coding integrity (Figure 3C). Of these, 4,143 SVs resulted in likely gene LOF. LOF-associated SVs tended to occur at low allele frequencies (AF<1%). We identified 1,023 duplications predicted to cause copy number gain of one or several consecutive protein-coding genes. Copy number gain events were typically larger compared to LOF events (median size 96kb vs 4.6kb). These patterns are in line with gnomAD where the majority of protein coding affecting SVs resulted in LOF, and copy gain events exhibited larger sizes.
We assessed the potential impact of SVs on clinically actionable genes, focusing 78 actionable genes (ACMG v3.146) associated with highly penetrant and actionable genetic conditions. AnnotSV47 was used to identify SVs potentially affecting at least one ACMG v3.1 gene. We found 35 SVs affecting coding sequence integrity in 21 clinically actionable ACMG genes. For example, we identified two separate 2.3kb and 9.4kb deletions in 5 and 3 Chinese individuals, affecting TRDN (Figure 3D), encoding triadin and a key component of the calcium release complex48. We also found, in 48 individuals (27 Chinese, 12 Indian, 9 Malay) a 134bp deletion affecting DSG2 (Supplementary Figure 5), an essential component of desmosomes that provides mechanical strength and stability to heart and skin tissues49. Both TRDN and DSG2 genes have been associated with severe cardiac dysfunction (catecholaminergic polymorphic ventricular tachycardia (CPVT), and arrhythmogenic right ventricular cardiomyopathy (ARVC) respectively).
SV Patterns Between International Cohorts
Reflecting the novelty of the SG10K-SV-r1.3 catalog, 74.3% (113,446/152,655) of the catalog were not previously reported in gnomAD-SV (Figure 2A). To compare the SG10K-SV-r1.3 cohort with gnomad-SV more stringently, we then applied a 50% call rate cut-off across each population within SG10K_SV, resulting in 85,162 SVs not exhibiting any overlap with gnomAD-SV. We hereby refer to these SVs as novel “Asian-specific" SVs. This included 47,064 deletions, 20,462 duplications, and 17,636 insertions. The majority of novel Asian-specific SVs were detected at lower allele frequencies compared to SVs commonly found both in SG10K_SV and gnomAD-SV (Supplementary Figure 6). Additionally, we identified 39,209 SV events in SG10K-SV which overlaps gnomAD-SV events. We further focused on this subset to identify events with a higher prevalence in Asian populations, employing Fst50 analysis on the gnomAD-SV dataset as described in the Methods section. Using this approach, we further detected 32,085 events in gnomAD-SV displaying such differences which overlaps with 14,198 SV in SG10K-SV dataset
Notable examples of Asian-specific events include a previously reported 2,903 bp deletion in intron 2 of the BIM gene, which is associated with resistance to tyrosine kinase inhibitors51. This SV is present in gnomAD-SV at a higher AF in East-Asians compared to other ethnicities (AF EAS: 7.37 x 10-2, AF others: 1.04 x 10-4). Another example comprises a rare 19.3kbp deletion spanning the HBA1 and HBA2 genes, associated with α-thalassemia and detected more frequently in Asian populations (AF EAS: 9.93 x 10-3, AF others: 1.04 x 10-4)20. It is worth noting that without the availability of SG10K-SV as a tool to enrich for Asian events, identifying these SV AF differences solely through an internal comparison of the gnomAD-SV database would have been challenging due to the large number of events in gnomAD-SV. Specifically, conducting an equivalent analysis on the entire gnomAD-SV database would have yielded 1,647 events with significant AF differences between Asian and non-Asian populations. These 1,647 events represent 0.542% of the queried events, in contrast to the 21% (8,285/39,209) obtained when utilizing a comparison between SG10K-SV and gnomAD-SV. This approach serves two important purposes. Firstly, it identifies new events that had not been previously discovered, which constitute the majority of our findings. Secondly, it provides an enhanced framework to facilitate the detection of Asian-specific events within existing published resources.
SVs Between Asian Ancestry Groups
We then investigated SV patterns distinctive to the three major Asian ancestries. Principal components analysis (PCA) on the full set of SG10K_SV demonstrated ancestry-specific population clustering (Figure 4A), which was further replicated using either insertions, deletions, or duplication events (Supplementary Figure 7). These results support pervasive differences across the three SV classes contributing to population differentiation. 38% of SVs were seen in only one ancestry, 15% were shared across two ancestries, and half of the SVs (47%) were in all three populations (Figure 4B). However, as the numbers of events detected as unique in a population correlated with cohort size (Supplementary Table 1) and were enriched for low-frequency SVs (Supplementary Figure 8), it remains possible that some of these SVs may be present in other populations, but remain undetected due to low allele frequency.
To gain a more granular understanding of ancestry-specific SV patterns, we calculated fixation indexes (Fst)50 for each of the detected events and assigned a significance score to each observation using permutation analysis (see Methods). By examining the resulting Fst trends, we found that SVs with extreme Fst values (0.7 and above) were mostly detected in small numbers of individuals (call rate < 2%) not reaching significance thresholds (Figure 4C). Of SVs exhibiting statistically significant Fst values, we identified 18,076 SVs displaying ancestry-specific frequency patterns, comprising 8,346 deletions, 5,660 insertions, and 4,070 duplications (see Methods, refer to VCF in SG10K_Health Chorus Browser).
The set of 18,076 ancestry-specific SVs were further filtered to those annotated to harbour functional consequences (Figure 3C). This analysis yielded a subset of 143 ancestry-specific SVs, comprising 91 deletions, 47 duplications, and 5 insertions (Figure 4C, Supplementary Table 4). By plotting AF trends for the top 50 events with highest Fst (Figure 4D), we observed ancestry-specific events across a range of allele frequencies. SVs associated with Indian ancestry drove differences for the majority of the cases (N=45/50). We also identified 10 SVs for which the GRCh38 reference genome contained the minor allele (MAF>50% in at least one population), underscoring the importance of moving beyond a single human genome reference52 to establish reference genomes better reflecting the genetic diversity of global human populations.
Curation of the 143 events confirmed previously reported ancestry-specific SVs. For example, we observed a 27.6kbp deletion in the ACOT1 gene, involved in fatty acid metabolism. This ancient deletion is marked by significant AF differences between continents, almost reaching fixation in Asian populations53. The ACOT1-associated deletion exhibited a lower AF in Indians (AF SG-Chinese: 0.873, AF SG-Indian: 0.532, AF SG-Malay: 0.769). Another example was a 32kbp deletion in CYP2A6, a member of the cytochrome P450 (CYP-450) superfamily involved in drug metabolism54. This SV also exhibited significantly lower AFs in Indians (AF SG-Chinese: 0.139, AF SG-Indian: 0.045, AF SG-Malay: 0.171). A third example was a 21.5kbp duplication overlapping MPV17L2 and RAB3A, present in gnomAD-SV East Asians but rare in other ancestries55. We observed a similar frequency for Chinese, with lower AFs in Malay and Indians (AF SG-Chinese: 0.030, AF SG-Indian: 0.002, AF SG-Malay: 0.011). There was also a 204.4kbp duplication overlapping multiple genes in chromosome 4, observed only in SG-Indians (AF SG-Chinese: 0, AF SG-Indian: 0.015, AF SG-Malay: 0.002). While not identified in gnomAD-SV, we confirmed detection of this SV in 12 individuals from the 1KGP dataset, all of South-Asian ancestry.
Importantly, we also discovered previously unreported SVs. One such event was a 3kbp deletion overlapping AHNAK2, encoding a nucleoprotein involved in calcium signalling. The AF of this SV was higher in Indians compared to Chinese and Malay (AF SG-Chinese: 0.018, AF SG-Indian: 0.117, AF SG-Malay: 0.014). We also detected a 59bp deletion in TNNT3, which encodes Troponin T3, a protein involved in muscle contraction and distal Arthrogryposis56,57. This event was detected with the highest AF in Malays (AF SG-Chinese: 0.005, AF SG-Indian: 0.032, AF SG-Malay: 0.056). Other Malay-specific deletions involved OR2B2 (1.3kbp; AF SG-Chinese: 0.005, AF SG-Indian: 0.003, AF SG-Malay: 0.029) and FAM3B (1.6kbp; both with AF SG-Chinese: 0.002, AF SG-Indian: 0.002, AF SG-Malay: 0.028). OR2B2 encodes an olfactory receptor, a gene family known for population stratification, whilst FAM3B encodes a secreted cytokine-like protein involved in glucose metabolism and linked to type 2 diabetes. One more Malay specific insertion included a 209 bp event overlapping CEACAM3 (AF SG-Chinese: 0.05, AF SG-Indian: 0.07, AF SG-Malay: 0.11), a cell adhesion molecule that plays a crucial role in the innate immune response to bacterial infections. Finally, we identified a 12.2kbp duplication overlapping CLPSL1 and CLPS, which encode enzymes involved in the digestion of dietary proteins. The AF of this duplication was lower in Chinese compared to Malay and Indian individuals (AF SG-Chinese: 0.52, AF SG-Indian: 0.60, AF SG-Malay: 0.62). The observed minor AF was greater than 50% in all three populations, indicating that this duplication is a common event. Collectively, our analyses demonstrate that numerous population-specific SVs among Asians can be detected using SG10K-SV.
SVs Exhibit cis-linkage to Disease GWAS Loci
Finally, SVs are gaining prominence as potential genetic drivers of disease susceptibility, drug response and other phenotypes58. To explore potential associations between SVs and biological phenotypes, we hypothesized that certain trait-associated lead SNPs identified by GWAS (GWAS-lead SNPs) might not (and indeed often do not) represent the actual causative variant. Conventional GWAS analysis thus often requires pinpointing underlying causal variants using fine-scale genetic mapping to assess variants showing high linkage disequilibrium (LD) with lead SNPs. Since SVs are large variants in terms of genomic span, it is possible that certain SVs in strong LD with GWAS lead SNPs might also be causative59.
To explore this possibility, we performed LD analysis between SG10K-SVs and previously reported SG10K_Health SNPs/short indels inferred from WGS23. LD was computed for high-confidence (call rate > 0.8) common (MAF>1%) SVs (n=6,453) and small variants (n=9,206,351) located within a 1Mbp distance (Figure 5A). 14% of SVs were not in LD with any SG10K_Health small variants (R2<0.2), suggesting that a substantial proportion of SVs represent genetic variability that might be overlooked in conventional genetic association analyses. 4,047 of the 6,453 high-confidence common SVs were in strong LD with 172,698 SG10K_Health SNPs (R2>0.8). Of these, 748 SVs were in strong LD with 1,814 SG10K_Health SNPs matching lead SNPs from the EBI GWAS catalogue (genomic location and allelic alteration), with 75 SVs (35 deletions, 4 duplications, and 36 insertions) in strong LD with 174 lead SNPs from GWAS focused on Asian cohorts. Supplementary table 5 provides these 75 SG10K-SVs and their associated lead SNPs.
From the 75 SVs, we focused on the subset that overlapped exons, since they could most directly be assigned a functional consequence. These included a predicted LOF deletion (chr1: 89,010,225-89,012,941) of exons 7 and 8 of GBP3 that was in strong LD (R2=0.968) with a missense variant (C->R) in the same gene (rs17433780; Figure 5B). Notably, rs17433780 is associated with markers of subclinical atherosclerosis in Chinese individuals (P=2x10-6 60). While this missense SNP is certainly a candidate, our analysis suggests that the linked LOF SV should also be considered a potential causal variant for subclinical atherosclerosis in this locus.
GWAS-lead SNPs are often found in non-coding regions of the genome. Our analysis highlighted one exonic-associated SVs in high LD with these non-exonic SNPs, where the former may represent underlying causal variants. For example, a predicted LOF SV (chr11:55,264,123-55,271,064) deleting exons 2 to 6 of TRIM48 exhibited strong LD (R2=0.903) with an intergenic GWAS-lead SNP (chr11:54,697,371; rs11532186) associated with altered glomerular filtration rate. Notably, an integrative analysis of genetic association and gene expression in a cohort of patients with reduced kidney function identified TRIM48 among the top causal candidates for urine metabolite variation61. These examples support the value of including SG10K-SVs in analyses of genetic drivers of phenotypic variation in Asian cohorts. The full list of SVs in high LD with GWAS lead SNPs is reported in Supplementary Table 5.