Defining Complex Regions within the human genome
Genetic diversity within populations is a key driver of phenotypic variation, yet comprehensive large-scale genomic analysis often incurs substantial costs. Consequently, this study employs the Complex_Region as an approximate representation of the entire human genome, showing the genetic diversity of the human population. Assuming that the genome contains numerous elements, this study treats different genomic regions as distinct set elements, highlighting their varying degrees of similarity and differences among samples. The Complex_Region, defined in this study contains those elements with the most significant differences across the population, thus these regions reflect inter-individual diversity and can be used as an initial population characterization (Fig. 1a). Considering the presence of similar or divergent sequences across individuals, the cross-overlap between circles indicates the high frequency region sequences in the population (enclosed by the light purple dotted circle). As the analysis relies on a reference-based approach, it is necessary to exclude highly similar sequences (within the light blue dotted circle) to define the final Complex_Region. This region, demarcated by the red arrow, represents the final outcome of the screening process. To better define these regions, we utilized more data from the human pan-genome12, 13 (HPRC and CPC) as well as the 1000 Genomes Project at different levels, using hg38 coordinates to identify and integrate genomic regions where variant density or average sequence depth was greater than the overall mean plus two standard deviations (mean + 2*SD) and with continuous segment lengths of at least 450 kb. To further illustrate the above definition, we visualized the Complex_Region regions on chromosome 19, and these highlighted regions can be used to roughly outline the major heterogeneity of genetic information in human populations represented by any dataset of HPRC, CPC or 1000 genomes, confirming that our selected regions meet the previous definition and that the datasets used are sufficiently representative of the population (Fig. 1b).
The Complex_Region sequences may reflect the heterogeneity within the human population, prompting us to further confirm them and mine their sequence characteristics for subsequent functional exploration. Therefore, we compared them with non-Complex_Region sequences from the hg38 genome by selecting random sequences with similar length distributions, and performed PCA dimensionality reduction after vectorisation using 6-mer tokenisation. The analysis revealed that Complex_Region sequences are slightly different from the random non-Complex_Region genomic sequences due to their inherent human origin and display greater diversity (Supplementary Fig. 1a). To minimize potential noise and accurately identify Complex_Region sequences, we constructed Complex_Region dataset in which random genomic regions were specifically used as classification control sequences. Moreover, we constructed the genomic Random dataset using a similar strategy. The final binary and multi-classification results demonstrate that the fine-tuning pre-trained DNA_bert_6 and human_gpt2-v1 models can effectively identify the Complex_Region sequences, and the classification accuracies of the Random dataset range from 0.2864 to 0.5205, whereas those of the Complex_Region dataset are within the range of 0.7942 ~ 0.8700 (Supplementary Table 1). Compared with the Random datasets, these results highlight the distinctiveness of Complex_Region sequences with random genomic regions, reflecting their high internal diversity and sequence uniqueness, as indicated by the four classification results of the Complex_Region dataset, which show the differentiation of coding sequences from non-coding regions.
Sequence characterization of Complex Regions
In this study, we totally identified approximately 248.69 Mb of Complex_Region, accounting for approximately 8.1% of the human genome, containing 239 distinct sequence segments, with the largest contiguous region spanning 5.433 Mb (Supplementary Table 2). These regions are distributed both in clustered and scattered across the hg38 reference genome, varying in density and location (Fig. 1c). Notably, these regions exhibit a variable chromosomal distribution pattern, particularly within chromosomes 16, 6, 2, and 1, which contain more cumulative sequences, and similarly the clinical medical genes14 are more abundant on chromosomes 16 and 19 (Supplementary Fig. 1b, Supplementary Table 2). This observation highlights the fragmented characterization and nonuniform distribution of the Complex_Region across chromosomes, which is consistent with previously reported nonrandom genomic mutations15. This further implies that these patterns may result from chromosome-specific evolutionary processes as well as being influenced by environmental factors.
To determine whether the Complex_Region is adequately representative of population diversity, we further analyzed its population characteristics, inter-sequence similarity, and sequence-specific types within these regions to explore its function. The Complex_Region of the HPRC pan-genome exhibited high sequence diversity entropy, with some regions also showing sequence repetitiveness (Fig. 1d). Compared with the whole hg38 genome, we then assessed whether there was an enrichment of specific sequence types within these regions. The results revealed a general decrease in overall repeat sequences, LINE repeat sequences, and DNase I hypersensitive site regions (DNase_Clusters), but a significant enrichment for SV genetic variants, CpG islands, high GC content, immunogenetic regions, chromosomal topological domains and pan-cancer variants in the Complex_Region. Small variants were significantly more abundant than in non-repetitive genomic regions, and slightly exceeded the overall genomic level, which is consistent with the presence of long region fragments and increased SNVs in genomic segmental duplications16. Additionally, other features such as partial repeat sequences (LINE-Alu; SINE; LTR-ERVK), gene numbers, regulatory elements, the human virus interaction database, HERV coding sequences and the GWAS locus were also proportionally greater than those at the genome-wide level, whereas other characteristics remained generally consistent (Supplementary Fig. 1c). The high genetic variation is similar to the sequence diversity of the HPRC pan-genome, and the low percentage of repetitive sequences supports the analysis. The low percentage of DNase_Clusters also reflects the sequence complexity within the Complex_Region17, 18. The remaining features also imply the potential functional role of these regions, as evidenced by the enriched signals for somatic mutations in pan-cancer and GWAS locus, which directly indicate their potential biomedical value.
Genes functional characteristics of Complex Regions
The Complex_Region, which reflects the greatest genetic heterogeneity among human populations, studies of the functions of related genes help to elucidate the biomedical functions of sequences within this region and the influence of regional environments in shaping population phenotypes. Despite the insignificant enrichment of related genes in the Complex_Region compared with the genome-wide level (Supplementary Fig. 1c), genes within these regions still play a crucial role in regulating differential gene expression regulation across populations. Functional enrichment analysis revealed that these related genes are associated primarily with immunity and metabolism, development, growth, and pathways for specific infections and autoimmune diseases (Fig. 2a, Supplementary Fig. 2a). These genes exhibit tissue-specific expression in the spleen, bone marrow, and blood, particularly in CD34+ and BDCA4+ cells (Supplementary Fig. 2b, Supplementary Table 3). Notably, compared with other regions of the autosomes, the Complex_Region shows greater significant enrichment in molecular functions such as immune system regulation, including cytokine signaling, antigen processing and presentation, immunoregulatory interactions and other processes. Meanwhile, processes such as pathogen response and metabolism, such as herpes simplex virus 1 infection, beta-defensins and fatty acid metabolism, were significantly enriched in the Complex_Region and other regions within the autosomes. However, functions related to human development, sensory perception, and complex genetics were not significantly enriched in Complex_Region, which may be potentially due to the influence of potentially repetitive sequences and complex organismal regulatory mechanisms (Supplementary Table 4). In addition, the intragenomic disease-specific enrichment revealed that genes in these regions are significantly associated with infection-induced diseases, genetic and developmental diseases, autoimmune diseases, neuromuscular diseases, and other diseases, showing greater enrichment than those in non-Complex_Region regions (Fig. 2b, Supplementary Table 5).
Given the uneven distribution of Complex_Region sequences and genes across chromosomes, we aimed to further explore their gene enrichment and their specific functional roles at different chromosomal levels. Initially, we quantified the proportion of Complex_Region genes across different chromosomes (both all and protein-coding genes), and reported that chromosomes 1, 2, 6, 11, 16, and 19 presented a notable enrichment trend, whereas chromosomes 3, 4, 8, and 10 exhibited a significant reduction in genes (Fig. 2c). Subsequent functional enrichment analysis of overlapping genes by chromosome (chr1, chr2, chr6, chr11, chr16, chr19, and others) identified specific functional enrichments across different chromosomes (Fig. 2d, Supplementary Table 6). For instance, chromosome 1 is associated with immunity, metabolism and development, and chromosome 2 focuses on metabolism, locomotion, and cytoarchitecture, chromosome 6 is involved in antigen processing, immunomodulation and cellular stress, chromosome 11 is closely associated with olfactory perception, antigen processing and cancer, chromosome 16 is particularly related to ethanol catabolism, membrane protein insertion and inflammation. These findings of chromosome-specific enrichments in the Complex_Region, which further enhances our understanding of population heterogeneity and local sequence function within the genome.
To explore the implications of chromosome-specific immune and metabolic genes in disease mechanisms, we selected an inflammatory gene set on chromosome 19 consisting of the KIR gene cluster5 (KIR2DL3, KIR2DS4, KIR3DL1, KIR3DL2), members of the immunoglobulin superfamily19 (LILRB1, LILRB2) and genes involved in complement with inflammation regulation20, 21 (C3, NLRP2) for further analysis (Supplementary Fig. 3a). Moreover, the selected metabolic gene set included ATP energy metabolism genes22 (ATP2A1, ATP2C2, ATP6V0D1) located on chromosome 16, carbohydrate metabolism genes23 (AMY1A, AMY1B, AMY1C) on chromosome 1, as well as detoxification and drug metabolism genes24 (GSTM1, GSTM2, GSTM3) together (Supplementary Fig. 3b). We found that these immune and metabolic gene sets exhibited strong associations with autoimmune disorders, infections, cardiovascular diseases, and malignancies, highlighting the need for deeper exploration of sequence function in the Complex_Region. Further studies we will include refining variant-level phenotypic associations and exploring the factors that regulate gene function.
Disease associated signals enriched in Complex Regions
Recently, large-scale GWAS studies have identified numerous genetic phenotype signals25. This study utilized a hypergeometric distribution analysis to further investigate the enrichment of these signals within the Complex_Region, revealing a substantial enrichment mainly in immunity-related regions (CD64 on CD14+ CD16+ monocytes; etc.) and metabolism (Triacylglycerol (56:6) [M + NH4]+ levels; etc.) (Fig. 3a, Supplementary Fig. 4a, Supplementary Table 7). This raises the question of whether the Complex_Region defined in this study can approximate genome-wide level results. To explore this further, we demonstrated the chromosomal distribution of the major genetic susceptibility loci of HIV-associated variants in infectious immune diseases and RA-associated variants in chronic immune diseases, and the results suggest that the redefined Complex_Region can cover the major association loci associated with these two diseases. Consequently, using the Complex_Region for the above enriched phenotypes in large-scale cohort studies may provide a cost-effective strategy without compromising the integrity of the research results (Fig. 3b, Supplementary Fig. 4b).
Tumors have attracted much attention as a major medical challenge. This study explores the distribution of strong somatic mutation association signals within the Complex_Region and their underlying genetic implications. A hypergeometric test revealed substantial enrichment of tumor-causing mutations in these regions, particularly in gastric adenocarcinoma (TCGA-STAD) and kidney clear cell carcinoma (TCGA-KIRC) (Fig. 3c, Supplementary Table 8). Further analysis of genes within the Complex_Region associated with these malignancies revealed that the top 10 mutated genes in gastric adenocarcinoma predominantly harbored missense mutations, with CSMD1 gene expression strongly correlated with patient prognosis. Similarly, in kidney clear cell carcinoma, the top 10 mutated genes contained a high frequency of missense mutations and frameshift deletions, and the expression level of the PBRM1 gene, which presented the greatest number of mutations, was also strongly associated with patient survival (Fig. 3d-e). Therefore, for tumors that are significantly enriched in somatic mutations within the Complex_Region, researchers can directly analyze the overlapping genes to detect the pathogenic genes associated with this tumor. Furthermore, prioritizing this region in large-scale tumor cohort studies may reduce experimental and computational costs without significantly compromising effectiveness.
Characteristics of putative viral sequences in Complex Regions
The identified Complex_Region exhibits unique sequence polymorphisms in human populations and their potential biomedical significance. Non-random regions exceeding 450 kb in length suggest that these sequences may increase population adaptability, genetic complexity, and phenotypic plasticity. Which factors contribute to this phenomenon? Environmental viral microbes, as important external driving forces in human evolution, may partially explain the emergence of this phenomenon. Therefore, we investigated putative viral sequences across the entire human genome, focusing particularly on those within the Complex_Region. Initially, the reference sequence hg38 and its endogenous viral regions were aligned separately with the Virus-Host Database26, revealing that the largest cumulative alignment lengths were primarily for retroviridae, herpesviridae and poxviridae, with several chromosomal variations (Supplementary Table 9). We subsequently assessed the enrichment of potential viral types in the human endogenous retrovirus, the human-virus interaction regions (HVIDB), and the defined Complex_Region, using the potential viral sequences accumulated on the different chromosomes of hg38 as a reference (Fig. 4a). The results show that these regions also contain many potential viruses, such as retroviridae, herpesviridae, poxviridae, and baculoviridae, but clearly exhibit a clear chromosome-specific enrichment. Within the Complex_Region, the most pronounced retroviridae enrichment is on chromosomes 6 and 19, with the parvoviridae and papillomaviridae also showing chromosome-specific enrichment of potential viral sequences (Supplementary Table 10). Compared with the HERV and HVIDB regions, the Complex_Region exhibits a greater diversity of chromosome-specific viral enrichment, suggesting that external viruses contribute to the formation of these sequences, which have been integrated into human DNA through evolutionary selection, indicating that more remnants of ancient viruses may still remain in the human genome.
We found that the sequences within the Complex_Region contained an ~ 2-fold enrichment of viral sequences compared with the whole hg38 genome. The majority of viral sequence remnants were short, predominantly from families like retroviridae, herpesviridae, and poxviridae, with only the Proteus phage VB_PmiS-Isfahan and the human endogenous retrovirus K113 (NC_022518) having remnants exceeding 500 bp and over 95% similarity (Fig. 4b). Given the genome-wide similarity among many sequence fragments within the human genome, this phenomenon is likely to be related to genome sequence complexity and its dynamic regulation during evolution27. As a prominent manifestation of population heterogeneity, does the Complex_Region and its contained putative viral sequences exhibit a similar trend? Further exploration of sequence collinearity across the entire hg38 genome revealed that most of the highly similar segments between chromosomes are smaller than 50 kb, with only a few larger segments, reflecting the genomic similarity in the Complex_Region. The putative viral sequences within the Complex_Region also show inter-chromosomal sequence similarity, but they exhibit chromosomal preferences, such as the absence of this phenomenon on chromosomes 13 and 14, whereas NC_022518 (~ 9.47 kb) is widely present across multiple chromosomes (Supplementary Fig. 5a-b, Fig. 4c).
To further explore the organismic activities associated with the NC_022518 sequence, a functional investigation of human protein-encoding genes carrying potential NC_022518 sequences was conducted (Supplementary Table 11), which revealed that these 10 protein-encoding genes are crucial for fundamental biological processes and closely related to various diseases. The study revealed that the NC_022518 sequence predominantly resides in the intronic regions, with ENSG00000283809 present in exons 5 and 6, and these 10 genes can be clustered into two major groups (Supplementary Fig. 6a). To further investigate the characterization of sequence traits associated with potential NC_022518 sequences across primates, diverse populations, and individual genomes, and we assessed the similarity scores of these sequences, followed by PCA for dimensionality reduction (Fig. 4d, Supplementary Fig. 6b-c). The analysis revealed no significant differences among the NC_022518-like sequences in primates, whereas the human sequences presented greater diversity. Multiple coding gene sequences highly similar to those of NC_022518 in humans clustered with primate and viral sequences, whereas the longest non-coding sequence displayed distinct characteristics. Notably, non-coding sequences closely related to NC_022518 exhibited greater disparities than coding regions did, which was particularly evident in the differences between the two haplotype segments within East Asian populations, indicating individual-level discrepancies in potential NC_022518 sequences rather than the population, and this phenomenon may be related to the adaptability of populations. The evolutionary tree constructed by the potentially similar sequences of NC_022518 also showed the same trend, in which these sequences had no obvious evolutionary distance among non-human primates, but had obvious differentiation in the population (Supplementary Fig. 6d). Furthermore, alignment results from non-redundant nucleic acid databases revealed the presence of the NC_022518-like sequences in various primate species (Supplementary Table 12), and the results of the corresponding species tree indicated that the collected sequences clustered into three primary branches, one closely aligned with HERVK-related sequences of the NC_022518 sequence, another predominantly composed of HERVK (I) sequences, and a third represented by HERVK HML-2 sequences in primates (Supplementary Fig. 7). In addition, recent research findings have highlighted HERVK as a virus recently integrated into the genome, playing a significant role in regulating processes such as aging and neurological diseases28, 29, which further highlights the crucial role of the NC_022518 sequence in organismic regulation and disease.
The pathogenicity and regulation of putative viral sequences in Complex Regions
The immune system, a vital component of the body’s response to the environment, plays a critical role in defending against diseases and eliminating pathogens30, 31. It has been found that ancient viral sequences, integrated into the human genome via reverse transcription mechanisms, persist in regulating gene expression. These sequences trigger innate immune responses by activating viral defense pathways, which are closely associated with cancer and neurological disorders32. To further explore the functions of these putative viral sequences in the Complex_Region, this study assessed the chromosomal distribution and related gene regulatory networks of long terminal repeat retrotransposons (LTR_Repeat), putative viral regions (Virus), immunogenetic regions (Immunogenetic), and human-virus protein interaction zones (HVIDB). The findings revealed that sequences from LTR_Repeat, Virus, and HVIDB predominantly extended across chromosomes 1, 2, 6, and 16, whereas the immunogenetic sequences were located mainly on chromosomes 6, 8, 14, and 19 (Fig. 5a).
Building on the premise that sequence structure determines function, this study investigated the common characteristics among Virus, HVIDB, and Immunogenetic sequences, revealing that these regions collectively encompass approximately 275.471 kb and include 86 genes (Fig. 5b). The disease enrichment analysis for these genes revealed that they are associated primarily with immune and infection-related diseases, tumors and specific neurological disorders. Moreover, the average pathogenic prediction scores for gene sets associated with specific diseases are similar to the enrichment results, with conditions like Agnosia for Pain achieving a high pathogenicity score of 0.7, whereas those such as HIV-1 are around 0.4. (Supplementary Fig. 8a). Moreover, the results of the gene-protein interaction network analysis revealed that these overlapping genes are organized into three clusters: MODULE_1 involves graft-versus-host disease, antigen processing, and the presentation of exogenous peptide antigens; MODULE_2 is related to transmembrane receptor protein tyrosine kinase signaling pathways and enzyme-linked receptor-associated signaling pathways, etc.; MODULE_3 is linked to graft-versus-host disease, antigen processing and presentation, and natural killer cell-mediated cytotoxicity (Fig. 5c). Consequently, specific regions within the Complex_Region containing putative viral sequences continue to interact with external viruses and contribute to immune regulation, signal transduction, and graft rejection processes. Notably, given that many putative viral sequences are located in non-coding regions, we used the susceptibility prediction scores of the SNPs to assess their average pathogenicity to the corresponding regions within Complex_Region. Compared with sequences within the entire Complex_Region, putative viral sequences on various chromosomes presented differing pathogenicity scores, with Chromosomes 1, 3, 4, 6, 11, 16, and 19 showing greater pathogenicity for Virus and HVIDB sequences, whereas the immunogenetic sequences on chromosomes 4, 6, 11, and 19 presented increased pathogenicity (Supplementary Fig. 8b, Supplementary Table 11). These pathogenicity assessments will continue to guide the interpretation of their functions, facilitating ongoing research.
To assess the impact of potential putative viral variants in specific regions on the disease, this study further demonstrated their chromosomal distribution and associated phenotypes. The results of the cumulative small variants analysis revealed that chromosomes 1, 2, 6, 8, 12, 16, and 19 presented a relatively high density of Virus and HVIDB-related variants, with chromosomes 6, 14, and 19 exhibiting more immunogenetic variants. At the structural variants level, chromosomes 1, 5, 9, 15, 16 and 22 presented greater distributions of virus and HVIDB variants, with chromosomes 1 and 9 also harboring significant immunogenetic mutations (Supplementary Fig. 9a-b). Utilizing the results of GWAS analyses, we identified the complex phenotypes associated with these viral variants, mainly in regions related to immunity and its related diseases, metabolic and cardiovascular indicators, and general health markers (Fig. 5d). Additionally, these GWAS-associated potential viruses show strong signals of recent positive selection (XP-EHH, etc.) and local adaptation (Fst), highlighting their active role in recent regulatory selection of the organism (Supplementary Table 14–15). Further analysis of the pathogenicity scores, functional annotations, and population frequencies of these putative viral variants revealed that only a few had high pathogenicity scores, resulting in a limited number of mutations within different score regions (Supplementary Fig. 9c, Supplementary Table 16). High pathogenic mutations often occur at enhancers and transcription factor binding sites, displaying varying frequencies across populations, such as rs3748805, which has a high frequency in all populations, rs116399863 with a higher frequency in African populations, and rs2230209 with a higher frequency in South Asian populations (Supplementary Fig. 9d). Meanwhile, these highly pathogenic mutations associated with the 3D genome involve genes regulating the immune system (CD70), development and differentiation33 (UNCX, TBX10), signal transduction34 (G6B, DPCR1), cell apoptosis and growth35 (LGALS7, LZRS), and gene expression regulation36 (CRMP1/MIR137BD1, DOCK11/DDX11-AS1), and other functional processes. Further dynamic regulatory action analysis of single putative viral mutations revealed that the high-frequency rs3748805 variant plays a complex role in gene regulation, and is located in the exon region of the THEM4 gene (Supplementary Fig. 10, Supplementary Table 17). This mutation is closely associated with the inhibition of protein kinase B (Akt), cell apoptosis, insulin signaling, cancer, eye movement dysfunctions in schizophrenia, etc35,36,37., and is positively associated with AMR and SAS populations, and may be closely associated with the regional environmental viruses. Meanwhile, the rs10484554 variant in the PSORS1 gene has been found to significantly increase psoriasis risk, particularly in early-onset cases, with males being more susceptible to severe disease, possibly due to hormonal imbalances, immune response variations, or sex-specific genetic background38, 39.