Deciphering Complex Regions within the Human Genome and Unraveling Their Critical Biomedical Regulatory Functions

doi:10.21203/rs.3.rs-4800291/v1

Download PDF

Research Article

Deciphering Complex Regions within the Human Genome and Unraveling Their Critical Biomedical Regulatory Functions

https://doi.org/10.21203/rs.3.rs-4800291/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Nuclear genomic DNA plays a crucial role in individual development and phenotype determination. The genetic landscape within populations exhibits significant heterogeneity, contributing to diverse human traits. Current studies of human genome heterogeneity often focus on specific segments of high-frequency phenotype-associated sequences or structurally complex regions. Therefore, to overcome the limitations of previous studies and more directly explore population heterogeneity, it is essential to study the entire genome rather than focusing only on known phenotype-associated regions.

Results: Using set theory, we have clearly defined Complex Regions (Complex_Region) by integrating pan-genome datasets, covering about 8.1% of the human genome. These regions exhibit high sequence diversity and nonrandom long continuous fragments (≥450kb), thus reflecting population genetic complexity. Our enrichment analysis revealed that genes within Complex_Region are primarily involved in immunity and metabolism, indicating chromosome-specific functional enrichment. Notably, immune genes are mainly located on chromosomes 6 and 19, which are closely associated with disease occurrence. Moreover, these regions are enriched for human phenotype-related signals and tumor somatic mutations, providing novel insights for large-scale cohort studies. We also detected ancient viral sequences, particularly ~9.47 kb human endogenous retroviruses (HERV) insertion sequence NC_022518, which is diverse in humans but remains conserved across primates, to be implicated in regulating bodily functions and various diseases.

Conclusions: Our study highlights the biomedical importance of Complex_Region by revealing associations among genotypes, environment, and phenotypes. This enhances our understanding of life regulation and phenotype shaping, highlighting the role of these regions in immunity, metabolism, and disease association.

Complex Regions

Genotypes

Phenotypes

Immunity and Metabolism

Pan-genome

HERV

Nuclear genomic DNA, which is pivotal to biological processes, orchestrates complex interactions involving gene expression regulation, epigenetic modifications, microbiome interactions, and environmental influences on organismal phenotype development. A deeper understanding of the genotype-environment-phenotype relationship can enhance our comprehension of biological diversity and ultimately provide new perspectives and strategies for disease diagnosis, treatment and prevention. Genome-wide association studies (GWAS) have revealed that over 10,000 genetic variants are associated with various phenotypes and diseases¹. By integrating diverse human genotype-phenotype data, the DisGeNET database now includes more than 24,000 diseases and traits, and 17,000 genes with 117,000 variants². In addition, researchers have identified approximately 8–9 SNPs per 10 kb in the human genome, revealing a significant nonrandom mutation pattern across populations^{3, 4}. Notably, regions such as the KIR immune gene family on chromosome 19 are highly variable within the genome⁵. Moreover, the precise resolution of large-scale genome sequences remains constrained by cost and technological limitations at this stage, with only a few complete genomes like CHM13 and HG002 available^{6, 7}. The above potential needs highlight the importance of further exploration of the complex characteristics of the human genome and population diversity.

Current studies exploring the heterogeneity of the human genome tend to focus on specific similar segments of high-frequency phenotype-associated sequences in populations or structurally complex regions. On the basis of linkage disequilibrium and allele frequency, researchers have identified regions associated with complex traits, shedding light on various evolutionary patterns influencing these trait loci⁸. Furthermore, the integration of 3D structure data with GWAS signals has highlighted genes in close proximity to known phenotypic sites, facilitating the identification of potentially schizophrenia-associated genomic regions⁹. Additionally, sequences like tandem or proximal repeats, which are intricate regions of eukaryotic genomes but pose challenges for molecular identification because of their high similarity, play crucial roles in understanding human diseases, developmental processes, and environmental adaptability¹⁰. By analyzing the genomes of humans and eight nonhuman primates, scientists initially introduced the concept of structurally variable regions (SDRs) in primate genomes, identified several SDRs closely associated with human diseases such as Joubert syndrome¹¹.

To overcome the limitations of previous research and more directly explore population heterogeneity rather than focusing solely on known phenotype-associated regions. Inspired by set theory, this study has explicitly defined the Complex_Region explicitly by integrating graphical pan-genomic data, revealing the primary functions of genes in these regions associated with immune and metabolic processes. It subsequently conducted a comprehensive exploration of the sequence characteristics, potential phenotypes, and regulatory mechanisms of these regions. From another perspective, our defined Complex_Region, which reflects the "degree of sequence difference" in bioinformatic analysis, and it focuses on high-frequency, long segments, and nonrandom genomic regions within populations. However, owing to limitations in current graph pan-genome construction methods regarding repetitive sequences, our Complex_Region includes more polymorphic regions of the human genome rather than highly repetitive regions. Briefly, the Complex_Region, characterized by high sequence diversity entropy and long nonrandom contiguous segment features of at least 450kb, which can be used to characterize the complexity of genetic information at the population level, facilitating the advancement of precision medicine, with great significant potential for diverse biomedical applications.

Defining Complex Regions within the human genome

Genetic diversity within populations is a key driver of phenotypic variation, yet comprehensive large-scale genomic analysis often incurs substantial costs. Consequently, this study employs the Complex_Region as an approximate representation of the entire human genome, showing the genetic diversity of the human population. Assuming that the genome contains numerous elements, this study treats different genomic regions as distinct set elements, highlighting their varying degrees of similarity and differences among samples. The Complex_Region, defined in this study contains those elements with the most significant differences across the population, thus these regions reflect inter-individual diversity and can be used as an initial population characterization (Fig. 1a). Considering the presence of similar or divergent sequences across individuals, the cross-overlap between circles indicates the high frequency region sequences in the population (enclosed by the light purple dotted circle). As the analysis relies on a reference-based approach, it is necessary to exclude highly similar sequences (within the light blue dotted circle) to define the final Complex_Region. This region, demarcated by the red arrow, represents the final outcome of the screening process. To better define these regions, we utilized more data from the human pan-genome^{12, 13} (HPRC and CPC) as well as the 1000 Genomes Project at different levels, using hg38 coordinates to identify and integrate genomic regions where variant density or average sequence depth was greater than the overall mean plus two standard deviations (mean + 2*SD) and with continuous segment lengths of at least 450 kb. To further illustrate the above definition, we visualized the Complex_Region regions on chromosome 19, and these highlighted regions can be used to roughly outline the major heterogeneity of genetic information in human populations represented by any dataset of HPRC, CPC or 1000 genomes, confirming that our selected regions meet the previous definition and that the datasets used are sufficiently representative of the population (Fig. 1b).

The Complex_Region sequences may reflect the heterogeneity within the human population, prompting us to further confirm them and mine their sequence characteristics for subsequent functional exploration. Therefore, we compared them with non-Complex_Region sequences from the hg38 genome by selecting random sequences with similar length distributions, and performed PCA dimensionality reduction after vectorisation using 6-mer tokenisation. The analysis revealed that Complex_Region sequences are slightly different from the random non-Complex_Region genomic sequences due to their inherent human origin and display greater diversity (Supplementary Fig. 1a). To minimize potential noise and accurately identify Complex_Region sequences, we constructed Complex_Region dataset in which random genomic regions were specifically used as classification control sequences. Moreover, we constructed the genomic Random dataset using a similar strategy. The final binary and multi-classification results demonstrate that the fine-tuning pre-trained DNA_bert_6 and human_gpt2-v1 models can effectively identify the Complex_Region sequences, and the classification accuracies of the Random dataset range from 0.2864 to 0.5205, whereas those of the Complex_Region dataset are within the range of 0.7942 ~ 0.8700 (Supplementary Table 1). Compared with the Random datasets, these results highlight the distinctiveness of Complex_Region sequences with random genomic regions, reflecting their high internal diversity and sequence uniqueness, as indicated by the four classification results of the Complex_Region dataset, which show the differentiation of coding sequences from non-coding regions.

Sequence characterization of Complex Regions

In this study, we totally identified approximately 248.69 Mb of Complex_Region, accounting for approximately 8.1% of the human genome, containing 239 distinct sequence segments, with the largest contiguous region spanning 5.433 Mb (Supplementary Table 2). These regions are distributed both in clustered and scattered across the hg38 reference genome, varying in density and location (Fig. 1c). Notably, these regions exhibit a variable chromosomal distribution pattern, particularly within chromosomes 16, 6, 2, and 1, which contain more cumulative sequences, and similarly the clinical medical genes¹⁴ are more abundant on chromosomes 16 and 19 (Supplementary Fig. 1b, Supplementary Table 2). This observation highlights the fragmented characterization and nonuniform distribution of the Complex_Region across chromosomes, which is consistent with previously reported nonrandom genomic mutations¹⁵. This further implies that these patterns may result from chromosome-specific evolutionary processes as well as being influenced by environmental factors.

To determine whether the Complex_Region is adequately representative of population diversity, we further analyzed its population characteristics, inter-sequence similarity, and sequence-specific types within these regions to explore its function. The Complex_Region of the HPRC pan-genome exhibited high sequence diversity entropy, with some regions also showing sequence repetitiveness (Fig. 1d). Compared with the whole hg38 genome, we then assessed whether there was an enrichment of specific sequence types within these regions. The results revealed a general decrease in overall repeat sequences, LINE repeat sequences, and DNase I hypersensitive site regions (DNase_Clusters), but a significant enrichment for SV genetic variants, CpG islands, high GC content, immunogenetic regions, chromosomal topological domains and pan-cancer variants in the Complex_Region. Small variants were significantly more abundant than in non-repetitive genomic regions, and slightly exceeded the overall genomic level, which is consistent with the presence of long region fragments and increased SNVs in genomic segmental duplications¹⁶. Additionally, other features such as partial repeat sequences (LINE-Alu; SINE; LTR-ERVK), gene numbers, regulatory elements, the human virus interaction database, HERV coding sequences and the GWAS locus were also proportionally greater than those at the genome-wide level, whereas other characteristics remained generally consistent (Supplementary Fig. 1c). The high genetic variation is similar to the sequence diversity of the HPRC pan-genome, and the low percentage of repetitive sequences supports the analysis. The low percentage of DNase_Clusters also reflects the sequence complexity within the Complex_Region^{17, 18}. The remaining features also imply the potential functional role of these regions, as evidenced by the enriched signals for somatic mutations in pan-cancer and GWAS locus, which directly indicate their potential biomedical value.

Genes functional characteristics of Complex Regions

The Complex_Region, which reflects the greatest genetic heterogeneity among human populations, studies of the functions of related genes help to elucidate the biomedical functions of sequences within this region and the influence of regional environments in shaping population phenotypes. Despite the insignificant enrichment of related genes in the Complex_Region compared with the genome-wide level (Supplementary Fig. 1c), genes within these regions still play a crucial role in regulating differential gene expression regulation across populations. Functional enrichment analysis revealed that these related genes are associated primarily with immunity and metabolism, development, growth, and pathways for specific infections and autoimmune diseases (Fig. 2a, Supplementary Fig. 2a). These genes exhibit tissue-specific expression in the spleen, bone marrow, and blood, particularly in CD34⁺ and BDCA4⁺ cells (Supplementary Fig. 2b, Supplementary Table 3). Notably, compared with other regions of the autosomes, the Complex_Region shows greater significant enrichment in molecular functions such as immune system regulation, including cytokine signaling, antigen processing and presentation, immunoregulatory interactions and other processes. Meanwhile, processes such as pathogen response and metabolism, such as herpes simplex virus 1 infection, beta-defensins and fatty acid metabolism, were significantly enriched in the Complex_Region and other regions within the autosomes. However, functions related to human development, sensory perception, and complex genetics were not significantly enriched in Complex_Region, which may be potentially due to the influence of potentially repetitive sequences and complex organismal regulatory mechanisms (Supplementary Table 4). In addition, the intragenomic disease-specific enrichment revealed that genes in these regions are significantly associated with infection-induced diseases, genetic and developmental diseases, autoimmune diseases, neuromuscular diseases, and other diseases, showing greater enrichment than those in non-Complex_Region regions (Fig. 2b, Supplementary Table 5).

Given the uneven distribution of Complex_Region sequences and genes across chromosomes, we aimed to further explore their gene enrichment and their specific functional roles at different chromosomal levels. Initially, we quantified the proportion of Complex_Region genes across different chromosomes (both all and protein-coding genes), and reported that chromosomes 1, 2, 6, 11, 16, and 19 presented a notable enrichment trend, whereas chromosomes 3, 4, 8, and 10 exhibited a significant reduction in genes (Fig. 2c). Subsequent functional enrichment analysis of overlapping genes by chromosome (chr1, chr2, chr6, chr11, chr16, chr19, and others) identified specific functional enrichments across different chromosomes (Fig. 2d, Supplementary Table 6). For instance, chromosome 1 is associated with immunity, metabolism and development, and chromosome 2 focuses on metabolism, locomotion, and cytoarchitecture, chromosome 6 is involved in antigen processing, immunomodulation and cellular stress, chromosome 11 is closely associated with olfactory perception, antigen processing and cancer, chromosome 16 is particularly related to ethanol catabolism, membrane protein insertion and inflammation. These findings of chromosome-specific enrichments in the Complex_Region, which further enhances our understanding of population heterogeneity and local sequence function within the genome.

To explore the implications of chromosome-specific immune and metabolic genes in disease mechanisms, we selected an inflammatory gene set on chromosome 19 consisting of the KIR gene cluster⁵ (KIR2DL3, KIR2DS4, KIR3DL1, KIR3DL2), members of the immunoglobulin superfamily¹⁹ (LILRB1, LILRB2) and genes involved in complement with inflammation regulation^{20, 21} (C3, NLRP2) for further analysis (Supplementary Fig. 3a). Moreover, the selected metabolic gene set included ATP energy metabolism genes²² (ATP2A1, ATP2C2, ATP6V0D1) located on chromosome 16, carbohydrate metabolism genes²³ (AMY1A, AMY1B, AMY1C) on chromosome 1, as well as detoxification and drug metabolism genes²⁴ (GSTM1, GSTM2, GSTM3) together (Supplementary Fig. 3b). We found that these immune and metabolic gene sets exhibited strong associations with autoimmune disorders, infections, cardiovascular diseases, and malignancies, highlighting the need for deeper exploration of sequence function in the Complex_Region. Further studies we will include refining variant-level phenotypic associations and exploring the factors that regulate gene function.

Disease associated signals enriched in Complex Regions

Recently, large-scale GWAS studies have identified numerous genetic phenotype signals²⁵. This study utilized a hypergeometric distribution analysis to further investigate the enrichment of these signals within the Complex_Region, revealing a substantial enrichment mainly in immunity-related regions (CD64 on CD14⁺ CD16⁺ monocytes; etc.) and metabolism (Triacylglycerol (56:6) [M + NH4]⁺ levels; etc.) (Fig. 3a, Supplementary Fig. 4a, Supplementary Table 7). This raises the question of whether the Complex_Region defined in this study can approximate genome-wide level results. To explore this further, we demonstrated the chromosomal distribution of the major genetic susceptibility loci of HIV-associated variants in infectious immune diseases and RA-associated variants in chronic immune diseases, and the results suggest that the redefined Complex_Region can cover the major association loci associated with these two diseases. Consequently, using the Complex_Region for the above enriched phenotypes in large-scale cohort studies may provide a cost-effective strategy without compromising the integrity of the research results (Fig. 3b, Supplementary Fig. 4b).

Tumors have attracted much attention as a major medical challenge. This study explores the distribution of strong somatic mutation association signals within the Complex_Region and their underlying genetic implications. A hypergeometric test revealed substantial enrichment of tumor-causing mutations in these regions, particularly in gastric adenocarcinoma (TCGA-STAD) and kidney clear cell carcinoma (TCGA-KIRC) (Fig. 3c, Supplementary Table 8). Further analysis of genes within the Complex_Region associated with these malignancies revealed that the top 10 mutated genes in gastric adenocarcinoma predominantly harbored missense mutations, with CSMD1 gene expression strongly correlated with patient prognosis. Similarly, in kidney clear cell carcinoma, the top 10 mutated genes contained a high frequency of missense mutations and frameshift deletions, and the expression level of the PBRM1 gene, which presented the greatest number of mutations, was also strongly associated with patient survival (Fig. 3d-e). Therefore, for tumors that are significantly enriched in somatic mutations within the Complex_Region, researchers can directly analyze the overlapping genes to detect the pathogenic genes associated with this tumor. Furthermore, prioritizing this region in large-scale tumor cohort studies may reduce experimental and computational costs without significantly compromising effectiveness.

Characteristics of putative viral sequences in Complex Regions

The identified Complex_Region exhibits unique sequence polymorphisms in human populations and their potential biomedical significance. Non-random regions exceeding 450 kb in length suggest that these sequences may increase population adaptability, genetic complexity, and phenotypic plasticity. Which factors contribute to this phenomenon? Environmental viral microbes, as important external driving forces in human evolution, may partially explain the emergence of this phenomenon. Therefore, we investigated putative viral sequences across the entire human genome, focusing particularly on those within the Complex_Region. Initially, the reference sequence hg38 and its endogenous viral regions were aligned separately with the Virus-Host Database²⁶, revealing that the largest cumulative alignment lengths were primarily for retroviridae, herpesviridae and poxviridae, with several chromosomal variations (Supplementary Table 9). We subsequently assessed the enrichment of potential viral types in the human endogenous retrovirus, the human-virus interaction regions (HVIDB), and the defined Complex_Region, using the potential viral sequences accumulated on the different chromosomes of hg38 as a reference (Fig. 4a). The results show that these regions also contain many potential viruses, such as retroviridae, herpesviridae, poxviridae, and baculoviridae, but clearly exhibit a clear chromosome-specific enrichment. Within the Complex_Region, the most pronounced retroviridae enrichment is on chromosomes 6 and 19, with the parvoviridae and papillomaviridae also showing chromosome-specific enrichment of potential viral sequences (Supplementary Table 10). Compared with the HERV and HVIDB regions, the Complex_Region exhibits a greater diversity of chromosome-specific viral enrichment, suggesting that external viruses contribute to the formation of these sequences, which have been integrated into human DNA through evolutionary selection, indicating that more remnants of ancient viruses may still remain in the human genome.

We found that the sequences within the Complex_Region contained an ~ 2-fold enrichment of viral sequences compared with the whole hg38 genome. The majority of viral sequence remnants were short, predominantly from families like retroviridae, herpesviridae, and poxviridae, with only the Proteus phage VB_PmiS-Isfahan and the human endogenous retrovirus K113 (NC_022518) having remnants exceeding 500 bp and over 95% similarity (Fig. 4b). Given the genome-wide similarity among many sequence fragments within the human genome, this phenomenon is likely to be related to genome sequence complexity and its dynamic regulation during evolution²⁷. As a prominent manifestation of population heterogeneity, does the Complex_Region and its contained putative viral sequences exhibit a similar trend? Further exploration of sequence collinearity across the entire hg38 genome revealed that most of the highly similar segments between chromosomes are smaller than 50 kb, with only a few larger segments, reflecting the genomic similarity in the Complex_Region. The putative viral sequences within the Complex_Region also show inter-chromosomal sequence similarity, but they exhibit chromosomal preferences, such as the absence of this phenomenon on chromosomes 13 and 14, whereas NC_022518 (~ 9.47 kb) is widely present across multiple chromosomes (Supplementary Fig. 5a-b, Fig. 4c).

To further explore the organismic activities associated with the NC_022518 sequence, a functional investigation of human protein-encoding genes carrying potential NC_022518 sequences was conducted (Supplementary Table 11), which revealed that these 10 protein-encoding genes are crucial for fundamental biological processes and closely related to various diseases. The study revealed that the NC_022518 sequence predominantly resides in the intronic regions, with ENSG00000283809 present in exons 5 and 6, and these 10 genes can be clustered into two major groups (Supplementary Fig. 6a). To further investigate the characterization of sequence traits associated with potential NC_022518 sequences across primates, diverse populations, and individual genomes, and we assessed the similarity scores of these sequences, followed by PCA for dimensionality reduction (Fig. 4d, Supplementary Fig. 6b-c). The analysis revealed no significant differences among the NC_022518-like sequences in primates, whereas the human sequences presented greater diversity. Multiple coding gene sequences highly similar to those of NC_022518 in humans clustered with primate and viral sequences, whereas the longest non-coding sequence displayed distinct characteristics. Notably, non-coding sequences closely related to NC_022518 exhibited greater disparities than coding regions did, which was particularly evident in the differences between the two haplotype segments within East Asian populations, indicating individual-level discrepancies in potential NC_022518 sequences rather than the population, and this phenomenon may be related to the adaptability of populations. The evolutionary tree constructed by the potentially similar sequences of NC_022518 also showed the same trend, in which these sequences had no obvious evolutionary distance among non-human primates, but had obvious differentiation in the population (Supplementary Fig. 6d). Furthermore, alignment results from non-redundant nucleic acid databases revealed the presence of the NC_022518-like sequences in various primate species (Supplementary Table 12), and the results of the corresponding species tree indicated that the collected sequences clustered into three primary branches, one closely aligned with HERVK-related sequences of the NC_022518 sequence, another predominantly composed of HERVK (I) sequences, and a third represented by HERVK HML-2 sequences in primates (Supplementary Fig. 7). In addition, recent research findings have highlighted HERVK as a virus recently integrated into the genome, playing a significant role in regulating processes such as aging and neurological diseases^{28, 29}, which further highlights the crucial role of the NC_022518 sequence in organismic regulation and disease.

The pathogenicity and regulation of putative viral sequences in Complex Regions

The immune system, a vital component of the body’s response to the environment, plays a critical role in defending against diseases and eliminating pathogens^{30, 31}. It has been found that ancient viral sequences, integrated into the human genome via reverse transcription mechanisms, persist in regulating gene expression. These sequences trigger innate immune responses by activating viral defense pathways, which are closely associated with cancer and neurological disorders³². To further explore the functions of these putative viral sequences in the Complex_Region, this study assessed the chromosomal distribution and related gene regulatory networks of long terminal repeat retrotransposons (LTR_Repeat), putative viral regions (Virus), immunogenetic regions (Immunogenetic), and human-virus protein interaction zones (HVIDB). The findings revealed that sequences from LTR_Repeat, Virus, and HVIDB predominantly extended across chromosomes 1, 2, 6, and 16, whereas the immunogenetic sequences were located mainly on chromosomes 6, 8, 14, and 19 (Fig. 5a).

Building on the premise that sequence structure determines function, this study investigated the common characteristics among Virus, HVIDB, and Immunogenetic sequences, revealing that these regions collectively encompass approximately 275.471 kb and include 86 genes (Fig. 5b). The disease enrichment analysis for these genes revealed that they are associated primarily with immune and infection-related diseases, tumors and specific neurological disorders. Moreover, the average pathogenic prediction scores for gene sets associated with specific diseases are similar to the enrichment results, with conditions like Agnosia for Pain achieving a high pathogenicity score of 0.7, whereas those such as HIV-1 are around 0.4. (Supplementary Fig. 8a). Moreover, the results of the gene-protein interaction network analysis revealed that these overlapping genes are organized into three clusters: MODULE_1 involves graft-versus-host disease, antigen processing, and the presentation of exogenous peptide antigens; MODULE_2 is related to transmembrane receptor protein tyrosine kinase signaling pathways and enzyme-linked receptor-associated signaling pathways, etc.; MODULE_3 is linked to graft-versus-host disease, antigen processing and presentation, and natural killer cell-mediated cytotoxicity (Fig. 5c). Consequently, specific regions within the Complex_Region containing putative viral sequences continue to interact with external viruses and contribute to immune regulation, signal transduction, and graft rejection processes. Notably, given that many putative viral sequences are located in non-coding regions, we used the susceptibility prediction scores of the SNPs to assess their average pathogenicity to the corresponding regions within Complex_Region. Compared with sequences within the entire Complex_Region, putative viral sequences on various chromosomes presented differing pathogenicity scores, with Chromosomes 1, 3, 4, 6, 11, 16, and 19 showing greater pathogenicity for Virus and HVIDB sequences, whereas the immunogenetic sequences on chromosomes 4, 6, 11, and 19 presented increased pathogenicity (Supplementary Fig. 8b, Supplementary Table 11). These pathogenicity assessments will continue to guide the interpretation of their functions, facilitating ongoing research.

To assess the impact of potential putative viral variants in specific regions on the disease, this study further demonstrated their chromosomal distribution and associated phenotypes. The results of the cumulative small variants analysis revealed that chromosomes 1, 2, 6, 8, 12, 16, and 19 presented a relatively high density of Virus and HVIDB-related variants, with chromosomes 6, 14, and 19 exhibiting more immunogenetic variants. At the structural variants level, chromosomes 1, 5, 9, 15, 16 and 22 presented greater distributions of virus and HVIDB variants, with chromosomes 1 and 9 also harboring significant immunogenetic mutations (Supplementary Fig. 9a-b). Utilizing the results of GWAS analyses, we identified the complex phenotypes associated with these viral variants, mainly in regions related to immunity and its related diseases, metabolic and cardiovascular indicators, and general health markers (Fig. 5d). Additionally, these GWAS-associated potential viruses show strong signals of recent positive selection (XP-EHH, etc.) and local adaptation (Fst), highlighting their active role in recent regulatory selection of the organism (Supplementary Table 14–15). Further analysis of the pathogenicity scores, functional annotations, and population frequencies of these putative viral variants revealed that only a few had high pathogenicity scores, resulting in a limited number of mutations within different score regions (Supplementary Fig. 9c, Supplementary Table 16). High pathogenic mutations often occur at enhancers and transcription factor binding sites, displaying varying frequencies across populations, such as rs3748805, which has a high frequency in all populations, rs116399863 with a higher frequency in African populations, and rs2230209 with a higher frequency in South Asian populations (Supplementary Fig. 9d). Meanwhile, these highly pathogenic mutations associated with the 3D genome involve genes regulating the immune system (CD70), development and differentiation³³ (UNCX, TBX10), signal transduction³⁴ (G6B, DPCR1), cell apoptosis and growth³⁵ (LGALS7, LZRS), and gene expression regulation³⁶ (CRMP1/MIR137BD1, DOCK11/DDX11-AS1), and other functional processes. Further dynamic regulatory action analysis of single putative viral mutations revealed that the high-frequency rs3748805 variant plays a complex role in gene regulation, and is located in the exon region of the THEM4 gene (Supplementary Fig. 10, Supplementary Table 17). This mutation is closely associated with the inhibition of protein kinase B (Akt), cell apoptosis, insulin signaling, cancer, eye movement dysfunctions in schizophrenia, etc^35,36,37., and is positively associated with AMR and SAS populations, and may be closely associated with the regional environmental viruses. Meanwhile, the rs10484554 variant in the PSORS1 gene has been found to significantly increase psoriasis risk, particularly in early-onset cases, with males being more susceptible to severe disease, possibly due to hormonal imbalances, immune response variations, or sex-specific genetic background^{38, 39}.

Understanding the basis for nucleotide-level differences among us from the perspective of nuclear genomic DNA, and how major regional sequences interact with environmental factors to shape individual phenotypes, is crucial for understanding the essence of life phenomena, which will further to benefit biomedical research. This study starts from this idea, inspired by mathematical set theory, and has redefined the Complex_Region by collecting as much genome sequence data as possible from different sources and levels, accounting for about 8.1% of the genome, comprising approximately 248.69 Mb of nonrandom continuous long regions, the longest of which is about 5.43Mb. These regions have a significant distribution of functional sequences on chromosomes 1, 2, 6, and 16, and most fragments within these regions exhibit high genetic diversity and low overall repetition, and there is a significant enrichment of signals from Pan-Cancer and GWAS studies. To further explore the potential biomedical value of these sequences revealed that genes within this region are involved in immunological and metabolic processes, indicating chromosome-specific functional enrichment. Notably, many phenotypic genetic signals and somatic mutations in tumors are significantly enriched in these regions, which undoubtedly provides new perspectives for subsequent studies. The analysis of putative viral sequences in the human genome revealed that this region contains about two folds the enrichment, including sequences from retroviridae, poxviridae, and herpesviridae viruses, with the longest being the HERV K113 sequence (NC_022518; ~9.47 kb), which is located on several protein-coding genes involved in disease. In exploring the pathogenicity and regulatory roles of these putative viral sequences, we discovered that genes within the putative viral regions are involved in immune regulation, signal transduction and rejection processes, and still interact with external viruses. These interactions are likely to be involved in physiological characteristics and complex disease phenotype regulation, potentially reflecting recent environmental adaptation selection events.

The Complex_Region represents a comprehensive understanding of "sequence difference degree" in the bioinformatic analysis, overcoming the limitations of previous studies that focused only on specific phenotypic regions. Despite our efforts to investigate their origins, there remain shortcomings such as incomplete high-frequency repetitive sequences, insufficient understanding of sequence causation, limited knowledge of sequence regulatory networks, and a lack of genotype-phenotype associations remain. Owing to current limitations with pan-genome data, we cannot completely cover regions related to high-frequency repetitive sequences such as centromeres, which can be directly integrated into our defined intervals in the future^{16, 37}. To deepen our understanding of sequence causality, we will continue to collaborate with evolutionary biologists based on the interpretation of environmental viruses. and adopt computational strategies such as pre-trained genome models, to deepen our understanding of their origins using a "human in the loop" approach. The complex regulatory networks of living organisms, such as gene regulatory networks and the influence of non-coding sequences on gene expression, which cannot be studied in humans by methods like genetic perturbation, so the Complex_Region should be extended to experimental animal models in the future. Currently, public databases collecting both genotypes and complex phenotypes, particularly for non-European-American populations, are still scarce and more data need to be collected and pre-processed. Therefore, our future work will focus on understanding this phenomenon from more methods and perspectives, expanding this definition to other mammals and building model animals. We will continue to collect and integrate more "genotype-complex phenotype" data to validate and explore, and apply the Complex_Region to specific biomedical research, such as the use of CRISPR-Cas9 technology.

In summary, in our study within the Complex_Region, we attempt to understand the associations between genotype, the external environment, and external phenotypes, fundamentally exploring potential regulatory factors behind population differences in phenotypes. The results can be summarized in the following aspects: 1) The definition and characteristics of the Complex_Region, it was defined by various data sources and levels, exhibiting high sequence diversity entropy, nonrandom distribution, and long continuous segments (≥ 450 kb), which can roughly characterize the complexity of population genetic information. 2) The biomedical significance and functions of the Complex_Region, where gene function enrichment is focused mainly on immunity and metabolism, and there is chromosome-specific functional enrichment. Many genetically related phenotypes and somatic mutations in tumors are significantly enriched in these regions, thus allowing for inexpensive and comprehensive preliminary exploration in cohort studies of these diseases. 3) With respect to the origin and regulation of putative viral sequences in the Complex_Region, there is significant potential enrichment of exogenous viral sequence enrichment, particularly the longest fragment NC_022518-like sequence, which is classified in the HERVK family. This sequence is particularly present in several disease-regulatory coding genes, reflecting high population diversity and conservation across primates. This study further explored the pathogenicity and regulatory roles of these putative viral sequences and their impact on life processes and phenotype shaping. In essence, the Complex_Region contributes to a deeper understanding of human genetic complexity at the population level, further advances precision medicine, and provides a guide for subsequent sequence function studies, but it still needs to be further explored and interpreted.

Complex Regions selection by integrating multiple data types

A range of data sources and types were used to ensure that the complex regions identified were sufficiently representative. The selection process included GFA format screening based on human graphic pangenomes (HPRC and CPC), VCF file screening based on human graphic pangenomes HPRC constructed by Minigraph-Cactus and PGGB respectively^{38, 39}, and 1000 Genomes VCF file screening. For the screening of GFA files based on human pangenomes, the HPRC and CPC graphic pangenomes constructed via the Minigraph-Cactus process were initially converted to the GFA format using vg convert, and then together with the HPRC graphic pangenomes constructed by the PGGB method, the GFA files were further transformed using odgi build. This was followed by a preliminary screening for complex regions, where we calculated the average mapping depth within 500 bp windows using odgi depth⁴⁰ and bedtools across the three pan-genome datasets. After sorting the depth data in ascending order, we removed the top and bottom 20% to avoid outliers, using mean + 2*SD as our filtering threshold for depths within 500 bp windows, and merged the windows within 15 kb using bedtools merge, with the cumulative lengths for each graphical pan-genome separately. When screening the VCF files from the graphical pan-Genome constructed by Minigraph-Cactus and PGGB, we retained loci where the alternative allele frequencies were not less than 0.3, and then counted the corresponding SNPs within 500 bp windows. We then used a similar filtering approach to further obtain the potential regions. For screenings based on the 1000 Genomes VCF files, the filters used are as described above.

We then combined the potential complex regions obtained by filtering these varied data sources, sorted them by position and coordinates, and merged intervals within 15 kb to form preliminary complex regions. To emphasize longer segments and ensure the broad applicability of identified complex regions in biomedical research, this study retained regions exceeding 450 kb as the final complex regions. To increase their representativeness, the longest sequences from the Y and M chromosomes were also included, creating approximately 248.468 Mb of complex regions. To demonstrate the screening process of the above process, we visualized the average depth (HPRC and CPC) and variant count (1000 Genomes) within 500 bp windows of these complex regions on chromosome 19. This visualization enhanced our understanding of the characteristics of these genomic regions.

Uniqueness sequence characterization within Complex Regions

To assess the sequence specificity of the identified complex regions, we first obtained potential non-complex regions within the human genome using bedtools subtract, and finally used bedtools shuffle to finally generate non-complex regions similar to the distribution of complex regions. The corresponding DNA sequences of these regions from the hg38 reference were extracted by coordinates using bedtools getfasta by coordinates, and transformed them into numerical 6-mer feature vectors using CountVectorizer. We standardized the resulting sparse matrix using StandardScaler without subtracting the mean of each element during the process to maintain sparsity, and further used SparsePCA for dimensionality reduction and presentation using the above results.

Next, following the strategy described in a recent publication⁴¹, we created the Complex_Region dataset and the Random dataset. We then fine-tuned the DNA_bert_6 and human_gpt2-v1 models for binary and multiple classification of DNA sequences using the hugging face interface. In the binary classification task of sequences, the Complex_Region dataset aims to distinguish between complex regions and non-complex regions sequences; the Random dataset is used to classify genomic random region sequences that are similarly distributed to the Complex_Region dataset. In the quaternary classification of sequences, the Complex_Region dataset further distinguishes whether a sequence is located in the coding region or not using the above labels, whereas the Random dataset is used to randomly distinguish genomic sequences.

Sequence characterization of identified Complex Regions

After obtaining the complex regions, we proceeded to quantify the longest segment and cumulative length of each chromosome, and used the CMRG_v1.00_HG002 clinical gene set region as a reference for visual display and statistics. To further delineate the diversity and repetitiveness of sequences within the complex region, we extracted the potential sequences of each interval from the HPRC pan-genome using odgi extract and transformed them into FASTA sequences using odgi paths after optimization and sequencing by odgi sort, respectively. Then the sequences exceeding 10kb were retained using seqkit seq, and the pgr-tk’s pgr-pbundle-decomp⁴² was used to obtain the primary components, leading to turn constructed the GFA graph files of the corresponding regions. Additionally, we performed a simple diffusion model for each extracted complex region sequence using the pgrtk.compute_graph_diffusion_entropy function to obtain the sequence diversity entropy and node weights, and used the last 32 elements mean in the weight list as an assessment of sequence repeatability. Finally, we visualized the relationship between the sequence diversity entropy value and its repeatability for all complex regions. Furthermore, by integrating data from UCSC, GENCODE, and pertinent literature, we calculated and displayed the proportions of cumulative sequence lengths in complex regions relative to the entire genome, and displayed them after enrichment rate calculation.

Enrichment of overlapping gene functions in Complex Regions

Using the hg38 reference genome annotations, we utilized bedtools intersect to extract annotations for genes overlapping with complex regions (≥ 150 bp) for further analysis. We then employed the Metascape⁴³ to enrich and perform disease-related analysis through DisGeNET using the selected protein-coding gene set. Concurrently, we evaluated the functional enrichment and disease characteristics of genes in complex regions and non-complex region from each chromosome, to further verify their functional specificity via Metascape in complex regions.

Subsequently, the genes within the complex regions were categorized into protein-coding (Protein_coding) and all genes (All), with proportions calculated based on the basis of chromosomal distribution against the hg38 reference. Chromosomes were labeled as (chr1, chr16, chr11, chr6, chr19, chr2, others). Functional enrichment and DisGeNET-related analyses were conducted for the protein-coding gene sets on each chromosome using Metascape, with the results displayed according to the new chromosome labels. Furthermore, select immune and metabolic genes within the complex regions were analyzed for gene-disease associations via DisGeNET.

Genetic phenotypic signals and somatic mutation enrichment in Complex Regions

We initially downloaded the GWAS summary files from the UCSC website, identifying 10,376 phenotypes with 403,362 signals across the entire genome, of which 3,975 phenotypes comprising 38,859 signals were located within complex regions. To quantify the prevalence of trait enrichment in complex regions versus the whole genome, we first counted the cumulative Trait_num for all the signal sites associated with the 3,975 phenotypes. Subsequently, we normalized this count by the length of the corresponding region (Len), resulting in Trait_num/Len. For each of the 3,975 phenotypes, we calculated the enrichment ratio (ER) by comparing the Trait_num/Len of the complex regions to that at the genome-wide level. To determine which phenotypes were enriched in the complex region, we performed a hypergeometric distribution test for each phenotype with the following formula: phyper(N-1, M, Total_num-M, Total_case, lower.tail = F), where N represents the number of signals in the complex regions (Trait_num1), M represents those at the genome level (Trait_num2), Total_num is the total 403,362 association signals across the genome, and Total_case is the 38,859 signals within the complex regions. We applied the Benjamini-Hochberg method for P-value correction, classifying significant phenotypes into three major categories: immune system and cellular biology, metabolism and biochemistry, and health status and other biometrics. Phenotypes with an ER greater than five and an adjusted P-value less than 0.05 were visualized using EnhancedVolcano plots. Furthermore, GWAS signal loci for diseases like the rheumatoid arthritis (RA) and human immunodeficiency virus (HIV) infection were identified at both the genome-wide and complex region levels to demonstrate the adequately representativeness of complex regions.

Using a similar computational approach, we analyzed the downloaded pan-cancer data for 33 tumor types, calculating the Trait_num and ER of signals within genome-wide and complex regions. We applied the hypergeometric distribution test phyper(N-1, M, Total_num-M, Total_case, lower.tail = F), where N is the tumor signal Trait_num1 in the complex regions, M being the tumour signal Trait_num2 at the genome level, and Total_num being 3,038,556 tumor association signals at the genome-wide level, and Total_case is 325,537 tumor association signals in the complex region. Next, we used EnhancedVolcano for volcano plot visualization of the enrichment results with the pan-cancer type as the label, and selected gastric adenocarcinoma (STAD) and renal clear cell carcinoma (KIRC) as the most significantly enriched cancers. In addition, the genes with the most mutated types in STAD and KIRC were displayed after statistical analysis by maftools⁴⁴, and the most mutated genes were selected for survival analysis using the RNA sequencing, respectively.

Enrichment of putative viral sequences within Complex Regions

To comprehensively analyze the viral sequences within complex regions, we adopted a dual alignment strategy at both the endogenous retrovirus (HERV) and whole-genome levels. Initially, at the HERV level, we constructed a virus sequence database using makeblastdb, and then extracted the HERV sequences from the reference hg38 using bedtools getfasta, and utilized bedtools intersect to isolate structural variant (SV) sequences within the HERV regions from the 1000 Genomes cohort. Subsequently, we aligned these HERV sequences from hg38 and 1000 Genomes to the viral database using blastn respectively. At the whole-genome level, SVs from the 1000 Genomes data were converted into sequences and merged with the reference hg38, and a corresponding database was constructed using makeblastdb, followed by comparing Virushostdb sequences to this database. For the above comparison results, we filtered and retained sequences longer than 30 bp with over 95% similarity, categorizing them based on the viral family information in the Virushostdb database to obtain a potential viral dataset, including both HERV and other aligned viral sequences. Using this filtered viral dataset, we calculated the non-redundant cumulative alignment lengths for specific regions (hg38, Complex_Region, HERV_Region and HVIDB_Region), and demonstrated the distribution of potential viral alignment lengths on different chromosomes. We also assessed the enrichment ratios of potential aligned viral sequences within a specific region (Complex_Region, HERV_Region, HVIDB_ Region), based on the proportion of viral sequences on different chromosomes of hg38. Finally, we showcased all potential viral alignment sequences within the Complex_Region.

Characteristics and species distribution of the NC_022518 Sequence

To comprehensively analyze the NC_022518 sequence within the complex regions, we initiated our study by utilizing nucmer to align against each sequence within the reference hg38, and then the corresponding region alignment relationships were shown by show-coords and used as controls⁴⁵. Subsequent extractions of sequences from both the Complex_Region and the Virus regions (including NC_022518), were aligned back to hg38 to obtain chromosomal sequence similarity. The alignment results were then categorized into four major categories: 10–50 kb, 50–100 kb, 100–500 kb, and ≥ 500 kb, and further to visualize based on this classification. To explore the characteristics of NC_022518-like sequences in humans, we identified ten protein-coding genes containing this region. We then assessed their functions, pathways, and cell or tissue specificity using the GenDoma database, and explored their corresponding population and primate sequence features using UCSC. To elucidate the relationships between similar sequences among these genes, sequences that highly similar to NC_022518 were extracted with bedtools getfasta, constructed an evolutionary tree using iqtree after mafft multiple sequence comparison^{46, 47}, and finally visualized it.

To explain the divergence of NC_022518 in primates and humans, we identified its longest protein-coding TRPC6 region (chr11:101695064–101704528) and a similar fragment within the LncRNA gene ENSG00000286016 (chr12:127153658–127168021) on the genome hg38, and continue to analyze the sequences within these regions in depth. Firstly, we extracted the sequence fragments in primate, human pan-genome and genome hg38 according to the above coordinates in turn. For sequence extraction in primates, we used minimap2 to align the fragments within ENSG00000286016 and chromosome 12 to the primate genome respectively, and then extracted the corresponding sequences from the aligned regions. For human pan-genome sequence extraction, the above two region sequences in HPRC and CPC were extracted by odgi extract separately, and the corresponding sequences were obtained after odgi sort and odgi paths conversion. For the sequence fragment of genome hg38, we used bedtools getfasta to retain the potential protein coding sequence and the sequence within ENSG00000286016. Immediately following the above steps, we combined the potential sequences together to filter out the sequences with more than 80% of non-A/G/C/T bases, and the filtered sequences were compared by mafft to obtain their multiple sequence alignment results. In order to further portray the degree of similarity between sequences, we used pgrtk.get_shmmr_pairs_from_seq to extract the key information from the multi-sequence alignment results, and combined it with the sample annotation information to perform PCA downscaling and visualization. Meanwhile, we also constructed evolutionary trees using iqtree and ggtree to demonstrate the inter-sequence relationships. Furthermore, we aligned the NC_022518 sequences to the non-redundant nucleic acid database of NCBI, further statistically analyzed the alignment results and constructed the corresponding species sequence trees.

Chromosomal distribution, regulatory networks, and pathogenicity prediction of putative viral regions

Using bedtools intersect, we extracted sequences from specified regions (LTR_Repeat, Virus, Immunogenetic, HVIDB) within the complex regions and displayed them after counting them separately by chromosome. To assess the immune characteristics of these sequences, we first used bedtools intersect to count overlapping sequences between endogenous and other aligned viral sequences (Virus), human-virus interaction region sequences (HVIDB), and human immune region sequences (Immunogenetic). We selected the gene overlap among these three groups and used the Metascape website to analyze the protein interaction networks and visualize the core gene modules. For the pathogenicity assessment of specific regions, we first extracted single-base pathogenicity scores predicted by PrimateAI and AlphaMissense^{48, 49}, and then used a bedtools map to calculate the average pathogenicity scores of the desired region (Complex_Region, Virus, Immunogenetic, HVIDB), and presented them at the chromosome level. Moreover, we employed a similar strategy to calculate the pathogenicity scores of overlapping genes and demonstrated the similarity between the pathogenicity scores and the functional enrichment results of these genes.

Pathogenicity and intrabody regulation of putative viral variants

To analyze putative viral mutations within specified genomic regions (Virus, HVIDB, Immunogenetic), we initially employed bedtools intersect to extract small variants and SVs, followed by a chromosomal distribution analysis of these variants. Subsequently, we extracted GWAS association results to delineate the characteristics of virus-related variants. Utilizing the GSEL toolkit⁵⁰, we identified several evolutionary selection indicators in GWAS association signals, including signals for natural selection and adaptivity differentiation (xPEHH; iES; GERP), as well as measures for genetic diversity and population differentiation (Fst) and conservation assessments (PhastCons; phyloP100). Furthermore, we calculated the pathogenicity scores of all identified viral variants and classified them into different pathogenicity score ranges, which included (0,0.3], (0.3,0.5], (0.5,0.7], and (0.7,1]), and then counted and displayed the number of variants in the different ranges. To ensure the reliability of our findings, we repeatedly verified that these variants were located in the HERV region, followed by selecting the variants with scores above 0.7 for conversion to the reference hg19 using liftover⁵¹. We employed the 3DSNP database to perform regulatory analysis on these high-pathogenicity variants. Finally, we describe and discuss the functional regions (Enhancer, Promoter, TFBS, Motif) and intra-population frequencies of these variants, and finely explore the in vivo regulation of some high-frequency variants and their associated diseases.

HERV

Human endogenous retrovirus

GWAS

Whole-genome association studies

DNase_Clusters

DNase I hypersensitive site regions

SDRs

structurally variable regions

LTR_Repeat

long terminal repeat retrotransposons

HVIDB

human-virus PPI database zones

structural variant

Generative AI in scientific writing

AI-assisted technologies have been employed to increase readability and improve language proficiency. Nevertheless, the final results were exclusively produced by the authors, who meticulously edited the language to conform to domain terminology. Consequently, we are wholly responsible and accountable for the content of this study.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing Interest

None declared.

Funding

This work was supported by the Peak Disciplines (Type IV) of Institutions of Higher Learning in Shanghai.

Author Contribution

Conceptualization: DD, FZ, LL; Formal Analysis: DD; Funding Acquisition: LL; Investigation: DD, ML; Methodology: DD, ML; Project Administration: FZ, LL; Resources: LL; Supervision: FZ, LL; Useful Suggestions: WZ, FZW, XC; Visualization: DD, CYZ; Writing – Original Draft: DD, XL; Writing – Review & Editing: DD, FZ, LL.

Acknowledgement

This work was supported by the Peak Disciplines (Type IV) of Institutions of Higher Learning in Shanghai. We thank all the authors for their hard work. This work was supported by the Medical Research Data Center of Fudan University.

Data Availability

All relevant raw data supporting the key findings of this study are available within the article and its Supplementary Information page. All custom analytical pipeline codes used in this work are available at https://github.com/ GeorgeBGM/Complex-genome_analysis.git.

Momozawa Y, Mizukami K. Unique roles of rare variants in the genetics of complex diseases in humans. J Hum Genet. 2021;66:11–23.
Pinero J, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48:D845–55.
Zhao Z, Fu YX, Hewett-Emmett D, Boerwinkle E. Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene. 2003;312:207–13.
Campbell CD, Eichler EE. Properties and rates of germline mutations in humans. Trends Genet. 2013;29:575–84.
Middleton D, Gonzelez F. The extensive polymorphism of KIR genes. Immunology. 2010;129:8–19.
Nurk S, et al. The complete sequence of a human genome. Science. 2022;376:44–53.
Guo Y, Feng X, Li H. Evaluation of haplotype-aware long-read error correction with hifieval. Bioinformatics 39, (2023).
Abraham A, LaBella AL, Capra JA, Rokas A. Mosaic patterns of selection in genomic regions associated with diverse human traits. PLoS Genet. 2022;18:e1010494.
Buxton DS, Batten DJ, Crofts JJ, Chuzhanova N. Predicting novel genomic regions linked to genetic disorders using GWAS and chromosome conformation data - a case study of schizophrenia. Sci Rep. 2019;9:17940.
Ranz J, Clifton B. Characterization and evolutionary dynamics of complex regions in eukaryotic genomes. Sci China Life Sci. 2019;62:467–88.
Mao Y, et al. Structurally divergent and recurrently mutated regions of primate genomes. Cell. 2024;187:1547–e15621513.
Liao WW, et al. A draft human pangenome reference. Nature. 2023;617:312–24.
Gao Y, et al. A pangenome reference of 36 Chinese populations. Nature. 2023;619:112–21.
Wagner J, et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol. 2022;40:672–80.
Monroe JG, et al. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature. 2022;602:101–5.
Vollger MR, et al. Increased mutation and gene conversion within human segmental duplications. Nature. 2023;617:325–34.
Thurman RE, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489:75–82.
Chen A, Chen D, Chen Y. Advances of DNase-seq for mapping active gene regulatory elements across the genome in animals. Gene. 2018;667:83–94.
Zeller T, et al. Dual checkpoint blockade of CD47 and LILRB1 enhances CD20 antibody-dependent phagocytosis of lymphoma cells by macrophages. Front Immunol. 2022;13:929339.
Zarantonello A, Revel M, Grunenwald A, Roumenina LT. C3-dependent effector functions of complement. Immunol Rev. 2023;313:120–38.
Zhang T, et al. NLRP2 in health and disease. Immunology. 2024;171:170–80.
Fujikura Y, et al. Ketogenic diet containing medium-chain triglyceride ameliorates transcriptome disruption in skeletal muscles of rat models of duchenne muscular dystrophy. Biochem Biophys Rep. 2022;32:101378.
Hariharan R, Mousa A, de Courten B. Influence of AMY1A copy number variations on obesity and other cardiometabolic risk factors: A review of the evidence. Obes Rev. 2021;22:e13205.
Zhang J, et al. Comprehensive analysis of the glutathione S-transferase Mu (GSTM) gene family in ovarian cancer identifies prognostic and expression significance. Front Oncol. 2022;12:968547.
Chang M, He L, Cai L. An Overview of Genome-Wide Association Studies. Methods Mol Biol. 2018;1754:97–108.
Mihara T, et al. Linking Virus Genomes with Host Taxonomy. Viruses. 2016;8:66.
Guarracino A, et al. Recombination between heterologous human acrocentric chromosomes. Nature. 2023;617:335–43.
Jern P, Sperber GO, Blomberg J. Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and taxonomy. Retrovirology. 2005;2:50.
Wang J, Lu X, Zhang W, Liu GH. Endogenous retroviruses in development and health. Trends Microbiol. 2024;32:342–54.
Jo EK. Interplay between host and pathogen: immune defense and beyond. Exp Mol Med. 2019;51:1–3.
Kang SH, Sun YD, Atallah OO, Huguet-Tapia JC, Noble JD, Folimonova SY. A Long Non-Coding RNA of Citrus tristeza virus: Role in the Virus Interplay with the Host Immunity. Viruses 11, (2019).
Jakobsson J, Vincendeau M, SnapShot. Human endogenous retroviruses. Cell. 2022;185:400–e400401.
Yonezawa Y, et al. Identification of a Functional Susceptibility Variant for Adolescent Idiopathic Scoliosis that Upregulates Early Growth Response 1 (EGR1)-Mediated UNCX Expression. J Bone Min Res. 2023;38:144–53.
Radoux-Mergault A, Oberhauser L, Aureli S, Gervasio FL, Stoeber M. Subcellular location defines GPCR signal transduction. Sci Adv. 2023;9:eadf6059.
Sewgobind NV, Albers S, Pieters RJ. Functions and Inhibition of Galectin-7, an Emerging Target in Cellular Pathophysiology. Biomolecules 11, (2021).
Chen Y, et al. The regulation of DOCK family proteins on T and B cells. J Leukoc Biol. 2021;109:383–94.
Dolzhenko E et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol, (2024).
Hickey G, et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol. 2024;42:663–73.
Garrison E et al. Building pangenome graphs. bioRxiv, (2023).
Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. ODGI: understanding pangenome graphs. Bioinformatics. 2022;38:3319–26.
Du D, Zhong F, Liu L. Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models. bioRxiv, 2023.2012.2005.570173 (2023).
Chin CS, et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods. 2023;20:1213–21.
Zhou Y, et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat Commun. 2019;10:1523.
Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 2018;28:1747–56.
Riva G, Mauri M. MuMMER: How Robotics Can Reboot Social Interaction and Customer Engagement in Shops and Malls. Cyberpsychol Behav Soc Netw. 2021;24:210–1.
Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–74.
Katoh K, Rozewicki J, Yamada KD. MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform. 2019;20:1160–6.
Gao H, et al. The landscape of tolerated genetic variation in humans and primates. Science. 2023;380:eabn8153.
Cheng J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381:eadg7492.
Abraham A, Labella AL, Benton ML, Rokas A, Capra JA. GSEL: a fast, flexible python package for detecting signatures of diverse evolutionary forces on genomic regions. Bioinformatics 39, (2023).
Park KJ, Yoon YA, Park JH. Evaluation of Liftover Tools for the Conversion of Genome Reference Consortium Human Build 37 to Build 38 Using ClinVar Variants. Genes (Basel) 14, (2023).

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Deciphering Complex Regions within the Human Genome and Unraveling Their Critical Biomedical Regulatory Functions

Status:

Version 1

Abstract

Figures

Background

Results

Defining Complex Regions within the human genome

Sequence characterization of Complex Regions

Genes functional characteristics of Complex Regions

Disease associated signals enriched in Complex Regions

Characteristics of putative viral sequences in Complex Regions

The pathogenicity and regulation of putative viral sequences in Complex Regions

Discussion

Methods

Complex Regions selection by integrating multiple data types

Uniqueness sequence characterization within Complex Regions

Sequence characterization of identified Complex Regions

Enrichment of overlapping gene functions in Complex Regions

Genetic phenotypic signals and somatic mutation enrichment in Complex Regions

Enrichment of putative viral sequences within Complex Regions

Characteristics and species distribution of the NC_022518 Sequence

Chromosomal distribution, regulatory networks, and pathogenicity prediction of putative viral regions

Pathogenicity and intrabody regulation of putative viral variants

Abbreviations

Declarations

Generative AI in scientific writing

Declarations

Funding

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1