The bulk of the literature supports the hypothesis that recombination hotspots, rarely (if at all) occur at the same locus between human and chimpanzee (Auton et al., 2012; Myers et al., 2010; Ptak et al., 2005; Ptak et al., 2004; Wall et al., 2003; Winckler et al., 2005). An isolate report of shared hotspot loci between these two species was at β-globin and HLA regions on chromosome 21, based on high Bayes factors of shared hotspots at locations within both regions (Wang and Rannala, 2014).
Here, we identified thousands of colonies of two-repeat units of various CG-rich trinucleotides in human, several of which coincided with dynamic and species-specific formulas in several other primates, and in a number of instances, even as phylogenetically distant as in mouse. Our proposed models for the evolution of the colonies (exemplified in C266), the presence of pure units and units that were overlaps of those pure units, and the fact that the common elements across the colonies were the two-repeat units, indicate that the main reason for the hotspot events in those colonies is the two-repeat units, and not their flanking sequences.
Each species that we studied had its complex formula for those colonies, which resulted from dynamic and specific recombination, conversions, and probably other rearrangement events in those colonies in that species. Not only the absolute number and the combination of the two-repeat units were different across the selected species (Table 2), but also the distribution of those units in the pure and overlapping unit compartments also differed, adding additional levels of complexity to the genomic events in those colonies.
Recombination hotspots are regions in a genome that exhibit elevated recombination rates relative to a neutral expectation. The exceedingly significant Poisson probability values associated with the detected colonies argue against a neutral hypothesis. A large body of evidence suggests that recombination and gene conversion correlate with CpG density, GC%, repetitive elements, and the neutral mutation rate in many eukaryotes, including mammals and human, which have consequences for genomic landscapes, molecular evolution, and functional genomics (Duret and Galtier, 2009; Fullerton et al., 2001; Jensen-Seaman et al., 2004). In line with these premises, GC content has been shown to elevate mutation and recombination rates in several organisms, such as the yeast Saccharomyces cerevisiae(Kiktev et al., 2018; Marsolier-Kergoat and Yeramian, 2009) and Poecilia reticulata (Charlesworth et al., 2020). In our study, the extensive complexity of the identified CG-rich colonies aligns with the previous reports that crossovers are associated with mutation and biased gene conversion at recombination hotspots (Arbeithuber et al., 2015).
Although the two-repeat units are repetitive sequences, they do not match the conventional definition of repetitive sequences (Lower et al., 2019). In terms of the density of the events, the colonies do not match segmental duplications. Of note, the shortest reported human segmental duplications, copy number variations, and other genomic rearrangements are estimated to involve lengths of ≥ 10 kb of genomic DNA in human (Bailey et al., 2002; Marques-Bonet and Eichler, 2009; Mehan et al., 2004; Sharp et al., 2005). In general, crossovers occur in regions that are among the most extensive available stretches of identity for a particular pair of mismatched genes(Metzenberg et al., 1991; Pratto et al., 2014), indicating that the known mechanisms involved in crossovers can only partially explain the extent of the events observed in the colonies.
Another possible mechanism that can partially explain the observed colonies is slipped strand mispairing (SSM, also known as replication slippage), a mutation process that occurs during DNA replication. It involves denaturation and displacement of the DNA strands, resulting in the mispairing of the complementary bases. SSM is one explanation for the origin and evolution of repetitive DNA sequences, such as microsatellites (and probably minisatellites), and a mechanism for sequence evolution across species. However, studies indicate that replication is not the only mechanism for the evolution of repetitive sequences, and recombination is also involved, especially in the instance of minisatellites (Buard and Vergnaud, 1994; Jeffreys et al., 1994; Richard and Paques, 2000). In fact, Jeffreys et al. (1988) proposed that small mutations involving gain or loss of 4–10 repeats could occur by a mitotic replication slippage mechanism, while large mutation events involving gain or loss of up to 200 repeats were more compatible with a meiotic recombinational process (Jeffreys et al., 1988). The above applies to the identified colonies in our study, many of which consist of tens or even hundreds of repeats.
The evolutionary implications of the colonies may be versatile. For example, C266 is specific to great apes and hosts several piRNAs. piRNAs are a class of non-coding RNAs of ~ 24-35-nucleotide length that have pivotal roles in preserving the integrity of mammalian germline genomes by silencing transposons (Wang et al., 2023). piRNAs are mainly located as genomic clusters. Recently, it was reported that satellites with 150 base pairs length contain piRNAs, which have a crucial role in the embryonic development of Aedes aegypti (Halbach et al., 2020). piRNAs need to evolve rapidly to keep up with TEs. Therefore, a high mutation rate is required to program the birth and death of piRNAs. We propose that the two-repeat units and the recombination events and conversions associated with these units are a novel mechanism for the emergence and propagation of piRNA copy numbers across the genomes through increasing mutation rates at the colonies. Furthermore, the development of these colonies can be considered as a defense strategy to suppress or regulate TEs, such as LINEs and SINEs, through DNA methylation. TEs play pivotal roles in the evolution of species (Carotti et al., 2023; Schmitz, 2012) as well as diseases (Payer and Burns, 2019). LINEs are the most significant fraction of TEs in the human genome. Through target-primed reverse transcription, LINEs can be transcribed by self-produced ribonucleoprotein machinery to generate complementary DNA (cDNA). Moreover, this machinery can make a complex with the RNA of SINEs to generate cDNA. The generated cDNAs can be randomly integrated into the host genome (Fu et al., 2023). DNA methylation is induced at these regions to inactivate TEs (Alves et al., 2023). Our exemplary models can explain the entangled mechanisms of piRNAs, LINEs, and SINEs. On the one hand, the development of units predisposes the colonies to DNA methylation. On the other hand, the high mutation rate provides a mechanism for generating piRNAs to suppress TEs, such as LINEs and SINEs.
Compared to C266, C96 was composed of more versatile pure and overlapping units and conversions, in line with the more ancient ancestry of this colony that was shared before the separation of the primate and mouse lineages. Crossing-over and mutation in human C96 led to the emergence of an uncharacterized non-coding RNA in human, which is a sample of the yet-to-be-characterized evolutionary implications inside this colony.
The colonies only fractionally overlap with G quadruplex (G4) structures (Chen and Yang, 2012; Fotsing et al., 2019; Sawaya et al., 2013). G4 structures are DNA tetraplexes that typically form in guanine-rich regions of genomes. Four guanine bases associate with each other through Hoogsteen hydrogen bonds to form a guanine tetrad plane (G-quartet), and then two or more G-quartet planes stack on top of each other to form a G4 structure (Qin and Hurley, 2008). Organisms may have evolutionarily developed G4 into a novel and elaborate transcriptional regulatory mechanism benefiting multiple physiological activities of higher organisms (Kostadinov et al., 2006; Wu et al., 2021).
C266 and C96 are only examples of the extent of events and their evolutionary relevance. For further studies, the raw data is deposited at: https://figshare.com/articles/dataset/All_possible_CG-rich_trinucleotides/23260562.