The RT2T consortium WG of volunteers are dedicated to sample acquisition and processing, sequencing, assembly, curation, cytogenetics, and annotation to prepare the complete genome assemblies for public use. Specific efforts to characterize heterochromatin regions and genomic repeats, non-coding RNA components, small RNAs, centromeres/kinetochores, and TEs are planned. These activities will contribute to the final, annotated, complete genome assemblies. The consortium will make internal decisions on data freezes and the point(s) at which comparative analyses will commence. We recognize that interesting results could be obtained by similar analyses with partial sets of T2T genomes prior to completion of the full set of species but would then dilute the impact of the broader studies envisioned. We request that non-consortium researchers refrain from using these assemblies for similar analyses without written permission from the consortium leadership. We welcome requests from interested researchers to participate within the consortium. A list of WG and lead contacts are found in Table 2 with specific analyses planned described below.
Table 2
Working groups, their respective co-leaders, and contact information for the co-leaders for those with questions, or who wish to contribute.
Working Group Name | Working Group Leaders | Contact information |
3D genome architecture | Darren Hagen, Brenda Murdoch | [email protected], [email protected] |
Annotation | Christine Elsik, James Koltes | [email protected], [email protected] |
Assemblers and curation | Ben Rosen, Tim Smith | [email protected], [email protected] |
Chromosome evolution | Rachel O’Neill, Temitayo Olagunju | [email protected], [email protected] |
Comparative methylome | Stephanie McKay, Brenda Murdoch | [email protected], [email protected] |
Cytogenetics | Tamara Potapova | [email protected] |
Immunogenomics | Yana Safonova, Corey Watson | [email protected], [email protected] |
Repeat Annotation | Rachel O'Neill, Jessica Storer | [email protected], [email protected] |
Variant Discovery and Population Sequencing | Robert D. Schnabel, George Liu | [email protected], [email protected] |
The chromosome evolution and 3D genome architecture WGs will undertake comparative analysis of chromosomal and centromere structure and evolution. Eukaryotic genomes are not simply a linear compilation of coding and noncoding sequences; rather, genomes are organized into three dimensions that not only define gene regulatory domains41 and contribute to genome stability42, but also provide signatures of regulatory evolution43 over long evolutionary time frames44–46. Regulatory elements can be positioned up to millions of base pairs away from the genes they regulate47 and several levels of organization can be defined including chromosome compartments, topologically-associated domains (TADs), and loops. There are two categories of compartments, open chromatin in A compartments and closed chromatin in B compartments48. Up to 36% of the mammalian genome undergoes compartment changes during development49 and compartments are tissue-specific48. TADs display characteristic self-interactivity with boundaries frequently indicated by the presence of CTCF50. Loops are generally smaller than TADs and have been shown to be a mechanism for regulation of gene expression, with disruption of loops causing dysregulation associated with disease phenotypes51.
The 3D genome WG will annotate compartments, TADs, and loops for comparative analyses using chromatin conformation contact assays (Hi-C, Pore-C and/or Micro-C). A previous 3D genome study of carnivore species52 using Hi-C reported broad conservation at the level of whole chromosomes across three families separated by 54 My since the last common ancestor. High consistency of TADs and compartments in liver samples across livestock species including chickens, pigs, and goats has also been reported53. There is little literature describing 3D conformation in and between ruminant species, and tissue and developmental stage-specific aspects of compartments, TADs, and loops complicate comparative studies. The goal of the 3D genome WG is to compare chromatin structure across tissues within and between species where tissue source and developmental stage are similar. This information will be used to annotate the T2T genome assemblies with structural information for all tissues/cell lines used. These analyses will provide additional information for predicting effects of genomic variants, including structural variants and sequence polymorphisms, on phenotype and adaptation. An important contribution of RT2T genome assemblies is to enable a comprehensive analysis of structural variation (SV) among species and within populations. SVs are known to have a significant impact on genome 3D organization and may impact genome function through this reshaping of the 3D structure54–56. The study of the relationship between SVs and genome 3D structure will therefore benefit from the different ruminant T2T genome assemblies.
The chromosome evolution WG will use chromosome-wide aspects of 3D genome assays as well as sequence content to examine the evolution of chromosomes since the last common ancestor of the suborder Ruminantia and the order Artiodactyla. Recent work in model species spanning the metazoan phylogeny (human, mouse, Drosophila, yeast) has shown that TAD boundaries define evolutionarily conserved gene expression patterns and that lineage-specific rearrangements in response to selection are enriched at TAD boundaries57. In a recent study using reconstructed ancestral karyotypes of Artiodactyls, Ruminants, Pecorans, and Bovids, evolutionary breakpoints defining chromosome rearrangements among species were found to be enriched for sequences associated with active or lineage-specific TEs and genes with divergent gene expression patterns58. Thus, TADs and linear chromosome organization are implicated in defining gene expression regulatory patterns likely by delineating the regions in which genes interact through insulator activities. Although cell-type specific TAD boundaries within an organism may be variable 59, TADs shared across all tissues are stable across cell types and appear enriched for heritability of complex multigenic traits and evolutionary constraint45. Organismal TAD boundaries are linked to chromosome rearrangements, repeat expansions60 and epigenetic signatures (e.g., DNA methylation61 and histone modification62) and are enriched for adaptive structural variants57. While TADs are considered synonymous with regions of conserved synteny and constrained gene regulation, genome organization across mammals beyond model systems is largely unexplored. Moreover, how TAD organization and boundaries imbue constraint on adaptation is unknown. The ability to derive methylation and TAD organizational information from data used in the generation of long-read based genome assemblies affords an unprecedented opportunity in the context of ruminant genome biology and chromosome evolution.
Fixed chromosomal rearrangements among species may be an important driver of species evolution by contributing to species-specific gene regulation patterns, genome organization, selfish element activity, recombination patterns, and faithful Mendelian inheritance. Ruminant species carry the broadest range of chromosome complements among mammals – ranging in chromosome number from the smallest of any mammal, 2n = 6/7 in the Indian muntjac30, to 2n = 70 found in many deer species. In this regard, many ruminant species are distinguished by extensive chromosome rearrangements (fissions, fusions, translocations, centric shifts), multiple sex chromosome complements (e.g., XX/XYY), and the presence of potentially meiotically driven selfish B chromosomes (several brocket deer species[genus Mazama], Siberian roe, and Siberian musk deer)63. The chromosome evolution WG aims to perform comparative analysis of chromosome structure, including centromere and kinetochore sequence and position, to reveal the evolutionary pathways leading from the last common ancestor in the order Artiodactyla to the existing species across all ruminant genera and closely related species with widely divergent karyotypes. Among the primary goals are deriving an ancestral karyotype, defining ruminant evolutionary breakpoints, and discriminating general mechanisms of chromosome structure evolution across mammals from clade-specific features. This will entail close interaction between the chromosome evolution WG and the cytogenetics WG, clarifying ambiguous karyotypes, such as in dama gazelle and in species for which cell lines are available or can be established. Fluorescence in situ hybridization (FISH) on chromosomal spreads will be used to validate the locations of specific DNA sequences on chromosomes, which can be particularly useful for hard-to-assemble regions such as satellite repeats and rDNA gene arrays. For example, gaps within rDNA gene arrays4 and tandem repeat arrays of nearly-identical copies can be resolved using high-resolution FISH.
The annotation WG will undertake comparative analysis of gene family contractions and expansions, and identification of targets of positive selection. Correlations of transcript abundance and polymorphisms of cis-regulatory elements can identify expression quantitative trait loci (eQTL) and illuminate principles of functional biodiversity and consequences to evolutionary development, selection, and adaptive responses. Gene retrocopies64, resulting from reverse transcription and genomic insertion of spliced mRNA by LINE-1 retrotransposition20,65, will be evaluated for evolution in non-coding pseudogenes66. The polled and fleece type traits in sheep represent examples of the impact of retrogenes16. Retrocopies identified with RetroScan67 will provide preliminary classification of retrocopies. Ka/Ks ratios will provide their estimated age distribution.
The annotation WG has also planned specific comparative analysis of lactation-related genes to examine the evolution of genes that regulate milk synthesis and variation in milk constituents. There is a wealth of transcriptome datasets in cattle, buffalo, and sheep related to the mammary gland, and milk represents a rich, non-invasive source of RNA, including non-coding sno-RNA, miRNA, lnc-RNA, and mRNA68–70. Both are part of epithelial cells present in milk and within cytoplasmic droplets encapsulated in milk fat globules during apocrine secretion 71–75. Planned analyses will capitalize on data from public repositories or sequence from milk samples of non-agricultural ruminant species.
One outcome of the human T2T project was the identification of additional genes in the newly assembled and corrected portions of the genome, most of which corresponded to predicted non-coding RNAs. There is limited knowledge of their organization across species, function, and evolution. The annotation WG will use data generated in the project and public transcriptome datasets to annotate non-coding RNA genes, particularly in previously unassembled portions of genomes, to provide new understanding of the evolution and activity of these little-understood genes. Comparative study across the ruminants and correlation of gene expansion/contraction with other aspects of genome biology will provide new insights into the role of non-coding RNAs in genome function and evolution.
The immunogenomics WG has a focus on expressed adaptive immune gene repertoires (antibody repertoires and T-cell receptor repertoires), some of which are in germline loci encoded through somatic genomic recombination76. These genomic regions have been difficult to examine prior to the advent of 3rd generation sequencing technology and assembly methods. The WG will annotate germline genes encoding antibodies and T-cell repertoires, identify their eQTL characteristics using expressed repertoire-sequencing data (AIRR-Seq), and perform comparative analyses to reveal species-specific adaptations of ruminant adaptive immune systems related to environment, pathogen exposure, and domestication. The comparative analysis will also make it possible to investigate the evolutionary origin of the ultralong antibodies. Previous studies show that such antibodies are partially encoded by unusually long immunoglobulin diversity (D) and joining (J) genes, selecting one V gene, one D gene, and one J gene and concatenating them together to generate the variable region of a heavy or light chain of the antibody77. The resulting VDJ sequences are further diversified by somatic hypermutations and gene conversion. Recent studies showed that cattle cysteine-rich ultralong antibodies likely play a key role in responses to bovine respiratory disease25. Orthologs of IGHD8-2 were found in genomes of cows and its close relatives (e.g., zebu, American bison, gayal), suggesting that ultralong antibodies are common for some bovines78,79. A preliminary analysis of existing ruminant genomes performed by the immunogenomics WG revealed IGHD8-2-like genes in red deer and giraffe, suggesting that ultralong antibodies either emerged earlier in the ruminant lineage or resulted from convergent evolution (Table X). The WG will explore the role of ultralong antibodies in immune responses and their therapeutic potential and invite collaborators studying ruminant diseases and developing antibody-based drugs. The WG will also collaborate with immunogenomics societies to deposit AIRR-Seq data in a standardized and open-to-public manner.
The Comparative methylome (epigenetic) WG will make use of the latest sequencing technologies that provide information on 5-methylcytosine (5mC) base modification in the genome, associated with gene regulation. These methylation patterns are generated simultaneously in the HiFi and ONT-UL reads base calls80–85. A complete T2T assembly will subsequently yield an accompanying T2T methylome for ruminant species and will assist in realizing the extent to which 5mC influences gene expression, genome regulation, and genome stability. Initially, patterns of 5mC will be characterized throughout the genomes including previously unresolved genomic regions such as rDNA arrays and centromeric regions. Subsequently, comparative epigenomics will be employed to discern molecular insights into domestication and selection as has been studied in dogs and fish86,87. Of particular interest is investigation of the change in 5mC over evolutionary distance among species within the suborder Ruminantia. Understanding the epigenetic mechanisms altered as a result of domestication and selection may inform the agricultural genomics community of the potential for marker assisted selection of epigenetically induced phenotypes.
Table 3
Ultralong D genes of three ruminant species: cow, red deer, and giraffe. The corresponding amino acid sequence is shown in the “Amino acid translation” column.
Species | Gene length (nt) | Amino acid translation |
Cow (Bos taurus) | 148 | SCPDGYSYGYGCGYGYGCSGYDCYGYGGYGGYGGYGYSSYSYSYTYEY |
Red deer (Cervus elaphus) | 117 | YCYSSSSGYYDCSSGYYDCCGSSSYYGYCGSSYYSYYG |
Giraffe (Giraffa camelopardalis) | 104 | CHSSSCRSGYSSGYGCRSGYGYGYSYGYGYGCCG |
A subset of species used for agriculture, including cattle, sheep, goat, and American bison, present an opportunity to obtain samples from specific developmental stages for T2T assembly and enhanced annotation. Fetal tissues from these species collected after secondary myogenesis were used for genome sequencing, transcriptome profiling with long and short reads, chromatin conformation contact analysis, and methylome characterization. Multi-tissue comparative analysis will be generally confined to these species since similar samples from many of the ruminant species, including some critically endangered, are neither practical nor possible. However, where cell lines can be obtained/created from fibroblast cells, a parallel comparative effort characterizing gene expression, 3D architecture, and DNA modification is planned.
A large amount of population-level SNP-chip and sequence data exists for agricultural species, as well as for some wild species. The Variant Discovery and Population Sequencing (variant) WG has the goal of determining the impact of T2T-level assemblies on the use of short and long read-based variant identification and genotyping. Ruminants used in agriculture have been the subject of projects modeled after the human “1000 Genomes” project utilizing medium or high-density SNP chips, and more recently whole genome shotgun (WGS) sequence data. These resources have been used to establish association with more or less detailed phenotypes. The variant WG has plans to evaluate the impact of T2T-level assemblies on the use of short and long read-based variant identification and genotyping compared to current reference assemblies. Additionally, for species with a sufficient amount of population data available, the variant WG intends to produce standardized resources to enable researchers to use the respective T2T assemblies, thus enabling a transition from previous references to the newly produced T2T assemblies.
Population-level WGS of endangered species of ruminants will enhance understanding of the distribution of genome-wide variation, inbreeding through analyses of the amounts and distribution of runs of homozygosity, and the burden of masked and realized genetic load within the context of the declining populations that often characterize such species. Such information can be incorporated into conservation management programs that seek to ensure the long-term sustainability of endangered species 88.