Genome assembly quality control
GfaStats65 was used to produce the assembly statistics used to evaluate the contiguity of the whole genome assemblies using default parameters. The reference-free k-mer-based genome assembly completeness evaluation tool Merqury24 was run as recommended by the developers on the diploid assemblies from the F1 crosses of the cattle and sheep species to produce q-value statistics and plots to check the k-mer distribution between the raw and the final assemblies for assembly completeness assessment. Compleasm66 (formerly miniBUSCO) was also used to evaluate assembly completeness using the number of orthologous genes identified in the draft assemblies. Compleasm was run with default parameters and lineage="cetartiodactyla_odb10".
Seqtk telo module from Seqtkv1.4 (https://github.com/lh3/seqtk) was used to identify the telomere sequence at the ends of the Y-chromosome assemblies. Although the program identified the telomeric repeat sequence at the ends of the Y-chromosomes, it is difficult to ascertain if the full length of the telomeres were assembled.
Defining the PAR boundaries
We defined the pseudo autosomal region (PAR) on the Y-chromosome assemblies by mapping raw PacBio long reads (HiFi for cattle and CCS for sheep) from female haplotypes to the cattle and the sheep Y-chromosome assemblies as previously described28. Briefly, the PAR on the X-chromosomes of the cattle and sheep reference assemblies were hard-masked using the current annotation coordinates. Long reads from a female individual were mapped to the assemblies using minimap267 minimap2 -a -t $threads $inFile $inFile2 $inFile3 $inFile4 > $inputTAG".sam". The alignment file was sorted and indexed with Samtools68 before visualization with IGV29. The alignment track (Supplementary Figure S9) clearly showed the PAR boundaries with the soft clipping of the long reads at the region lacking homology with the Y-chromosome rendered as the highly colored reads segment, coupled with a drastic drop of the reads coverage to zero as shown in the coverage track (Supplementary Figure S9).
To further finely define the PAR region for the cattle and sheep sex chromosomes, we aligned the Y- and X-chromosome assemblies to each other with Mashmap69 using the parameters segment length -s 10000 and minimum identity --pi 95. For cattle, the PAR region on the Y-chromosome extended from the beginning of the chromosomes to 6,819,999bp on the Y-chromosome 6,810,542 Mb on the X-chromosome with 99.38% sequence identity between them. The sequence identity of the next alignment block between the two sex chromosomes spanning 9,999bp dropped drastically to 92.79%. The sheep PAR region spanned 7,019,999bp in two blocks from the start of the Y-chromosome and between 179bp and 6,987,958 on the X-chromosome, at an average of 98.85% sequence identity between them, while the next alignment block of 9,999bp dropped to 95.07 sequence identity.
X-d regions
The X-degenerate region of the Y-chromosomes comprise homologous regions with the X-chromosome outside the PAR and without active recombination with it. This region was defined by aligning the Y-chromosome to the X-chromosome with Mashmap. Mashmap was run with segment length “-s 5000” and percentage identity “--pi 65”.
Gene annotation
The protein-coding genes on the Y-chromosomes were initially annotated manually in two ways - using Liftoff v1.6.3 70 and by homology search with protein sequences from cattle and sheep, as well as from the well-annotated human and mouse genomes. Homology-based annotation was necessary to supplement the lack of Y chromosome-specific genes from the liftoff of the current cattle and sheep reference assemblies since the reference had no annotated Y. Refined annotation became available upon submission to NCBI through the application of their RefSeq annotation pipeline, revealing only slight disparities in the number and loci of genes from our manual annotations (Supplementary Tables S3 and S4). However, we used the NCBI annotations for the downstream analysis.
Liftoff was run with the parameters -flank 0.0 -s 0.4 -exclude_partial -copies -sc 0.98 -cds on the Y-chromosome and X-chromosome assemblies of both species using the X-chromosomes of the current assemblies of the Rambouillet ARS-UI_Ramb_v2.071 for sheep and the ARS_UCD_1.372 for cattle as reference. The liftoff result was filtered for only protein-coding genes (Supplementary Tables S3 and S4). Homology-based gene annotation on the Y-chromosomes was done using protein sequences from cattle, sheep, mouse and the human T2T genome assemblies. Miniprot 73 was used with the --outc=0.7option to output only hits with at least 70% coverage between the query and the target. The result was filtered with the alignment score (Supplementary Tables S3 and S4) before manual curation with Apollo74, a genome annotation plug-in for JBrowse75.
Repeat elements annotation
Repeats analysis was done on the Y-chromosomes with RepeatMasker version 4.1.2-pl76 using a combined repeats library produced from the Dfam library (dfam.h5 version 3.7) and an older library (RepeatMasker.lib) in order to have a comprehensive database of repeat elements. RepeatMasker was run with the options -species "bos taurus" -xsmall -no_is -pa $threads -gff.
Comparison of the T2T Y-chromosomes with publicly available assemblies
We collected all the cattle and sheep genome assemblies where male samples were reported sequenced from NCBI. Out of a total of eleven assemblies published for cattle, seven assemblies came from male samples, but only Btaurus_INIA1 was not assembled to chromosome level. All the chromosome level assemblies reported partial Y-chromosome assemblies except Bos_taurus_UMD_3.1.1 which had none (Supplementary Table 15). The current cattle reference genome ARS_UCD1.3, does not have a Y-chromosome assembly since it is from a Hereford cow. We thus selected the Y chromosome from the Btau_5.0.1 (GCA_000003205.6) assembly being the longest at 43.3Mb.
For the publicly available sheep assemblies, from 56 whole genome assemblies NWAFU_Friesian_1.0 and ASM2243283v1 were the two chromosome-level assemblies out of the 6 produced from male animals (Supplementary Table 14). Take note that assemblies ASM2243283v1, ASM2132593v1 and ASM2132590v on one hand, and assemblies ASM2270250v1 and ASM2270250v on the other hand were produced from the same biotypes and were thus not treated as unique assemblies. We selected the Hu Sheep assembly14 which covered only the MSY of the sheep Y chromosome for comparison with our T2T Y-chromosome.
We aligned the T2T Y-chromosome assemblies to the publicly available assemblies with Mashmap69 at a minimum identity of 90% and segment length of 5kb.
Defining the centromere
The centromeres of the Y-chromosomes were defined using satellite DNA repeats annotation Tandem Repeat Finder (TRF)77, CENP-A enrichment analysis from CUT&RUN data, and epigenetic information of methylated Cytosine residues at the centromeric region.
Satellite DNA annotation with Tandem Repeat Finder (TRF)
TRF was run on the two Y-chromosome assemblies with the following parameters:
trf $inFile $match $mismatch $indel $PM $PI $minscore $maxperiod -d -h -ngs > trfOutput.txt
match=2, mismatch=7, indel=7, PM=80, PI=5, minscore=200, maxperiod=2000
The output of TRF was converted into a bed file and visually inspected on JBrowse to identify regions with tandem arrays of repeat elements. The 73bp repeat sequence which was identified in this manner from the centromeric tandem array was aligned back to the cattle (Wagyu and Charolais) and sheep (Churro and Friesian) assemblies using Blastn78 to estimate their relative abundance. The blast results were filtered with a minimum sequence coverage of 80% between the query and the target, and a minimum e-value of 1E-3.
CENP-A enrichment analysis
CENP-A pull-down assay was produced from bovine satellite cells obtained from semeimbranosous muscle of a 53-day old commercial male Angus calf using the GeneTex antibody with catalog number GTX13939 and sequenced. Adapter sequences were trimmed from the raw reads using Cutadapt79. The trimmed high-quality reads were then aligned to the whole genome sequence of the Y chromosome-containing Wagyu cattle using Bowtie280 with default parameters asides "-k 100". The alignments were sorted with Samtools68 and duplicates were removed with Picard toolkit (https://broadinstitute.github.io/picard/). Bedgraph files were created from the alignment files as input into the SEACR81 peaks caller to call the CENP-A peaks. SEACR was run using the stringent and 10% filter options.
bash SEACR_1.3.sh target.bedgraph 0.1 non stringent output
Methylated Cytosines analysis (CpG Methylation)
CpG methylation data was produced with ONT and PacBio HiFi reads using the manufacturer-prescribed protocols.
Oxford Nanopore Technology (ONT)
ONT-based methylated CpG sites were called in four main steps:
-
Methylated bases were called bases with Dorado using the R9_v0.3.4_v3.3_5hmc-5mc_A100mig model
-
Fastq files were extracted from the modbams and the MM, ML tags were retained
-
The fastq files were aligned to the reference assembly with Winnowmap82. The resulting sequence alignment map (.sam) file was filtered, converted to a bam file and indexed.
-
The bam file was parsed into modbam2bed to produce the bed file containing the methylated sites filtered using the 20% threshold for unmethylated bases and 80% for methylated bases.
Pacific Biosciences (PacBio)
The prescribed PacBio pipeline for calling CpG methylated sites is as follows:
1. Primrose was used to estimate the probability of methylation from the raw reads at each of the CpG sites.
2. The resulting bam files were aligned to the reference assembly with pbmm2 using the command:
pbmm2 align \
-j 48 \
--sort \
--log-level INFO \
--preset HIFI $1 $2 $3, where $1 is the reference assembly, $2 is the file of list of modbams, and $3 is the output
3. Aligned_bam_to_cpg_scores from pb-CpG-tools version 2.3.2 was used to produce the percentage of methylated reads across the assembly:
aligned_bam_to_cpg_scores \
--bam $3 \
--output-prefix $4 \
--model pileup_calling_model.v1.tflite \
--modsites-mode reference \
--ref $1 \
The CpG methylation calls obtained from these steps were post-processed by first changing all the
NANs in the methylation values to 0.00s. The aggregate methylation levels for 5kb and 10kb bins were calculated using custom python scripts. The resulting data were plotted with an R script in RStudio to visualize the tracks of the methylation pattern. With visual inspection of the methylation tracks, the Centromeric dip region (CDR) was defined according to52 as the region with a local dip in methylated Cytosines with respect to the surrounding sequence and flanking the active satellite array.
Transcript level quantification on the Y-chromosomes
We obtained publicly available RNA-Seq data of transcripts quantification experiments from different tissues in order to measure the transcription activity of the genes on the Y-chromosomes.
Sheep: Two sets of transcriptome data were obtained for the transcription activity analysis on the sheep Y-chromosome.
-
PRJNA552574: RNA-Seq was used to investigate whether testis development and gene expression is altered by the exposure of sheep to gossypol in utero and during lactation83. We obtained RNA-Seq data from a total of 18 testes samples from 60-day old lambs with 9 in control and 9 in treated groups were used.
-
PRJNA437085: This study used RNA-Seq to investigate the effects of dietary energy levels on Hu sheep testicular development84. RNA-Seq data from three replicates each of the two energy levels were obtained for our analysis.
Cattle: Two sets of transcriptome data were also used for the cattle.
-
PRJNA565682: Six samples were retrieved from the project in an experiment to profile the transcriptome of mature and immature testes85. Three replicates each of the mature (24-month old) and immature (1-day old) testes of Chinese Red Steppes cattle were used (Supplementary Table 9).
-
The second set was RNA-Seq data obtained from the Lung, Spleen, Muscle and Liver tissues of the Wagyu_x_Charolais F1 offspring from which the cattle Y-chromosome was obtained.
Quality control steps were carried out on the raw .fastq files to remove adapter, poly-X and poly-G sequences using fastp86 with the parameters -f 4 --trim_poly_g --trim_poly_x
STAR splice-aware aligner87 was then used to align the reads from the multi-tissue RNA-Seq data to the whole genome assemblies of the Wagyu and the Churro from where the Y-chromosomes were obtained. Before aligning the reads, the index of the whole genome assemblies were built with the command
STAR --runThreadN $threads --runMode genomeGenerate --genomeDir $pathToGenome \
--genomeFastaFiles $pathToGenomeFasta --sjdbGTFfile $annotationPath \
--sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 100
The reads were then aligned with STAR in single or paired end reads mode depending on the data.
For single end reads mode:
STAR --runThreadN $threads --runMode alignReads --genomeDir $pathToGenome \
--readFilesIn $pathToRead1 \
--readFilesCommand zcat --outFileNamePrefix $pathToOutputFileWithPrefix \
--outSAMtype BAM SortedByCoordinate --quantMode GeneCounts
For paired-end reads mode:
STAR --runThreadN $threads --runMode alignReads --genomeDir $pathToGenome \
--readFilesIn $pathToRead1 $pathToRead2 \
--readFilesCommand zcat --outFileNamePrefix $pathToOutputFileWithPrefix \
--outSAMtype BAM SortedByCoordinate --quantMode GeneCounts
The output is the number of reads aligned to each gene.
Phylogenetic analysis
The sequence for each member of the ampliconic gene families were extracted from the coordinates using bedtools getfasta module. Multiple sequence alignment was performed using Mafft88. The output was imported to Unipro Ugene using the PhyML maximum likelihood tree building method with branch length optimized.
Dot plots
The self-alignment dotplots of the Y-chromosome assemblies were produced for visualization with Moddotplot (https://github.com/marbl/ModDotPlot) using the program parameters --no-bed --identity 85 -s 52.