High-density genome-wide cattle SNP data sets
For this study new Illumina® BovineHD 777K BeadChip SNP data sets were generated for 39 African cattle (23 Somba, 8 N’Dama and 8 Boran). The Somba breed data were obtained using DNA samples previously published as part of a microsatellite-based survey of cattle genetic diversity (Freeman et al, 2004) and were generated by Weatherbys Scientific (Naas, Ireland) using standard procedures for Illumina® SNP array genotyping. The N’Dama and Boran data were obtained using cattle DNA samples from a trypanosome challenge time-course experiment (O'Gorman et al, 2009) and were generated by Neogen Europe (Ayr, Scotland) also using standard procedures. Additional Illumina® BovineHD 777K BeadChip data sets were obtained from published studies (Bahbahani et al, 2017; Barbato et al, 2020; Upadhyay et al, 2017; Verdugo et al, 2019; Ward et al, 2022; Wragg et al, 2022) and the Web-Interfaced next generation Database Exploration (WIDDE) repository Sempéré et al (2015).
The total data set consisted of high-density 777K SNP data for 1,030 cattle before filtering and 24 different populations were represented, including three European B. taurus populations (Holstein Friesian, Angus, and Jersey); three African B. taurus populations (Muturu, Lagune, and Guinean N’Dama); three B. indicus populations (Tharparkar, Gir, and Nelore); five European hybrid populations (Romagnola, Chianina, Marchigiana, Maremmana, and Alentejana); five trypanotolerant African hybrid populations (hybrid N’Dama, Borgou, Somba, Keteku, and Sheko) and five trypanosusceptible African hybrid populations (Ankole, Nganda, East African Shorthorn Zebu, Karamojong, and Boran). The cattle BovineHD 777K SNP data were converted to binary PLINK files with Illumina® allele coding for the FORWARD strand as required using PLINK (v. 1.90 beta 6.25) (Chang et al, 2015) and SNPchiMp (v. 3) (Nicolazzi et al, 2015). The sample data were then merged with PLINK (v. 1.90 beta 6.25).
Figure 1 illustrates the overall study workflow including the genome assembly updating, data preparation, and filtering steps, which are described in the following subsections and that were implemented prior to the population genomics analyses. Table 1 shows the taxonomic, breed, geographical, sample number (pre- and post-SNP data filtering), and sources for the BovineHD 777K BeadChip SNP data sets. There was a total of 750 individual animal BovineHD 777K BeadChip SNP data sets retained after filtering.
Updating the bovine genome assembly
The BovineHD 777K BeadChip SNP locations were updated from the UMD3.1 bovine genome assembly to the current assembly ARS-UCD1.2 (Rosen et al, 2020) using coordinates from the National Animal Genome Research Program (NAGRP) Data Repository genotyping array SNP mapping to ARS-UCD1.2 resource (Schnabel, 2018) and PLINK (v. 1.90 beta 6.25).
Data preparation and filtering
Generation of a low-density SNP array data set
To produce a comparative low-density SNP array data set, the high density BovineHD 777K SNP data set was downsampled to the subset of the 46,713 SNPs in common with the Illumina® Bovine SNP50 BeadChip using PLINK (v. 1.90 beta 6.25). A list of the Bovine SNP50 BeadChip SNPs from the NAGRP Data Repository (Schnabel, 2018) was used for this purpose and modified with dplyr (v. 1.1.2) (Wickham et al, 2023a) and readr (v. 2.1.4 (Wickham et al, 2023b) with R (v. 4.3.0) (R Core Team, 2023).
Missing SNP removal
Individual animals that had missing SNP call rates exceeding 0.95 from the low-density data set were removed using a missing genotype filter with PLINK (v. 1.90 beta 6.25). The same set of animals were also removed from the high-density data set.
Table 1. Group, code, population, origin, number of samples pre- and post-filtering and sources of SNP data used in this study.
Group
|
Code
|
Population
|
Country of origin
|
No. pre-filtering
|
No. post-filtering
|
Source*
|
European
Bos taurus
|
HOLS
|
Holstein Friesian
|
Netherlands
|
60
|
36
|
a
|
European
Bos taurus
|
ANGU
|
Angus
|
United Kingdom
|
42
|
26
|
a
|
European
Bos taurus
|
JERS
|
Jersey
|
United Kingdom
|
38
|
23
|
a
|
European hybrid
|
ROMA
|
Romagnola
|
Italy
|
51
|
51
|
b, a
|
European hybrid
|
CHIA
|
Chianina
|
Italy
|
19
|
19
|
b, c
|
European hybrid
|
MARC
|
Marchigiana
|
Italy
|
13
|
13
|
b
|
European hybrid
|
MARE
|
Maremmana
|
Italy
|
9
|
5
|
c
|
European hybrid
|
ALEN
|
Alentejana
|
Portugal
|
11
|
6
|
d, c
|
African
Bos taurus
|
MUTU
|
Muturu
|
Nigeria
|
39
|
19
|
e
|
African
Bos taurus
|
LAGU
|
Lagune
|
Benin
|
5
|
5
|
a
|
African
Bos taurus
|
NDAG
|
N’Dama Guinea
|
Guinea
|
70
|
47
|
e, a, f
|
Trypanotolerant African hybrid
|
NDAM
|
N’Dama hybrid
|
Unspecified, Togo, Kenya
|
65
|
41
|
b, d, g
|
Trypanotolerant African hybrid
|
BORG
|
Borgou
|
Benin
|
50
|
50
|
e
|
Trypanotolerant African hybrid
|
SOMB
|
Somba
|
Benin
|
33
|
23
|
g
|
Trypanotolerant African hybrid
|
KETE
|
Keteku
|
Nigeria
|
22
|
22
|
e
|
Trypanotolerant African hybrid
|
SHEK
|
Sheko
|
Ethiopia
|
16
|
16
|
a
|
Trypanosusceptible African hybrid
|
ANKO
|
Ankole
|
Uganda
|
50
|
25
|
f
|
Trypanosusceptible African hybrid
|
NGAN
|
Nganda
|
Uganda
|
53
|
27
|
f, e
|
Trypanosusceptible African hybrid
|
EASZ
|
East African Shorthorn Zebu
|
Kenya
|
111
|
111
|
f, e
|
Trypanosusceptible African hybrid
|
KARA
|
Karamojong
|
Uganda
|
16
|
16
|
f
|
Trypanosusceptible African hybrid
|
BORA
|
Boran
|
Kenya
|
90
|
90
|
h, g
|
Bos indicus
|
THAR
|
Tharparkar
|
Pakistan
|
13
|
13
|
b
|
Bos indicus
|
GIR
|
Gir
|
India
|
85
|
28
|
a, f
|
Bos indicus
|
NELO
|
Nelore
|
Brazil
|
69
|
38
|
a, c, f
|
|
|
|
Total
|
1030
|
750
|
|
* a Sempéré et al (2015), b Barbato et al (2020), c Upadhyay et al (2017), d Verdugo et al (2019), e Ward et al (2022), f Bahbahani et al (2017), g this study, h Wragg et al (2022).
Removal of duplicate samples by identity-by-state filtering
Duplicate samples present in two or more data sources were removed using PLINK (v. 1.90 beta 6.25) and a previously described identity-by-state methodology (Browett et al, 2018). The method was modified to select one from each pair of animals that had an identity-by-state value greater than or equal to 0.99 using dplyr (v. 1.1.2), readr (v. 2.1.4), and stringr (v. 1.5.0) (Wickham, 2023) with R (v. 4.3.0). The resulting list of sample duplicates were removed from the high- and low-density data sets with PLINK (v. 1.90 beta 6.25). The results were visualised using dplyr (v. 1.1.2), ggh4x (v. 0.2.4) (van den Brand, 2023), ggplot2 (v. 3.4.2) (Wickham, 2009), ggtext (v. 0.1.2) (Wilke and Wiernik, 2022), readr (v. 2.1.4), and stringr (v. 1.5.0) with R (v. 4.3.0). Colours were generated from viridis (v. 0.6.3) (Garnier et al, 2023) and khroma (v. 1.10.0).
Removal of admixed animals from the reference populations
An inbreeding analysis in PLINK (v. 1.90 beta 6.25) was used to remove animals that showed evidence for significant admixture in the reference populations (three European B. taurus populations: Holstein Friesian, Angus, and Jersey; three African B. taurus populations: Muturu, Lagune, and Guinean N’Dama; and three B. indicus populations: Tharparkar, Gir, and Nelore). To do this, outlier animals with statistically lower inbreeding values than the rest of the population were identified via boxplots using dplyr (v. 1.1.2), readr (v. 2.1.4), and tidyr (v. 1.3.0) (Wickham et al, 2023d) with R (v. 4.3.0). The resulting list of animals were removed from the high- and low-density data sets. A systematic inbreeding analysis was then performed with PLINK (v. 1.90 beta 6.25) and the output was modified using dplyr (v. 1.1.2) and readr (v. 2.1.4) with R (v. 4.3.0) to identify animals with the top 25 and bottom 25 inbreeding values across the three European B. taurus populations (Holstein Friesian, Angus, and Jersey). These samples were then removed from the high- and low-density data sets to balance the numbers of animals across the reference groups. A final inbreeding analysis of the low-density data set after the filters were applied was performed to compare the results with those of the high-density data set. Results from the inbreeding analyses were visualised using dplyr (v. 1.1.2), ggh4x (v. 0.2.4), ggplot2 (v. 3.4.2), ggtext (v. 0.1.2), and readr (v. 2.1.4), with R (v. 4.3.0). Colours were generated from viridis (v. 0.6.3) and khroma (v. 1.10.0).
Filtering of SNPs by call rate and minor allele frequency
The high- and low-density data sets were filtered to retain autosomal SNPs with a minimum call rate of 95% and minor allele frequency (MAF) of at least 5% with PLINK (v. 1.90 beta 6.25). The methodologies used for this process have been described in a previous study (McHugo et al, 2019).
Population genomics analyses
Principal component analysis
Principal component analysis (PCA) was performed for the high- and low-density data sets using smartpca after file conversion with convertf, both part of EIGENSOFT package (v. 7.1.2) (Patterson et al, 2006). The results were visualised using dplyr (v. 1.1.2), ggplot2 (v. 3.4.2), patchwork (v. 1.1.2) (Pedersen, 2023), readr (v. 2.1.4), stringr (v. 1.5.0), and tidyr (v. 1.3.0) with R v. 4.3.0). Colours were generated from viridis (v. 0.6.3) and khroma (v. 1.10.0).
Genetic structure analysis
Genetic structure analysis was performed for the high- and low-density data sets using structure_threader (v. 1.3.4) (Pina-Martins et al, 2017) with fastStructure (v. 1.0) (Raj et al, 2014). The structure analysis was carried out with the model complexity or number of populations (K) set from 2 to 25. The chooseK function was used to test the outputs to find a range of values of K that best accounted for the structure in the data (Raj et al, 2014). The results were visualised using dplyr (v. 1.1.2), ggh4x (v. 0.2.4), ggplot2 (v. 3.4.2), ggtext (v. 0.1.2), magick (v. 2.8.1) (Ooms, 2023) with ImageMagick (v. 6.9.12.96) (ImageMagick Studio LLC, 2023), magrittr (v. 2.0.3) (Bache and Wickham, 2022), patchwork (v. 1.1.2), readr (v. 2.1.4), stringr (v. 1.5.0) and tidyr (v. 1.3.0) with R (v. 4.3.0). Colours were generated from viridis (v. 0.6.3) and khroma (v. 1.10.0).
Phylogenetic analysis
An additional sample of three gaur (B. gaurus) that were genotyped using the BovineHD 777K BeadChip (Sempéré et al, 2015; Verdugo et al, 2019) were available to use as an outgroup. After the pre-processing steps described above were performed to convert the data to binary PLINK files with Illumina® allele coding for the FORWARD strand on the ARS-UCD1.2 bovine genome assembly, the B. gaurus data set was filtered with PLINK (v. 1.90 beta 6.25) to retain only SNPs present in the high-density data set. The gaur data were then merged with the high-density data set and an additional filter was applied with PLINK (v. 1.90 beta 6.25) to retain autosomal SNPs with a minimum call rate of 95% and minor allele frequency of at least 5%. A gzipped allele frequency cluster file was produced with PLINK (v. 1.90 beta 6.25) and the resulting file was converted to TreeMix format using the plink2treemix python script provided with the TreeMix software package (v. 1.13) (Pickrell and Pritchard, 2012).
Phylogenetic analysis was performed for both the high- and low-density SNP data sets using TreeMix (v. 1.13) with the number of migration edges (m) set from 1 to 15 for ten iterations using windows of SNPs (k) increasing from 100 to 1000 by increments of 100. The OptM package (v. 0.1.6) (Fitak, 2021) was used with R (v. 4.3.0) to calculate the mean and standard deviation (SD) across the 10 iterations for the composite likelihood (L(m)), proportion of variance explained and the second-order rate of change (Δm) across migration edges (m). The results were visualised using dplyr (v. 1.1.2), ggplot2 (v. 3.4.2), ggtext (v. 0.1.2) and patchwork (v. 1.1.2) with R (v. 4.3.0). The BITE R package (v. 1.2.0008) (Milanesi et al, 2017) was also used to generate a Unix shell script customised to perform 100 TreeMix bootstrap replicates for the selected numbers of migration edges (m). The results were visualised using ape (v. 5.7.1) (Paradis and Schliep, 2019), dplyr (v. 1.1.2), ggh4x (v. 0.2.4), ggnewscale (v. 0.4.9) (Campitelli, 2023), ggplot2 (v. 3.4.2), ggtext (v. 0.1.2), ggtree (v. 3.9.0.1) (Yu, 2022), patchwork (v. 1.1.2), stringr (v. 1.5.0), tidytree (v. 0.4.4) (Yu, 2022), and treeio (v. 1.25.2) (Wang et al, 2020a) with R (v. 4.3.0) and a modified version of the script provided with TreeMix. Colours were generated from viridis (v. 0.6.3) and khroma (v. 1.10.0).
Local ancestry estimation
MOSAIC analysis
The high- and low-density SNP data sets were separated by chromosome with PLINK (v. 1.90 beta 6.25) and each chromosome was phased with SHAPEIT (v. 2.r904) (O'Connell et al, 2014). The resulting segregated chromosome SNP data files were converted to MOSAIC format using the R script provided with MOSAIC (v. 1.5.0) (Salter-Townshend and Myers, 2019) and R (v. 4.3.0). Recombination rate files were prepared from a cattle recombination map (Ma et al, 2015) using an R script adapted from Ward et al (2022) with R (v. 4.3.0). For each hybrid population three-way local ancestry analysis was performed across all autosomes without FST estimation and assuming an effective population size (Ne) of 400 using MOSAIC (v. 1.5.0), dplyr (v. 1.1.2), and stringr (v. 1.5.0) with R (v. 4.3.0). The potential donor populations were the three European B. taurus populations (Holstein Friesian, Angus, and Jersey), the three African B. taurus populations (Muturu, Lagune, and Guinean N’Dama) and the three B. indicus populations (Tharparkar, Gir, and Nelore).
ELAI analysis
The high- and low-density SNP data sets were separated and converted into BIMBAM format for each population and chromosome with PLINK (v. 1.90 beta 6.25). Local ancestry analysis was carried out for each hybrid population and autosome with 30 expectation-maximization (EM) steps, 3 upper clusters, 15 lower clusters, and 200 mixing generations using ELAI (v. 1.0) (Guan, 2014). The donor populations for each hybrid population were selected based on the results of the MOSAIC analysis.
Local ancestry analysis comparison
The local ancestry results were extracted using dplyr (v. 1.1.2, MOSAIC (v. 1.5.0), parallel (v. 4.3.0) (R Core Team, 2023), readr (v. 2.1.4), scales (v. 1.2.1) (Wickham et al, 2023c), stringr (v. 1.5.0), and tibble (v. 3.2.1) (Müller and Wickham, 2023) with R (v. 4.3.0). Mean ancestry scores across the individual hybrid animals and a genome-wide z-score for each of the three ancestry components were calculated for each hybrid population. Weighted mean ancestry scores and z-scores were calculated across the hybrid populations within each group of European hybrids, trypanotolerant African hybrids, and trypanosusceptible African hybrids for a subset of the hybrid populations selected to only include populations with a minimum of 15 animals and a relatively stable level of admixture based on a visual examination of the structure results. The local ancestry results were visualised using dplyr (v. 1.1.2), ggplot2 (v. 3.4.2), magick (v. 2.8.1) with ImageMagick (v. 6.9.12.96), magrittr (v. 2.0.3), parallel (v. 4.3.0), patchwork (v. 1.1.2), readr (v. 2.1.4), scales (v. 1.2.1), stringr (v. 1.5.0), and tibble (v. 3.2.1) with R (v. 4.3.0). Colours were generated from khroma (v. 1.10.0). The correlation between the local ancestry results from MOSAIC and ELAI were visualised using dplyr (v. 1.1.2), ggplot2 (v. 3.4.2), ggtext (v. 0.1.2), parallel (v. 4.3.0), patchwork (v. 1.1.2), readr (v. 2.1.4), and stringr (v. 1.5.0) with R (v. 4.3.0). Colours were generated from viridis (v. 0.6.3).
Functional enrichment of introgressed regions
Functional enrichment was performed and visualised using dplyr (v. 1.1.2), ggplot2 (v. 3.4.2), ggrepel (v. 0.9.3) (Slowikowski, 2023), gprofiler2 (v. 0.2.2) (Kolberg et al, 2023), magick (v. 2.8.1) with ImageMagick (v. 6.9.12.96), magrittr (v. 2.0.3), parallel (v. 4.3.0), readr (v. 2.1.4), and stringr (v. 1.5.0) with R (v. 4.3.0). Colours were generated from khroma (v. 1.10.0). The background set was the set of genes within 1 Mb up- and downstream from a SNP in the high-density data set. The query sets were the genes within 1 Mb up and downstream from the SNPs with a z-score ≥ 2.0 for each of the ancestries.