General description of variants
The raw variant call dataset was filtered to have high quality SNP genotypes, and animals with missing genotypes were not included. This not only reduced the overall number of animals and sites substantially but it also reduced the number of heteroplasmic genotypes, improved the average read depth and retained higher quality sites (Table 2, Figure S1). However, when we compared the levels of heteroplasmy per individual separately in the Semen and Non-semen tissue groups, heteroplasmy was much higher in the Semen derived samples.
Table 2
Summary of the parameters of raw and filtered variant datasets before and after removing of sites with missing data (Site) and removing both sites and animals with missing data (Site & Ani).
Parameters
|
Raw dataset
(Unfiltered)
|
Dataset filtered by
|
Site
|
Site & Ani
|
No. of Animals (Ani) in dataset
|
4931
|
4931
|
1883
|
Total No. of POS in dataset
|
5903
|
3394
|
3069
|
Total No. of POS with at least one Het_GT animal
|
5201
|
3394
|
1227
|
Mean No. of Ani with Het_GT per POS (med)
|
253 (5)
|
388 (12.5)
|
2 (0)
|
No. of Ani with at least one Het_GT
|
3934
|
3717
|
712
|
Mean No. of POS with Het_GT per Ani (med)
|
302.2 (278)
|
266 (245)
|
3.5 (0)
|
No. of POS_Missing GT
|
5903
|
3394
|
0
|
Mean No. of Ani with Missing GT per POS (med)
|
420 (409)
|
232.3 (175)
|
0
|
No. of Ani with Missing GT
|
3299
|
2748
|
0
|
Mean No. of POS_Missing GT across all Ani (med)
|
251.3 (7)
|
159.9 (3)
|
0
|
Mean read depth per POS (across all Ani) (med)
|
284.5 (299.8)
|
287 (299.9)
|
699 (723)
|
Mean read depth per Ani (across all POS) (med)
|
284 (18.9)
|
287 (18.9)
|
699 (597)
|
Ani=Animal, POS=nucleotide position, GT=genotype, Het_GT Heteroplasmic genotype, med=median, Site=nucleotide position, med=median
|
We therefore imposed a strong filter based on the maximum number of heteroplasmic sites per individual, to result in a similar distribution of heteroplasmy and allelic ratios in both the Semen and Non-semen groups (Figure S2, S3). Overall, in the final set of 1,883 individuals, the per site heteroplasmy count was considerably reduced and there was a slight improvement in the average read depth coverage in the final dataset (Figure S4).
Mitochondrial haplogroups, population structure and admixture
Haplogroups using MitoToolPy
The haplogroup membership for each of the 1,883 animals in the filtered set was predicted in MitoToolPy using the ARS-UCD1.2_M reference, and the lifted over variants that MitoToolPy uses to define haplogroups (Table S1). MitoToolPy detected 11 major pre-defined haplogroups (I1, I2, T1, T2, T3, T4, T5, T6, P, Q1, Q2) based on variants from the whole genome sequences (16,340 bp). Overall, T3 was the predominant haplogroup (1502 animals) with about 15 subgroups within T3. The dominant subgroups were T3 (752) and T3r (547) (Figure S5). In most cases, the predicted haplogroup of each animal was as would be expected based on the breed and sub-species (Table S3). All the African cattle breeds (Ankole, Afrikander, Ndama, Benishangul, Goffa, Kenana, Muturu) were classified as the T1 haplogroup that is fixed in African taurine breeds [18]. Generally, the indicine cattle breeds belonged to major haplogroup I and modern taurine cattle to haplogroup T, although there were some exceptions. As expected, the composite breeds mostly sourced from Australia were unpredictable. Notably, the haplogroups of Brahman cattle (N=18) were mostly T1 (N=12), T3 (N=5) and one indicus (I). In some animals of European breed origin and their composites (N=1,302), the integration of T1 (3.5%), I1 and I2 (1.3%) haplogroups was also observed (Table 3 and 4). For example, several Holstein (N=5) and Jersey cattle (N=4) from Australia were of T1 origin. This was further confirmed by checking the original haploid genotypes for heteroplasmy across the haplogroup determining positions. For Jersey belonging to T1, the haplogroup determining positions were all homoplasmic (except 1 position in one animal) (Table S4). Altogether, the T1 haplogroup was observed in about 13 European taurine breeds and composites. Within Australian sourced cattle, T1 had considerable influence on Holstein, Jersey and composite breeds (36 animals). Similarly, the I Haplogroup was present in Holstein animals from China (N=5/12), Herefords from New Zealand (N=3/4) and, as expected, in composite taurus x indicus breeds from Australia 4/12 (Table 4).
Table 3
Prevalence of T1 (African taurine) haplogroup in non-African cattle breeds and composites.
Origin of sample
|
Breed
|
Sub-species
|
n/N
|
Sex
|
Australia
|
Angus Lowline
|
taurus
|
2/2
|
2F
|
Beefmaster
|
taurus X indicus
|
1/2
|
1M
|
Brahman
|
indicus X taurus
|
12/18
|
1F, 10M, 1U
|
Dexter
|
taurus
|
1/2
|
1F
|
Holstein
|
taurus
|
5/5
|
4F, 1M
|
Jersey
|
taurus
|
4/8
|
4M
|
Senepol
|
European taurus X African taurus
|
5/12
|
5U
|
Composite
|
|
6/13
|
6M
|
Germany
|
Holstein Red
|
taurus
|
1/3
|
1M
|
France
|
Blonde d’Aquitaine
|
taurus
|
2/16
|
1F, 1M
|
|
Brown Swiss
|
taurus
|
1/1
|
1M
|
Korea
|
Hanwoo
|
taurus
|
2/21
|
2U
|
Unknown
|
Holstein
|
taurus
|
1/67
|
1F
|
Romagnola
|
taurus
|
2/10
|
1M, 1U
|
San Martinero
|
taurus
|
1/2
|
1M
|
Limonero
|
taurus
|
1/9
|
1U
|
F= female; M=male; U=unknown; n=No. of animals showing T1 haplogroup, N=No. of animals in a breed sampled from the specified country.
|
Table 4
Prevalence of indicine haplogroup (I) in European taurine breeds and composites.
Origin of sample
|
Breeds
|
n/N
|
Sex
|
Haplogroup
|
Australia
|
Composite
|
4/12
|
M
|
I1
|
Brahman
|
1/18
|
M
|
I1
|
Belted Galloway
|
1/2
|
F
|
I1
|
China
|
Holstein
|
5/12
|
F
|
I1
|
New Zealand
|
Hereford
|
1/4
|
M
|
I2
|
Hereford
|
2/4
|
M
|
I1
|
Unknown
|
Shorthorn
|
1
|
F
|
I1
|
F=female; M=male; U unknown; n=No. of animals showing I haplogroup, N=No. of animals
in a breed sampled in a country.
|
In the past, sequences from D-loop region (910 bp long) have been extensively used in the prediction of haplogroups [1, 18]. However, using our filtered D- loop genotype data, MitoTool.py could not differentiate between the two major I and T haplogroups likely because some variants used in previous studies were filtered out of our variant set. In our dataset prior to any filtering, there were 206 D- loop variants compared to 153 D-loop variants in the pre-defined set that MitotoolPy uses for prediction of haplogroups but only 87 variants overlapped. Further, in our filtered set, only 60 D-loop variants overlapped with the 153 MitoToolPy D-loop variants, suggesting that this was the main contributing factor resulting in poor resolution of haplogroups using only the D-loop variants. On the other hand, using our filtered set of sequence variants from the non-D-loop region, MitoToolPy could distinguish the major haplogroups (I, T, P and Q) but did not resolve haplogroup sub-levels. For example, the incidence of unresolved haplogroups was more than 60% of the animals between T1 and T3 (1280), and T3 and T4 (N 15). This indicates that the D-loop variants in our set played a key role in defining the sub-haplogroups when used together with the non-D-loop. This is not unexpected because the higher mutation rate in the D-loop region is more likely to resolve the sub-haplogroup levels (i.e. more recently diverged groups).
Principal Component Analysis
The PCA of the GRM derived from all filtered mitochondrial variants (whole sequence) revealed distinct clusters that corresponded to the I, T and Q major haplogroups after annotation with MitoToolPy results (Figure 1). However, sub-clustering within the major haplogroups T and T3 was not entirely resolved, despite the tendency to marginally separate T1 and T2's (Figure S6a), as well as T3 and T3r (Figure S6b).
A GRM of only D-loop variants was also used for PCA and revealed the same two major clusters (T and I, Figure S7a). Within the I cluster, sub-clusters of I1 and I2 were separated to some extent while T1 and T3 did not separate clearly. Similarly, the variants from the non-D-loop region could segregate T and I haplogroups into separate clusters but did not resolve further into sub-clusters of haplogroups (Figure S7b).
Population structure using Admixture
The population structure based on all mitochondrial sequence variants was determined using Admixture [49], where each animal is assigned a proportional membership of a predetermined number of k population groups (e.g. sub-species, breeds). Depending on the k value used (2 to 6), the major haplogroups were progressively split (Figure 2). Admixture estimated the optimal a priori k value to define population groups (based on the changes in cross-validation errors) as four (k=4) (Figure S8). When annotated with the predicted MitoToolPy haplogroups, the population structure with k=3 showed I separating from two further subpopulations within the T haplogroup. Further sub-groups were apparent at higher k values and these corresponded to sub-haplogroups within T.
Hierarchical clustering
The nucleotide differences between each pair of whole mtDNA variant sequences was calculated using an in-house script. The mean nucleotide difference across all pair combinations was 36 but ranged from 0 to 224. Hierarchical clustering, based on the nucleotide differences matrix between individuals, again presented two broad and distinct clusters (Figure 3: Cluster 1 and 2). The individuals in Cluster 1 and 2 were from the major haplogroups T and I, respectively and Cluster 1 also included animals belonging to the P and Q haplogroups.
Private variants
Private variants are additional variants present in a query mitochondrial sequence but not in the list of haplogroup determining variants. They are of interest because they can provide insights into plausible subgroups that have not been previously catalogued within the pre-defined haplogroups. We therefore examined the distribution of these private variants (output from MitoToolPy) within a haplogroup and/or breed(s). Some of the private variants were specific to a group of animals within a haplogroup (Table S5). For the most part, private variants were transition mutations from the reference allele. Four breeds had members of a sub-haplogroup that showed a specific set of private variants (Table 5). Almost 50% of private variants (N=43) were specific to particular haplogroups, annotated either as missense (50%), upstream/downstream (30%) or synonymous (16%) gene variants (Table S5). In general, a substantial proportion of the private variants specific to a particular haplogroup included 2 SNPs in I1, 1 SNP in I2, 5 SNPs in T1, 2 SNPs in T1b1b1, 1 SNP in T1c, 1 SNP in T2 and 1 SNP in T3. Interestingly, a number of the private variants were annotated as missense, and it is therefore possible that these mutations could have downstream effects on phenotypes.
Table 5
Annotation of the private variants specific to a group of individuals within a breed showing the type of variants and affected region/gene.
Breed
|
Haplo-group
|
Source of sample
|
n/N*
|
Annotation: variant position (bp), type, gene
|
NDama
|
T1
|
Benin,
Guinea
|
7/12
|
2579, NCTE, rRNA
|
4714, Missense, ND2
|
6882, Missense, COX1
|
10435, Missense, ND4L
|
Holstein
|
T3
|
Switzerland,
Canada
|
6/111
|
7948, Missense, COX2
|
T3d1
|
United Kingdom
|
5/7
|
9807, NCTE, tRNA,
|
13277, Missense, ND5
|
Hereford miniature
|
T3
|
Australia
|
2/2
|
5603, Synonymous, ND2
|
Senepol
|
T1
|
Australia
|
3/5
|
6388, Synonymous, COX1
|
*N = total number of animals in a breed in the sub-haplogroup; n = number of animals with private variants in a breed within the haplogroup; NCTE = non-coding transcript exon
|
A maximum-likelihood tree was constructed for whole mitogenomes of only the animals belonging to I haplogroup using MEGA X. This analysis showed four distinct clusters, one cluster corresponded to the I2 haplogroup and the three other clusters were annotated to I1 haplogroup (Figure 4a). The subclusters within I1 haplogroup were labelled as I1a, I1b and I1-Orig. The cluster I1a consisted of a group of 64 animals which were characterised by two group specific (private) variants (1497 bp and 6848 bp). The cluster I1b contained a group of 10 animals with one group specific variant (5707 bp) (Table 6).
The third I1 cluster, I1-Orig, consisted of the remaining 38 animals under I1 haplogroup in which the private variants specific to I1a and I1b were not present. The cluster I1a was mainly composed of Chinese indicine breeds except for two Buryat animals (Russia), while I2, I1-Orig and I1b were mostly indicine breeds from the Indian subcontinent and Chinese indicine breeds (Table 6).
Table 6
Breed annotation and the number of animals within subclusters of the indicus (I) cluster based on alternate clustering techniques.
Cluster I2
(N=19) (p/q)1
|
Cluster I1b
(N=10) (p/q)
|
Cluster I1a
(N=64) (p/q)
|
Cluster *I1-Orig
(N=38) (p/q)
|
Achai (1/4)
Bhagnari (1/4)
Cholistani (1/5)
Dhanni (2/5)
Dianzhong (1/5)
Gabrialli (1/5)
Gir (1/1)
Kangayam (1/1)
Nari Master (1/4)
Red Sindhi (1/3)
Sahiwal (5/7)
Vechur (1/1)
Hereford (1/48)
Unknown (1)
|
Bhagnari (1/4)
Cholistani (2/5)
Dajal (2/4)
Dhanni (1/5)
Gabrialli (1/5)
Hariana (1/1)
Composite (2/13)
|
Bohai Black (2/5)
Buryat (2/21)
Chaidamu Yellow (2/5)
Dabieshan (2/3)
Dianzhong (1/5)
Guangfeng (3/4)
Jian (3/3)
Jiaxian Red (2/5)
Jinjiang (3/4)
Leiqiong (3/3)
Lingnan (6/7)
Luxi (5/5)
Nanyang (3/3)
Sichuan Indigenous (1/1)
Wandong (2/2)
Wannan (3/7)
Weining (3/4)
Wenshan (4/6)
Xuanhan (2/5)
Zaobei (4/5)
Holstein (5/330)
Unknown (3)
|
Achai (1/4)
Bhagnari (2/4)
Brahman (1/29)
Cholistani (2/5)
Dajal (2/4)
Dhanni (2/5)
Dianzhong (1/5)
Jiaxian Red (1/5)
Kazakh (2/9)
Lohani (1/1)
Mongolian (1/7)
Nari Master (2/4)
Red Sindhi (2/3)
Sahiwal (2/7)
Tharparkar (8/8)
Zebu Indian (1/1)
Composite (2)
Galloway Belted (1/3)
Shorthorn (1/1)
Hereford (2/48)
Unknown (1)
|
1 N = total number of animals in the cluster, p = No. of animals within breed in the haplogroup, q = total No. of animals within the breed, *I1-Orig =remaining animals under I1 haplogroup after assignment of other animals to I1a and I1b
|
Further, to explore the substructure of the I haplogroups revealed by the phylogenetic tree, the mitogenomes of only the animals assigned to haplogroup I (by MitoToolPy) were reanalysed using PCA of the GRM, Admixture and hierarchical clustering. The PC plot also showed sub-grouping of the I1 haplogroup into three well-separated clusters that were distinct from I2 (Figure 4b). Similarly, using Admixture with k set to 2, 3 or 4, there was distinct substructure within the I haplogroups (Figure 4c). With k =2, Admixture separated I2 and I1, and with k = 4 there was further clear separation of I2, I1a, I1b and I1-Orig, in agreement with the PC plot. The hierarchical clustering analysis (based on animals' pairwise nucleotide differences) showed two main clusters (1 and 2 in Figure 4d). Further, distinct sub-clusters were observed within both Cluster 1 and Cluster 2, with three main subclusters under I1 haplogroups that matched those identified from the other methods (Figure 4d). The I2 cluster showed one outlier that was in agreement with the PC plot outlier (i.e. the same animal). In all the above unsupervised clustering analyses, sub-clusters I1a and I1b were comprised of the same group of animals. Interestingly, all three unconventional mitochondrial clustering methods reproduced the same grouping of these animals as with maximum likelihood method.
Mitochondrial haplotype diversity
Overall, across 1,883 animals, 1,309 whole mitochondrial genome haplotypes were identified with haplotype diversity of 0.999 (SD 0.0001). Of the 1,309 haplotypes, 1,010 were singletons (i.e., one animal per haplotype) indicating considerable diversity. The remaining haplotypes (299) were shared by 2 to 23 animals (Figure S9). The shared haplotypes were approximately 60% within a breed and 25% between the breeds. The haplotype diversity within breed was generally high and ranged from 0.932-0.998 (Table 7). The shared haplotypes specific to a breed were also found across animals sampled in several different countries. Two haplotypes distinct to Angus were present in animals sourced from Canada and USA. Additionally, some haplotypes were shared among several breeds and across several countries. For example, one haplotype was identified in 23 animals from a wide range of breeds including Holsteins sourced from China and a number of other breeds mostly of Asian origin (Luxi, Lingan, Zaobei, Weining, Wannan, Jian, Jinjiang, Wenshan, Nanyang, Xuanhan, Leiqiong and Bohai Black). Similarly, another haplotype was shared by 23 animals in Angus (Canada), Brown Swiss (USA), Charolais (France), Deutsches Schwarzbuntes Niederungsrind (Germany), Gelbvieh (Canada), Hereford (Australia, Russia and USA), Holsteins (Netherlands and USA), Rodkulla (Sweden), Red Angus (USA), Hanwoo (Korea), Belgian Blue and a composite breed (Australia). Among the breeds, Holstein was the most numerous breed in our study (N=267), therefore it was of interest to examine the network of haplotypes within Holsteins from a total of 210 haplotypes (168 singletons and 42 shared) (Figure 5). Haplotypes from T3 and subgroups formed the core of the network with side branches in agreement with MitoToolPy I and T1 haplogroup allocations.
Table 7
Mitochondrial DNA sequence polymorphism and diversity (standard deviation) of selected breeds with a sample size of 20 or more.
Breed
|
No. of
Sequences
|
No. of Segregating
Sites
|
Average No. of
difference
|
No. of Haplotypes
(H)
|
Haplotype
Diversity (Hd)
|
Nucleotide
diversity (π)
|
Holstein
|
267
|
697
|
16.96
|
210
|
0.998 (0.001)
|
0.0055 (0.0010)
|
Jersey
|
27
|
57
|
9.62
|
16
|
0.937 (0.031)
|
0.0031 (0.0004)
|
Brown Swiss
|
84
|
202
|
8.99
|
64
|
0.993 (0.003)
|
0.0029 (0.0001)
|
Simmental
|
32
|
80
|
7.55
|
24
|
0.976 (0.002)
|
0.0025 (0.0002)
|
Norwegian Red
|
222
|
338
|
10.51
|
180
|
0.998 (0.001)
|
0.0034 (0.0001)
|
Holstein Friesians
|
35
|
121
|
9.94
|
31
|
0.992 (0.010)
|
0.0032 (0.0003)
|
DSN
|
47
|
154
|
10.2
|
40
|
0.992 (0.007)
|
0.0033 (0.0002)
|
Angus
|
103
|
122
|
7.65
|
45
|
0.935 (0.014)
|
0.0025 (0.0001)
|
Yakut
|
35
|
68
|
9.44
|
15
|
0.938 (0.018)
|
0.0031 (0.0003)
|
Hereford
|
44
|
312
|
33.84
|
33
|
0.979 (0.012)
|
0.011 (0.0043)
|
Charolais
|
33
|
114
|
8.856
|
31
|
0.996 (0.009)
|
0.0028 (0.0002)
|
Limousin
|
27
|
100
|
8.80
|
25
|
0.994 (0.012)
|
0.0029 (0.0002)
|
Modern Danish Red
|
23
|
73
|
11.55
|
19
|
0.980 (0.020)
|
0.0038 (0.0002)
|
Hanwoo
|
24
|
146
|
15.59
|
23
|
0.996 (0.013)
|
0.0051 (0.0013)
|
Buryat
|
20
|
250
|
48.16
|
12
|
0.932 (0.035)
|
0.0157 (0.0070)
|
To test the ability of a naïve hierarchical clustering approach to differentiate haplotypes across all animals, we used the height of the cluster (h) corresponding to the nucleotide difference of 0 between two haplotype pairs. The resulting cluster groups were compared with the haplotypes derived from the DnaSP, mainly focussing on the non-singleton haplotypes and hierarchical clusters. There was approximately the same number of singleton clusters as singleton haplotypes (1032). At least 132 clusters and haplotypes had substantial memberships in common (Table S7). For example, Cluster 328 and Haplotype-324 (with 23 each), Cluster-7 and Haplotype-7 (21 animals each) and all other cluster-haplotype combinations with more than five animals (total 30) had 100% of the same individuals except for five groups. This demonstrates a high concordance between determination of haplotypes by hierarchical cluster and the traditionally determined haplotypes.
Mitochondrial DNA polymorphism and nucleotide diversity
We investigated mitochondrial nucleotide diversity in animals from breed groups with N ≥ 20 animals. Overall, there were 1,825 segregating sites, nucleotide diversity (π) was 0.012, and the average nucleotide difference between the pair of sequences was 35.5. The nucleotide diversity was high in Buryat and Hereford and other breeds had low but comparable nucleotide diversity ranging from 0.002-0.005. The analysis of molecular variance (AMOVA) showed that the percentage of genetic variation from among and within breed components was 3.1% and 96.9%, respectively, indicating high within breed genetic diversity.
Imputation and MitoToolPy haplogroup prediction
The routine practice of discarding the sites with missing genotypes from all sequences in the mtDNA analysis results in loss of information, particularly when the proportion of missing genotypes in an individual were low. In this case, the imputation of sporadic missing genotypes could increase the number of animals and sites for analysis, but the empirical accuracy of mitochondrial imputation in cattle is unknown. To test this, we masked 10% of known genotypes (307 sites) in a random 20% of animals (333 out of 1,883). Then we imputed the masked genotypes using Beagle (version 4.0) using the gt and ref option using the remaining 1,550 animals with all genotypes present as a reference for imputation. The overall concordance of this imputation was 99.8%, although concordance for heteroplasmic sites was approximately 66% (Table 8). There was a tendency for imputation to bias heteroplasmic and homoplasmic alternate genotypes towards the homoplasmic reference genotypes (0/0). The genotype likelihood ‘gl’ option also produced a similar concordance of 99.5%.
To evaluate the effect of imputation on haplogroup prediction, we re-analysed the animal haplogroups in the imputed dataset and compared these to their haplogroup prediction from the original dataset. This was replicated 50 times with a new random set of animals chosen for masking genotypes for imputation. The predicted haplogroups matched in 99.7% of the individuals when compared to their haplogroup predicted from the full set of real genotypes. The accuracy of imputation and the predicted haplogroup for the masked dataset showed little variation across the 50 replications (Table S8). This suggests that missing genotypes can be imputed and used for prediction of haplogroups with high but not perfect accuracy. These results are provided for information only, that is, no imputed data was used elsewhere in this study.
Table 8
Empirical accuracy of imputing sporadic missing genotypes in mitogenomes. Number of correctly imputed genotypes (percentage correct in brackets) on the diagonals and number of genotypes wrongly imputed shown on the off-diagonals. Assessment was based on randomly masking of 10% of positions (307) per animal in 20% of animals (333).
|
|
Original Genotype
|
|
|
|
0|0
|
0|1
|
0|2
|
1|1
|
2|2
|
3|3
|
|
Imputed genotype
|
0|0
|
101089
|
39
|
|
73
|
|
1
|
|
|
(99.9%)
|
|
|
|
|
|
|
0|1
|
37
|
82
|
|
11
|
|
|
|
|
|
(65.6%)
|
|
|
|
|
|
0|2
|
2
|
|
3
|
|
2
|
|
|
|
|
|
(100%)
|
|
|
|
|
1|1
|
64
|
4
|
|
800
|
1
|
|
|
|
|
|
|
(90.4%)
|
|
|
|
2|2
|
4
|
|
|
1
|
16
|
|
|
|
|
|
|
|
(84.2%)
|
|
|
3|3
|
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
(67.0%)
|
|
|
Total
|
101196
|
125
|
3
|
885
|
19
|
3
|
|
|
|
|
|
|
|
|
|
101992/102231
(99.8%)
|