Material
The data set analyzed (Table 1) represented data from the Polish national genetic evaluation of Holstein cattle from December 2021 for stature which represents a highly heritable trait (h2 = 0.54) and foot angle representing a low heritable trait (h2 = 0.09). It comprised 1,098,611 cows with phenotypes for stature and 1,098,766 cows with phenotypes for foot angle born from 1992 to 2019 as well as 141,397 bulls (stature), and 117,482 bulls (foot angle) born from 1986 to 2017 with pseudo-phenotypes expressed by their de-regressed proofs (DRP) from the multiple across country evaluation (MACE) carried out by the Interbull (interbull.org). Genomic data in the form of genotypes of 46,118 SNPs were available for 134,960 individuals, including 70,134 cows with phenotypes born from 2009 to 2021 as well as 64,826 bulls born from 1985 to 2021, of which 26,471 represented young individuals without pseudo-phenotypes and 38,355 were bulls with MACE-DRP. In our study, the pedigree of animals with phenotype data was truncated after the 5th generation resulting in 1,555,995 individuals and 33 genetic groups.
Table 1
Number of animals in the analyzed data sets
Category | Sex | Number of animals |
Phenotype data (Stature) | Females with phenotypes | 1,098,611 |
Males with MACE DRP | 141,397 |
Phenotype data (Foot angle) | Females with phenotypes | 1,098,766 |
Males with MACE DRP | 117,482 |
Genotype data | Females | 70,134 |
Males (bulls and candidates) | 64,826 |
Pedigree data | Females | 1,368,487 |
Males | 187,508 |
Models
GEBV prediction models
The following single-step single-trait models were considered for the prediction of GEBV:
$$\:\mathbf{y}=\mathbf{X}\mathbf{b}+{\mathbf{W}}_{\mathbf{G}}\mathbf{a}+\mathbf{e}$$
1
,
\(\:\mathbf{y}\) is the vector of dependent variables represented by cows’ measured phenotypes and bulls' pseudo phenotypes expressed by their MACE DRPs, \(\:\mathbf{b}\) represents a vector of fixed effects including age at calving, lactation phase, and herd corresponding to cows’ phenotypes as well as corresponding phantom codes of the fixed effects for bulls’ DRPs, \(\:\mathbf{a}\) represents a vector of breeding values, and \(\:\mathbf{e}\) is the residual. The underlying covariance structure of the random effects is given by \(\:\mathbf{a}\sim\mathbf{N}\left(0,{\mathbf{H}}_{\mathbf{G}}{{\sigma\:}}_{\text{a}}^{2}\right)\) and \(\:\mathbf{e}\sim\mathbf{N}\left(0,\mathbf{R}{{\sigma\:}}_{\text{e}}^{2}\right)\). \(\:{\mathbf{H}}_{\mathbf{G}}\) is given by \(\:\left[\begin{array}{cc}{\mathbf{A}}_{11}&\:{\mathbf{A}}_{12}\\\:{\mathbf{A}}_{21}&\:{\mathbf{A}}_{22}\end{array}\right]+\left[\begin{array}{cc}0&\:0\\\:0&\:\mathbf{G}-{\mathbf{A}}_{22}\end{array}\right]\), where \(\:{\mathbf{A}}_{11}\), \(\:{\mathbf{A}}_{12}/{\mathbf{A}}_{21}\), and \(\:{\mathbf{A}}_{22}\) being the components of the numerator relationship matrix constructed based on the pedigree information corresponding to non-genotyped animals, the covariance between non-genotyped and genotyped animals, as well as to genotyped animals, while \(\:\mathbf{G}\) represents the genomic covariance matrix between genotyped animals. \(\:\mathbf{R}\) is a diagonal matrix containing 1.00 for cows with phenotypes or \(\:{\text{n}}_{\text{i}}\) for bulls with MACE DRPs, where \(\:{\text{n}}_{\text{i}}\) represents a difference in effective daughter contributions of i-th bull between the MACE and the national evaluation. \(\:\mathbf{X}\) and \(\:{\mathbf{W}}_{\mathbf{G}}\) denote the corresponding design matrices.
$$\:\mathbf{y}=\mathbf{X}\mathbf{b}+{\mathbf{W}}_{\mathbf{S}}\mathbf{a}+\mathbf{e}$$
2
,
where \(\:\mathbf{y}\), \(\:\mathbf{b}\), \(\:\mathbf{a}\), and \(\:\mathbf{e}\) are the same as defined above, but \(\:\mathbf{a}\) is parameterised as \(\:\mathbf{Z}\mathbf{g}+\:\mathbf{u}\) with \(\:\mathbf{g}\) being the vector of random SNP effects and \(\:\mathbf{u}\) - a vector of random additive residual polygenic effects. The covariance structure imposed on the residual effect (\(\:\mathbf{e}\)) is the same as defined above, while for the additive genetic effect \(\:\mathbf{a}\) the distribution is defined as \(\:\mathbf{a}\sim\mathbf{N}\left(0,{\mathbf{H}}_{\mathbf{S}}{{\sigma\:}}_{\text{a}}^{2}\right)\) where the structure of \(\:{\mathbf{H}}_{\mathbf{S}}\) is expressed by \(\:\left[\begin{array}{ccc}\mathbf{K}\mathbf{G}{\mathbf{K}}^{\mathbf{{\prime\:}}}+\mathbf{D}&\:\mathbf{K}\mathbf{G}&\:\mathbf{K}\mathbf{Z}\mathbf{B}\\\:\mathbf{G}{\mathbf{K}}^{\mathbf{{\prime\:}}}&\:\mathbf{G}&\:\mathbf{Z}\mathbf{B}\\\:\mathbf{B}{\mathbf{Z}}^{\mathbf{{\prime\:}}}{\mathbf{K}}^{\mathbf{{\prime\:}}}&\:\mathbf{B}\mathbf{Z}{\prime\:}&\:\mathbf{B}\end{array}\right]\) with \(\:\mathbf{D}={{(\mathbf{A}}^{11})}^{-1}\), \(\:\mathbf{K}={\mathbf{A}}_{12}{\mathbf{A}}_{22}^{-1}\), and \(\:\mathbf{B}\) representing a diagonal matrix of the form\(\:\:\mathbf{I}\frac{1}{\sum\:_{\text{i}=1}^{\text{N}}2{\text{p}}_{\text{i}}\left(1-{\text{p}}_{\text{i}}\right)}\) with \(\:{\text{p}}_{\text{i}}\) denoting the allele frequency of the i-th SNP and N is the number of SNPs. \(\:\mathbf{X}\:\text{a}\text{n}\text{d}\:{\mathbf{W}}_{\mathbf{S}}\:\) denote the corresponding design matrices (Liu et al. 2016). Note that for none of the models, the variance components and the proportion of residual additive polygenic variance were estimated, instead their values corresponding to the parameters used in the Polish national genetic and genomic evaluation for a given trait (Table 2) were used.
Table 2
Variance components underlying the analyzed phenotypes
Trait | Genetic variance | Residual variance | Residual polygenic variance |
Stature | 5.01 | 4.63 | 20% |
Foot angle | 0.11 | 1.06 | 20% |
Note that \(\:{{\sigma\:}}_{\text{a}}^{2}\) and \(\:{{\sigma\:}}_{\text{e}}^{2}\) used in all of the above models, were not estimated, instead values from the national conventional evaluation were used. \(\:{{\sigma\:}}_{\text{a}}^{2}=5.50058\), \(\:{{\sigma\:}}_{\text{e}}^{2}=4.63406\).
Solving the GEBV prediction equations
Solutions for the effects fitted in the above models were obtained using the MiXBLUP 3.0 software (Ten Napel et al. 2020) that optimise the following equation: \(\:{{\mathbf{D}}^{-1}\mathbf{M}}^{-1}\mathbf{C}\mathbf{x}={{\mathbf{D}}^{-1}\mathbf{M}}^{-1}\mathbf{p},\) where \(\:\mathbf{C}\) represents the coefficient matrix corresponding to the Mixed Model Equations (MME) for solving (1) or (2), \(\:\mathbf{x}\:\)is the vector of model parameters, and \(\:\mathbf{p}\) is the RHS of MME, while \(\:\mathbf{M}\) and \(\:\mathbf{D}\) respectively represent the first level and the second level preconditioning matrices. The computations were performed on a dedicated computing server running the Linux Red operating system with 260GB of RAM, 16 Intel Xeon CPUs with 2.20GHz, and 600GB hard disk space. For a large national dairy population, as the one used in our study, the numerically severe challenge is to obtain the inverse of the \(\:{\mathbf{H}}_{\mathbf{G}}\) and \(\:{\mathbf{H}}_{\mathbf{S}}\) matrix, in particular of its component related to genotyped individuals – the dense sub-matrix \(\:\mathbf{G}\). In our study, the following approaches were considered:
-
the GT approach (Mäntysaari et al. 2020) in which the \(\:{\mathbf{G}}^{-1}\) matrix is represented by \(\:\frac{1}{\text{w}}{\mathbf{A}}_{22}^{-1}-\frac{1}{\text{w}}{\mathbf{T}}^{{\prime\:}}\mathbf{T}\), with \(\:\mathbf{T}={\mathbf{L}}^{-1}{\mathbf{Z}}^{{\prime\:}}{\mathbf{A}}_{22}^{-1}\) where \(\:w\) denotes the proportion of a residual polygenic variance and \(\:\mathbf{L}\) is defined by \(\:\mathbf{L}{\mathbf{L}}^{\mathbf{{\prime\:}}}={\mathbf{Z}}^{{\prime\:}}{\mathbf{A}}_{22}^{-1}\mathbf{Z}+\text{w}\mathbf{I}\).
-
the APY approach (Misztal et al. 2014) that divides the genotyped individuals into a core and non-core sub-groups for which the inverse is handled differently in a way that the exact inverse is computed only for the core animal sub-group (\(\:{\mathbf{G}}_{\text{c}}\)) and the covariance between core and non-core individuals, while the part of the matrix corresponding to the non-core genotyped animals (\(\:{\mathbf{G}}_{\text{n}}\)) is handled as a diagonal matrix: \(\:{\mathbf{G}}^{-1}\approx\:\left[\begin{array}{cc}{\mathbf{G}}_{\text{c}}^{-1}&\:0\\\:0&\:0\end{array}\right]+\left[\begin{array}{c}-{\mathbf{G}}_{\text{c}}^{-1}{\mathbf{G}}_{\text{c}\text{n}}\\\:\mathbf{I}\end{array}\right]{\mathbf{M}}_{\text{n}}^{-1}\left[\begin{array}{cc}-{\mathbf{G}}_{\text{n}\text{c}}{\mathbf{G}}_{\text{c}}^{-1}&\:\mathbf{I}\end{array}\right]\) where the subscripts \(\:\text{n}\) and \(\:\text{c}\) represent non-core and core individuals respectively, and \(\:{\mathbf{G}}_{\text{n}}\) is a diagonal matrix. In our study, four approaches to the selection of core animals were considered: APY3000top – where 3,000 genotyped bulls with the highest effective daughter contributions (EDC) were selected as core individuals, APY3000random – where the 3,000 core individuals were selected randomly from the genotyped population and the corresponding versions termed APY15000top, APY10000random, and APY15000random implementing 10,000 and 15,000 core individuals respectively.
-
The SNP-BLUP approach (Liu et al. 2014) does not meet the numerical burden of the models based on \(\:\mathbf{G}\), since in terms of the modelling of genomic information only the inverse of the diagonal matrix \(\:\mathbf{B}\) is required: \(\:{\mathbf{H}}_{\mathbf{S}}^{-1}=\left[\begin{array}{ccc}{\mathbf{A}}^{11}&\:{\mathbf{A}}^{12}&\:0\\\:{\mathbf{A}}^{21}&\:{\mathbf{A}}^{22}+\left(\frac{1}{\text{w}}-1\right){\mathbf{A}}_{22}^{-1}&\:-\frac{1}{\text{w}}{\mathbf{A}}_{22}^{-1}\mathbf{Z}\\\:0&\:-\frac{1}{\text{w}}{\mathbf{Z}}^{{\prime\:}}{\mathbf{A}}_{22}^{-1}&\:\frac{1}{1-\text{w}}{\mathbf{B}}^{-1}+\frac{1}{\text{w}}{\mathbf{Z}}^{{\prime\:}}{\mathbf{A}}_{22}^{-1}\mathbf{Z}\end{array}\right]\) where \(\:\left[\begin{array}{cc}{\mathbf{A}}^{11}&\:{\mathbf{A}}^{12}\\\:{\mathbf{A}}^{21}&\:{\mathbf{A}}^{22}\end{array}\right]\) represent the blocks corresponding to \(\:{\left[\begin{array}{cc}{\mathbf{A}}_{11}&\:{\mathbf{A}}_{12}\\\:{\mathbf{A}}_{21}&\:{\mathbf{A}}_{22}\end{array}\right]}^{-1}\).
Validation of GEBV
For the validation of GEBV prediction, bulls born after 2013 (7,296 bulls for stature; 6,200 bulls for foot angle) and cows born after 2016 (96,772 cows for stature; 96,771 cows for foot angle) were removed from the phenotype vector (\(\:\mathbf{y}\)) of equations (1) and (2). Based on such truncated data sets, the GEBVs of bulls with EDC greater than 20 were then predicted by models (1) and (2). Prediction accuracy was assessed by a weighted linear regression: \(\:{\mathbf{G}\mathbf{E}\mathbf{B}\mathbf{V}}_{\text{f}}={\text{b}}_{0}+{\text{b}}_{1}{\mathbf{G}\mathbf{E}\mathbf{B}\mathbf{V}}_{\text{t}}+\mathbf{e}\), with \(\:{\mathbf{G}\mathbf{E}\mathbf{B}\mathbf{V}}_{\text{f}}\) representing the vector of GEBVs predicted based on the full data set with all available individuals while \(\:{\mathbf{G}\mathbf{E}\mathbf{B}\mathbf{V}}_{\text{t}}\) contains the GEBVs predicted based on the truncated data set. For i-th bull weights were defined as \(\:\:\frac{{\text{E}\text{D}\text{C}}_{\text{i}}}{{\text{E}\text{D}\text{C}}_{\text{i}}+\text{k}}\) with \(\:\text{k}=\:\frac{4-{\text{h}}^{2}}{{\text{h}}^{2}}\). The above linear regression equation was fitted using the lm function in R software (Rstudio Team 2021).