Genotype Data
This study referred to a previous study that included 354 Ashdan female yaks and 98K SNPs. The data (Jia et al. 2020) link is (https://www.animalgenome.org/repository /pub/NWAU2019.0430/). Three types of files were obtained: ‘ped’ file, ‘map’ file, and phenotype file. The ‘ped’ file contained 98688 SNPs, the ‘map’ file contained genetic map information, and the phenotype file contained four growth traits, all obtained from 354 yaks. The PLINK 1.9 software (http://www.cog-genomics.org/ plink2) was used to convert ‘ped’ files into numeric format genome information files. The genotypes were coded as 0, 1, and 2, corresponding to major alleles, heterozygous genotype, and minor alleles, respectively.
Based on the original dataset, we divided the whole data into several sub-datasets: 10K, 30K, 50K, and 70K. They were randomly selected from 98K data, with each data size generating 30 replications. Missing genotypes were randomly sampled in proportions of 5%, 10%, 15%, and 20% and replicated 30 times. The missing genotypes were represented by ‘NA’, and the data containing the missing genotypes were used as the reference dataset. The missing genotypes in the reference dataset were imputed using the Beagle method and the KBeagle method. The missed genotype data was saved in another independent document as the true genotype of the prediction dataset.
Imputation Algorithm
Beagle
Beagle 5.1 used hidden Markov models (HMM) for genotype imputation. In the Li and Stephens (2003) model, the HMM state space is a matrix of reference alleles, with each row recorded as a reference haplotype, each column recorded as a marker, and each HMM state labeled with the allele carried by the reference haplotype at the marker. For each marker, the sum of the probabilities of the states labeled by an allele was the imputed probability for that allele (Browning et al. 2018). The probability of a specific haplotype was calculated by adding the probabilities from the first marker to the last marker. The haplotype set was obtained by genotyping the target samples. The model was established according to the common gene sequences between the target sample and the reference panel. The missing alleles in the target sample were predicted and imputed using the calculated probability of allele markers in the reference panel.
KBeagle
The principle of the KBeagle method is to use the K-Means algorithm to calculate the genetic distance of samples with missing genotypes, classifying the samples with close genetic distances into one clustered group, and then use the Beagle software to estimate the missing genotype of samples in each clustered group. The K-Means algorithm divided the population into K cluster groups, each cluster group with several uncertain samples. This study assumed that the population was divided into 10 cluster groups, and the function fviz_nbclust in the factoextra package was used to find the best cluster number K less than or equal to 10.
The whole KBeagle progress consisted of 6 steps: Data Inputting, K-Means Clustering, Clustering Result, Beagle Imputing, Imputing Result, and Data Outputting (Fig 1). Numeric data formats could be accepted as input data. The whole data should include the taxa and the genotype with the missing value. The genotypes such as homozygotes of major and minor alleles, and heterozygotes should be coded as 0, 2, and 1 respectively (Fig 1A). Genome wide markers were used to estimate the genetic distance between each sample. Additionally, the K-Means algorithm was used to select K cluster centers, calculate the distance between the cluster center and the samples, and group those closest to the cluster center into one class. All samples were clustered into K classes (Fig. 1B). Next, all samples were reordered according to the clustering results, samples in the same cluster together were arranged, and K data sets were generated for subsequent imputation (Fig. 1C). The Beagle software was used to impute the samples in each data set (Figure 1D). K data sets containing all genotypes were imputed with Beagle software (Fig. 1E). The input samples were arranged in the original order and outputted according to the ID order of the input data (Fig. 1F).
Definition of imputation ability
To compare the imputation ability of the two genotype imputation methods, this study used the imputation matching rate and computing time as the evaluation criteria. The imputation matching rate was referred to as the ratio of the number of correctly imputed genotypes to the number of totally imputed genotypes. The calculated formula is
Computing time was defined as the time it took from the beginning of imputation to the end of imputation, including data inputting, data processing, genotype imputation and the calculation of imputation accuracy. The system.time function started measuring the time from the data inputting step and ended when the imputation accuracy was calculated, thus, the total time of all steps was the computing time.
Genome selection
Simulated phenotype
In this study the GAPIT (Wang and Zhang 2021)(www.zzlab.net/GAPIT) software package Gapit.Phenotype.Simulation was used to simulate the population phenotype. In order to evaluate the effectiveness of genome selection using the imputed genotype data, the phenotypic data was simulated using the genotype data with missing values. Before running the simulation, we needed to calculate the missing rate of each SNP and arrange the misssing rate from high to low. When simulating the phenotype, the heritability was set to 75%, the number of quantitative trait nucleotides (QTNs) was set to 20, and the loci of QTNs were selected from the SNPs with the highest missing rates. The simulated phenotypic and genotypic data were then subjected to genomic selection analysis.
Prediction model
The full name of rrBLUP (Medina et al. 2021) is ‘ridge regression best linear unbiased prediction’, which is the best linear unbiased prediction of ridge regression. It is one of the most commonly used models for genome selection and also represents the indirect method models. The principle of the indirect method is to estimate the marker effect in the reference population, and then combine the genotype information of the predicted population to accumulate the marker effect and to obtain the individual estimated breeding value of the predicted population.
The rrBLUP package mainly included two functions: the A.mat function and the mixed.solve function. The A.mat function was used to filter and impute genotypes, and the mixed.solve function directly solved model parameters.
The mixed.solve function was the core of the rrBLUP package, and the solution (Endelman 2011) form was
Estimated Heritability
First, the phenotype was simulated using the Gapit.Phenotype.Simulation function in the GAPIT software package. Then, the real data, the data containing the missing genotypes and the imputed data were analyzed by genome selection using the GAPIT software package, and the heritability of the three datasets was calculated, and selected the gBLUP model for calculation. Each dataset was processed 30 times, and the average of the 30 replicates was taken as the heritability of each data.
Prediction accuracy
Prediction accuracy was defined as the Pearson correlation coefficient between the predicted phenotypic value and the true phenotypic value with higher values indicating higher accuracy. First, the simulated phenotype data should be randomly divided into training and test phenotypes, and the corresponding genotype data should be divided into training and test sets, according to the classification results. The model was then built in the mixed.solve function of the rrBLUP package in R, substituting the training phenotype and training set, and selecting the maximum likelihood method for the calculation. Then, the test set was multiplied by the coefficient in the model to obtain the predicted phenotype value. Therefore, the Pearson correlation coefficient between the predicted phenotype value and the test phenotype gave the prediction accuracy. This was repeated 50 times and the average value was used to obtain the average accuracy of cross validation. Finally, all the data were repeated 30 times, and the average value was obtained to determine the average accuracy of all the data.