Data source
The data used for this study were obtained from a wide-genome study done by Yuan et al. (2018) in four Chinese indigenous chicken breeds. The study, used 394 birds from four indigenous breeds consisting of two typical low-body-weight breeds (Chahua chicken and Silkie chicken) and two intermediate and high body-weight breeds (Beard chicken and Langshan chicken). The birds were randomly selected from the original conserved population and bred by performing artificial insemination of hens with sperm pools. The obtained fertilized eggs were incubated and the chicks were of an approximately equal number of cocks and hens. The hens and cocks of the different breeds were phenotyped for body weight.
Phenotyping
Live body weight (BW) was measured at hatch and every week until 15 weeks of age.
Sample collection, DNA extraction and genotyping
Blood samples were collected from 394 birds at 15 weeks of age. The DNA extraction was done using the phenol-chloroform method and diluted to 50 ng/ml. Genotyping was performed using Illumina 60K Chicken SNP BeadChips (Groenen et al., 2011). Quality control was conducted on all birds (after quality control of their phenotypic records) across four breeds by customized scripts in R software version 3.3.0 (R Core Team, 2007). Single-nucleotide polymorphisms (SNPs) filtering were done using the following criteria: individual samples were excluded with call rates < 0.9 and minor allele frequency (MAF) < 0.05. After imposing the quality control checks, a total of 46211 SNPs remained.
Genotype preprocessing and coding
The 46211 markers had some missing genotypes therefore, missing markers were imputed using A.mat function of the rrBLUP package installed in R software. Markers with 50% missing genotypes were not imputed, leaving 26698 markers for data analysis. Genotypes were coded as {0 1 2} based on R code script by Eva KF Chan (http://www.evachan.org/rscripts.html)
Data analysis
Conventional Neural Network
The conventional neural network model to be adopted for this research were a feed-forward artificial neural network (ANN). The data were splited into training, validation and test data with 80% used for training, 10% being used for validation and remaing 10% used for testing. Sampling were random in order to avoid any selection bias in the dataset. The Prediction (testing) was done with the remaining data sample that were not used during the training and validation phase. Data preprocessing and analysis were done using the R software, while the construction and training of neural networks were performed using MATLAB. A three-layed, fully-connected, feed-forward artificial neural network were built using a MATLAB software function “feedforwardnet”, with a 18-32-1 architecture (18 nodes in the first hidden layer, 32 nodes in the second hidden layer and one node as an output layer) (Ma et al., 2018). Error optimization will be done using back-propagation algorithm (Rumelhart, et al., 1986). Nodes in one layer will be fully connected to all nodes in the next layer to form a multilayer ANN architecture.
Conventional statistical model (Ridge Regression-Based Linear Unbiased Prediction (RR-BLUP)
Ridge Regression-Based Linear Unbiased Prediction (RR-BLUP) were adopted as a conventional statistical model. This model was built using the standard linear regression formula, given the genotype matrix as Z (n × p) and the corresponding phenotype vector as Y (n × 1);
where, n = individuals, p = markers, μ = mean of phenotype vector Y, ɡ (p×1) = a vector of marker effects and (n × 1) = vector of random residual effects. The model was implemented using the function of mixed.solve in R software.
Deep convolutional neural network model (DeepGS)
The DeepGS model was constructed using the deep learning technique as described in a project done by Ma et al. (2019) available at GitHub (https://github.com/cma2015/DeepGS). Where An 8-32-1 architecture with one input layer, one convolutional layer with 8 neurons, one sampling layer, 3 dropout layers, 2 fully-connected layers and one output layer were constructed. The input layer receives the genotypic markers of a given individual in the 1 × p matrix, where p is the number of genotypic markers. The first convolutional layer filters the input matrix with 8 kernels (weights) which are each 1 × 18 in size with a stride size of 1× 1, followed by a 1 × 4 max-pooling layer with a stride size of 1 × 4. The output of the max-pooling layer is passed to a dropout layer with a rate of 0.2. The first fully-connected layer with 32 neurons is used after the dropout layer to join together the convolutional characters with a dropout rate of 0.1. Rectified linear unit (ReLU) activation function is applied in the convolutional and first fully connected layers. First fully-connected layer output is then fed to the second fully-connected layer with one neural and a dropout rate of 0.05. Using a linear regression model, the output of the second fully-connected layer is finally connected to the output layer which presents the predicted phenotypic value of the analyzed individual.
The Deep GS model was trained on the training set and validated on the ten-fold cross-validation. Parameters were optimized with the back-propagation algorithm, by setting the number of epochs to 6,000, the learning rate to 0.01, the weight decay to 0.00001 and the momentum to 0.5. The loss function minimized will be taken as the mean absolute error (Mae) index (Ma et al., 2019);
where, m is the number of individuals in the training dataset, and predictk and obsk represent the predicted and observed phenotypic values of the kth individual, respectively. Deep GS were implemented using the central processing unit (CPU) based on Deep Learning framework MXNet.
Ensemble model
An ensemble GS model (E) was constructed using the ensemble learning approach by linearly combining the predictions of DeepGS (D) and RR-BLUP (R), using the formula described by Ma et al. (2018).
PredictE = (WD × predictD + WR×predictR) / (WD + WR)
where WD = weight of DeepGS ,WR = weight of RR-BLUP, predictD = predictions of DeepGS (D), predictR = prediction of RR-BLUP
For each fold of cross-validation, parameters (WD and WR) were optimized on the corresponding validation dataset using the particle swarm optimization (PSO) algorithm.
Statistical analysis
For each BW trait of a single breed, boxplots were generated by R version 3.3.0 (R CoreTeam, 2017) to screen for outliers. The Pearson’s correlation coefficient (PCC) and corresponding significance level (p-0.05) will be calculated using R programming language for each model. The significance level of the difference between paired samples will be examined using the student’s t-test using R software.