Plant materials
A total of 291 C. annuum accessions were used in this study, consisting of two types: inbred accessions (n = 132) and F1 accessions (n = 159). The inbred accessions included commercial cultivars and plant genetic resources from Asia and Central America, maintained either at the Agricultural Department of Shinshu University, Japan, or the Gene Bank of the National Agriculture and Food Research Organization, Japan. The F1 accessions were obtained through test crosses among 20 of the 132 inbred accessions. Detailed information about the inbred accessions and parental combinations of the F1 accessions can be found in Supplemental Dataset 1 and Fig. S1. These accessions were cultivated at the Faculty of Agriculture, Shinshu University, Minamiminowa, Nagano, Japan, over three years (2021–2023). Each year, one plant was cultivated, and five mature green fruits were used to collect phenotypic data on fruit-related traits.
Experimental design
This study was carried out in three steps: 1) Data collection (Fig. 1a), 2) Evaluation of GP model accuracy (Fig. 1b), and 3) Simulation of F1 phenotypic values (Fig. 1c). First, we collected phenotypic data on five fruit-related traits: fruit length (FL), fruit width (FD), fruit shape index (FI) (FL/FD ratio), fruit weight (FWT), and thickness of pericarp (THP) from all accessions over three years. Concurrently, we also obtained single nucleotide polymorphism (SNP) data using multiplexed inter-simple sequence repeat genotyping by sequencing (MIG-seq). Using these data, we performed genomic predictions (GP) for the fruit-related traits across three populations (all accessions, inbred, and F1) using 11 genomic models. The accuracy of each model was evaluated within each population. Finally, using the most accurate model, we conducted a phenotypic simulation of the F1 accessions based solely on the information from the inbred lines, including their parental accessions, and assessed the accuracy of the simulation. Detailed methods for each procedure are described below.
Phenotypic and genotypic data collection
For phenotypic data, FL and FD were measured with a ruler, and FI was calculated as the ratio of FL to FD. FWT and THP were measured using an electronic balance (FA-2000, A&D Co., Ltd., Tokyo, Japan) and a caliper, respectively. The mean value for each trait was calculated from five fruits, with data collected over three years (2021–2023). To minimize environmental effects across the three years, we calculated the best linear unbiased prediction (BLUP) values using a mixed linear model in the R package ‘lme4’; genotypes (accessions) and year were set as random and fixed effects, respectively, as described by Minamikawa et al. (2017) For genotypic data, genomic DNA was extracted from young leaves using the DNeasy Plant Pro Kit (Qiagen, Hilden, Germany) and subjected to MIG-seq analysis by GENODAS Co., Ltd. (Miyagi, Japan). Briefly, the first polymerase chain reaction (PCR) was performed with the primer set described by Suyama and Matsuki (2015) to amplify the inter-simple sequence repeat (ISSR) region. The second PCR added barcode and adaptor sequences. DNA fragments longer than 300 bp were selected, and pair-end sequencing was performed on a NextSeq500 (Illumina, San Diego, CA, USA). After sequencing, we trimmed adapters, conducted quality control, mapped reads to the reference genome, and called variants following the procedures described in our previous study (Kondo et al. 2023), using the C. annuum reference genome ‘Takanotsume’ (Shirasawa et al. 2023). We obtained variant call format (VCF) files of SNPs, extracted bi-allelic SNP loci with a genotyping ratio above 70%, imputed missing SNP loci using Beagle (beagle.22Jul22.46e.jar; Browning and Browning 2016), and removed loci with a minor allele frequency below 0.05. The genotypic values at each SNP locus were then converted to numeric values (-1, 0, 1).
Genetic statistical analysis
We calculated the number and density of SNP loci detected via MIG-seq for each chili pepper chromosome. Using these genotypic data, two population genetic analyses were conducted. The first was phylogenetic and admixture analysis of the inbred accessions (n = 132). We calculated genetic distances among the inbred accessions using Euclidean distance and constructed a phylogenetic tree with the neighbor-joining method using the R packages ‘ape’ and ‘ggtree’. We also estimated the proportion of ancestral groups in the inbred accessions using ADMIXTURE v1.3 (Alexander et al. 2009), with the number of ancestral populations (K) ranging from 1 to 10, and calculated cross-validation (CV) errors for each K. The second analysis was principal component (PC) analysis for all accessions, calculating PC1, PC2, and PC3 scores. We also determined the heterozygosity ratio across all SNP loci.
Phenotypic statistical analysis
We calculated the mean, median, minimum, and maximum values for each trait. Broad-sense heritability (hb²) for all accessions was calculated as follows: \(\:{h}_{b}^{\:2}=\:\frac{{\sigma\:}_{g}^{\:2}}{({\sigma\:}_{g}^{\:2}+{\sigma\:}_{\epsilon\:}^{\:2}/n)}\times\:100\) described by Schmidt et al. (2019), where σg² and σε² represent genetic and environmental variances, respectively, obtained from BLUP and residue values, with n denoting the number of cultivation replicates (3 years). Pearson’s correlation coefficients were calculated to evaluate the relationships among the five fruit-related traits using BLUP values from all accessions. In the F1 accessions, additive and dominance effects were calculated as follows: \(\:\text{A}dditive\:effect=\frac{\left|{P}_{P1}-{P}_{P2}\right|}{2}\), \(\:Dominance\:effect={P}_{F1}-\frac{{P}_{P1}+{P}_{P2}}{2}\), as described by Ukai (2002), where PP1, PP2, and PF1 denote the phenotypic values of the mother parent, father parent, and F1 progeny, respectively.
Genomic prediction for five fruit-related traits
Genomic predictions (GP) for the five fruit-related traits were performed in all accessions (n = 291), inbred accessions (n = 132), and F1 accessions (n = 159), using 11 models: 1) Ridge, 2) Least Absolute Shrinkage and Selection Operator (LASSO), 3) Elastic Net (EN), 4) Genomic Best Linear Unbiased Prediction (GBLUP)-A, 5) GBLUP-AD, 6) GBLUP-GAUSS, 7) Random Forest (RF), 8) Bayesian Ridge Regression (BRR), 9) Bayes A, 10) Bayes B, and 11) Bayes C. Ridge, LASSO, and EN models were implemented using functions in the R package ‘glmnet’. GBLUP models were implemented as described by Ishimori et al. (2020), with additive relation distance matrix (A), dominance genomic distance relationship (D), and Gaussian kernel distance matrix (K) calculated using the R package ‘RAINBOWR’, and models developed using ‘EMMREML’. For GBLUP-A and -GAUSS, A and K were used as single kernels, respectively, while A and D were used for multiple kernels in GBLUP-AD. BRR, Bayes A, Bayes B, and Bayes C models were implemented using the R package ‘BWGS’. GP accuracy was assessed by calculating Pearson’s correlation coefficient between observed (BLUP) and predicted values using 10-fold CV. The mean values of the ten folds were treated as the representative values for each run of the 10-fold CV, and this process was repeated ten times. The mean values of these repetitions were compared across the 11 models.
Phenotypic simulation in the F1 progenies
Using the GBLUP-GAUSS model, we developed a GP model based solely on the inbred accessions (n = 132), as previously described. We estimated the genotypic values of F1 accessions (n = 159) based on their 20 inbred parents using the following formula: \(\:{g}_{F1}=\frac{{g}_{P1}+{g}_{P2}}{2}\), where gF1, gP1, and gP2 represent the vectors of numeric genotypic values (-1, 0, 1) for the F1 progeny, mother parent, and father parent, respectively. In this study, we did not account for the segregation of heterozygous loci in the F1 generation. Therefore, when fractions (-0.5 and 0.5) were detected in the calculated gF1 vector due to heterozygous genotypes in either parent, they were converted to -1 and 1, respectively. Using the developed GP model and the estimated genotypic values, we simulated the phenotypic values for the F1 progenies without actual phenotypic and genotypic information of the F1 accessions. To assess accuracy, we examined Pearson’s correlation coefficient between the observed and simulated values. To explore the factors contributing to simulation error, we first calculated the root square error (RSE) between the observed and simulated values for each F1 accession. We then investigated Pearson’s correlation coefficient between the RSE and two statistical values: phenotypic values of the F1 progeny and the absolute values of dominance effects. For comparison, the same values were calculated for the GP models using all (inbred + F1) and only F1 accessions. In these cases, RSE values for each F1 accession were derived from observed and predicted values obtained through ten repetitions oF10-fold CV in the GPs. The mean RSE values of these ten replicates were used as the representative values for each F1 accession.
Additionally, we simulated the fruit length (FL) and fruit diameter (FD) in the F1 progenies derived from all possible combinations (132C2 = 8,646 combinations) of inbred accessions using the same method as described above, and presented these as scatter plots. In 2023, we cultivated eight new F1 lines (F1A - F1H) along with six of their parental lines (P1 - P6) at the Faculty of Agriculture, Shinshu University. These parental lines were different from the 20 parents of the F1 accessions. We collected the mean FL and FD values from five fruits of multiple plants (parental lines: n = 3, F1 lines: n = 5). Detailed information about the accessions is provided in Table S1 and Supplemental Dataset 1. We then compared the mean values with the simulated values.