Plant Material and Germplasm
CDC Stanley x CDC Landmark Double Haploid Population
We tested the doubled haploid (DH) population from the cross of spring wheat cultivars ‘CDC Stanley’ and ‘CDC Landmark’ developed by the Crop Development Centre at the University of Saskatchewan, and hence termed the “StanMark-DH” population. The development of DH lines was performed with the wheat-maize wide hybridization method43. Initially, F1 hybrids were developed by crossing CDC Stanley and CDC Landmark and followed by planting of F1 seeds. The spikelets from the F1 plants were emasculated and pollinated with maize pollen to induces haploid embryo development. The developing embryos were then excised and cultured in media, and developed into plantlets. The haploid plants were treated with colchicine to doubled the chromosomes and generated primary DH plants. The primary DH plants were self-pollinated to produce the DH0:1 generation, from which 48 unique DH lines were used in this study.
Wheat 5D monosomic group
A 5D monosomic line (TA3059), derived in the Chinese Spring (TA3008) background and maintained by the Wheat Genetics Resource Center (WGRC), Manhattan, KS, USA, was self-pollinated to produce progenies segregating for the dosage of the 5D chromosome. This population of 864 samples, named CS M5D, included 839 self-pollinated progeny from TA3059, 16 standard Chinese Spring (TA3008) lines as internal controls and 9 blank samples with no DNA. These genetic stocks were developed by and are available through the WGRC.
Wheat-barley introgressions
Two advanced backcross populations of wheat-barley translocation lines were made by crossing wheat-barley recombinants with group 7 translocations 44,45 to the elite breeding lines, KS090616K-1 and ‘KS Silverado’ developed by the Kansas State University winter wheat breeding program. The wheat-barley recombinants were developed and described previously by Danilova et al. (2019) 45where group 7 translocations (7AS.7HL-7AL(TA5798), 7BS.7HL-7BL(TA5797), and 7DS.7HL-7AL(TA5799)) were cytologically verified. The wheat-barley homozygous recombinant lines in the ‘Chinese Spring’ background were independently crossed with the two elite lines to generate F1 hybrids. The F1 was backcrossed with the respective recurrent parent to form BC1 progenies for each cross combination. The final population included 335 BC1 lines, in addition to the homozygous wheat-barley recombinant lines, the elite recurrent parent lines, and Chinese Spring as internal checks.
Thinopyrum intermedium—Wheat Amphiploid Mapping
A panel of 285 Thinopyrum intermedium and Th. intermedium—wheat (Triticum durum) addition lines were evaluated which included 141 Th. intermedium genets and 144 amphiploid genets derived from crossing Th. intermedium x Triticum durum lines. The amphiploids were developed by crossing winter T. durum as females to Th. intermedium as the males. Embryos were rescued and germinated on a modified MS medium, and chromosome doubling was completed with colchicine in young plants. Plants with successful doubling of chromosomes were male-fertile and produced self-progeny that had the complete set of 28 wheat-derived chromosomes and 42 chromosomes for Th. intermedium. These amphiploids were then used as male parents and crossed to Th. intermedium. Crosses were made using emasculating Th. intermedium plants as females followed by embryo rescue of the hybrid. The subsequent progeny were male sterile and were crossed again to Th. intermedium as the male parent. A small number of viable seeds were obtained from these crosses, with the resulting progeny including both male-fertile and male-sterile plants. The male-fertile plants were crossed as male parents to Th. intermedium and as female parents to Th. intermedium in the case of male-sterile plants. The resulting seed was germinated, and young leaf tissue was collected for DNA extraction, genotyping and evaluating the chromosome constitution. Previous research has shown that crosses of Th.intermedium to wheat can have variable chromosome composition46-50.
Library construction
Genomic DNA was extracted from leaf tissue collected from seedlings at the two- to three-leaf stage. Approximately 1.5-inch-long leaves were collected, lyophilized for 3 days and ground using a Retsch mixer mill MM400. Genomic DNA was extracted in 96 well plates using BioSprint DNA kit (Qiagen Inc.) following the manufacture’s protocol. In each plate, a random blank well was left as a negative control.
An optimized, low-volume high-throughput library preparation was developed using Illumina Tagment DNA TDE1 Enzyme and Buffer Kits (Illumina Tagment DNA TDE1 Enzyme and Buffer Kits, Illumina, Inc., San Diego, CA, USA), (Supplementary Text S1). This library preparation method provides a high level of multiplexing into a single library that can be sequenced in a single flow cell lane. First, the DNA samples were diluted to ~20 ng/µl and quantified using a Quant-iT™ PicoGreen™ dsDNA Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA). The quantified DNA was then normalized to a target volume of 40 μl at 0.75 ng/μl. Next, a tagmentation reaction consisting of 1 μl normalized to 0.75 ng/µl of the genomic DNA, 0.9966 μl TDE1 Tagment DNA Enzyme, 0.504 μl Tagment DNA Buffer, and 3.3964 μl water was fragmented was incubated at 55°C for 15 minutes, and then cooled to room temperature.
Next, the libraries were PCR amplified to add dual indexes with a unique i5 index for each plate and a unique i7 index for each sample to the tagmented DNA (Supplementary Table S1). For each sample, 5.0 μl of tagmented DNA, 12.5 μl of Taq 2X Master Mix (New England Biolabs Inc., Ipswich, MA, USA), 2 μl of combined i7 and i5 index adapters at 2.5 μM each, and 5.5 μl water were added to make a final reaction volume of 25 μl. The PCR amplification was completed as follows:72°C (3 min), 95°C (1 min), 18 cycles consisting of 95°C (10 sec), 55°C (20 sec), 72°C (3 min), and a final cycle of 72°C (5 min).
For multiplexing, all barcoded and amplified samples were quantified using the Quant-iT™ PicoGreen™ dsDNA Assay Kit. The samples were normalized to 15 μl at 6 ng/μl and then pooled into a single tube. This library was purified using a QIAquick PCR Purification Kit (QIAGEN, Hilden, Germany) and then size-selected from 600 to 800 bp using BluePippin (Sage Science, Inc., Beverly, MA, USA). This library was then cleaned, and the fragment size distribution was verified with an Experion™ DNA 1K Reagents kit (#7007164) using Experion™ Automated Electrophoresis Station (Bio-Rad Laboratories, Inc., Hercules, CA, USA). Finally, the libraries were quantified using the Quant-iT™ PicoGreen™ dsDNA Assay Kit before paired-end sequencing. Paired-end library sequencing was performed by Psomagen (Rockville, MD, USA) with Illumina NovaSeq 6000 or HiSeq X10.
Bioinformatics Pipeline
The analysis pipeline described in this study (Figure 1) can be used for a range of different genomics applications, including variant calling, dosage estimation and identifying chromosome segments from different genomes. This analysis is readily adapted to evaluate introgressions, aneuploidy (dosage), and SNP discovery and genotyping. Efficient processing pipelines for each use case include the following step:
Demultiplexing
The first step in the skim-seq approach demultiplexes the combined sequence library into individual samples. Depending on the sequencing machine, e.g., HiSeq X and NexSeq 2000, the returned sequence files could require varying levels of processing. If sequence data includes separate fastq files for the index reads, (R1.fq, R2.fq, and separate index files I1.fq and I2.fq), a custom Perl script as used here provides easy demultiplexing (https://github.com/sandeshsth/Skim-seq_Method). Based on the sequencing machine,the i5 index could also be reverse complement, which should be identified and the barcode file read accordingly. If the i7 and i5 barcodes are present in the header of the raw fastq file, trimming raw reads to remove the Nextera adapters and primers before demultiplexing can be done using the bbduk program of BBTools (BBMap) suite (https://jgi.doe.gov/data-and-tools/bbtools/). When the i7 and i5 barcodes were provided in separate fastq files than the sequence files, we trimmed the reads after demultiplexing using fastp (https://github.com/OpenGene/fastp).
For project data integrity, we always include a random blank well in each plate to identify any potential plate mix-ups. Blank wells in each 96-well plate were used to assess data quality, as these wells should have little if any sequence data which we confirm as a negative control as less than 0.01% of the average reads per sample.
After a quality check of the sequencing data, we estimated the average sample genome coverage per individual for each population using the following equation:
\(genome coverage= \frac{\left(read count*read length*2\right)}{\left(total genome size*total number of samples\right)}\) Equation 1.
Sequence Alignment and Concordant Read Selection
We used HISAT2 v2.1.0 (Kim et al., 2019) for read alignment of the skim-seq data to relevant reference sequences. For each genome, index files were generated using HISAT2. For aneuploidy, SNP discovery, and genotyping, a single species reference was utilized. For these analyses, we utilized the Chinese Spring RefSeq v1 assembly (Apples et al., 2018)
For interspecific introgression mapping, a reference assembly was generated by concatenating the reference sequences of a donor and a recipient species. We combined the Chinese Spring reference genome V1.051 and barley pseudomolecule assembly of barley cv. Morex52 for identification of wheat-barley group 7 introgressions. An additional combined reference was generated to map Th. intermedium – wheat introgression lines using the Chinese Spring (CS) wheat reference and T. intermedium draft genome assembly (provided by Thinopyrum intermedium Genome Sequencing Consortium https://phytozome-next.jgi.doe.gov/info/Tintermedium_v2_1) developed from accession C4-5353T1. When combining reference genomes, all chromosomes or pseudomolecule names must be unique.
The HISAT2 pipeline was run with the default parameters for paired-end reads in a multithreaded environment. We disabled the spliced alignment option and suppressed the sequencing alignment map (SAM) records for reads that failed to align. The output SAM files were then filtered using command lines tools to filter for uniquely mapped concordant reads (https://github.com/sandeshsth/Skim-seq_Method).
Normalized read counts were computed using the AWK programming language. Information about chromosome and physical location written to a bed file was used as the input to calculate normalized read counts per one Mb bin. The normalized read counts were computed as:
\(\frac{normalized reads}{Mb}= \frac{sum of reads in Mb bin}{total number of reads per sample} x Normalization Factor\) Equation 2
The normalization factor can specified, where we used reads per 10 million or 100/average read count in all bins. The script (https://github.com/sandeshsth/Skim-seq_Method) also added sample names to the text file. To efficiently process hundreds of samples, we ran array jobs on a high-performance cluster. The resulting text files included read count in bins, with chromosome and physical locations.
Data Filtering and Visualization for Introgressions and Aneuploidy
Once each sample had been processed to obtain normalized read counts, unknown chromosomes were removed using the UNIX sed command (https://github.com/sandeshsth/Skim-seq_Method), and a final file for all samples was made by concatenating all sample files together. Graphical displays to visualize karyotypes of introgression and aneuploid lines, were plotted using ggplot2 (Wickham, 2009) in R (R programming language). The R scripts for data visualization (https://github.com/sandeshsth/Skim-seq_Method) also allowed us to easily generate read counts per bin and view read depth. For the Th. intermedium—wheat lines, read depth provided an efficient way to determine which chromosome additions were present. Leveraging the centromere position with this information also allowed for visualization of Robertsonian Translocations and aneuploidy.
SNP Discovery and Genotyping in StanMark-DH
The genotyping of the DH population was accomplished in two bioinformatics steps by discovering SNP between the two parents and later genotyping the discovered SNPs in the population. To discover SNPs between the two parents, the high-coverage paired-end raw reads of CDC Stanley and CDC Landmark were mapped to the CDC Landmark reference genome (available through the Sequence Read Archive PRJNA544491) using HISAT253,54. The alignment was completed with default parameters except for turning off the spliced alignment function and preventing the unaligned reads from being output in the SAM files. In preparation for variant calling, the alignment files were sorted by chromosome and position. The alignments were filtered using samtools v1.1055 to keep reads with unique and concordant alignment based on the SAM tags NH:i:1 and YT:Z:CP respectively. The filtered output BAM files were csi indexed using SAMtools to generate index files needed for variant calling. Variant discovery was performed with BCFtools commands: bcftools mpileup followed by bcftools call56. The output VCF was annotated with the --annotate AD,DP,INFO/AD option with mpileup in BCFtools. Variants were discovered on an individual sample basis instead of a population level with option -G - in bcftools call. The SNP discovery process was run in parallel for each chromosome individually with -regions. Output VCF files were filtered and merged together. Each SNP position was filtered based on read depth to keep the SNPs when the following criteria were met: minimum and maximum filtered read depths of ≥ 6 and ≤ 100 respectively and reference and alternate allele read depths of ≥ 3. High-quality SNPs discovered between the parents, CDC Stanley and CDC Landmark, were then called (genotyped) in the 48 DH lines. To genotype the StanMark-DH population, the skim-seq data was filtered using fastp to remove any reads containing adapters while maintaining the final read length of 150 bp57. The paired-end fastq files of each sample were processed to generate alignment files with the same pipeline used for the two parents. The alignment files of 48 DH lines were used in genotyping the SNP positions discovered between the two parents using the -T option in BCFtools.
Down sampling for low-coverage samples
While most target applications for genotyping in breeding programs such as genomic selection will utilize very low-coverage sequencing to reduce costs, the StanMark-DH population was sequenced at relatively higher depth with raw coverage ranging from 0.6-1.2x. As the cost for sequencing to this depth for a genome the size of wheat would be untenable within a breeding program for large populations, we mimicked low coverage empirical data by randomly sampling three different low coverage levels of 0.1x, 0.05x, and 0.01x. Sampling was completed using seqtk (https://github.com/lh3/seqtk), and the low-coverage samples were mapped and filtered as described earlier and genotyped the SNP positions identified between the parents with option -T using BCFtools.