Sugarcane genomics analysis faces major hassles due to the gain and loss of genetic material tightly associated with the highly polyploid and aneuploid nature of the crop. Genotyping sugarcane is not only complex because of the multiple number of allele copies, but also because of the aneuploidy and unknown ploidy of some hybrids (Serang et al. 2012). Furthermore, in polyploids the false-positive rate of SNPs detection rises due to ambiguous mapping of highly similar homeologous loci, becoming challenging to the precise SNPs detection (Clevenger et al. 2015). GBS has been proved to be a successful way to analyse sugarcane (Balsalobre et al. 2017; Yang et al. 2017), however, to our knowledge ddRADseq approach has never been applied to this crop before. Our simulations and comparisons showed that, regardless of the complexity of genotyping sugarcane hybrids, there is consistency in our data results obtained by the application of the ddRADseq protocol.
Current NGS platforms generate data for analysing genes and genomes, through sequencing by synthesis chemistry (López de Heredia 2016). Differences between platforms rely on the output range, reads per run, maximum read length and costs (Illumina 2021). For this reason, the comparison between sequencing platforms is crucial to decide how to invest available resources. Our results showed that the number of reads assigned to each genotype followed the same order in Ne-70 and No-150 platforms, although the final number of reads in the latter one was on average several times the number detected in the former one (Table 1). It is worth noting that NA 78–724 and NA 56 − 30, two Argentinean hybrids coming from the same sugarcane breeding program, stood for extreme numbers of reads in both sequencers, suggesting a consistency in sequencer performance among samples.
The overall alignment rate of both platforms averaged ~ 48%. Several non-mutually exclusive phenomena might be responsible for this value: First, the hybrid R570 reference genome assembly originated from synteny with sorghum, using BACs based on their global alignments with this related species and representing the gene-rich part of its genome (Garsmeur et al. 2018). This indicates that fewer conservative regions are unlikely to align with other sugarcane hybrids. Second, R570 is a cultivar from Reunion Island, located in the Indian Ocean, suggesting that it might be genetically distant from the Argentine and American genomes assayed in this investigation (Acevedo et al. 2017; Garsmeur et al. 2018); then the alignment software was developed for diploid species, representing a technical drawback for a highly polyploid and aneuploid species such as sugarcane along with computational problems. These observations increase the difficulty in locating the reads in the exact position of the reference genome even though Yang et al. (2017) reported that Bowtie 2, was more effective than other aligners in sugarcane. Interestingly, the high fibre biotype and the sugarcane cultivars showed similar alignment rates, independently of the platform used (Table 1).
Stacks software provides two types of analyses for SNP recovery: de novo and reference based (Rochette and Catchen 2017). The de novo analysis allowed a higher rate of SNPs detection than the reference-based analysis in both platforms, although the latter pipeline let us know the SNP position in the chromosome and thus further information could be acquired. In addition, more raw data SNPs were recovered using de novo strategy compared to reference-based strategy.
Simulations applied to further unravel the role of the read length, number of reads and paired-end versus single-reads impact on the number of called SNPs have demonstrated that the SNPs are not evenly distributed along the read length. This is supported by the fact that the minority of the SNPs in denovo_map.pl (767,203 out of 2,831,922), but the majority in ref_map.pl (351,864 out of 460,890) landed on the first 70 base pairs. Thus, even though more SNPs were recovered in the longest reads (No-150) than in the shortest one (No-70-10M) for both mapping strategies, the de novo strategy was responsible for the greatest difference in SNPs recover (3.7 vs. 1.3 in ref_map.pl). Additionally, the number of SNPs is not directly proportional to the number of reads per individual. In relative terms, the amount of SNPs recovered in No-70-4M exceeded half the amount recovered in No-70-10M, suggesting that assigning 4M reads/individual is more efficient than 10 M reads/individual.
The lack of information on the application of the ddRADseq protocol in sugarcane moved us to test six filtering schemes (FS) composed of a combination of different MAF and read depths in order to select the most suitable filtering parameters values for improving SNP calling. The inclusion of minimum and maximum depth filters was recommended for depth coverage (Ravinet and Meier 2020). While a minimum depth cut-off removes false positive calls and ensures higher quality calls, a maximum cut-off tends to avoid paralogous or repetitive regions (Ravinet and Meier 2020) that have been recently reported in the sugarcane genome (Trujillo-Montenegro et al. 2021). The combination of the same MAF with increasing minimum and maximum depths cut-offs diminished the SNPs recovered in both Ne-70 mapping strategies and in No-150 denovo_map.pl. Nevertheless, in No-150 ref_map.pl the combination of the same MAF with increasing read depths augmented the SNP call. This could be related to paralog regions coming from the duplication events or polyploid nature where we can find several copies of some regions (or genes) within the same genome that belong to different ancestors (subgenomes). In addition, in highly polyploid genomes, such as sugarcane which can achieve a ploidy level of twelve, 56x was the estimated sequence depth for single dose markers to be called, while in lower depth sequencing, a single dose SNP of heterozygous genotype could be called as homozygous genotype (Yang et al. 2017). A series of best practices for SNP calling in polyploids would imply the combination of the highest number of reads, the highest MAF, and the highest depth (Clevenger et al. 2015; Yang et al. 2017).
Distal regions along each sugarcane chromosome exhibited relatively high SNP density both in raw data and, to a minor extent, in FS1 and FS5 (Fig. 2) due to the differential number of SNPs retained in these FS compared to raw data (Table 2). These observations are consistent with the pattern observed in the sorghum reference genome assembly, where distal regions of chromosomes exhibited high gene density and pericentromeric regions displayed low gene density and low rates of recombination (McCormick et al. 2018).
Findings of hierarchical clustering and principal component analyses were in accordance both within and between matrices, highlighting the isolation of the high-fibre biotype from the commercial hybrids (Fig. 3), likely due to their differential genetic backgrounds (García et al. 2021; Kane et al. 2021). Interestingly, NA 78–724 and NA 56 − 30, the two Argentinean hybrids genetically developed under the same sugarcane breeding program did not share a same cluster in any of the matrices analysed. Furthermore, NA 56 − 30 and HOCP 95–665 showed the nearest genetic distance in No-150 (Fig. 3b) among the samples assayed, in agreement with genetic similarities based on coefficient of parentage estimations (Acevedo et al. 2017).
To shed more light on the effect of the SNPs retained on genes, transcripts, protein sequences, and regulatory regions on the five sugarcane genotypes assayed, the variant effect predictor (VEP) was used. Overall, similar proportions of SNPs were assigned to the same functional consequences in both matrices (Fig. 4a, c). This further shows that independently of the differential number of SNPs detected in each platform, similar variant proportions were allocated to the very same consequence terms. Among the SNPs discovered, missense variants (6.4–5.4%, Fig. 4b, d) responsible for the change of one base resulting in a different amino acid sequence where the protein length is preserved, emerged as interesting variants to consider for further analysis. Some SNPs identified in the coding region, such as splice region variants (0.5% − 0.7%), refer to a change either within 1–3 bases of the exon or 3–8 bases of the intron, indicating that they have low impact for they are unlikely to change protein behaviour, even though they still represent a variation in DNA sequences. Similar considerations may be considered for synonymous variants (6.3% − 4.7%) as they are supposed to have no change to the encoded amino acid. More than half of the variants detected landed within regulatory regions, suggesting that they might alter the biological functionality of the gene (Fig. 4a, c). Finally, in non-coding variants or variants affecting non-coding genes, impact is difficult to predict or there is no evidence of impact at all.
Recently, a new genome assembly from the sugarcane hybrid CC 01-1940 has been published using NGS technologies (PacBio, Illumina Short Reads and HI-C Reads) (Trujillo-Montenegro et al. 2021). Our preliminary analysis using this brand-new reference genome showed that the overall alignment increased the mapping rate from ~ 54% to ~ 85% in Ne-70, demonstrating that the hybrid CC 01-1940 is genetically closer to our hybrids than the cultivar R570 (Garsmeur et al. 2018). Furthermore, the largest (chromosome 1) and smallest (chromosomes 5 and 8) sizes of chromosomes coincide in Trujillo-Montenegro et al. (2021) and in this investigation. Analyses of commercial and high-fibre sugarcane hybrids using the genome assembly of the hybrid CC 01-1940 as reference, are in progress (Manuscript in preparation). This underscores the fact that the pipeline considered in this study can be used regardless of any newer genome sequence version that might be released in the future. Furthermore, it is a good tool for detecting novel SNPs and, if compared to SNP array technology, it can be applied without previous knowledge on polymorphisms. Even more, this protocol can be applied specifically to the sugarcane population of the breeding program being assessed to discover molecular markers associated with agricultural features.