Determining individual ancestry from traces of DNA derived from forensic context or from poorly preserved ancient human remains from archaeological context is typically challenging due to low number of host DNA molecules in the sample, their fragmentation and post-mortem damage. Because of the lack of recombination in Y chromosome, its variation accumulates over time along a simple phylogenetic tree. The main branches of this tree can be robustly inferred even from low coverage data because of the hierarchic redundance of the accumulated variation. Furthermore, higher Y chromosome regional differentiation observed among present-day populations, compared to mitochondrial and autosomal diversity (Karmin et al., 2015), makes Y chromosome haplogroup prediction attractive for genetic ancestry inference. In this study we have shown that human Y chromosome haplogroups can be predicted accurately and efficiently from ultralow coverage sequence data using methods that determine the relative abundance of male-specific k-mers. With a simplified example, illustrated on present-day Estonian Biobank chrY data generated at high sequence depth and quality (Mitt et al., 2017), we showed that the three most common basal haplogroups in Estonia, N3, R1, and I1, can be differentiated from each other on the basis of the k-mer abundance of just two chrY k-mers from repetitive regions DYZ3 and DYZ19. The observed haplogroup-specific differences in k-mer abundance could reflect structural changes in the phylogenetic ancestry of the given haplogroups, e.g. shortening of the centromeric DYZ3 region in the ancestry of haplogroup R1 and shortening of the DYZ19 region in the I1 ancestry. Differentiation at sub-clade level and separation of minor haplogroups J2, I2 and E2 was, however, not possible with these two k-mers only. Furthermore, haplogroup prediction models based only on a small number of repeats are likely to be vulnerable to parameters such as sequencing depth, which motivated us to explore models based on a larger number of k-mers.
The accuracy, resolution, and robustness of haplogroup prediction, whether using the SNVs or the Y-mer approach described here, would depend, besides the quality and sequence coverage of the sample, on the size and diversity of the reference panel as well as the number of informative variants being used. Our tests with global and European training and validation sets showed that models using just a single T2T Y chromosome reference genome as the source of k-mer extraction performed poorly, with lower than 80% accuracy across the tested coverage range of down-sampled validation data. Models using k-mers extracted from 21, 110, and 213 different Y chromosome sources performed better, with accuracy higher than 95%, in coverage range > 0.001x (Fig. 3). As we did not observe major differences in the performance of M21W versus M213W and M21E versus M213E models, we can conclude that a phylogenetically diverse panel of more than 20 Y chromosomes suffices as a k-mer source for basal haplogroup determination.
We showed that haplogroup predictions with Y-mer are robust to contamination in a model (M21E) trained with individuals from phylogenetically distinct haplogroups (Fig. 5). However, models trained with regionally specific sets of haplogroups (M21NE), including sub-clades of I1 and R1b that have separated only within the last 5,000 years (Karmin et al. 2015), performed less well even without the presence of contamination (Table 1). This drop of performance is likely caused by increasingly higher proportions of k-mers that overlap the k-mer lists extracted for the given sub-clades. This result highlights the need to use additional filters that remove k-mer overlaps in future developments of detailed sub-clade prediction models, or, with our current approach, the need to use both diverse and balanced training sets. Even though we did not see improvement between M21W and M110W models (Fig. 3) it is possible that with models that require higher haplogroup resolution the number of Y chromosome sources from which k-mers are retrieved will also need to be adapted.
Our analyses revealed that for robust haplogroup prediction the number of k-mers selected to distinguish each individual haplogroup included in the model has to be sufficiently high (> 10,000). Our comparisons of models with increasingly higher numbers of k-mers, however, showed no detectable increase in accuracy (Fig. 4) in models with more than 20,000 haplogroup-specific k-mers, suggesting that for computation efficiency models with 20,000–50,000 k-mers are likely to represent the most optimal solution.
When applying Y-mer on validation sets different in their haplogroup composition from the training set we observed higher rates of mismatches, particularly in association with rare sub-clades. Predictions supported by models trained at different haplogroup levels appeared to be mostly correct suggesting that applying multiple models on the same data can be helpful in distinguishing predictions that are robustly supported by multiple models from those that have lower confidence and are supported only by individual predictions. The necessity to apply multiple models on the same data was further illustrated in case of the ancient DNA data from the Steppe Belt, where our basal haplogroup prediction model M21W made haplogroup calls at high accuracy (~ 95%), while predictions with region-specific haplogroup compositions, adapted on Europe or more specifically in case of some models on Northeast European data, showed higher number of mismatches, particularly in case of haplogroups that were not included in the model. While we could see that haplogroup predictions that were supported by multiple models appeared mostly to be correct, this case study highlighted the need also for caution in choosing the models with appropriate haplogroup composition and the training sets for the future use of Y-mer tool in ancient DNA studies. Preliminary insights of haplogroup composition of an ancient cohort through SNV analyses of higher coverage individual samples will be advisable to inform such models. Where such high coverage data is obtainable it can substantially increase the accuracy of Y-mer in predicting haplogroups from lower coverage range of data. Human chrY 0.001x sequencing depth appeared to be sufficient for accurate haplogroup prediction in our tests where the validation set was down-sampled to lower coverage (Fig. 3).
Analyses of two ancient DNA data sets from Europe (Saag et al., 2017, 2019; Gretzinger et al., 2022) revealed high accuracy of base haplogroup prediction with M213E model which does not differentiate between recently diverged sub-clades of R1b and I1. The M222NE model that was designed to predict these sub-clades with a training set that combines them with the global range of haplogroup diversity performed less well (accuracy < 0.53) than the case where either the M110W or M213E model’s haplogroup R1 or I1 predictions were further resolved with subclade-specific models, which showed high (> 0.95) accuracy for high confidence (p < 0.05) calls. These results suggest that a two-stage strategy in haplogroup prediction, whereby the base haplogroup is called first and the sub-clade determination is performed separately may be preferable over models in that combine different levels of haplogroup diversity.
Our analyses of Chinese and Estonian NIPS data further confirmed the good performance of Y-mer at low coverage range, 0.001-0.12x, of the male fetus Y chromosome data as we obtained haplogroup frequency profiles similar to those expected from relevant reference data. While these results show that the most common haplogroups in Europe and Asia can be predicted with sufficient accuracy, the drop of accuracy we observe with models that entail haplogroup distinction at sub-clade level further emphasize the limitation of our currently described approach for purposes that require both high confidence and resolution, such as the determination of genetic relatedness. However, in cases where genetic relatedness has been determined independently, e.g. via identity-by-descent or identity-by-state methods (Monroy Kuhn, Jakobsson and Günther, 2018; Popli, Peyrégne and Peter, 2023), Y-mer analyses can be useful for testing (ruling out) the plausibility of patrilineal relatedness, even when acknowledging that a match at a generic Y chromosome haplogroup level cannot constitute a proof of patrilineal relationship.
In conclusion, we present a new k-mer based tool, Y-mer, for predicting Y chromosome haplogroups. We show that Y-mer is able accurately to predict basal chrY haplogroups from ultralow (> 0.001x) coverage data. As such, it is an approach that can be useful in situations where basic, low resolution, information about individual ancestry is required while higher coverage sequencing of the samples is either not possible or practical, e.g. due to costs or inavailability of sufficient quantities of the sample. For ancient DNA studies or forensic case analyses this approach can potentially make more individual samples available for Y chromosomal ancestry analyses, which can be more informative when high coverage/quality data is already available for a subset of the samples. We show that Y-mer performs more accurately when its models are trained on data that match the haplogroup composition of the target group, which highlights the needs for tailored approach in cases where detailed sub-haplogroup level distinctions are required. Besides its possible uses for Y chromosome data, the k-mer based approach described here is potentially extendable also on ancestry analyses of the autosomal genome. Considering the high rate of genetic variation detected in centromeric regions assembled with long read sequences and the low rate of recombination in the pericentromeric regions (Logsdon et al., 2024), the study of k-mers from (peri-)centromeric haplotypes may offer new prospects for autosomal ancestry scanning from low coverage data, similarly, though not identically, to Y-chromosome analyses described here, considering the differences in inheritance of autosomal and Y chromosome DNA. Alternatively, genome-wide scans of population-specific peaks of autosomal k-mer abundance could be screened and used in ancestry mapping of low coverage data. Development of such tools would require larger ancestry diverse reference panels such as graph-based pangenomes that are currently being developed and likely become available in the near future.