In the present study, we evaluated 10 panel sets: eight AISPN, one HDSNP and one WGS and, using a tetrahybrid admixture model, we compared ancestry inferences in America admixed populations.
To verify the accuracy of the panels, we used samples from HGDP and 1KGP, whose geographic origin is known and without evidence of recent admixture. Despite the low marker overlap observed in the AISNPs panels, all panels showed a high rate of accuracy (error rate 0.1-3.0%; Supplementary Table S3). High levels of correlation were also observed in the pairwise comparisons of the panels (r2 > 0.84; Supplementary Figures S4A-G). However, it is also possible to observe an increase in accuracy and correspondence in ancestry inferences, as the number of SNPs in the panels is higher.
These results reveal, in most cases, that the available AISNP panels meet the proposed role of correctly attributing ancestry according to the continental group to which the individual belongs. Several studies already compared the accuracy of panels and have reached similar results [19, 22, 24]. Therefore, currently many authors argue that there is no necessity for new AIMs panels to assign the 6 biogeographic regions (Sub-Saharan Africa, Europe, Southwest Asia, South Asia, East Asia and the Americas). Instead, efforts should be directed towards building panels for global use and with greater representation of population groups [22].
In this sense, many comparative studies are focused on minimizing marker redundancy (reducing costs and enabling reproducibility by different groups), selecting only the most informative and filling gaps in population representation [41]. In the same manner, there is a need to improve the selection of informative markers capable of revealing differences in subregions within continents or in regions with complex admixture patterns, such as populations from East, Southeast and South Asia, as well as those between Middle Eastern and European individuals [42].
Most AIMs panels use HGDP and 1KGP data as a reference population for the selection of their markers, including some evaluated in the present study [17, 43, 44]. These two public databases were essential for understanding the distribution of genetic diversity and affinity among human population groups [27, 45, 46]. However, they capture only a portion of human population diversity. Therefore, many AIMs panels, during their development process, endeavoured to include more populations for different population groups (e.g. 55 AISNP [22]; 128 AISNP [47]; 446 AISNP [48]).
Soundararajan et al [49] argues if there is low representativeness of data from reference populations, a greater number of markers becomes necessary for the robustness of allele frequencies for the definition of population groups of interest. Our results converge to this point, since we observed greater correspondence in individual ancestry inferences between panels with a greater number of markers, in addition to those of HDSNP and WGS data.
In the present study, we propose to focus on the American admixed populations. These populations emerged in the last half-century, especially from Native American, European, African sources. More recently, they have also received contributions from other regions such as East Asia and the Middle East.
Admixed populations need a closer look at ancestry inferences, as their genomic particularities trigger several challenges. Each admixture population has a peculiar evolutionary history, differing in parental sources, proportion and time of admixture. Furthermore, the admixture process produces variation at different levels: (i) in ancestry between admixed populations, (ii) between individuals in the same admixed population and (iii) throughout the genome of the same admixed individual [50].
For this reason, a method, model or panel that well captures the profile in one admixed population or admixed individual will hardly have the same performance for another. Our results showed this heterogeneity, especially in paired comparisons of individual ancestry inference between panels (Supplementary Figures S5-S11), where we observed variation in correlation coefficients both between ancestry components within the same admixed population and between admixed populations.
The inconsistencies observed in the ancestry inferences between the panels were even more evident for the minority ancestry components of the individuals in our results (e.g. Supplementary Figure S5C; S8A; S11B). This probably occurs because the genome of an admixed individual is a mosaic composed of segments from different parental sources. Over generations, due to the process of meiotic recombination, the components of distinct ancestry are shuffled between homologous chromosomes [11, 51]. Thus, the greater the number of generations since the admixture beginning, the smaller the size of the genomic segments of the ancestry will be. As well, the greater the proportion of an ancestry component, the greater the size of its segments in the genome, while on the other hand, the smaller the proportion of the ancestry component, the smaller the segments in the genome51. In this scenario, due to the lower density and genomic coverage, AISNP with few markers tend to infer more accurately the components of the majority ancestry than the minority ones of an admixed individual. Nevertheless, this problem should not be observed with data with higher SNP density and higher genomic coverage.
We also observed the Native American component as the one with the lowest correspondence in the ancestry inferences between the panels (Supplementary Figures S7D, S10D, S11D). Three factors likely explain these results: (i) two of the panels we used (12AISNP and 34AISNP) were not developed to capture Native American ancestry, which explains their low performance for this component. (ii) the Native American populations, due to their recent bottleneck history, are the most differentiated in the world [45] and the ones with the lowest number of representatives in the reference panels. As much as there are panels that were developed with the aim of enriching the Native American component (e.g. [22, 47, 48]), they do not always capture this component well in all admixed populations; (iii) the Native American component is the minority in most populations evaluated here (with the exception of MXL and PEL).
Here we focused on evaluating the tetrahybrid admixture model in American populations (NAM, EUR, AFR and EAS). We know that for most of the admixed populations assessed here, the East Asian contribution is less than 1%. However, there is a growing migratory flow of this population group to large urban centers in the USA and Brazil. East Asian immigration to Brazil began in 1908 with the Japanese and today, according to the Ministry of Foreign Affairs of Japan, more than 2 million descendants of Japanese people live in Brazil. In São Paulo, the city where the Brazilian samples of the present study were collected, there is one of the largest Japanese communities outside Japan. The Brazilian cohort has 33 samples 100% EAS that are direct descendants of the first Japanese immigrants [26]. Data from the 2010 Brazilian census reveal that in 10 years there was a 173.7% increase in individuals who declared themselves to be of Asian descent (Japanese, Chinese and Korean) [37]. Peru has the second largest ethnic Japanese population in South America after Brazil, and we observed in the 1KGP PEL samples at least two individuals, one with ~ 50% and the other with ~ 12% EAS ancestry. This scenario triggers, even with an average minority population ancestry, at the individual level, East Asian ancestry is majority (100%, 50%), as it represents recent admixture events. Therefore, it is recommended to adjust the admixture models according to the source number of each admixed population, even for minority sources, especially in cases of recent migratory and admixture events.
Although we did not perform the trihybrid model, we compared our results from admixed samples in 1KGP for 55AISNP, 128AISNP and 170AISNP with those from Pereira et al [38] (Supplementary Table S8). We found a variation in the differences between mean ancestry values from 0.13–5.72%. The biggest differences occurred in the inferences of European ancestry for the 128 AISNP in ACB (2.2%); ASW (3.6%); CLM (5.1%), MXL (5.7%) and PUR (4.9%) for Native American ancestry in PEL (3.6%). The directionality was that the tetrahybrid model underestimated that ancestry in relation to the trihybrid model. We are not suggesting that the tetrahybrid model should be universally used in the admixed populations of America. Again, it is necessary to know the history of populations to choose the most suitable model. In this sense, we are encouraging studies to use the tetrahybrid model even when the fourth component is a minority in the populations.
Finally, in the analyses with admixed populations, we found correlation coefficients greater than 0.9 between 446AISNP, 672AISNP, HDSNP and WGS data. This result shows that despite differences in genotyping methods and in the number of markers (hundreds to thousands), there are reliable panels for identifying the four components of ancestry in admixed populations. The choice of one or the other will depend on the purpose and needs of the study. For example, in forensic genetics, sometimes samples with quantity and quality are not available, which limits the genotyping methodology [52]. On the other hand, in clinical or genetic association studies, the accuracy of genomic ancestry is essential, and it is often necessary to go a step beyond the genomic mean and make inferences about ancestry in specific genomic segments [53, 54].
Increasingly, the admixed populations of America are protagonists in different studies (population history, clinical studies, forensics). Therefore, nowadays it is essential to discuss and understand how methodological advances, both in genotyping and in analysis, help to improve the inference of genetic ancestry in admixed populations. In the present study, we show that heterogeneity within and between admixed populations still poses methodological challenges. However, we have also shown that there are good panel sets and tools that can help address these challenges.