Sequence performance
The average mapped reads were 139,681 per sample, and the overall mean read depth was 1,260X ± 422X (mean ± SD) per individual. The variants recommended by EMPOP as well as the haplogroup information and the mean sequencing depth of 209 Daur individuals are presented in Table S5.
Haplogroup distribution
Figure 1 presents a simplified phylogenetic tree that shows the distribution of the coarse haplogroups, and the detailed typing results are shown in Table S5. In general, the matrilineal component of the Daur group was predominantly comprised of the eastern Eurasian-specific component (89.21%), represented by haplogroups D (28.24%), G (10.54%), B (10%), C (8.62%), R9 (7.65%), N9 (6.92%), Z (6.23%), A (4.79%), M7 (4.78%) and M9 (1.44%) [39,40]. The remaining samples consisted of haplogroups U (1.44%), T (1.92%) and H (1.44%), which are generally confined to the European region[41,42], and a few root types (R* and M*). Among these haplogroups, C and D have distinct Asian characteristics, and more than half of the northern Asian pool of human mtDNA is fragmented into their subclades[39,43]. In the Daur population we studied, haplogroup C consisted of four sister subclades, C1 (0.48%), C4 (2.39%), C5 (3.83%) and C7 (1.92%), while haplogroup D consisted of three sister subclades, D2 (0.96%), D4 (19.62%) and D6 (7.66%). Notably, haplogroup D4 not only has a high frequency but also contains a total of 28 abundant downstream clades (Table S5). Some subbranches of haplogroup D4 have very distinctive geographical distributions and are of great significance for the study of the demographic history of Asia[34,43]. For example, haplogroup D4j (2.87% in this study) demonstrated a more southern geographic distribution, and haplogroup D4e4a (0.48% in this study) was mostly found in the Subarctic and Arctic regions[44]. According to previous studies, haplogroups B (10% in this study) and G (10.54% in this study) are also frequent in Mongolic-speaking groups[39,45].
On the whole, the Daur population in this study embodies distinct regional and ethnic characteristics. Compared with earlier studies on Daur mitochondria [14-16], our research showed some changes in some haplogroup frequency distributions and detected some types that were not previously found in Daur mitochondria (U, F, H, etc.), which could be attributed to the larger sample size and more advanced full mtDNA sequence methods used in this study.
Genetic diversity analysis
Based on whole mtDNA sequence data, a total of 127 different haplogroups were identified from the 209 unrelated Daur samples, of which 81 (63.78%) were unique. Although close matrilineal relatives (first to three degrees) were excluded, 61.24% of the total samples still shared haplogroups with others. It is worth noting that the haplogroups belonging to M7b1a1+(16192), G2a1 and Z3d were shared by 6 individuals. Moreover, one haplogroup was shared between five individuals, seven were shared between four individuals, seven were shared between three individuals and twenty-eight were shared between two individuals. The overall haplogroup diversity was calculated as 0.9933 with a discrimination capacity of 60.77%. Table S6 summarizes the above results. Repeated analysis based on CR and HVS1 showed that whole mtDNA sequence data decreased the number of shared haplogroups and increased the number of unique haplogroups. This is reflected in the discriminatory capacity increasing from 53.11% with the HVS1 haplogroups and 54.55% with the CR haplogroups to 60.77% with the whole mtDNA sequence for the Daur samples (Table S6). These results indicate that the whole mtDNA sequence data offer a high power of discrimination and can be useful for genetic investigation and maternal lineage research in the Daur minority.
Of course, the genetic diversity of maternal genetic markers was slightly lower than that of paternal genetic markers, which is more due to the limitations of mitochondrial genetic markers themselves. In our previous study of genetic polymorphisms of 27 Yfiler® Plus loci in the Daur group, a total of 196 different haplotypes were observed in the sample of 203 Daur individuals, and the overall haplotype diversity was calculated as 0.9997 with a discrimination capacity of 0.9655[7]. Our other two studies based on Y-STR/Y-SNP and Y-chromosome sequencing provided rich details on the paternal genetic diversity of the Daur group[8-10].
Population comparisons and phylogenetic analysis
We first performed a series of genetic relationship and structure analyses among 51 populations based on haplogroup frequencies (Table S2). In our PCA results, 59.2% of the genetic variations were extracted by the first three components (Figure 2). The African ancestry (AFR) and American ancestry (AMR) populations can be separated clearly by PC1 and PC2, while the four large groups from Eurasia, East Asian ancestry (EAS), European ancestry (EUR), South Asian ancestry (SAS) and Middle East (Middle_Est), are closely related and even overlap. When using PC2 and PC3 as references, the PCA showed a genetic affinity cline, an east-west cline, which consisted of EAS, SAS, Middle_Est and EUR. Our Daur population was located on one side of EAS between Yakut and CHB (Beijing Han) individuals. Whole mtDNA sequence analysis based on worldwide populations further illustrates the correlation between the maternal genetic background and geographic factors, and the position of Daur in the PCA plot was generally consistent with its geographic origin.
To clarify the genetic relationship of the Daur group with East Asian populations, partial sequences (16024-16383) of all 55 populations (Table S3) were selected for further genetic analysis. Pairwise Fst values (Table S7) were calculated based on partial sequence variation results that are displayed as a heatmap in Figure S1. The results ranged from 0.00148 (for the LK group, Lowland Kyrgyz from Artux, Xinjiang, China) to 0.09088 (for the Balochi group, from Pakistan). The Daur group showed higher similarity levels with LK, Yakut, JPT, LNH, UzbT, MHN, LU, Tib_LB, Gelao, SouthKorea, Turk, Hazara and Burusho (Fst < 0.01, P > 0.0009, after Bonferroni’s correction), which may indicate a smaller genetic difference.
MDS plots based on pairwise Fst value data were drawn for the obtained data, as shown in Figure 3. To a certain extent, the genetic relationship patterns reconstructed here also correspond to their geographical origin or linguistic affinities. In terms of rough categories, populations from the Indo-European languages (Iranian, Indo-Aryan and Slavic) are gathered on the left side, people from the Sino-Tibetan languages family (Chinese and Tibeto-Burman) are clustered mainly on the right side, while Mongolic, Tungusic and Turkic (used to be known collectively as the “Altaic language”) speaking groups are mainly located at the bottom of the plot. The Daur group (Mongolic-speaking) is in marginal position of the “Altaic language” group and is closest to the LNH (Han in Liaoning Province, also located in Northeast China). Notably, the Daur group is also close to Gelao (Gelao in Guizhou Province) and MHN (Miao in Hunan Province), two groups from South China. We have not come up with a good explanation for this so far, but in our previous studies on the Y chromosome of the Daur ethnic group, we found that the proportion of four haplogroups mainly distributed in Southern China and Southeast Asia (O1b1a1-M95, C2a1b-F845, O2a2a1a2-M7 and O1a-M119) was also not low (14.49%)[11]. Further research is needed on the phenomenon that certain southern characteristic elements appear in both the Daur paternal and maternal lineages.
A cladogram was also drawn applying N-J methods, as presented in Figure 4. There were four main branches and the relatively independent LK population in the resulting phylogenetic tree, in which the first branch populations consisted of populations speaking the Indo-European languages (Iranian, Indo-Aryan and Slavic) and Turkic languages, the second and third branches and the relatively independent LK population mainly came from Mongolic-, Tungusic- and Turkic-speaking groups, and the bottom branch was mainly comprised of people from the Sino-Tibetan language family (Chinese and Tibeto-Burman) and low latitude regions. In the bottom branch, the Daur group was first clustered with LNH and JPT (Japanese in Tokyo, Japan) and then with Tibeto-Burman speaking populations and low latitude populations. Although the Daur group is representative of Mongolic-speaking populations, it is not genetically close to the others, as shown in Figure 4. This indicates that the maternal genetic composition of the Daur group is greatly influenced by other groups, especially a genetic admixture from northern East Asia.
The heat maps, MDS plots and cladograms involving only the whole mtDNA sequence and the HVS1 sequence taken from the whole mtDNA sequence (Table S8-S9) are shown in Figure S2-S4. Despite including fewer groups, the patterns of genetic relationships reconstructed here are also generally similar to the results of the partial sequence dataset, and populations with linguistic or regional associations clustered more closely in the MDS plot of the whole mtDNA sequence. In other words, this also reflects that the whole mitochondrial sequence data increase the resolution and offer a higher power of discrimination than previous maternal typing systems.
As mentioned above, haplogroup D4 not only has a high frequency (19.62%) but also contains abundant downstream clades in the Daur samples. According to previous studies based on partial sequences, D4 is also the high-frequency type of several ancient ethnic groups in Northeast China[46-48]. In the latest genome-wide study of northern East Asia, D4 also accounted for the majority of the detected samples in the ancient Heilongjiang River basin(66.67%, 16/24)[13]. We collected relevant available full sequence data (Table S4) and constructed networks(Figure 5 and Figure S5). In Figure 5A, the Daur samples came from scattered sources, showing connections with multiple regions of Asia. When we focused on the genetic connection between the Daur samples and ancient samples, we found that most samples from the ancient Heilongjiang River basin had close connections with samples of Daur (Figure 5A and 5B), and concentrated in haplogroups D4m, D4o, D4g and D4c. Haplogroup D4h, another high-frequency type in ancient Heilongjiang River basin populations, has not been detected in the modern Daur group which also makes sense that D4h is a distinctive native American type that may not have been involved in the late demographic history of northern East Asia[49]. In other words, the network analysis shows that the Daurians do have certain connections with the ancient populations in the Heilong River basin, but in the development process of the Daurians, they also absorbed a large number of female population from other sources. As to whether the modern Daur group has the closest matrilineal genetic connection with the ancient Heilongjiang population, we will collect more complete mitochondrial sequence data and carry out it in detail in follow-up studies.