3.1 SNP calling
A total of 319,015 X chromosome variants from 36 individuals across four yak populations successfully passed the quality filter criteria, which included metrics for missing genotypes (MG), Hardy-Weinberg equilibrium (HWE), minor allele frequency (MAF) and missing individual (MIND). LD pruning was purposefully omitted since some of the statistical analysis rely on LD signals (iHS). Furthermore, implementing LD pruning on sex-chromosomes is not advisable since sex-chromosomes comprises of long blocks of haplotype due to their lack of recombination. Consequently, all subsequent analyses were conducted on the unpruned set of SNPs.
3.2 Empirical distribution of test statistics of various methods
The empirical distributions of various test statistics based on the X chromosome are demonstrated in Figures 1, 2, and 3. The analysis of the distributions for iHS was found to be consistent among 3 Indian yak populations while in Jinchuan yak, the distribution was flatter on top distinguishing it from Indian yak population. Distribution of FST, shows skewness towards lower FST value while comparing among Indian yaks, while distribution was skewed toward higher FST on comparison with Jinchuan yak showing distinction among Indian and Chinese yak. XP-EHH test statistics similarly revealed demarcation on comparing Indian yaks with Chinese yaks revealing haplotype-based differentiation. This comparability underscores the robustness and reliability of the detected selection signatures. The detailed visual representations in the figures highlight the distinct patterns and commonalities in the genetic makeup and evolutionary pressures experienced by these yak populations.
3.3 Identification of selective sweeps on X chromosome
A range of techniques was employed to investigate selection signatures on the X chromosomes. For the within-population analysis, three distinct methods were utilized: Tajima’s D, Composite Likelihood Ratio (CLR) and integrated Haplotype Score (iHS) as these methods represent various categories, including site frequency spectrum (SFS) and haplotype-based approaches. For the between-population analysis, Cross-Population Extended Haplotype Homozygosity (XP-EHH) and Fixation Index (FST) were used to assess the degree of population differentiation.
3.3.1 Tajima’s D Analysis
In our study, we focused on identifying candidate regions under positive selection using Tajima’s D. We isolated the bottom 1% of Tajima’s D values, which revealed 461 candidate regions in each yak population. Tajima’s D values below -2 were observed for all yak populations, indicating positive selection with the fixation of rare alleles. A total of 460 candidate genes were found to overlap with the selected regions. Specifically, 123 genes were identified in Arunachali yaks, 108 in Himachali yaks, 121 in Ladakhi yaks, and 108 in Jinchuan yaks. The distribution of Tajima’s D values along the X chromosome is illustrated in Figure 4. To elucidate the roles and functions of the identified genes, a comprehensive functional annotation process was conducted.
3.3.2 Composite Likelihood ratio (CLR) Analysis
Using the SweeD tool, we calculated likelihood values and annotated candidate regions with a boundary of 50 kb upstream and downstream from each region. Retaining the top 1% likelihood values, we identified 12,103, 10,118, 12,003, and 3,837 candidate regions in Arunachali yak, Himachali yak, Ladakhi yak, and Jinchuan yak, respectively. Figure 5 depicts the distribution of Composite Likelihood Ratio (CLR) values along the X chromosome of different yak populations, providing a visual representation of the selection signatures detected. Our analysis identified a total of 2,119 candidate genes overlapping with these selected regions. Specifically, 503 genes were found in Arunachali yak, 644 in Himachali yak, 376 in Ladakhi yak, and 596 in Jinchuan yak, respectively.
3.3.3 Integrated Haplotype Score (iHS) Analysis
The Selscan tool, designed for haplotype-based selection signature scanning, was employed to analyze signatures of selection using the Integrated Haplotype Score (iHS) along the X chromosome. For each SNP, the iHS score was calculated, normalized, and a threshold of |iHS| > 2 was applied. The top 1% of normalized scores greater than 2 (|iHS| > 2) were considered to enhance the accuracy of selection signature detection. Our analysis revealed 57, 167, 165, and 196 regions under selective sweep, harboring 366 candidate genes within these regions and their flanking regions. Specifically, the number of genes found in Arunachali yak, Himachali yak, Ladakhi yak, and Jinchuan yak were 53, 117, 89, and 107, respectively. The empirical distributions of iHS values across the various yak populations, as depicted in Figure 1. Moreover, we identified a list of genes that were common to the intra-population statistics (Tajima’s D, CLR, and iHS). Figure 6 presents a composite diagram illustrating the intersection of detected genes among different yak populations.
3.3.4 Fixation Index (FST) Analysis
The fixation index (FST), first proposed by Wright in 1951, is a measure that uses allele frequency fluctuations to infer genetic differentiation between populations. It quantifies the degree of genetic differentiation between populations due to factors such as selection pressure. Higher FST values indicate greater genetic differentiation and potential selection signals. In our study, we compared allele frequency differences between various yak populations (Arunachali/Himachali, Arunachali/Ladakhi, Himachali/Ladakhi, Jinchuan/Arunachali, Jinchuan/Himachali, and Jinchuan/Ladakhi) based on the X chromosome. By selecting the top 1% of empirical FST values, we identified 68 regions in each pair of populations. We annotated 312 genes located in and around these regions with a flanking threshold of 50 kb. Population-wise analysis revealed the presence of 47, 18, 53, 51, 70, and 73 genes in the Arunachali/Himachali, Arunachali/Ladakhi, Arunachali/Himachali, Jinchuan/Arunachali, Jinchuan/Himachali, and Jinchuan/Ladakhi pairs, respectively.
3.3.5 Cross population extended haplotype homozygosity (XP-EHH) Analysis
XP-EHH is a method based on measuring the difference in haplotype frequency and length between two populations, effectively detecting recent and ongoing positive selection. In our study, the Selscan tool was utilized to measure haplotype-based differentiation between pairs of populations (Arunachali/Himachali, Himachali/Ladakhi, Arunachali/Ladakhi, Jinchuan/Arunachali, Jinchuan/Himachali, and Jinchuan/Ladakhi) based on the X chromosome. Regions with the top 1% of |XP-EHH| values greater than 2 were defined as undergoing selection between pairs of populations. Specifically, 175, 173, 172, 150, 150, and 153 candidate regions were detected between Arunachali/Himachali, Himachali/Ladakhi, Arunachali/Ladakhi, Jinchuan/Arunachali, Jinchuan/Himachali, and Jinchuan/Ladakhi pairs, respectively. A total of 64 genes were identified, with distributions of 4, 7, 13, 14, 13, and 13 in the Arunachali/Himachali, Himachali/Ladakhi, Arunachali/Ladakhi, Jinchuan/Arunachali, Jinchuan/Himachali, and Jinchuan/Ladakhi pairs, respectively. Moreover, a list of genes common to inter-population statistics, i.e., FST and XP-EHH, is depicted in Figure 7. An X chromosome map was constructed displaying genes and their positions across different yak populations identified using Tajima’s D, CLR, iHS, FST, and XP-EHH (Supplementary file F1, F2, F3, F4).
3.4 Functional analyses of the extracted candidate genes
For structural annotation of potential selected regions, we accessed the ENA database (https://www.ncbi.nlm.nih.gov) and utilized gene location data from the BosGru3.0 yak genome assembly. Our gene enrichment analysis covered various categories, including cellular components, molecular functions, biological processes, and KEGG pathways. Additionally, we performed a detailed functional annotation, examining biological processes, molecular functions, and cellular components using the DAVID v2022q2 gene ontology tool (https://david.ncifcrf.gov). This comprehensive approach allowed us to gain a deeper understanding of the functional roles and pathways associated with the selected genes (Table 1) (Supplementary Table S2, S3, S4, and S5).
Here we also found some of the unique X chromosome genes and loci that are identified by various statistics and population-specific, i.e., MSN, ENSBGRG00000011790, PABPC5, CCNQ in Arunachali yak, RPL36A, UBA1, ENSBGRG00000009234, ENSBGRG00000007912, ENSBGRG00000007899, ENSBGRG00000008814, SLC9A in Himachali yak, ENSBGRG00000024060, MTM1, ENSBGRG00000024700, KCND1, SUV39H1, GPR173 in Ladakhi yak, and HSD17B10, ENSBGRG00000022974, ENSBGRG00000020907, ENSBGRG00000013035, AFF2, SH2D1A, MIR222, LPAR4 in Jinchuan yak (Table 2) (Supplementary Table S2, S3, S4, and S5).
Table 1: Genes and Novel Loci Associated with Selection Signatures Identified by Tajima’s D, iHS, CLR, FST, and XP-EHH Analyses, with Gene Ontology (GO) Terms (Supplementary Data)
Methods
|
Genes associated with selection signature
|
Tajima’s D, iHS and CLR
|
AIFM1, APOOL, ATRX, CHST7, DACH2, DGAT2L6, DIAPH2, EIF2S3B,
ENSBGRG00000008170, ENSBGRG00000008355, ENSBGRG00000008903, ENSBGRG00000008920, ENSBGRG00000010041,
ENSBGRG00000010189, ENSBGRG00000010672, ENSBGRG00000010695, ENSBGRG00000012630, ENSBGRG00000013098,
ENSBGRG00000013114, ENSBGRG00000013130, ENSBGRG00000013264, ENSBGRG00000016546, ENSBGRG00000016563,
ENSBGRG00000016583, ENSBGRG00000019326, ENSBGRG00000019489,
ENSBGRG00000019527, ENSBGRG00000019584, ENSBGRG00000020695, ENSBGRG00000021228, ENSBGRG00000021324,
ENSBGRG00000021353, ENSBGRG00000021428, ENSBGRG00000023145, ENSBGRG00000023158, ENSBGRG00000023191,
ENSBGRG00000023217, ENSBGRG00000023464, ENSBGRG00000024623, ENSBGRG00000024765, ENSBGRG00000025313,
ENSBGRG00000025461, ENSBGRG00000025945, ENSBGRG00000026272, ENSBGRG00000026304,FAM133A,
FGF13,FMR1,FRMD7,G6PD,GPC3,GRIA3,HDX,HS6ST2,HTR2C,HUWE1,IKBKG,
IL1RAPL2,INTS6L,KIF4A,KLHL15,MAGT1,MBTPS2,Metazoa_SRP,MID2,MIR18A,MIR19B2,MIR20B,
MIR363,MIR92A2,NBDY,NKAP,NR0B1,PGK1,PGRMC1,PHEX,RAB33A,RPS6KA6,SLC9A7,SPIN4,TENM1,TRPC5,U6,ZC4H2,ZNF280C
|
FST and XP-EHH
|
ARHGEF6, COL4A5, ENSBGRG00000008907, ENSBGRG00000009057, ENSBGRG00000020148, ENSBGRG00000020495, ENSBGRG00000021828, ENSBGRG00000024024, ENSBGRG00000025650, GPR119, HS6ST2, MAGED1, MOSPD1, PQBP1, SLC25A14, SLC35A2, TIMM17B, WDR44
|
Table 2: Genes and novel loci unique to Arunachali, Himachali, Ladakhi, and Jinchuan Yak identified by various statistical methods
Yak Populations
|
Genes identified
|
Arunachali yak
|
MSN, ENSBGRG00000011790, PABPC5, CCNQ, ENSBGRG00000010049, ENSBGRG00000024354 ENSBGRG00000020934, ENSBGRG00000024406 ENSBGRG00000018988, LDOC1
|
Himachali yak
|
ENSBGRG00000007503, ENSBGRG00000013880, ENSBGRG00000018466, ENSBGRG00000020038, ENSBGRG00000026084, ENSBGRG00000018983, RPL36A, UBA1, ENSBGRG00000009234, ENSBGRG00000007912, ENSBGRG00000007899, ENSBGRG00000008814, SLC9A6, ENSBGRG00000013446, TAF7L, CDX4, ENSBGRG00000024296, ENSBGRG00000009915, ENSBGRG00000026379, ENSBGRG00000020074, ENSBGRG00000019979, ENSBGRG00000018671, CLDN2, ENSBGRG00000008968, TKTL1, ENSBGRG00000018162, CITED1, PABIR2, GLA, ENSBGRG00000008003, ENSBGRG00000026384, MED14, ENSBGRG00000020673, USP9X, BTK, ENSBGRG00000018982, ENSBGRG00000024248, ENSBGRG00000024024, ENSBGRG00000025438, RIPPLY1, SOX3, MAGEE2, ENSBGRG00000020495, TMEM35A, CETN2, ENSBGRG00000019681, CENPI, ENSBGRG00000026345, ENSBGRG00000018143, HAUS7, RLIM, HNRNPH2, ENSBGRG00000024259, DRP2, ENSBGRG00000008290, ENSBGRG00000018656, ENSBGRG00000023465, SNORA35B, ENSBGRG00000007883, ENSBGRG00000008800, TEX11, BGN, FUNDC1, NSDHL, ENSBGRG00000024121
|
Ladakhi yak
|
ENSBGRG00000024060, MTM1, ENSBGRG00000024700, KCND1, SUV39H1, GPR173, ENSBGRG00000024081, TSPYL2 ENSBGRG00000024069, SNORA35
|
Jinchuan yak
|
ENSBGRG00000017892, HSD17B10, ENSBGRG00000022974, ENSBGRG00000020907, ENSBGRG00000013035, AFF2, SH2D1A, MIR222, LPAR4, ENSBGRG00000012747, ENSBGRG00000008066, ENSBGRG00000012409, ENSBGRG00000007551, ENSBGRG00000009197, ENSBGRG00000007923, RIBC1, ENSBGRG00000007060, NKRF, ENSBGRG00000026097, ENSBGRG00000023969, PBDC1, SRPK3, ENSBGRG00000013049, ENSBGRG00000018938, TMEM164, ENSBGRG00000017398, ABCD1, ENSBGRG00000017469, PRDX4, PRAF2, OTUD6A, ENSBGRG00000015557, ENSBGRG00000020119, GATA1 IDH3G CCDC120, ENSBGRG00000012421, ERAS, ENSBGRG00000011954, TFE3, ENSBGRG00000009995, ENSBGRG00000017943, MIR221, CCDC22, ENSBGRG00000019586, PLP1, ENSBGRG00000025435, CACNA1F, ENSBGRG00000008034, ENSBGRG00000008253, ZIC3, ENSBGRG00000026351, ENSBGRG00000019967, ENSBGRG00000020770, ENSBGRG00000025261, SAT1, ENSBGRG00000022376, HDAC6, ENSBGRG00000009057
|