Genetic ancestry and status of the CRM cohort
The initial dataset comprised 919 captive CRM individuals that were sequenced to a high mean depth (~ 30.47X) (Supplementary Table 1) and 80 wild CRM samples16 with moderate genomic coverage (~ 11.71X). After applying a series of sample and variant quality controls (see Methods), we obtained a total of 84,480,388 high-quality sequence variants across 961 individuals, including 74,752,163 single-nucleotide variants (SNVs) and 9,728,225 insertions or deletions (Indels) (Fig. 1a). This corresponds to an average of one variant per 35 base-pairs (bp) genomic DNA. Nearly 59% of these variants occurred at low allele frequencies (AF < 0.01) whereas approximately 17.9% were classified as very common (AF > 0.05). The intersection with the largest IRM cohort (mGAP v2.2)14 revealed that more than 62 million of the variants (73.94%, Fig. 1a) were newly identified, despite the much smaller sample size of our cohort compared to that of the mGAP project (964 vs. 2,425). This is perhaps not surprising given that the reference genome per se is an Indian-origin subspecies, which is phylogenetically distinct from the CRM27. Nevertheless, we cannot exclude another possibility that our CRM cohort may possess higher levels of genetic diversity compared to the mGAP cohort, which is evident from the results presented below.
We traced the genetic ancestry of the CRM cohort by incorporating samples from diverse geographical regions of China alongside samples from India. The PCA results show a clear separation between the CRMs and the IRMs (Fig. 1b), thereby corroborating the marked genetic divergence of these two geographically separated subpopulations16,28. Within the Chinese samples, the captive CRMs were indistinguishable from the wild population, irrespective of whether or not the Indian-origin samples were excluded. Such pronounced admixture between captive CRM samples and the wild population was further corroborated in FRAPPE-inferred ancestral clusters (Fig. 1c), implying that the captive CRMs are likely a recent admixture of multiple wild sources, aligning with the maintenance history of the cohort. The combination of multiple genetic ancestries introduces increased nucleotide variation into the recipient population. As expected, we found that the captive CRMs showed the highest genetic diversity (mean π = 0.0016), which is comparable to that of the wild population (average π = 0.0015) and 1.7-fold higher than the mGAP cohort (average π = 0.0001) (Fig. 1d). This notwithstanding, the mutational load pattern indicated that both the captive CRMs and the wild population carried significantly fewer deleterious mutations (Fig. 1e) and homozygous loss-of-function (LoF) (Fig. 1f) than the mGAP cohort (Mann-Whitney U test, p-value < 2.2x10− 16). High genetic diversity and low genetic load are reliable indicators of a population's long-term viability29,30. These results imply that the genetic status of our captive CRMs compares favorably with the mGAP samples, with a lower risk of inbreeding, germplasm degradation and loss of genetic diversity.
Variant annotation and mutational profiling
We classified the variants into different categories based on their location and functional impact. As seen in human cohorts31,32, the majority of the CRM variants were found in intergenic and intronic regions, accounting for 45.13% and 39.75%, respectively, whereas the variants located in coding and splicing regions made up 0.89% of the total (Fig. 2a and Extended Data Fig. 1a–c). The number of synonymous variants (~ 328K) was slightly higher than the non-synonymous variants (~ 315K); they together comprised 85.22% of the variants in coding and splice regions. The allele frequency distribution indicated that the non-synonymous and frameshift mutations, start/stop gains or losses, and splice site variants are more likely to be rare or singletons (Fig. 2b), reflecting the putative purifying selection acting on them.
We further examined the mutational constraint on different genes and pathways. To mitigate the potential population genetic forces that may have influenced the amount of variations as well as the features of the local sequence (e.g., gene length, mutation rate), we utilized the number of synonymous variation as the control baseline33. Specifically, for each gene, we computed the ratio of non-synonymous to synonymous substitutions (dN/dS), a statistic that is also commonly used to measure the strength and mode of natural selection acting on protein-coding genes34. After controlling for the false discovery rate (FDR), our results showed that the most evolutionarily constrained pathways (involving genes with no observed non-synonymous mutations) were related to core biological processes, e.g., ribosome, spliceosome and proteasome components (adjusted p-value < 0.05, Fig. 2d), consistent with previous findings in human cohorts33,35. By contrast, the immune-related pathways, such as the chemokine signaling pathway, cytokine − cytokine receptor interactions, viral protein interactions with cytokine and cytokine receptors, were among the least constrained pathways (dN/dS > 4). Interestingly, several neurodegeneration pathways, such as those evident in amyotrophic lateral sclerosis (ALS), Parkinson disease (PD), Huntington disease (HD) and Alzheimer disease (AD), were also found to be markedly conserved (adjusted p-value < 0.05), implying their functional importance and strong purifying selection in rhesus macaques. It is reasonable to suppose that these categories of conserved pathways are also less tolerant to deleterious mutation.
Loss of function (LoF) variants and association with phenotypes
LoF variants, including nonsense, frameshift or canonical splice-site mutations, are of particular interest as they have the potential to severely disrupt the functionality of protein-coding genes, thereby could serve as naturally occurring gene knockouts to explore gene function36. However, LoF variants are known to have a high false-positive rate due to various factors, including incomplete and imperfect genome annotation, occurrence on non-canonical transcripts or within the last 5% of the transcript37,38. To increase the probability of a given variant being accurately annotated as a predicted loss-of-function (pLoF) mutation, we applied a set of filtering strategies to the raw LoF variants derived from the SnpEff prediction (see Methods for detail). In total, we identified 4,166 high-confidence pLoF variants across 2,746 genes (Supplementary Table 2), where at least one copy of the gene was predicted to be inactivated based on both rhesus macaque and human genome annotations. Of these, the majority (83.08%) were found to be rare (MAF < 0.01) and only 5.61% of the pLoF variants were very common (MAF > 0.05). On average, each individual macaque carried 97 pLoF variants, similar to the numbers found in human genomes36,37.
The very common pLoF alleles are likely to be LoF-tolerant because they are less constrained by purifying selection. We observed a significant enrichment in olfactory receptors among these alleles (adjusted p-value = 1.809×10− 4, Supplementary Table 3), consistent with the findings of previous studies33,35,39. It is intriguing to find that seven mouse essential genes (PPP1R15B, IFT52, CYP1A2, ETV2, NFASC, SLC2A9, PLRG1) and two human essential genes (MAK16, PLRG1) were tolerant to biallelic inactivation in CRMs (Fig. 3b and Supplementary Table 2). For example, the PLRG1 gene, which encodes a core component of the cell division cycle 5-like (CDC5L) complex, is crucial for both mouse embryonic and human cells in terms of their viability40,41. However, our observations suggest that the homozygous knockout of this gene does not result in severe consequences or a disease state in CRMs, probably the evolutionary change of gene essentiality across species42 or a compensation effect from gene family members43. By contrast, rare pLoF alleles (MAF < 0.01) are expected to be less tolerated and likely associated with a strong functional effect. We found a strong depletion of homozygosity among rare pLoF variants, with only 78 (2.29%) of the variants being homozygous. These genes were significantly enriched for metabolic pathways, such as arachidonic acid metabolism, glycerophospholipid metabolism, and glycerolipid metabolism (adjusted p-value < 0.05, Supplementary Table 4). Interestingly, we identified 338 genes as potential drug targets within the high-quality pLoF catalog (Fig. 3b and Supplementary Table 2). These genes exhibited varying degrees of gene loss, which could potentially lead to inter-individual differences in pharmacological efficacy. Consequently, the compilation of high-confidence LoF variants could serve as a key resource to guide the selection of suitable "druggable" targets, and it would be rewarding to have a primary screening for these druggable targets in CRMs for selecting the proper individuals for the pharmacological evaluations.
To further characterize the phenotypic consequences of the rare pLoF variants, we performed an association screen against 52 distinct phenotypes (Supplementary Tables 5 and 6). Association results surpassed the Bonferroni significance threshold (p-value = 2.83×10− 5, see Methods) for seven pLoF-trait pairs (Supplementary Table 7). The most significant association was a splice acceptor variant in ANO10 (c.203-2 AG > G), which was related to the full-leg length (p-value = 8.97×10− 6). Compared to the non-carriers, ANO10 (c.203-2 AG > G) heterozygotes displayed a significant reduction in full-leg length (Mann-Whitney U test, p-value = 0.0251, Fig. 3c). Notably, ANO10 (c.203-2 AG > G) heterozygous carriers also exhibited a nominally significant reduction in full-arm length (Mann-Whitney U test, p-value = 0.0139, Fig. 3d), although the association test (p-value = 4.12×10− 5) did not surpass the level of significance required by Bonferroni correction, likely because the correction approach is highly conservative and would tend to "overcorrect" the variants in the context of a mild or small effect44. ANO10 encodes a transmembrane protein that belongs to the transmembrane 16 family. Defects in this gene can cause ataxia, a neurological condition characterized by gait and balance impairment, upper limb coordination problems, as well as impairment of speech and eye movements45,46. However, to our knowledge, ANO10 has never been reported to be associated with limb length. Similarly, we could identify a heterozygous splice acceptor mutation at PRRC2B (c.6379-2A > G), which was predicted to play a role in embryonic development47, was significantly associated with a higher body weight (p-value = 9.67×10− 6, Fig. 3c). If employing a less conservative association p-value threshold (e.g., 1×10− 4), we could identify another 13 associations that was align with the gene function (Fig. 3d). For instance, the carriers of a stop gain mutation in the ATR gene possess a smaller head length (Mann-Whitney U test, p-value = 0.0072). It has been suggested defects of this gene was a cause of Seckel syndrome 1, a syndrome characterized by severe intrauterine and postnatal growth retardation, microcephaly and mental retardation48. In addition, the heterozygous knock-out of ALOX15, which encodes an enzyme that acts on various polyunsaturated fatty acid substrates49, was associated with lower high-density lipoprotein (HDL) and low-density lipoprotein (LDL) concentrations in serum of CRMs (p-value = 0.0042 and 0.0073, respectively).
Genome-wide association for 52 phenotypes in CRMs
The availability of multiple genomes coupled with phenotypic data also provides an unprecedented opportunity to investigate the genetic foundations of phenotypic variation in CRMs. To this end, we performed GWAS analyses for each quantified trait on the common variants (SNVs + Indels) with a mixed linear model by fitting relevant covariates, e.g., age, sex, genetic relationship, population structure (see Methods). The genomic control factor λ did not show any sign of inflation for all tests (λ < 1.03), suggesting that population structure has been well controlled. In total, we identified 44 variants associated with 16 phenotypic traits that passed the genome-wide significance threshold (p-value = 5.13 × 10− 8). These variants were clumped into 30 independent loci across 18 chromosomes, explaining 3.36%-5.97% of phenotypic variation (Extended Data Fig. 2 and Supplementary Table 8).
The annotation of these significant variants revealed six genes (DCDC2C, TRIB1, EDIL3, GGT1, SHISA9, WWOX) have been reported to be associated with specific human traits (Supplementary Table 8). For instance, the EDIL3 gene, which encodes an integrin ligand, has been previously suggested to be related to human body mass index (BMI)50. In this study, we discovered that a downstream variant of this gene was significantly associated with a reduction in BMI in rhesus macaques (beta=-1.0737, p-value = 6.30×10− 9, Extended Data Fig. 2c). We also observed associations of the SHISA9 locus link to hip circumference 51(beta= -0.6693, p-value = 3.69 ×10− 8), and the WWOX locus with body weight 52(beta = 0.3909, p-value = 3.87×10− 8) (Extended Data Fig. 2 and Supplementary Table 8). Apart from these known associations, we identified 11 significant associations that had not previously been reported in the human GWAS catalog53 (Supplementary Table 8). Of these, the most significant association was observed for a 5'-UTR variant at the IGLL1 locus (c.-1436C > T), which was related to the serum gamma-glutamyl transpeptidase concentration (γ-GGT) level in CRMs (p-value = 2.76x10− 11, Fig. 4a,b and Supplementary Table 8). This gene encodes an immunoglobulin lambda-like polypeptide 1 protein which plays an important role in B cell development54. In CRMs, the heterozygous and homozygous carriers exhibited a gradual increase in γ-GGT concentration as compared to non-carriers (Fig. 4d). Interrogation of human ENCODE databases55 revealed that this signal region exhibited distinct active enhancer signatures in a range of human cell types (Extended Data Fig. 3). It is noteworthy that this peak also encompassed an independent locus of GGT1 (p-value = 2.59 x10− 8), which has previously been reported to be associated with γ-GGT level in human56. However, regional association analysis indicated that these two variants were in weak linkage disequilibrium (LD) (r2 = 0.01, Fig. 4c), suggesting that they are independent associations with both being linked to the GGT level.
Reverse genetic screen identifies DISC1 (p.Arg517Trp) as a genetic risk factor for neuropsychiatric disorders
The above classical forward genetic approaches enabled the identification of multiple genetic variants associated with the phenotypic variations in CRMs. It is intriguing to verify whether a distinct genotype can predict a specific phenotype. In a reverse genetic screen, we identified 3,192 non-synonymous mutations across 2,216 genes that were predicted to be deleterious based on the intersection results of SIFT4G57 and PolyPhen-258 (Supplementary Table 9). We are particularly interested in the genes related to human neurological disorders (NDs) as these complex diseases are difficult to investigate using rodent models3,59. Non-human primates (NHPs) are not only phylogenetically close but they also share similar brain structure and function with humans, making them more suitable for the study of human NDs than other mammalian species60. Below, we highlight the case regarding the phenotypic consequences arising from a deleterious missense mutation in the DISC1 gene (p.Arg517Trp, c.1549C > T, SIFT4G score = 0.01).
DISC1 (Disrupted-In-Schizophrenia 1) is a well-established risk gene for schizophrenia and various other neuropsychiatric disorders, including affective disorder, bipolar disorder, autism spectrum disorder, and major depressive disorder61,62. This gene encodes a multi-compartmentalised protein that functions as a scaffold hub, interacting with numerous partners involved in brain development and disease processes. Defects in DISC1 have been reported to be associated with impaired working memory63. Anatomical changes mostly involve cortical abnormalities, including the prefrontal cortex as this area plays an important role in executive functions and working memory64. In this cohort, we identified eight CRMs that carried the DISC1 p.Arg517Trp mutation in the homozygous state. These macaques included three adults (aged 5 to 7 years) and five elderly individuals (aged over 19 years). Given that aging could potentially affect the results obtained (e.g., working memory), we focused on the three adults and excluded the elderly monkeys from the behavioral and brain imaging experiments.
We observed a significant reduction in neurological function in carriers of the risk allele (Trp, two-tailed t-test, p-value < 0.0001, Fig. 5d). This reduction was manifested by diminished limb reflexes, as well as a decreased response to pain and teasing. We further assessed the working memory under mild-stressful and non-stressful conditions, respectively. Our results showed that risk allele carriers consistently exhibited lower working memory performance with increasing delay lengths, and this pattern was particularly evident in the trials with 30 second delays (Fig. 5a,b). When a restraint stress was applied, the risk allele carriers displayed markedly more errors under these stressful conditions (two-tailed t-test, p-value = 0.0363, Fig. 5c). Since stress is a risk factor for psychiatric disorders associated with impaired prefrontal function65,66, these data may help to explain why the deleterious missense mutation of DISC1 increases the risk of psychiatric disorders.
Next, we carried out magnetic resonance imaging (MRI) to examine whether any cortical structure was altered in DISC1 Trp carriers. Although we did not detect a significant reduction in gray matter volume and thickness (Extended Data Fig. 4), we observed an increase in gray matter surface area in the Trp risk allele carriers, particularly in the motor cortices of the caudal frontal lobe (corrected p-value = 0.0338, Fig. 5e). Conversely, we detected a significant reduction of white matter volume in the temporal lobe (corrected p-value = 0.0064, Fig. 5f) and a significant increase in ventricular volume (corrected p-value = 0.0169, Fig. 5g). Further region-level results confirmed that the majority of changes in gray matter surface area and white matter volume were localized to the frontal lobe and temporal lobe, respectively (Extended Data Fig. 5–6). We also collected resting-state functional magnetic resonance imaging (rs-fMRI) data. The network-based statistic (NBS) was conducted across a range of primary thresholds (t = 3.0-3.4) to identify differences in functional connectivity between the Trp-bearing macaques and the Arg controls under general anesthesia. As the primary threshold increased, a stable set of differing functional connectivity persisted (Extended Data Fig. 7), with the results at the median threshold (t = 3.2) presented in Fig. 5h,i. Among these findings, the majority of increased functional connectivity measures in Trp-bearing monkeys were localized within the frontal lobe (n = 11), while a subset was observed between the frontal lobe and subcortical regions (n = 7) (Fig. 5h). Additionally, we identified 27 connections that displayed a reduction in strength in the Trp-bearing macaques compared to controls, with the majority of these reductions occurring between the frontal lobe and parietal lobe (n = 13) (Fig. 5i). It should be noted that functional connectivity under anesthesia is a measure of correlative firing, not functional efficacy, and increased values can represent correlated slow wave firing. Nevertheless, these data collectively indicated that the macaques carrying the risk allele of DISC1 p.Arg517Trp exhibited alterations in cortical architecture and functional connectivity, which may ultimately contribute to the observed impairment of working memory. As working memory impairment is a contributing symptom to most neuropsychiatric disorders linked to DISC1 mutations, these promising data provide a remarkable bridge across human and macaque species, albeit the number of macaques with DISC1 p.Arg517Trp were relatively small.