1. Research design and clinical characteristics of cohorts
The study involves two cohorts composed by Chinese people of Han ethnicity. The first group, discovery cohort, are used to identify possible risk genetic loci and genotypes for late-onset PE by exome sequencing based Genome-Wide Association Study (GWAS), while the other one, validation cohort, are recruited to verify the candidate genetic risks identified by GWAS with an expanded size of subjects (Fig. 1). The genetic predisposition is also compared between early-onset and late-onset PEs (Fig. 1).
A total of 42 subjects were collected in the discovery cohort, including 22 late-onset PE patients, and 20 healthy controls. Confounding factors that are highly correlated with PE, such as age, BMI, etc., are not differentially distributed between groups (Table 1). None of the subjects had complications or underlying diseases that may affect PE genetic research. The blood pressure and albuminuria levels of PE cases were significantly higher than those of the control group, since they are the main diagnostic standards for PE. The fetal birth weight and neonatal score (Apgar score) of PE cases were significantly lower than those of control, while the incidence of intrauterine growth retardation (IUGR) was significantly increased in the PE group, further suggesting fetal development could be influenced by PE apparently (Table 1).
In the validation cohort, 399 subjects were enrolled, including 135 PE patients and 264 controls. The PE group included 64 late-onset and 71 early-onset cases. Age, BMI, and blood pressure were not significantly different between early-onset and late-onset subgroups, but were significantly larger than those of the control group (Table 2). Fetal birth weight and the Apgar score of newborns showed a gradual increase between early-onset PE, late-onset PE, and healthy controls, which may reflect the different effect of the disease time or types of PE on the fetus, with early-onset PE showing larger impact than late-onset PE (Table 2). The possibility could not be excluded yet that the difference was actually caused by the different birth time of the fetuses from the pregnant women with early- late-onset PE. The rate of IUGR in early-onset PE was significantly higher than that in late-onset PE, further confirming that different time or types of PE have different effect on the fetuses (Table 2). The incidence of maternal GDM, HELLP, and other important underlying diseases and complicating gynecological and obstetric diseases, were generally higher in the PE group than in the control group, but there was no significant difference between the two types of PE (Table 2). Taking together, the characteristics of the validation cohort suggested a more severe impact of early-onset PE on the fetus than late-onset PE.
2. Identification of risk genetic loci and genotypes of Chinese late-onset preeclampsia cases
Exome sequencing was performed on the discovery cohort composed of patients with late-onset PEs and healthy controls. The Manhattan plot reveals a concentrated segment of chromosome 6 (6p21.32), which is clearly associated with PE (Fig. 2A). With the preset statistical significance threshold (p < 0.001), 139 risk genetic loci distributed within 22 protein-encoding genes were recalled, for which the genotypes are associated with late-onset PE in Chinese people significantly (Supplemental Dataset S1). The loci are enriched in 6p21.32 most strikingly, especially in two genes, HLA-DQB1 and HLA-DRB5 (Fig. 2B; Supplemental Dataset S1). The significant SNPs in HLA-DQB1 show high correlation between each other but low correlation with those in HLA-DBR5 or other nearby genes (Fig. 2B). Similarly, there is low linkage between the significant SNPs within and outside the HLA-DRB5 gene (Supplemental Fig. S1). The recombination rates are generally low between HLA-DQB1 or HLA-DRB5 and other loci (Fig. 2B; Supplemental Fig. S1). The results suggest that HLA-DQB1 and HLA-DRB5 could be associated with PE incidence independently at gene level, while the SNPs within individual genes could potentially contribute to the disease synergistically.
Furthermore, we combined GTEx gene expression data and transcriptome-wide disease association study (TWAS) approaches, and identified three genes significantly associated with late-onset PE, i.e., HLA-DQB1, CFAP61, and RARS2 (Fig. 2C; Supplemental Dataset S2). Among them, HLA-DQB1 is most significant and typical (Fig. 2C). HLA-DRB5 was not included in the original model, and therefore could not be analyzed for its possible association with PE at transcriptome level.
We compared the preeclampsia-associated loci with the significant GWAS results of other cohorts. However, only one common locus was identified, i.e., rs4762 in AGT gene, which was reported with significant association with preeclampsia in cohorts of multiple ethnics previously. It only shows low significance in the current Chinese GWAS cohort before multi-testing correction (p = 0.035). Despite the potential contribution of small size of the GWAS cohort to the relatively low power, it could also reflect the heterogeneity of preeclampsia genetic risk in different ethnics [30, 31].
3. Validation of three novel preeclampsia risk loci within HLA-DRB5 gene
Based on the results of GWAS and TWAS analysis, we identified a risk genetic region 6p21.32 that showed a strong association with Chinese late-onset PE and the polymorphic genetic loci were mostly enriched in genes HLA-DQB1 and HLA-DRB5. To further confirm the association of the region and genes with PE, we included more PE and control samples in a validation cohort, resolving the nucleotide and genotype composition for candidate risk polymorphic loci within the region. However, the main association loci within HLA-DQB1 and some loci in HLA-DRB5 are not suitable for design of sequencing or genotyping primers attributed to the poor conservation of flanking sequences. Eventually, only three risk loci within HLA-DRB5 were detected successfully for the base composition and genotypes in the complete validation cohort (Table 3).
For all the three loci, the results in the validation cohort are consistent with the GWAS results. The proportion of minor alleles for rs147440497 (GRCh37.p13 - Chr6: 32489888) increases significantly in the PE population, and the odd ratio reaches 2.0 (95% CI: 1.5-2.8) (Table 3; Fig. 3A). For rs141378803 (GRCh37.p13 - chr6: 32490016) and rs149025589 (GRCh37.p13 - chr6: 32490000), the minor alleles in healthy controls even shift to major ones in PE population, with composition odd ratios of 2.4 (95% CI: 1.8-3.3) and 2.3 (95% CI: 1.7-3.1) between PE and control, respectively (Table 3; Fig. 3A). The proportion of homozygous genotype consisting of minor alleles in control for each locus is significantly higher in the PE population, and the recessive heritability related odd ratios of rs147440497, rs141378803, and rs149025589 are 6.0 (95% CI: 2.9-12.5) and 3.6 (95% CI: 2.2-5.7) and 4.0 (2.5-6.4), respectively (Table 3; Fig. 3B).
The PE cases in the validation cohort could be further divided into late-onset and early-onset subgroups. In order to observe whether the PE genetic associations are specific to late-onset cases, the association analysis was repeated to the different types of PE for the three candidate loci. However, for both allele and genotype, the composition distributions for all the three loci show significant risk associations with either type of PE, and there is no significant difference between the two PE subgroups (Table 3; Fig. 3A-B).
The SNPs in HLA-DRB5, especially rs14740497, show some correlation with other SNPs in the same gene based on the GWAS data (Supplemental Fig. S1). Therefore, we further explored the possible linkage of genotypes among the three validated sites within HLA-DRB5. The three sites were genotyped with Sanger sequencing using the same set of primers for the validation cohort, and therefore they could be haplotyped. For rs147440497, rs149025589 and rs141378803, seven combinatorial forms for the nucleotide composition could be identified except for ‘TCG’, though three haplotypes, i.e., ‘ATA’, ‘TTA’ and ‘TTG’, are rarely detected (Fig. 3C). Haplotypes ‘TCA’ and ‘ACA’ are enriched while ‘ATG’ is depleted in PE cases significantly (Fig. 3C; χ2 tests, p < 0.05). Most of the haplotypes show similar proportion between early-onset and late-onset PE cases, except for ‘ACA’, which is with marginally significant higher proportion in early-onset PE cases (Fig. 3C; χ2 tests, p = 0.057).
4. Possible function of HLA-DRB5 on preeclampsia incidence
To infer the possible functional relevance on PE incidence, gene expression of HLA-DRB5 was examined in the bulky RNA-seq data obtained from placentas of 5 Chinese PE and 5 healthy pregnancies. The gene is detected with low expression in both types of tissues and with slightly higher expression in controls though not being significant statistically (Fig. 4A). The results are consistent with a microarray-based gene expression profiling experiment involving the placentas of 15 PE and 10 healthy control pregnancies (Fig. 4B). Interestingly, the transcriptional level of HLA-DRB5 is significantly higher in late-onset PE cases than in early-onset cases (Fig. 4C). The peripheral blood mononuclear cells (PBMCs) of the pregnancies express HLA-DRB5 with increased abundance, but not differentially between PE cases and healthy controls (Fig. 4D). HLA-DQB1 is expressed in a pattern similar to HLA-DRB5 in the placentas and PBMCs of PE and healthy pregnancies (Supplemental Fig. S2).
A single-cell RNA-seq dataset was further explored to observe the cell types expressing HLA-DRB5, which comprises placentas of two PE and two healthy Chinese pregnancies. The gene is expressed with low level in both PE and control placentas, but preferentially in dendritic cells (DCs), a C1Q-expressed macrophage subset and mesenchymal stem cells (MSCs), which also show the most abundant expression of HLA-DQB1 (Fig. 4E; Supplemental Fig. S3A). No difference is found between PE and control, for the expression level of either HLA-DRB5 or HLA-DQB1 (Fig. 4E; Supplemental Fig. S3A). In the peripheral blood mononuclear cells of healthy human subjects, HLA-DRB5 and HLA-DQB1 are also mainly expressed in subsets of myeloid cells and B cells (Supplemental Fig. S3B-C). Both HLA-DRB5 and HLA-DQB1 were also found with preferential expression in macrophages and DCs in human placentas and other tissues, based on online single-cell type analysis with the Human Protein Atlas (https://www.proteinatlas.org). These cell types where the genes are mainly expressed are major antigen presentation cells (APCs), and therefore HLA-DRB5 and HLA-DQB1 may contribute to PE incidence by antigen presentation processes.
Without typical difference in expression level, HLA-DRB5 could still contribute to PE pathogenicity or progress by the differential protein structure and function caused by the single nucleotide variations. Among the 3 PE risk sites of HLA-DRB5, rs141378803 and rs149025589 are located in the same intron dispersed by 16 nucleotides (Supplemental Fig. S4). The third site, rs147440497, is located at the 164th nucleotide and within the second exon of the corresponding transcript. The variation from T to A at the site causes a change of the encoding amino acid from Phe to Tyr (F55Y) (Supplemental Fig. S4). The exon where rs147440497 is located and the upstream sequences are known to encode an extracellular domain, which is important for antigen recognition and presentation (Supplemental Fig. S4). Structural modeling analysis demonstrated that the F55Y substitution causes a change in an important turn angle, which further lead to a striking conformation change of the extracellular domain for antigen recognition and presentation (Fig. 4F).
Taken together, the results indicate that HLA-DRB5 potentially functions in the pathogenicity or progress of PE by working differentially in the antigen presentation process through the local protein structure changes endued by the single nucleotide polymorphisms. HLA-DQB1 may contribute to PE by similar mechanisms.
5. A multi-gene model predicts Chinese women with high preeclampsia risk accurately
The genotypes of the three genetic loci in HLA-DRB5 were tested for the capability to predict the PE risk of Chinese women. A Support Vector Machine (SVM) model was trained only with the three loci of single nucleotide polymorphisms, and the Chinese validation cohort was predicted with the model. The 5-fold cross-validated average Area Under the Receiver Operating Characteristic Curve (ROC-AUC) reaches 0.65 while the accuracy reaches 0.69 (Fig. 5A). The model performs as well as those developed previously that are based on both the genotypes of 5 or 8 PE risk loci and clinical factors [31]. We also included the 8 previously identified loci and the age and BMI at pregnancy, and trained different regression or machine learning models. The AUC of ROCs is close to each other, averagely 0.83, 0.82, 0.85 and 0.84 for SVM, Logistic Regression (LR), Random Forest (RF) and Naïve Bayes (NB) models, respectively (Fig. 5B). A voting-based stacking model was built with the four primary models, but its performance is not better than individual ones (Fig. 5B). Other performance items such as Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pre), F1-score and Mathews Correlation Coefficient (MCC) were also evaluated and compared. Generally, the SVM model shows best performance, outperforming the LR, RF and NB ones (Fig. 5C). The model can reach a sensitivity of 0.63 at a high specificity of 0.92 to predict the cases with high PE risk in Chinese pregnancy population (Fig. 5C).