Sample collection and clinical description
After obtaining written informed consent, we collected blood samples from 66 patients (54 (82%) females and 12 (18%) males; all of Danish descent) diagnosed with DDH, belonging to a total of 29 families. Each of the families had at least two affected individuals. One family presented with five affected first-degree relatives, five families had three affected first-degree relatives, while the remaining 23 families had two affected first-degree relatives (the pedigrees are shown in Additional file 1).
Whole exome sequencing
To investigate the possibility of a shared genetic cause of DDH in our cohorts, we performed whole exome sequencing (WES) of the affected individuals. We reached a mean sequencing coverage of 140.5X. On average, 94.2% of the target regions had a sequencing depth of at least 30X (Additional file 2). There was only one fallout sample with a very low coverage, which we decided to keep as it belonged to a three-member family (family 28), and we could still use at least two related individuals for effective variant identification.
We performed genetic analysis of SNVs, indels, and CNVs. The diagram of the general structure of the study is shown in Fig. 1. After alignment and joint variant calling within each family, the resulting files contained on average 50243 variants (Additional file 2). We further filtered to include only coding sequence and splicing variants, with high quality signal as assessed by base call accuracy of at least 99%, a variant allele frequency (VAF) in reads of 25% or more, and by manual check of each variant. DDH is known for incomplete penetrance and we decided not to sequence healthy relatives as controls as it may have confounding effects. As control reference, we used variant frequencies in the general population as listed in the Genome Aggregation Database (gnomAD), based on data from 125,748 exomes and 15,708 whole-genomes of unrelated individuals from all over the world. We kept only the variants with a minor allele frequency (MAF) equal or under 0.01, and with less than five reported homozygotes in the overall global population according to gnomAD.
Variant distribution and annotation
We found a total of 3332 rare and shared variants in 2798 genes, carried by related individuals across the 29 families. The great majority of the variants were missense (3025), followed by a relatively small number of stop-gain variants (75), splicing mutations (72), and frameshift variants (70) (Additional file 3). Inframe indels accounted for 66 cases, and the rest of the variants involved either 5’UTR premature start codons (9), initiator codons (4), stop losts (4), or disruptive inframe deletions (3).
On average, we found 141 rare shared variants in families with only two members. As expected, families with three members shared a smaller number of variants on average (45) than two-member families (Fig. 2), suggesting a certain amount of false positives. Family 34 had five members and a relatively high number of rare variants in common between all the affected individuals: 14.
Surprisingly, there were two families, family 2 (two members), and family 35 (three members), which both had only two rare variants shared between the affected relatives. These were actually the same two missense mutations in the PPP6R2 gene: c.1421G>A and c.2345G>A (Table 1). The PPP6R2 gene encodes a regulatory subunit for the serine/threonine protein phosphatase-6 catalytic subunit (PPP6C). Phosphoprotein phosphatases (PPP) are multimeric holoenzymes that modulate mitotic entry and progression; usually their substrate specificity is defined through association of the catalytic subunit with regulatory subunits [9]. We searched for additional variants affecting genes coding for regulatory and catalytic subunits of PPPs in our data and we found that family 1 also bears the same two variants in the PPP6R2 gene as family 2 and 35. In total, the two PPP6R2 SNVs were detected in seven individuals from three unrelated families. The two SNVs were always inherited together, suggesting a cis positioning on the same allele.
Table 1 Genes encoding for phosphoprotein phosphatases (PPP) subunits and carrying SNVs
Gene
|
Protein
|
Mutation nucleotide
|
Mutation protein
|
Effect
|
Family
|
CADD score
|
Number of tools predicting damaging
|
PPP6R2
|
Regulatory subunit to the PPP6C catalytic domain
|
NM_001242898.1:c.1421G>A
|
NP_001229827.1:p.Arg474His
|
Missense
|
Fam1, Fam2, Fam35
|
23.5
|
4 out of 6
|
PPP6R2
|
Regulatory subunit to the PPP6C catalytic domain
|
NM_001242898.1:c.2345G>A
|
NP_001229827.1:p.Gly782Asp
|
Missense
|
Fam1, Fam2, Fam35
|
14.76
|
2 out of 6
|
ANKRD28
|
Regulatory subunit to the PPP6C catalytic domain
|
NM_001195098.1c.-325C>T
|
|
5’UTR premature start codon
|
Fam32
|
7.624
|
NA
|
PPP2R3C
|
Regulatory subunit to the PPP2C catalytic domain
|
NM_017917.3:c.173G>A,
|
NP_060387.2:p.Arg58Gln,
|
Missense
|
Fam13
|
22.6
|
2 out of 6
|
PPP2R3C
|
Regulatory subunit to the PPP2C catalytic domain
|
NM_017917.3:c.29G>A
|
NP_060387.2:p.Arg10His
|
Missense
|
Fam33
|
22.3
|
2 out of 6
|
PPP1R13B
|
Regulatory subunit to the PPP1C catalytic domain
|
NM_015316.2:c.1283C>T
|
NP_056131.2:p.Pro428Leu
|
Missense
|
Fam27
|
24.3
|
3 out of 6
|
PPP1R18
|
Regulatory subunit to the PPP1C catalytic domain
|
NM_133471.3:c.215C>T
|
NP_597728.1:p.Pro72Leu
|
Missense
|
Fam5
|
22.2
|
1 out of 6
|
PPP1R37
|
Regulatory subunit to the PPP1C catalytic domain
|
NM_019121.1:c.2059G>A
|
NP_061994.1:p.Gly687Arg
|
Missense
|
Fam25
|
24.1
|
3 out of 6
|
PPP1R9A
|
Regulatory subunit to the PPP1C catalytic domain
|
NM_001166160.1:c.125A>T
|
NP_001159632.1:p.Glu42Val
|
Missense
|
Fam4
|
25.6
|
5 out of 6
|
List of genes encoding for PPP subunits carrying rare variants that co-segregated with DDH within single families. All these genes encode for regulatory subunits of PPP enzymes, among which the protein phosphatases PP1, PP2, and PP6. In the PPP6R2 gene, the same two missense variants were found in seven individuals from three unrelated families. In family 2 and family 35 these two mutations were the only rare variants shared between the affected individuals of the family. The CADD score is used to evaluate the deleteriousness of a variant. A CADD score of over 20 indicates that a variant is among the 1% most deleterious substitution in the genome, and thus is considered damaging. Additionally, we used other six tools for evaluating a negative functional impact (SIFT, Polyphen, MutTaster, MutAssessor, FATHMM Pred, and FATHMM MKL). The results of a voting system from these six algorithms is shown. A higher number of tools predicting a variant damaging gives a higher score in the voting system. Abbreviations: PPP: phosphoprotein phosphatases; CADD: combined annotation dependent depletion.
Among other genes encoding for PPP subunits, we found a 5’UTR premature start codon variant in the ANKRD28 gene, which is another regulatory subunit of PPP6C (Table 1); two families bearing two missense PPP2R3C SNVs, and four families with four different missense variants in genes encoding PPP1C regulatory subunits: PPP1R13B, PPP1R18, PPP1R37, and PPP1R9A. Most of these variants were predicted to be somewhat damaging as their CADD score for deleteriousness was over 20.
We searched the literature for known genes associated with DDH (Additional file 4) to see if we found any known variants or known genes in our study group. We found novel variants in previously reported DDH candidate genes in humans CDH7, CX3CR1, DACH1, ESR1, MYH10, NOTCH2, PCNT, POLE, TBX4, and TENM3; while the following genes previously described in dogs had variants in our study group: COL6A3, EVC2, INPP5D, IQCA1, KLH17, LAMA2, NWD1, and OTOG.
CNV analysis
Analysis of CNVs did not detect any common structural variants in regions involving previously reported DDH candidate genes (Additional file 5). Only one CNV variant was detected in multiple families (family 1 and family 22), co-segregating with the disease, and it involved the duplication of exons 19 to 22 of the ANKRD24 gene. ANKRD24, similarly to some PPP6 regulatory subunits, has multiple ankyrin repeats domains; however, its function is currently unknown.
Functional analysis of the variants and candidate genes
To identify the most deleterious variants that may cause DDH in our cohort, we used in silico predictions to filter and select the most damaging, rare, and relevant variants. We included shared variants within families, and used a more stringent filtering for variant frequency in the total population. We selected only variants with a MAF equal or under 0.01, less than 5 homozygotes, and 590 alternate allele counts in the general population, which would be the expected number assuming an autosomal dominant mode of inheritance and an average DDH incidence of 1:3400.
Among the identified 3332 rare and shared variants from our study group, 825 of them were predicted to be highly damaging that either had a loss-of-function (LoF) mutation, or were assigned a very high prediction score (5 or 6 algorithms predicting deleteriousness, or a CADD score of over 25) (Additional file 6). Four of these variants occurred in known DDH candidate genes in humans: DACH1: c.2050C>T, MYH10: c.423G>A, NOTCH2: c.6094C>A, and TBX4: c.805G>A [10-13], while the EVC2: c.3061C>T, the OTOG: c.6943C>T, and the SHC3: c.875A>T gene variants occur in candidate genes previously described in dogs [14, 15].
As most of the families had many very damaging variants, we used the PhoRank algorithm to filter further the variants in an attempt to select for the most relevant ones. In Fig. 3 is shown the distribution of very damaging variants across the families and according to PhoRank. The PhoRank algorithm assigns a score, from 0 to 1, to each gene according to closeness to a user-specified phenotype (hip dysplasia in this case). For most of the variants the median value of PhoRank was low (under 0.325). However, few variants had a very high score: 35 genes carrying 36 very damaging variants had a PhoRank score over 0.75, indicating closeness to the DDH phenotype (Fig. 3, while Additional file 7 shows the whole list).
Among these, we found very damaging mutations in three collagen genes: COL12A1, COL2A1, and COL6A2. Mutations in COL2A1 are known to cause spondyloepiphyseal dysplasia and COL2A1 has previously been related to osteoarthritis secondary to DDH [16]. We also identified very damaging variants in four other genes involved in chondrodysplasias: BMPR1B, KIF22, MMP9, and NANS. Furthermore, we found three known pathogenic mutations in the genes MAN2B1, IDUA, and ABCA3, of which the IDUA: c.1829-1G>A is a LoF mutation causes mucopolysaccharidosis of type 1 [17], an autosomal recessive disorder that commonly involves the hips, with similar clinical and radiological presentation as DDH [18]. The MAN2B1 mutation is involved in mannosidosis alpha, while the ABCA3 mutation in pulmonary disease.
Importantly, most of the families presented with more than one good candidate variant that co-segregated in the affected members, leading to the assumption of a multiple genetic contribution to the phenotype. On the other hand, families with three and five affected members (family 5, family 8, family 20, family 28, and family 34), even though they had very damaging variants, none of these had a PhoRank score of over 0.75, indicating that genes currently not connected to DDH may be involved in the pathology in these families.
Mutational burden analysis
Since the number of identified DDH candidate genes was high in our analysis, we decided to undertake an alternative approach by investigating which genes were most frequently affected across the families. We filtered as previously for shared variants between the affected members within the families and low variant frequency in the general population, and then we looked for genes affected by these variants in more than one family. We found 455 genes to be affected in at least two families, carrying a total of 958 variants (Table 2 and Additional file 8 showing the whole list of genes).
Table 2 Number of genes bearing rare variants co-segregating with disease in more than one family.
Genes common to
|
Number of genes
|
Number of variants
|
9 families
|
1
|
11
|
6 families
|
1
|
6
|
5 families
|
7
|
31
|
4 families
|
16
|
52
|
3 families
|
62
|
163
|
2 families
|
368
|
673
|
Total
|
455
|
958
|
Titin (TTN) was the most affected gene in our study group, as we found nine families and eleven missense variants in the TTN gene, with two of the families carrying two variants. Next in line was cadherin 23 (CDH23), with six families affected and carrying six different variants, among which a LoF initiator codon variant. Seven genes were affected in five families, 16 genes among four families, and the remaining 62 and 368 genes were found carrying variants among three and two families, respectively. Figure 4 shows a list of the 87 most commonly hit genes as they were found carrying variants in three or more families.
Since our most affected gene (TTN) is also the longest in the genome, we ran a correlation test between gene length and the number of variants, and gene length and the number of families with variants for each gene. A correlation coefficient of r = 0.36 and r=0.33 respectively, suggested that gene length by itself was not a relevant factor in determining how many variants we would find, or how many families would carry the variants.
Next, we looked in the literature to see if the list of our most affected genes contains known DDH candidate genes. We found NOTCH2, TBX4, and TENM3 known candidate genes each to have variants in two families; while DACH1, CHD7, CX3CR1, ESR1, MYH10, PCNT, and POLE candidate genes had each variants only in a single family (Additional file 4).
Although our top hit genes have not been previously reported in relation to DDH, many of them belonged to the same protein family or had a common function with many known candidate genes. For example, CDH23, an essential protein for inner-ear mechanotransduction and hearing [19], is the second most affected gene in our study and it belongs to the cadherin superfamily of calcium-dependent cell-cell adhesion proteins, as well as the DDH candidate gene protocadherin9 (PCDH9) [11]. Similarly, we found two families with missense variants in the plasma membrane Ca2+ transporting 2 (ATP2B2) gene, which is closely related to the ATP2B4 gene, whose variants were previously described to co-segregate with heparan sulfate proteoglycan 2 (HSPG2) variants in four related DDH individuals [8]. The semaphorin gene SEMA4D is a candidate gene in dogs and in humans, and SEMA5D was previously reported in dogs [20, 21]; although we did not find variants in SEMA4D or SEMA5D, we did find a SEMA6C variant in three families.
Biological function and candidate pathways
At this point, we reasoned that the DDH phenotype might be a consequence of a dysfunctional pathway, rather than of a dysfunctional single gene. To investigate more systematically which groups of genes and pathways were more frequently affected by variants in our cohort, we performed GO term and pathway analysis taking into consideration only genes affected in two or more families.
The number one hit in the analysis of the enrichment of GO biological process terms was detection of mechanical stimulus, followed by limb morphogenesis (Fig. 5 and Additional file 9). Among other significantly enriched GO terms there were many belonging to the extracellular matrix, microtubules, or the cytoskeleton topology and function. GO cellular component they were: dynein and collagen complexes, the costamere, the basement membrane, collagen-containing extracellular matrix, cell-substrate junctions, myofibrils, and the cilium. While enriched molecular function terms were ATP-dependent microtubule motor activity, dynein binding, microtubule and motor activity, calcium ion transporting and binding activity, and actin binding.
Pathway analysis revealed among significantly enriched KEGG pathways: ABC transporters and the Extracellular matrix (ECM)-receptor interactions pathway (Fig. 6 and Additional file 9). When we used the Reactome database, significant enrichment of the Notch pathway was highlighted as we found variants in the following five genes: NOTCH2, CREBBP, EP300, MAML1, and NOTCH4. Other enriched pathways were the Laminin interactions (FDR=0.0012), and the Non-integrin membrane-ECM interactions pathway (FDR=0.0260) due to many variants found in the laminin (LAMA1, LAMA2, LAMA5, LAMC3, AGRN) and collagen (COL4A5, COL7A1, COL4A1, COL4A3) genes.
Significantly enriched keywords associated with our group of genes were Usher syndrome as we curiously found variants in four genes known to be causative of the disease including USH2A, CDH23, GPR98, and HARS. Other enriched keywords included: laminin EGF-like domain, basement membrane, dynein, motor protein, actin-binding, cilium, and extracellular matrix in confirm to what already found with GO term and pathway analysis.
Protein interaction analysis
To investigate which protein networks the identified genes belonged to, we used STRING (https://string-db.org/) and performed protein interaction network analysis using as input the list of 455 genes affected in at least two families. The resulting protein network was rather large, shown in Fig. 7 (Additional file 10 for high resolution), with several clearly observable smaller clusters, most of which have already been identified by GO term and pathway analysis such as laminins and collagens, which together form a bigger network of extracellular matrix proteins, adhesion and cytoskelal proteins, and the Notch signalling pathway. On the figure, there were also visible two clusters with members of the Wnt signalling pathway and the E3-ubiquitin protein ligase complex.
In the overall, the resulting protein interaction network had significantly more interconnections than expected, as the protein interaction network enrichment p value was 4.6e-14 (string-db.org). The value was calculated for 453 number of nodes (proteins) in the network, and reaching to 613 edges, as compared to 446 expected edges for a network of a random set of proteins of similar size (the average node degree was 2.71 with an average node clustering coefficient of 0.34). This means that the genes from our list form a larger group of biologically connected proteins than what a random set of genes of the same dimensions would make.