Throughout this study, we analyzed two datasets. The first one is the SickKids one, consisting of 79 rare disease pediatric cases, some of which were subsequently described in Deshwar et al., 2023. RNA-seq data extracted from whole blood and clinical reports were available for all cases, while WGS was available for all except for 1 sample. All DNA-RNA pairs were verified to belong to the same individual using DROP’s sampleQC module (Yépez et al., 2021). On the sample without WGS, we called variants from RNA-seq data using DROP’s RNA variant calling module. To have training data for the model, we gathered a second rare disease dataset with solved cases described in Yépez et al., 2022 and Stenton et al., 2021. The dataset has matching WES and RNA-seq data from skin-derived fibroblasts from 303 individuals predominantly affected by mitochondrial disorders. We refer to it as the mitochondrial dataset. We considered only the publicly available disease-causal variants in the diagnosed individuals with available HPO terms and variants revealed in standard WES analysis (i.e., we discarded large deletions and intronic variants found by WGS or RNA-seq), resulting in a total of 93 samples.
On both datasets, we proceeded to extract genetic, transcriptomic, and phenotypic insights from the different raw data. Our model assumed a monogenic recessive mode of inheritance, which is the most common mode of inheritance in our training dataset. Therefore, we considered only rare (minor allele frequency MAF < 0.01) variants. Genes with no rare variants were discarded. On median, 9,800 genes per sample harbored at least one rare variant in the SickKids and 1,544 in the mitochondrial dataset (Fig. 2A, Fig. S1A). This large difference is attributed to the SickKids cohort providing WGS data, while the mitochondrial dataset is based on WES data. Then we extracted the two most severe variants per gene for every sample. Variant severity was defined by ranking according to known pathogenicity annotated in ClinVar, the impact and consequence annotated by the Variant Effect Predictor, VEP (McLaren et al., 2016), and scores from predictive algorithms such as CADD (Rentzsch et al., 2019) and EVE (Frazer et al., 2021) (Methods).
We then annotated for every RNA-seq sample which genes were expression or splicing outliers. In the SickKids dataset, a total of 13,170 genes were considered to be expressed using a cutoff of a Fragments Per Kilobase Million (FPKM) value greater than 1 in at least 5% of the samples. Regarding genes that cause a Mendelian disorder (OMIM), 2557 out of 4356 (59%) are expressed. These numbers align with the expressed genes in the mitochondrial dataset described in the original publication (14,100 genes in total and 66% OMIM). We then ran OUTRIDER (Brechtmann et al., 2018) and FRASER (Mertes et al., 2021) and obtained, on median, 1 and 25 expression and splicing candidates per sample on the SickKids dataset, also comparable to the median of 4 and 22 expression and splicing outliers per sample in the mitochondrial dataset (Fig. 2B, C and Fig. S1B, C).
Next, we manually annotated each individual with HPO terms (Köhler et al., 2020) based on the provided clinical reports. A total of 532 different terms were collected in the SickKids dataset (median 14 per sample), out of which the most frequent were “Global developmental delay” (HP:0001263, N=28) and “Decreased body weight” (HP:0004325, N=26). On the mitochondrial disease dataset, the most frequent terms were “Increased serum lactate” (HP:0002151, N=48), “Decreased activity of mitochondrial complex I” (HP:0011923, N=40), and “Global developmental delay” (N=39). We then computed the semantic similarity score using the Phenotype Consensus Analysis Package (Godard and Page, 2016). For most of the gene-sample combinations, the score is close to 0, indicating no relation between the individual’s phenotypes and the ones associated with the tested gene. The rest of the values follow a seemingly Gaussian distribution (Fig. S2). Assuming that genes with scores greater than 4 are phenotypically related to the patient, we obtained a median of 207 and 103 phenotypically similar genes in the SickKids (Fig. 2D) and the mitochondrial disease (Fig. S1D) dataset, respectively.
Predictive Pathogenicity Model
Next, we developed a model to predict which gene is causing the disease in an individual based on all the collected features from the different omics data. The features per gene were the semantic similarity score, a variety of variant scores for the two most severe variants per gene, and whether the gene is expressed, aberrantly underexpressed, aberrantly overexpressed, aberrantly spliced, and a known disease gene (Methods). For hyperparameter optimization and training, we combined the SickKids dataset with the 93 publicly available disease-causing genes from the mitochondrial dataset. The predictive class was set to TRUE for all the disease-causing genes in each individual of the mitochondrial dataset. The remaining gene-individual combinations (all coming from SickKids) were set to FALSE. To alleviate the problem that only the mitochondrial dataset is providing cases for the positive class, we did not use the HPO terms explicitly but the gene-HPO similarity scores, so that the model has the potential to generalize to other diseases. The machine learning algorithm we used is XGBoost (Chen and Guestrin, 2016), an algorithm that is adequate for strong class imbalanced classification problems such as this one (93 positive against >250,000 negative labels). The model estimates the probability for each gene-individual pair to be of the positive class. We selected the hyperparameters that yielded the highest area under the ROC curve using a 5-fold cross-validation scheme. The final model was then trained on the combined dataset with those hyperparameters. Finally, uncertainties of the predictions were estimated as the standard error across prediction scores generated by training the model on 10 bootstraps with replacement of the full dataset. The standard errors were around 10% of the scores (Fig. S3).
Benchmarking of the model on the mitochondrial disease dataset
To evaluate the model, we used the 93 mitochondrial disease individuals, where the disease-causing genes served as the truth set and all the other genes, the false set. We split the data into two-thirds training and one-third test data and to cover all individuals in the evaluation, we repeated the process three times. We then evaluated the model by comparing the rank of the disease-causal genes for each individual (Fig. 3). In case of a ‘tie’, i.e., genes with the same score, we assigned all the genes to the last rank, reasoning that the geneticist will have to examine all the genes on a same rank manually. We further ran other models with different subsets of the input features allowing us to evaluate the importance of the different omics layers. As expected, the fully integrative model across all the features and omic layers performed best. In more than 50% of the cases, the disease-causing gene was ranked first and was in the top 5 in 75%. Removing phenotypic or transcriptomic data results in a slight loss of performance. The RNA-seq-only model performed well only for the 40 solved individuals harboring either expression or splicing outliers in the disease causal gene. This benchmark showcases the complementarity of the different omic layers and the importance of integrating them into one full model.
Application to the SickKids dataset
The next step was to apply the trained model to the SickKids cohort. As expected from the high class imbalance due to only 1 out of more than 10,000 genes being disease-causal, the predicted score was very low for most combinations (<10-3 for more than 99.7% of combinations, Fig. S3). For the challenge, we submitted two solutions. The first one contained the automatically generated scores of the XGBoost model for the top 100 genes for each individual. The second one contained the same information as the first but with 22 manually curated scores overriding the model predictions where we followed the ACMG guidelines. For both submissions, we reported i) at most two potentially disease-causing variants per gene-individual combination, ii) the status and type (i.e. over or underexpression) of the aberrant expression and its fold-change and false discovery rate (FDR), iii) the status and nature of aberrant splicing (e.g. exon skipping) and the genomic coordinates of all the aberrantly spliced junctions, iv) the status of monoallelic expression (MAE) and the alternative allele ratio, and v) the model’s score and its standard error.
The challenge was evaluated using three previously diagnosed individuals in which RNA-seq data proved helpful for diagnostics together with twelve individuals diagnosed solely through DNA analysis. Some of those cases were reported after the challenge by Deshwar et al. Our model was able to prioritize 2 out of the 3 RNA-seq supported cases on the top 3 ranks (Table 1), while reaching a recall of over 50% under the top 100 genes across all 15 cases (Fig. 4).
Table 1: Summary of three previously solved cases via RNA. For each of the three solved cases, we present the disease-causal variants, RNA summary, semantic similarity score, and the rank of our model’s score. XLR: X-linked recessive, XLD: X-linked dominant.
ID
|
Gene
|
Model rank
|
Curated rank
|
Causal Variant
(GRCh37)
|
Genotype and inheritance mode
|
RNA Summary
|
Semantic Similarity
|
P1
|
SMS
|
2
|
1
|
X:21990620:T>A
NM_004595.5: c.265-5T>A
|
Hemizygous splice region variant (XLR)
|
Splicing outlier
|
4.08
|
P2
|
HDAC8
|
3
|
1
|
X:711933:TTCAA>T
NM_018486.3: c.134_137del
|
Hemizygous
frameshift (XLD),
4 bp deletion
|
Gene did not pass FPKM cutoff (0.93)
|
4.43
|
P3
|
CASK
|
1233
|
1233
|
X:41419085:T>TC
NM_001367721.1: c.1683dup
|
Heterozygous
frameshift (XLD),
1 bp duplication
|
Gene did not pass FPKM cutoff (0.91)
|
4.63
|
Sample 1 suffered from Snyder Robinson Syndrome (MIM: 309583), a syndrome caused by SMS (MIM: 300105, (Cason et al., 2003)) and which was ranked second by our model (score: 0.077). Following our manual curation algorithm, we assigned SMS a score of 1 as it harbored a rare variant near the splice site of an exon skipped due to the variant, on the gene known to cause the disease.
The other two samples harbored rare frameshift variants that should cause aberrant expression in the respective disease-causing genes HDAC8 (MIM: 300269) and CASK (MIM: 300172). However, the FPKM values of the respective genes were lower than the stringent default cutoff we used to define whether a gene is expressed and therefore were not tested for aberrant expression. Nevertheless, based on only the genomic and phenotypic features, our model ranked HDAC8 third. We manually assigned a score of 1 to the HDAC8 case as the detected hemizygous deletion had been reported as pathogenic by two independent groups in ClinVar (accession RCV000194427.9).
For the CASK case, however, the heterozygous frameshift variant was not prioritized, likely due to two key factors. First, the annotation with VEP only considers the precomputed scores for CADD, resulting in not scoring this deletion not present in gnomAD, which otherwise has a deleterious score of 34. Second, this variant is heterozygous in an X-linked dominant gene, potentially diminishing its priority due to the high imbalance in the mode of inheritance of the training dataset. The mode of inheritance of the vast majority of genes (61 out of 77 unique genes, 79%) of the mitochondrial disease cohort is uniquely recessive, causing genes with only one impactful allele to face penalties in the prioritization within the model. We did not manually curate the CASK case, as its score was beyond the top 100 genes prioritized by the model.
For the 12 purely genomic-based diagnosed cases without any impact on the transcriptomic level, our model ranked three genes under the top 10 and 50% under the top 100 (Table S1). Two reasons affected the performance of five cases beyond the top 100. First, three contained frameshifts that were not annotated with CADD because they were not part of the precomputed CADD scores and in genes not expressed in blood (HMGA2, FOXG1, NKX6-2). Second, two were in genes that were proven to cause a Mendelian disorder only in 2022 (FBXW7 first described in Stephenson et al., 2022, and H3F3B in Bryant et al., 2020), therefore their semantic similarity score with respect to any sample was zero and the OMIM feature was false at the time we ran the model.