Predicting molecular events underlying rare diseases using variant annotation, aberrant gene expression events, and human phenotype ontology

doi:10.21203/rs.3.rs-3405211/v1

Download PDF

Research Article

Predicting molecular events underlying rare diseases using variant annotation, aberrant gene expression events, and human phenotype ontology

https://doi.org/10.21203/rs.3.rs-3405211/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Rare genetic diseases often pose significant challenges for diagnosis. Over the past years, RNA sequencing and other omics modalities have emerged as complementary strategies to DNA sequencing to enhance diagnostic success. In the 6th round of the Critical Assessment of Genome Interpretation (CAGI), the SickKids clinical genomes and transcriptomes challenge aimed to evaluate the diagnostic potential of multi-omics approaches in identifying and resolving undiagnosed genetic disorders. Here, we present our participation in that challenge, where we leveraged genomic, transcriptomic, and clinical data from 79 children with diverse suspected Mendelian disorders to develop a model predicting the causal gene. We employed a machine learning model trained on a cohort of 93 solved mitochondrial disease samples to prioritize candidate genes. In our analysis of the SickKids cohort, we successfully prioritized the causal genes in 2 out of the 3 diagnosed individuals exhibiting abnormalities at the RNA-seq level and 6 cases out of the 12 where no effect on RNA was seen making our solution one of the winning ones. The challenge and our approach highlight the invaluable contributions of an integrative analysis of genetic, transcriptomic, and clinical data to pinpoint the disease-causing gene.

The challenge was evaluated using three previously diagnosed individuals in which RNA-seq data proved helpful for diagnostics together with twelve individuals diagnosed solely through DNA analysis. Some of those cases were reported after the challenge by Deshwar et al. Our model was able to prioritize 2 out of the 3 RNA-seq supported cases on the top 3 ranks (Table 1), while reaching a recall of over 50% under the top 100 genes across all 15 cases (Fig. 4).

rare diseases

genetics

transcriptomics

pathogenicity score

Between 6,000 and 8,000 known rare diseases, each affecting less than 1 in 2,000 individuals, collectively impact approximately 3.5%-6% of the population (EURORDIS, 2005; Wakap, 2020; National Center for Advancing Translational Sciences). For those individuals suffering from a rare disease, finding the correct molecular diagnosis is key, as it can i) improve patient care, ii) lead to obtaining treatment, and iii) inform genetic counselors and care providers (Koch et al., 2017; Repp et al., 2018; Wright et al., 2018). The current standard line of care is the interpretation of variants obtained through whole-exome (WES) or whole-genome sequencing (WGS) (Turro et al., 2020). But that strategy still leaves more than 50% of individuals suspected to be affected with a Mendelian disorder undiagnosed (Clark et al., 2018; Burdick et al., 2020; The 100,000 Genomes Project Pilot Investigators, 2021). Even though it is possible to detect almost all genetic variants through WGS, we still lack the ability to functionally interpret and prove the pathogenicity of the majority of variants inside and especially outside the coding regions (Posey, 2019; Rehm, 2022; Cheng et al., 2023). This is one of the reasons diagnostic guidelines request additional functional evidence to establish a new variant-disease association (Richards et al., 2015; Marwaha et al., 2022).

RNA sequencing (RNA-seq) directly probes the effect of variants on expression and splicing; therefore it is a valuable complementary assay to DNA sequencing. Integrating RNA-seq into the diagnostic pipeline has yielded, depending on the disease and study, an 8–35% increased diagnostic rate (Cummings et al., 2017; Kremer et al., 2017; Frésard et al., 2019; Gonorazky et al., 2019; Murdock et al., 2021; Lee et al., 2022; Yépez et al., 2022; Dekker et al., 2023; Lunke et al., 2023). The strategy in those studies is similar: find genes with impaired expression or splicing and analyze them in combination with rare variants and phenotypes to provide a diagnosis on a case-by-case basis. However, all the approaches rather analyze the different data modalities separately and then use the findings to filter down candidate variants. But a true integrative model across genomic, transcriptomic, and phenotypic data is lacking. Currently, only algorithms that automatically integrate genotype with phenotype exist, such as the widely used Exomiser (Robinson et al., 2014) and its successor Genomiser (Smedley et al., 2016), eVAI from enGenome (Nicora et al., 2022) or Moon^™ by Invitae that were evaluated as part of the Rare Genomes Project CAGI6 challenge evaluated in Stenton et al., 2023. However, none of those approaches encompass or make use of transcriptomic data. Multiple studies showed how transcriptomic information is important to narrow down the disease-causing variant in specific cases such as UTR variants that cause aberrant expression (UFM1 in Yépez et al., 2022), synonymous variants that create a new splice site (KCTD7 in Frésard et al., 2019), or intronic variants that create a new cryptic exon (TIMMDC1 in Kremer et al., 2017).

To evaluate the benefits in diagnostics of a fully integrative analysis, the SickKids clinical genomes and transcriptomes challenge was announced within the sixth round of the Critical Assessment of Genome Interpretation (CAGI) challenges organized by the Hospital for Sick Children (SickKids). The goal of the SickKids challenge was the identification of the underlying molecular cause in 79 pediatric rare-disease individuals through the integrative analysis of genomic (WGS), transcriptomic (RNA-seq from whole blood), and clinical data encoded in Human Phenotype Ontology (HPO) terms (Köhler et al., 2019). The challenge requested from the participating teams for each individual and gene an estimated probability that the gene is impacted aggregating all the omics information per gene-individual combination. Some individuals were previously diagnosed, but as this information was going to be used to evaluate the competitors, it was not revealed.

Here we present an end-to-end model that integrates genomic, transcriptomic, and phenotypic data that we submitted to the CAGI SickKids challenge. The workflow and model include data preprocessing, feature extraction, modeling of the data, and the final prediction of a pathogenicity score for each gene-individual combination. In detail, we preprocessed the raw data by i) annotating rare variants on the WGS data with a wide range of complementary tools, ii) calling aberrant expression and splicing in the RNA-seq data, iii) converting the clinical reports into standard HPO terms and then into semantic similarity scores. We then developed a machine learning model (Fig. 1), which we trained in a similar dataset of 93 solved individuals predominantly affected by a mitochondrial disease described by Yépez et al., 2022. Finally, we applied the model to the SickKids dataset and automatically ranked the genes by the obtained score. We submitted a second manually-curated solution by prioritizing known pathogenic or likely pathogenic variants described in ClinVar (Landrum et al., 2018) following the ACMG guidelines (Richards et al., 2015).

Throughout this study, we analyzed two datasets. The first one is the SickKids one, consisting of 79 rare disease pediatric cases, some of which were subsequently described in Deshwar et al., 2023. RNA-seq data extracted from whole blood and clinical reports were available for all cases, while WGS was available for all except for 1 sample. All DNA-RNA pairs were verified to belong to the same individual using DROP’s sampleQC module (Yépez et al., 2021). On the sample without WGS, we called variants from RNA-seq data using DROP’s RNA variant calling module. To have training data for the model, we gathered a second rare disease dataset with solved cases described in Yépez et al., 2022 and Stenton et al., 2021. The dataset has matching WES and RNA-seq data from skin-derived fibroblasts from 303 individuals predominantly affected by mitochondrial disorders. We refer to it as the mitochondrial dataset. We considered only the publicly available disease-causal variants in the diagnosed individuals with available HPO terms and variants revealed in standard WES analysis (i.e., we discarded large deletions and intronic variants found by WGS or RNA-seq), resulting in a total of 93 samples.

On both datasets, we proceeded to extract genetic, transcriptomic, and phenotypic insights from the different raw data. Our model assumed a monogenic recessive mode of inheritance, which is the most common mode of inheritance in our training dataset. Therefore, we considered only rare (minor allele frequency MAF < 0.01) variants. Genes with no rare variants were discarded. On median, 9,800 genes per sample harbored at least one rare variant in the SickKids and 1,544 in the mitochondrial dataset (Fig. 2A, Fig. S1A). This large difference is attributed to the SickKids cohort providing WGS data, while the mitochondrial dataset is based on WES data. Then we extracted the two most severe variants per gene for every sample. Variant severity was defined by ranking according to known pathogenicity annotated in ClinVar, the impact and consequence annotated by the Variant Effect Predictor, VEP (McLaren et al., 2016), and scores from predictive algorithms such as CADD (Rentzsch et al., 2019) and EVE (Frazer et al., 2021) (Methods).

We then annotated for every RNA-seq sample which genes were expression or splicing outliers. In the SickKids dataset, a total of 13,170 genes were considered to be expressed using a cutoff of a Fragments Per Kilobase Million (FPKM) value greater than 1 in at least 5% of the samples. Regarding genes that cause a Mendelian disorder (OMIM), 2557 out of 4356 (59%) are expressed. These numbers align with the expressed genes in the mitochondrial dataset described in the original publication (14,100 genes in total and 66% OMIM). We then ran OUTRIDER (Brechtmann et al., 2018) and FRASER (Mertes et al., 2021) and obtained, on median, 1 and 25 expression and splicing candidates per sample on the SickKids dataset, also comparable to the median of 4 and 22 expression and splicing outliers per sample in the mitochondrial dataset (Fig. 2B, C and Fig. S1B, C).

Next, we manually annotated each individual with HPO terms (Köhler et al., 2020) based on the provided clinical reports. A total of 532 different terms were collected in the SickKids dataset (median 14 per sample), out of which the most frequent were “Global developmental delay” (HP:0001263, N=28) and “Decreased body weight” (HP:0004325, N=26). On the mitochondrial disease dataset, the most frequent terms were “Increased serum lactate” (HP:0002151, N=48), “Decreased activity of mitochondrial complex I” (HP:0011923, N=40), and “Global developmental delay” (N=39). We then computed the semantic similarity score using the Phenotype Consensus Analysis Package (Godard and Page, 2016). For most of the gene-sample combinations, the score is close to 0, indicating no relation between the individual’s phenotypes and the ones associated with the tested gene. The rest of the values follow a seemingly Gaussian distribution (Fig. S2). Assuming that genes with scores greater than 4 are phenotypically related to the patient, we obtained a median of 207 and 103 phenotypically similar genes in the SickKids (Fig. 2D) and the mitochondrial disease (Fig. S1D) dataset, respectively.

Predictive Pathogenicity Model

Next, we developed a model to predict which gene is causing the disease in an individual based on all the collected features from the different omics data. The features per gene were the semantic similarity score, a variety of variant scores for the two most severe variants per gene, and whether the gene is expressed, aberrantly underexpressed, aberrantly overexpressed, aberrantly spliced, and a known disease gene (Methods). For hyperparameter optimization and training, we combined the SickKids dataset with the 93 publicly available disease-causing genes from the mitochondrial dataset. The predictive class was set to TRUE for all the disease-causing genes in each individual of the mitochondrial dataset. The remaining gene-individual combinations (all coming from SickKids) were set to FALSE. To alleviate the problem that only the mitochondrial dataset is providing cases for the positive class, we did not use the HPO terms explicitly but the gene-HPO similarity scores, so that the model has the potential to generalize to other diseases. The machine learning algorithm we used is XGBoost (Chen and Guestrin, 2016), an algorithm that is adequate for strong class imbalanced classification problems such as this one (93 positive against >250,000 negative labels). The model estimates the probability for each gene-individual pair to be of the positive class. We selected the hyperparameters that yielded the highest area under the ROC curve using a 5-fold cross-validation scheme. The final model was then trained on the combined dataset with those hyperparameters. Finally, uncertainties of the predictions were estimated as the standard error across prediction scores generated by training the model on 10 bootstraps with replacement of the full dataset. The standard errors were around 10% of the scores (Fig. S3).

Benchmarking of the model on the mitochondrial disease dataset

To evaluate the model, we used the 93 mitochondrial disease individuals, where the disease-causing genes served as the truth set and all the other genes, the false set. We split the data into two-thirds training and one-third test data and to cover all individuals in the evaluation, we repeated the process three times. We then evaluated the model by comparing the rank of the disease-causal genes for each individual (Fig. 3). In case of a ‘tie’, i.e., genes with the same score, we assigned all the genes to the last rank, reasoning that the geneticist will have to examine all the genes on a same rank manually. We further ran other models with different subsets of the input features allowing us to evaluate the importance of the different omics layers. As expected, the fully integrative model across all the features and omic layers performed best. In more than 50% of the cases, the disease-causing gene was ranked first and was in the top 5 in 75%. Removing phenotypic or transcriptomic data results in a slight loss of performance. The RNA-seq-only model performed well only for the 40 solved individuals harboring either expression or splicing outliers in the disease causal gene. This benchmark showcases the complementarity of the different omic layers and the importance of integrating them into one full model.

Application to the SickKids dataset

The next step was to apply the trained model to the SickKids cohort. As expected from the high class imbalance due to only 1 out of more than 10,000 genes being disease-causal, the predicted score was very low for most combinations (<10^-3 for more than 99.7% of combinations, Fig. S3). For the challenge, we submitted two solutions. The first one contained the automatically generated scores of the XGBoost model for the top 100 genes for each individual. The second one contained the same information as the first but with 22 manually curated scores overriding the model predictions where we followed the ACMG guidelines. For both submissions, we reported i) at most two potentially disease-causing variants per gene-individual combination, ii) the status and type (i.e. over or underexpression) of the aberrant expression and its fold-change and false discovery rate (FDR), iii) the status and nature of aberrant splicing (e.g. exon skipping) and the genomic coordinates of all the aberrantly spliced junctions, iv) the status of monoallelic expression (MAE) and the alternative allele ratio, and v) the model’s score and its standard error.

Table 1: Summary of three previously solved cases via RNA. For each of the three solved cases, we present the disease-causal variants, RNA summary, semantic similarity score, and the rank of our model’s score. XLR: X-linked recessive, XLD: X-linked dominant.

ID	Gene	Model rank	Curated rank	Causal Variant (GRCh37)	Genotype and inheritance mode	RNA Summary	Semantic Similarity
P1	SMS	2	1	X:21990620:T>A NM_004595.5: c.265-5T>A	Hemizygous splice region variant (XLR)	Splicing outlier	4.08
P2	HDAC8	3	1	X:711933:TTCAA>T NM_018486.3: c.134_137del	Hemizygous frameshift (XLD), 4 bp deletion	Gene did not pass FPKM cutoff (0.93)	4.43
P3	CASK	1233	1233	X:41419085:T>TC NM_001367721.1: c.1683dup	Heterozygous frameshift (XLD), 1 bp duplication	Gene did not pass FPKM cutoff (0.91)	4.63

Sample 1 suffered from Snyder Robinson Syndrome (MIM: 309583), a syndrome caused by SMS (MIM: 300105, (Cason et al., 2003)) and which was ranked second by our model (score: 0.077). Following our manual curation algorithm, we assigned SMS a score of 1 as it harbored a rare variant near the splice site of an exon skipped due to the variant, on the gene known to cause the disease.

The other two samples harbored rare frameshift variants that should cause aberrant expression in the respective disease-causing genes HDAC8 (MIM: 300269) and CASK (MIM: 300172). However, the FPKM values of the respective genes were lower than the stringent default cutoff we used to define whether a gene is expressed and therefore were not tested for aberrant expression. Nevertheless, based on only the genomic and phenotypic features, our model ranked HDAC8 third. We manually assigned a score of 1 to the HDAC8 case as the detected hemizygous deletion had been reported as pathogenic by two independent groups in ClinVar (accession RCV000194427.9).

For the CASK case, however, the heterozygous frameshift variant was not prioritized, likely due to two key factors. First, the annotation with VEP only considers the precomputed scores for CADD, resulting in not scoring this deletion not present in gnomAD, which otherwise has a deleterious score of 34. Second, this variant is heterozygous in an X-linked dominant gene, potentially diminishing its priority due to the high imbalance in the mode of inheritance of the training dataset. The mode of inheritance of the vast majority of genes (61 out of 77 unique genes, 79%) of the mitochondrial disease cohort is uniquely recessive, causing genes with only one impactful allele to face penalties in the prioritization within the model. We did not manually curate the CASK case, as its score was beyond the top 100 genes prioritized by the model.

For the 12 purely genomic-based diagnosed cases without any impact on the transcriptomic level, our model ranked three genes under the top 10 and 50% under the top 100 (Table S1). Two reasons affected the performance of five cases beyond the top 100. First, three contained frameshifts that were not annotated with CADD because they were not part of the precomputed CADD scores and in genes not expressed in blood (HMGA2, FOXG1, NKX6-2). Second, two were in genes that were proven to cause a Mendelian disorder only in 2022 (FBXW7 first described in Stephenson et al., 2022, and H3F3B in Bryant et al., 2020), therefore their semantic similarity score with respect to any sample was zero and the OMIM feature was false at the time we ran the model.

Data acquisition and preprocessing

In this study, we used the VCF files containing SNVs and short indels (DNA), the CRAM files (RNA), and the clinical reports containing the patients’ phenotypes as provided by the CAGI 6 challenge through the SickKids Hospital. The genomic data (VCF files) were obtained by sequencing DNA purified from blood by Complete Genomics (Stavropoulos et al., 2016) or by SickKids (Lionel et al., 2018). The data were aligned against the genome build GRCh37 and variant calling was performed with GATK (Van der Auwera and O’Connor, 2020). Only single nucleotide variants and small insertions and deletions were provided for the challenge.

The RNA sequencing was stranded, polyA enriched, and extracted from whole blood. Data was aligned with STAR v2.6.1c (Dobin et al., 2013). We downloaded the CRAM files and converted them to BAM format using Samtools v1.12 (Danecek et al., 2021) with the provided FASTA file.

The summary files of the patients’ phenotypes were manually converted from clinical reports into the numeric identifiers of the HPO terms (Köhler et al., 2020).

Variant annotation

The provided VCF files were normalized using BCFtools v1.12 (Danecek et al., 2021) and the provided FASTA file. Variants that created a faulty entry through the normalization step were discarded. The normalized variants were then annotated using VEP (McLaren et al., 2016) with the everything flag that includes gnomAD allele frequencies (Karczewski et al., 2020), protein domains, and HGVS annotation based on the ENSEMBL release 99 (Zerbino et al., 2018). In addition, we used the following VEP plugins: SpliceAI (Jaganathan et al., 2019), CADD (Rentzsch et al., 2019), and UTRannotator (Whiffin et al., 2020), with their default configurations and required input files. Additionally, scores from a recent evolutionary model to predict the pathogenicity from variants, EVE, were added to the variants (Frazer et al., 2021).

Variant calling in RNA-seq data

No genetic data was provided for one individual. Therefore, variants were called on RNA-seq data for this individual using GATK best practices for RNA-seq short variant discovery as described in Zhao et al., (2019) and Yépez et al., (2022). In short, variants with a ratio of quality to coverage < 2, that were strand biased (Phred-scaled fisher exact score >30), or belonging to an SNP cluster (3 or more SNPs within a 35 bp window) were filtered out, as suggested by GATK.

Filtering and sorting variants

Based on the VEP annotation, we applied multiple filters to extract the two most impactful variants per gene and individual. We first discarded variants lying outside protein-coding genes or with a minor allele frequency (MAF) > 0.01 in any population within gnomAD (Karczewski et al., 2020). We then categorized the variants into three pathogenicity categories similar to the ACMG Guidelines (Richards et al., 2015): high impact, medium impact, and rare. The high impact is similar to the ACMG categories strong, very strong, and partially moderate. The medium category resembles the ACMG categories supporting and moderate. The detailed filtering criteria and cutoffs are given below with logical OR except for the MAF cutoff:

	High impact	Medium impact	Rare
MAF	0.01	0.01	0.01
ClinVar annotation	Likely pathogenic or pathogenic	Likely pathogenic or pathogenic	---
VEP Impact	HIGH	HIGH or MODERATE	---
VEP Consequence	---	Splice region	---
CADD Phred score	>20	>10	---
SpliceAI score	>0.5	>0.2	---
EVE score	>0.64	>0.5	---
UTRannotator	Any annotation	Any annotation	---

Finally, we extracted the two most impactful variants per gene-individual pair after ranking the variants by i) is canonical, ii) our impact categories, iii) the VEP impact, iv) the EVE score, and v) the CADD score.

Aberrant expression

Aberrant expression was obtained by following the aberrant expression module of DROP (Yépez et al., 2021), using the default parameters. First, read counts were computed based on the provided gene annotation file. Reads with an FPKM < 1 in 95% of the samples were considered not expressed and discarded. The read counts were then modeled using OUTRIDER (Brechtmann et al., 2018). The obtained P-values were correct for multiple testing per sample across all genes using the false discovery rate (FDR) approach by Benjamini-Yekutieli (Benjamini and Yekutieli, 2001). Gene-individual combinations with an FDR < 0.1 were considered significant.

Aberrant splicing

Aberrant splicing was also obtained using the aberrant splicing module of DROP with the default parameters. Split reads and non-split reads spanning exon-intron boundaries were counted and converted into the intron-centric splicing ratio-based metrics percent-spliced in 𝜓₃ and 𝜓₅ and splicing efficiency 𝜃₃ and 𝜃₅ (Pervouchine et al., 2013). These metrics were then modeled independently using FRASER (Mertes et al., 2021). The obtained P-values were corrected for multiple testing in two steps: first across all events within every gene and sample pair and then per sample across all genes. Junctions with an FDR < 0.1 and an absolute differential splicing effect greater than 0.3 were considered significant. We further subsetted the results to junctions with strong effects on both the donor and acceptor sites using a metric similar to the Jaccard index and discarded junctions lying on genomic blacklist regions obtained from Amemiya et al., 2019.

Mono-allelic expression

Mono-allelic expression (MAE) was also computed following the MAE module of DROP. For each heterozygous single-nucleotide variant based on the WGS data, RNA-seq reads aligning to each allele were counted. These reads were modeled using a negative binomial distribution. The obtained P-values were corrected for multiple testing per sample across all variants. Variants with an alternative allele ratio > 0.8 or < 0.2 and with an FDR < 0.05 are considered to be mono-allelically expressed.

Semantic similarity

First, the semantic similarity score was computed between all available HPO terms (Köhler et al., 2020) and the HPO terms from each sample using the compareHPSets function from the R package Phenotype Consensus Analysis PCAN (Godard and Page, 2016). Then, these scores were grouped by gene. Finally, a single aggregating semantic similarity score was computed per gene-individual combination using the hpSetCompSummary function from the R package PCAN.

Learning a disease-impacting score for each gene per individual

To score the impact on disease for every gene per individual, we considered a classification problem as follows. Non-coding genes and genes without a rare variant per individual were discarded upfront. We defined a training set by combining all remaining gene-individual combinations from the SickKids dataset, which defaults to the negative class. To obtain events for the positive class, we added all 93 disease-causing genes from the mitochondrial dataset (Yépez et al., 2022). We reasoned that the mitochondrial dataset can function as a good proxy to model the probability of a gene being disease-causing in the given individual because i) the features are based on semantic similarity rather than on HPO terms directly, and ii) the number of positive pairs is approximately equal to the number of individuals in the SickKids dataset. The combined training dataset was prepared as follows:

an element is a pair (gene, individual)
all genes with at least one rare variant (MAF<0.01) were considered, whether they are expressed or not in the RNA-seq sample
We use the following gene-level annotations:
- is the gene expressed in the tissue
- is the gene aberrantly down-regulated
- is the gene aberrantly up-regulated
- is the gene aberrantly spliced
- semantic similarity score
- is the gene reported to cause a Mendelian disorder in OMIM (Amberger et al., 2019)
The following variant-level annotations for the 2 most impactful variants:
- Alternative allele ratio from RNA-seq
- MAF
- CADD score
- SpliceAI score
- EVE score
In case the top variant was homozygous, these scores were repeated. If there was no second variant, we added placeholder values to avoid missing values (alternative allele ratio: 0.5, MAF: 0.2, CADD: 9, SpliceAI: 0.1, EVE: 0.4).

Next, we trained a gradient-boosted tree model to discriminate between the 93 positive gene-individual pairs and the rest of the dataset. To this end, we used XGBoost (Chen and Guestrin, 2016), an adequate algorithm for strong class-imbalanced classification problems. We selected the hyperparameters that yielded the highest area under the ROC curve using a 5-fold cross-validation scheme. The obtained hyperparameters are: eta = 0.1, max_depth = 2, gamma = 4, subsample = 1, colsample_bytree = 1, eval_metric = 'auc'. This model was then trained on the full dataset with the obtained hyperparameters. The trained model estimates the probability for each gene-individual pair to be of the positive class, meaning that a gene is disease-causing. That is the score we report in our first submission. Uncertainties of the predictions were estimated as the standard error across prediction scores generated by training the model on 10 bootstraps with replacement of the full dataset.

Manual Curation

We also prioritized genes based on manual inspection of the variants and associated features. For this, we followed the ACMG guidelines and focused on literature support in population, disease-specific, and sequence databases. We call ‘potentially biallelic’ variants in cases of compound heterozygous variants, where we could not conclude whether they lie on the same allele or not. For the manual curation, we settled on a discrete set of scores that overrode the scores predicted by the model:

1: biallelic variants, all classified as very strong or strong evidence of pathogenicity.
0.8: biallelic variants, all classified as moderate evidence of pathogenicity or higher.
0.5: biallelic or potentially biallelic variants, all classified as supporting evidence of pathogenicity or higher.
0.2: monoallelic variant classified as very strong or strong evidence of pathogenicity, or inconclusive variants in a gene with aberrant expression or splicing, or that matches the phenotype.

These values were assigned after a case-by-case inspection. Using this overriding scoring, we modified 22 scores across 19 samples. As the challenge requested a standard error for the score, we provided a rather arbitrary one for all the manually changed values. We set it to 0.1, the order of magnitude of the standard errors estimated by the bootstrap for the XGBoost model.

The focal point of this CAGI SickKids challenge was the full integration of genomic, transcriptomic, and clinical data, to identify the molecular cause of the underlying rare disease in 79 pediatric individuals. Here we developed a predictive pathogenicity model that can integrate those three layers after extracting meaningful data from each using state-of-the-art algorithms and databases. As no training data was provided in this challenge, we used an existing dataset composed of diagnosed individuals to design, construct, and train a predictive pathogenicity model. Our integrative XGBoost model prioritized the causal gene in 2 out of the 3 diagnosed cases exhibiting abnormalities at the RNA-seq level and 6 out of the 12 purely genomic-based diagnosed cases of the SickKids dataset and was one of the best-performing models of the challenge. In addition, we showed that each omic layer added useful information and increased the model performance in the mitochondrial dataset.

Our training dataset had several limitations. First, our training dataset mostly comprised individuals affected by mitochondrial disorders. Even if we converted the HPO terms into similarity scores, these could be biased to the disease as well as other features that we extracted from the data. Second, the vast majority (79%) of the cases exhibited a recessive mode of inheritance. In contrast, two thirds of the diagnosed cases of the SickKids dataset were dominant. Having an imbalanced training set towards recessive cases, makes the model prioritize genes where both alleles are affected by a deleterious variant thus devaluing dominant cases. It has to be further investigated if two optimized, independent models for the recessive and dominant cases are needed. This also includes the way how we encoded variants in the feature set, as we collect the two most severe variants per gene and individual. Third, the dataset was biased towards solved cases where RNA-seq was found useful (40 out of the 93 solved mitochondrial individuals (43%) had defects visible on the RNA level). In the challenge, however, only 3 out of 15 diagnosed individuals used RNA-seq. One could argue to increase the spectrum of the training dataset to capture also cases purely diagnosable through DNA sequencing and phenotype, but we believe it is more useful to have an optimized model such as ours as a second tier diagnostic tool after WES or WGS was inconclusive. This also better resembles the current way RNA-seq is used in diagnostics. Fourth, the variants were extracted solely from WES. That means intronic, upstream, and downstream variants were not captured. None of those types of variants turned out to be causal in the SickKids dataset, however, for future applications, it is important to compile a dataset that includes them. Many of these limitations can be tackled by gathering a broader training dataset with a variety of rare disorders, modes of inheritance and types of disease-causal variants. Extending the training dataset would also facilitate the generalization of the model to the vast spectra of rare diseases. Likewise, similar challenges also spanning a broad spectrum of disease entities and variant types should be conducted to assess the pathogenicity prediction models.

Throughout this study, only SNVs and short indels were considered as only those were available for both datasets. However, structural variation is a great cause of genetic disorders (Wagner et al., 2019; Riquin et al., 2023). For those, our model might have to be finetuned as pathogenicity scores are usually not available or accurate (Kumar et al., 2020). In this respect, RNA-seq can be quite useful as large deletions are prone to cause expression outliers (Chiang et al., 2017; Li et al., 2017).

Further, we want to highlight different improvements to the feature extraction and model that, due to the belated onboarding of the team to the SickKids challenge, we were unable to implement. The most obvious one is to verify that all variants were correctly annotated with all the scores, unlike the ones we missed such as the CADD scores. Another is to add new scores that predict variant impact such as REVEL (Ioannidis et al., 2016) and the novel AlphaMissense (Cheng et al., 2023) for missense variants, as well as AbSplice (Wagner et al., 2023) for splice-disrupting ones. Using continuous scores (e.g. p-value, z score, or effect size) instead of a binary outlier status could further improve the model’s performance. We have not implemented them into the current feature extraction step and model as we report exactly what we developed for the challenge. Finally, adding results from other omics, such as proteomics, could substantially improve the performance.

As transcriptomics is increasingly integrated into the routine diagnostic workflow, predictive pathogenicity models are needed. Impartial and objective assessments, such as the CAGI SickKids challenge, are important to evaluate the performance and further develop and improve the models. Further, we anticipate that more omic modalities such as proteomics or metabolomics will be used in diagnostics, and hence, we believe that our publicly available code can be the basis for other researchers to add novel features according to their available datasets and objectives.

Funding

This study was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) via the project NFDI 1/1 "GHGA - German Human Genome-Phenome Archive" (#441914366 to CM, NHS, and JG) and by the German Bundesministerium für Bildung und Forschung (BMBF) through the ERA PerMed project PerMiM (01KU2016B to VY and JG). Funding for the RNA-seq profiling and support for the CAGI6 evaluation team (Huayun Hou and Kyoko Yuki) was provided by a Genome Canada Grant (OGI-158) to Adam Shlien, Jim Dowling, Michael Wilson, and Michael Brudno.

Competing Interests

The authors have no financial or competing interests to disclose.

Author Contribution

VAY, JG and CM conceived the study conception and design. VAY, NHS, IS and CM implemented the methods and analyzed the data. VAY, NHS, JG and CM wrote the manuscript. All authors read and approved the manuscript.

Data and code availability

The code of this study is publicly available on GitHub: www.github.com/gagneurlab/cagi6_sickkids. It includes the code to (i) preprocess the raw data, (ii) find the expression outliers, and (iii) train and execute the XGBoost model. Data from the SickKids dataset can be requested from the corresponding author of Deshwar et al., 2023. Gene count matrices for the mitochondrial dataset can be found in Zenodo: 4646823 and 4646827 . The disease causal variants are available in Table S3 of Stenton et al., 2021.

Ethics approval and consent

Written consent was provided by each proband’s parents and/or guardians as well as the proband where appropriate. This study was performed in line with the principles of the Declaration of Helsinki.

Acknowledgments

We thank the organisers of CAGI6, the data providers from the Hospital for Sick Children who provided the data on behalf of the SickKids Genome Clinic (Stephen Meyn, Christian Marshall, Gregory Costain, Michael D. Wilson, Lianna G. Kyriakopoulou, Kyoko Yuki and Huayun Hou). RNA-seq data were generated by Kyoko Yuki, Huayun Hou, Adam Shlien, Jim Dowling, Lianna G. Kyriakopoulou and Michael D. Wilson. We would also like to thank the SickKids challenge assessors, Lianna G. Kyriakopoulou, Kyoko Yuki and Huayun Hou. Finally, we thank all the patients and their families whose participation in research made this challenge possible, as well as the many healthcare providers involved in the diagnosis and care of these children. Some of the icons in Figure 1 were taken from BioRender.

Amberger, J. S., Bocchini, C. A., Scott, A. F., and Hamosh, A. (2019). OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47, D1038–D1043. doi: 10.1093/nar/gky1151.
Amemiya, H. M., Kundaje, A., and Boyle, A. P. (2019). The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci. Rep. 9, 9354. doi: 10.1038/s41598-019-45839-z.
Benjamini, Y., and Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing Under Dependency. Ann. Stat. 29, 24. doi: 10.1214/aos/1013699998.
Brechtmann, F., Mertes, C., Matusevičiūtė, A., Yépez, V. A., Avsec, Ž., Herzog, M., et al. (2018). OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data. Am. J. Hum. Genet. 103, 907–917. doi: 10.1016/j.ajhg.2018.10.025.
Bryant, L., Li, D., Cox, S. G., Marchione, D., Joiner, E. F., Wilson, K., et al. (2020). Histone H3.3 beyond cancer: Germline mutations in Histone 3 Family 3A and 3B cause a previously unidentified neurodegenerative disorder in 46 patients. Sci. Adv. 6, eabc9207. doi: 10.1126/sciadv.abc9207.
Burdick, K. J., Cogan, J. D., Rives, L. C., Robertson, A. K., Koziura, M. E., Brokamp, E., et al. (2020). Limitations of exome sequencing in detecting rare and undiagnosed diseases. Am. J. Med. Genet. A. 182, 1400–1406. doi: 10.1002/ajmg.a.61558.
Cason, A. L., Ikeguchi, Y., Skinner, C., Wood, T. C., Holden, K. R., Lubs, H. A., et al. (2003). X-linked spermine synthase gene (SMS) defect: the first polyamine deficiency syndrome. Eur. J. Hum. Genet. EJHG 11, 937–944. doi: 10.1038/sj.ejhg.5201072.
Chen, T., and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 785–794. doi: 10.1145/2939672.2939785.
Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 0, eadg7492. doi: 10.1126/science.adg7492.
Chiang, C., Scott, A. J., Davis, J. R., Tsang, E. K., Li, X., Kim, Y., et al. (2017). The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699. doi: 10.1038/ng.3834.
Clark, M. M., Stark, Z., Farnaes, L., Tan, T. Y., White, S. M., Dimmock, D., et al. (2018). Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. Genomic Med. Res. 3. doi: 10.1038/s41525-018-0053-8.
Cummings, B. B., Marshall, J. L., Tukiainen, T., Lek, M., Donkervoort, S., Foley, A. R., et al. (2017). Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, 12. doi: 10.1126/scitranslmed.aal5209.
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience 10, giab008. doi: 10.1093/gigascience/giab008.
Dekker, J., Schot, R., Bongaerts, M., de Valk, W. G., van Veghel-Plandsoen, M. M., Monfils, K., et al. (2023). Web-accessible application for identifying pathogenic transcripts with RNA-seq: Increased sensitivity in diagnosis of neurodevelopmental disorders. Am. J. Hum. Genet. doi: 10.1016/j.ajhg.2022.12.015.
Deshwar, A. R., Yuki, K. E., Hou, H., Liang, Y., Khan, T., Celik, A., et al. (2023). Trio RNA sequencing in a cohort of medically complex children. Am. J. Hum. Genet. doi: 10.1016/j.ajhg.2023.03.006.
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. doi: 10.1093/bioinformatics/bts635.
EURORDIS (2005). Rare Diseases: Understanding this Public Health Priority. Rare Dis., 1–14.
Frazer, J., Notin, P., Dias, M., Gomez, A., Min, J. K., Brock, K., et al. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95. doi: 10.1038/s41586-021-04043-8.
Frésard, L., Smail, C., Ferraro, N. M., Teran, N. A., Li, X., Smith, K. S., et al. (2019). Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919. doi: 10.1038/s41591-019-0457-8.
Godard, P., and Page, M. (2016). PCAN: phenotype consensus analysis to support disease-gene association. BMC Bioinformatics 17, 518. doi: 10.1186/s12859-016-1401-2.
Gonorazky, H. D., Naumenko, S., Ramani, A. K., Nelakuditi, V., Mashouri, P., Wang, P., et al. (2019). Expanding the Boundaries of RNA Sequencing as a Diagnostic Tool for Rare Mendelian Disease. Am. J. Hum. Genet. 104, 466–483. doi: 10.1016/j.ajhg.2019.01.012.
Ioannidis, N. M., Rothstein, J. H., Pejaver, V., Middha, S., McDonnell, S. K., Baheti, S., et al. (2016). REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 99, 877–885. doi: 10.1016/j.ajhg.2016.08.016.
Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J. F., Darbandi, S. F., Knowles, D., Li, Y. I., et al. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535-548.e24. doi: 10.1016/j.cell.2018.12.015.
Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443. doi: 10.1038/s41586-020-2308-7.
Koch, J., Mayr, J. A., Alhaddad, B., Rauscher, C., Bierau, J., Kovacs-Nagy, R., et al. (2017). CAD mutations and uridine-responsive epileptic encephalopathy. Brain 140, 279–286. doi: 10.1093/brain/aww300.
Köhler, S., Carmody, L., Vasilevsky, N., Jacobsen, J. O. B., Danis, D., Gourdine, J.-P., et al. (2019). Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027. doi: 10.1093/nar/gky1105.
Köhler, S., Gargano, M., Matentzoglu, N., Carmody, L. C., Lewis-Smith, D., Vasilevsky, N. A., et al. (2020). The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217. doi: 10.1093/nar/gkaa1043.
Kremer, L. S., Bader, D. M., Mertes, C., Kopajtich, R., Pichler, G., Iuso, A., et al. (2017). Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun. 8, 15824. doi: 10.1038/ncomms15824.
Kumar, S., Harmanci, A., Vytheeswaran, J., and Gerstein, M. B. (2020). SVFX: a machine learning framework to quantify the pathogenicity of structural variants. Genome Biol. 21, 274. doi: 10.1186/s13059-020-02178-x.
Landrum, M. J., Lee, J. M., Benson, M., Brown, G. R., Chao, C., Chitipiralla, S., et al. (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067. doi: 10.1093/nar/gkx1153.
Lee, M., Kwong, A. K. Y., Chui, M. M. C., Chau, J. F. T., Mak, C. C. Y., Au, S. L. K., et al. (2022). Diagnostic potential of the amniotic fluid cells transcriptome in deciphering mendelian disease: a proof-of-concept. Npj Genomic Med. 7, 1–10. doi: 10.1038/s41525-022-00347-4.
Li, X., Kim, Y., Tsang, E. K., Davis, J. R., Damani, F. N., Chiang, C., et al. (2017). The impact of rare variation on gene expression across tissues. Nature 550, 239–243. doi: 10.1038/nature24267.
Lionel, A. C., Costain, G., Monfared, N., Walker, S., Reuter, M. S., Hosseini, S. M., et al. (2018). Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet. Med. 20, 435–443. doi: 10.1038/gim.2017.119.
Lunke, S., Bouffler, S. E., Patel, C. V., Sandaradura, S. A., Wilson, M., Pinner, J., et al. (2023). Integrated multi-omics for rapid rare disease diagnosis on a national scale. Nat. Med., 1–11. doi: 10.1038/s41591-023-02401-9.
Marwaha, S., Knowles, J. W., and Ashley, E. A. (2022). A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 14, 23. doi: 10.1186/s13073-022-01026-w.
McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R. S., Thormann, A., et al. (2016). The Ensembl Variant Effect Predictor. Genome Biol. 17, 122. doi: 10.1186/s13059-016-0974-4.
Mertes, C., Scheller, I. F., Yépez, V. A., Çelik, M. H., Liang, Y., Kremer, L. S., et al. (2021). Detection of aberrant splicing events in RNA-seq data using FRASER. Nat. Commun. 12, 529. doi: 10.1038/s41467-020-20573-7.
Murdock, D. R., Dai, H., Burrage, L. C., Rosenfeld, J. A., Ketkar, S., Müller, M. F., et al. (2021). Transcriptome-directed analysis for Mendelian disease diagnosis overcomes limitations of conventional genomic testing. J. Clin. Invest. 131, e141500. doi: 10.1172/JCI141500.
National Center for Advancing Translational Sciences (n.d.). About - Genetic and Rare Diseases Information Center. Available at: https://rarediseases.info.nih.gov/about [Accessed July 27, 2023].
Nicora, G., Zucca, S., Limongelli, I., Bellazzi, R., and Magni, P. (2022). A machine learning approach based on ACMG/AMP guidelines for genomic variant classification and prioritization. Sci. Rep. 12, 2517. doi: 10.1038/s41598-022-06547-3.
Pervouchine, D. D., Knowles, D. G., and Guigo, R. (2013). Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274. doi: 10.1093/bioinformatics/bts678.
Posey, J. E. (2019). Genome sequencing and implications for rare disorders. Orphanet J. Rare Dis. 14, 153. doi: 10.1186/s13023-019-1127-0.
Rehm, H. L. (2022). Time to make rare disease diagnosis accessible to all. Nat. Med. 28, 241–242. doi: 10.1038/s41591-021-01657-3.
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J., and Kircher, M. (2019). CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894. doi: 10.1093/nar/gky1016.
Repp, B. M., Mastantuono, E., Alston, C. L., Schiff, M., Haack, T. B., Rötig, A., et al. (2018). Clinical, biochemical and genetic spectrum of 70 patients with ACAD9 deficiency: is riboflavin supplementation effective? Orphanet J. Rare Dis. 13, 120. doi: 10.1186/s13023-018-0784-8.
Richards, S., Aziz, N., Bale, S., Bick, D., Das, S., Gastier-Foster, J., et al. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423. doi: 10.1038/gim.2015.30.
Riquin, K., Isidor, B., Mercier, S., Nizon, M., Colin, E., Bonneau, D., et al. (2023). Integrating RNA-Seq into genome sequencing workflow enhances the analysis of structural variants causing neurodevelopmental disorders. J. Med. Genet. doi: 10.1136/jmg-2023-109263.
Robinson, P. N., Köhler, S., Oellrich, A., Project, S. M. G., Wang, K., Mungall, C. J., et al. (2014). Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 24, 340–348. doi: 10.1101/gr.160325.113.
Smedley, D., Schubach, M., Jacobsen, J. O. B., Köhler, S., Zemojtel, T., Spielmann, M., et al. (2016). A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am. J. Hum. Genet. 99, 595–606. doi: 10.1016/j.ajhg.2016.07.005.
Stavropoulos, D. J., Merico, D., Jobling, R., Bowdin, S., Monfared, N., Thiruvahindrapuram, B., et al. (2016). Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine. Npj Genomic Med. 1, 1–9. doi: 10.1038/npjgenmed.2015.12.
Stenton, S. L., O’Leary, M., Lemire, G., VanNoy, G. E., DiTroia, S., Ganesh, V. S., et al. (2023). Critical assessment of variant prioritization methods for rare disease diagnosis within the Rare Genomes Project. 2023.08.02.23293212. doi: 10.1101/2023.08.02.23293212.
Stenton, S. L., Shimura, M., Piekutowska-Abramczuk, D., Freisinger, P., Distelmaier, F., Mayr, J. A., et al. (2021). Diagnosing pediatric mitochondrial disease: lessons from 2,000 exomes. doi: 10.1101/2021.06.21.21259171.
Stephenson, S. E. M., Costain, G., Blok, L. E. R., Silk, M. A., Nguyen, T. B., Dong, X., et al. (2022). Germline variants in tumor suppressor FBXW7 lead to impaired ubiquitination and a neurodevelopmental syndrome. Am. J. Hum. Genet. 109, 601–617. doi: 10.1016/j.ajhg.2022.03.002.
The 100,000 Genomes Project Pilot Investigators (2021). 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care — Preliminary Report. N. Engl. J. Med. 385, 1868–1880. doi: 10.1056/NEJMoa2035790.
Turro, E., Astle, W. J., Megy, K., Gräf, S., Greene, D., Shamardina, O., et al. (2020). Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102. doi: 10.1038/s41586-020-2434-2.
Van der Auwera, G. A., and O’Connor, B. D. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. O’Reilly Media, Inc Available at: https://www.oreilly.com/library/view/genomics-in-the/9781491975183/ [Accessed December 29, 2021].
Wagner, M., Osborn, D. P. S., Gehweiler, I., Nagel, M., Ulmer, U., Bakhtiari, S., et al. (2019). Bi-allelic variants in RNF170 are associated with hereditary spastic paraplegia. Nat. Commun. 10, 4790. doi: 10.1038/s41467-019-12620-9.
Wagner, N., Çelik, M. H., Hölzlwimmer, F. R., Mertes, C., Prokisch, H., Yépez, V. A., et al. (2023). Aberrant splicing prediction across human tissues. Nat. Genet., 1–10. doi: 10.1038/s41588-023-01373-3.
Wakap, S. N. (2020). Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur. J. Hum. Genet. 28, 165–173. doi: 10.1038/s41431-019-0508-0.
Whiffin, N., Karczewski, K. J., Zhang, X., Chothani, S., Smith, M. J., Evans, D. G., et al. (2020). Characterising the loss-of-function impact of 5’ untranslated region variants in 15,708 individuals. Nat. Commun. 11, 2523. doi: 10.1038/s41467-019-10717-9.
Wright, C. F., FitzPatrick, D. R., and Firth, H. V. (2018). Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 19, 253–268. doi: 10.1038/nrg.2017.116.
Yépez, V. A., Gusic, M., Kopajtich, R., Mertes, C., Smith, N. H., Alston, C. L., et al. (2022). Clinical implementation of RNA sequencing for Mendelian disease diagnostics. Genome Med. 14, 38. doi: 10.1186/s13073-022-01019-9.
Yépez, V. A., Mertes, C., Müller, M. F., Klaproth-Andrade, D., Wachutka, L., Frésard, L., et al. (2021). Detection of aberrant gene expression events in RNA sequencing data. Nat. Protoc. 16, 1276–1296. doi: 10.1038/s41596-020-00462-5.
Zerbino, D. R., Achuthan, P., Akanni, W., Amode, M. R., Barrell, D., Bhai, J., et al. (2018). Ensembl 2018. Nucleic Acids Res. 46, D754–D761. doi: 10.1093/nar/gkx1098.
Zhao, Y., Wang, K., Wang, W., Yin, T., Dong, W., and Xu, C. (2019). A high-throughput SNP discovery strategy for RNA-seq data. BMC Genomics 20, 160. doi: 10.1186/s12864-019-5533-4.

No competing interests reported.

SupFigures.pdf

Download PDF

Version 1

posted

You are reading this latest preprint version

Predicting molecular events underlying rare diseases using variant annotation, aberrant gene expression events, and human phenotype ontology

Status:

Version 1

Abstract

Figures

Introduction

Results

Predictive Pathogenicity Model

Benchmarking of the model on the mitochondrial disease dataset

Application to the SickKids dataset

Methods

Data acquisition and preprocessing

Variant annotation

Variant calling in RNA-seq data

Filtering and sorting variants

Aberrant expression

Aberrant splicing

Mono-allelic expression

Semantic similarity

Learning a disease-impacting score for each gene per individual

Manual Curation

Discussion

Declarations

Funding

Competing Interests

Author Contribution

Data and code availability

Ethics approval and consent

Acknowledgments

References

Additional Declarations

Supplementary Files

Status:

Version 1