Around 7,000 rare diseases have been identified, collectively imposing significant health socio-economic burden1. Majority of these diseases have a genetic origin due to variants ranging from single nucleotide variants (SNVs) or a few nucleotide insertions/deletions (INDELs), to large genomic changes such as copy number variants (CNVs), translocations, inversions, transposable element (TE) insertions, or complex rearrangements. Some are also associated with specific epigenomic profiles2. This diverse spectrum of disease-causing changes, often detected by different technologies, has challenged current genetic diagnostic strategies and contributed to long diagnostic odysseys, averaging at 6 years3, and delayed timely management or treatment plans for patients with rare disease.
Although short-read sequencing technologies have brought a remarkable leap in the diagnosis of rare genetic diseases4,5, more than half of the patients remain undiagnosed. This is partly due to the inherent limitations of this technology in detecting complex variants such as structural variants, methylation profiles, repeat expansions, or variants embedded in inaccessible regions of the genome, specifically high homology and GC rich regions6. Recent advances in third generation sequencing technologies have demonstrated the application of targeted LRS for identifying pathogenic variants in known or novel disease-causing genes7–9. However, the clinical implementation of LRS for detecting genome-wide variation and methylation changes in the context of rare diseases has been limited by challenges associated with the annotation and filtration of a large number of variants and is yet to be explored. Here we optimize a whole genome LRS workflow and a computational strategy in a cohort of undiagnosed patients with suspected rare diseases leading to additional diagnoses and the uncovering of a novel methylation signature associated with Spinal Muscular Atrophy (SMA).
We optimized our analysis workflow on a selected cohort of 14 patients with confirmed genetic diagnoses, encompassing a diverse array of genomic and epigenomic pathogenic variants (Fig. 1a and Supplementary Fig. 1a). The study design incorporated wet bench protocol optimized for long-read Oxford Nanopore sequencing using a PromethION system targeting a minimum of 30X coverage with average N50 of 12kb (Fig. 1a and Supplementary Fig. 1b). Our computational analysis workflow consists of a “genome” and “epigenome” modules (Fig. 1a and Extended Method). The former module consists of detection, annotation, and selection of genome-wide rearrangements like copy number variations (CNVs), short variants (SNVs and INDELs) and structural variations (SVs). Raw variants were retained if calls were supported by ≥ 5 reads with allele fraction ≥ 0.3 and were affecting the coding region of genes associated with disease as defined in OMIM or GeneCC (Extended Method). This reduced the number of variants by 40% for CNVs and 99% for SVs. Further filtering of variants unique to each patient in the cohort reduced CNVs by 98% (average n = 2) and SVs to 99.9% (average n = 12) (Fig. 1b and Supplementary Fig. 1c), which were then manually inspected for any clinical correlation. This led to the detection of all associated pathogenic variants in this group (Supplementary Fig. 1d-f). The epigenomic module is composed of two methods for scanning episignatures specific to 42 known diseases2, and for the diagnosis of SMA based on a novel methylation signature we characterize in this study (Extended Methods). SMA is a common, life-threatening autosomal recessive neuromuscular disease caused by biallelic loss, mostly deletions in exon 7, of the survival-of-motor-neuron (SMN1) gene10. We observed a specific methylation profile across introns 6 to 8 (chr5:70239954–70249165) of theSMN1 gene where 0%, 50–70% (moderate) and 98–100% (high) of bases with methylation modification were present for SMA patients, carriers and non-carriers respectively, elucidating a unique episignature for SMA (Fig. 1c and Supplementary Fig. 1e). We also confirmed the methylation profile for a control sample (OXN-18) with Angelman syndrome (Supplementary Fig. 1f). Overall, our pipeline was able to correctly identify all the pathogenic variants, including complex rearrangements and aberrant methylation, in the optimization cohort.
We applied this workflow to a set of undiagnosed patients (N = 39), who previously had inconclusive testing using short read exome sequencing with 39% also receiving microarray assays testing (Fig. 1a and Supplementary Table 1). Patients, were mostly of Arab descendant (90%), had overall equal gender representation (~ 40% females) and primarily presented with neurological disorders (44%) (Fig. 1d-e and Supplementary Table 1). Whole genome LRS in this cohort obtained an average of 53X coverage and N50 of 12.2Kb. Approximately 35,000 SVs and 83 CNVs were detected in each sample (Supplementary Table 2) which were significantly reduced by 99.98% and 98.49%, respectively, after applying our filtering and selection criteria (Fig. 2b and Supplementary Fig. 2a). Since all patients previously had inconclusive exome testing, we focused our analysis on SNVs with predicted splicing impact, which could have been previously filtered out. We applied our splicing SNV filtration criteria (see Methods) which retained on an average ~ 54 SNVs in disease-causing genes for each sample; significantly reducing the total number of SNVs (~ 1.6M) (Supplementary Fig. 2b and Supplementary Table 2). We evaluated variants within the genes matching the patient phenotype and identified a single variant in DNMT1 (NM_001130823: c.891 + 8C > T) in OXN-044, though its impact on DNMT1 RNA splicing (Supplementary Fig. S2c) and its relatively high allele frequency in the general population led to its classification as clinically benign. No other putative clinically relevant sequence variants were identified.
We next focused on large CNV events and identified pathogenic variants in two patients. For patient OXN-033, three deletions from a total of 59 CNVs were prioritized, of which a heterozygous deletion event (1.4Mb) at 2q11.1-q11.2 was classified as pathogenic post manual inspection and was validated by CMA (Fig. 2b and Supplementary Table 3). Individuals with 2q11.2 deletions have developmental delay, intellectual disability, dysmorphic features and variable skeletal anomalies along with obesity11,12 which was consistent with this patient’s phenotype. In another patient (OXN-048), with unconfirmed diagnosis of anterior segment dysgenesis and a heterozygous pathogenic variant in the SLC38A8 identified by exome sequencing, we detected a single heterozygous deletion in 16q23.3 (Fig. 2c and Supplementary Table 3), partially encompassing SLC38A8 (exons 8–3’UTR), using LRS. SLC38A8 is associated with autosomal recessive foveal hypoplasia and/or anterior segment dysgenesis matching the phenotype of the patient13. Taking advantage of the long reads, we phased the two variants and observed that each variant is in a distinct haplotype confirming the compound heterozygous state in this individual and biallelic impairment of the SLC38A3 (Fig. 2c).
We then examined the landscape of structural variants. We identified a homozygous deletion of 3.6kb partially including the 3’ untranslated region (UTR) of the M-Phase Specific PLK1 Interacting Protein gene (MPLKIP) in patient OXN-027 (Fig. 2d and Supplementary Table 3). This patient showed signs of learning disabilities with distinctive brittle hair, a hallmark of Trichothiodystrophy nonphotosensitive 1 associated with non-functional MPLKIP protein. The 3’UTR region is known to regulate mRNA-based processes14, hence we hypothesized that the homozygous 3’UTR deletion of the MPLKIP gene could alter its expression levels. In fact, transcriptomic analysis showed that this gene is significantly overexpressed (Fig. 2d) in this patient suggesting that its dysregulation might underlie the observed phenotype. Further investigation is required to understand the functional role of this 3’UTR deletion.
We next scanned the methylation patterns for all 39 patients and compared them to the episignature profiles associated with 42 known diseases2. One patient (OXN-062) had a methylation profile consistent with Hunter McAlpine syndrome (HMA) (Fig. 2e). Independently, we also identified a duplication at 5q35.2-q35.3 containing the NSD1gene which was confirmed by chromosomal microarrays (Fig. 2e). HMA is characterized by craniosynostosis, intellectual deficit, short stature and facial dysmorphism matching the clinical indication of the patient. While deletions of NSD1 and hypomethylation at this locus are associated with Sotos syndrome, HMA has been associated with micro-duplication involving NSD1 and a hypermethylation profile2 confirming the diagnosis for this patient. We then examined the SMA-specific methylation pattern, described above, across all the patients. Interestingly, we observed one patient (OXN-063) with the characteristic SMA episignature. The biallelic loss of SMN1 in this patient was confirmed by droplet digital PCR (Fig. 2e).
The protocols for analyzing LRS are still in nascent stages and no global standard methods have been established specifically for the clinical annotation, filtration and interpretation of the large genomic and epigenomic landscape in patients with rare diseases. In this study, we propose a simplified workflow which substantially reduces the number of putative disease-causing changes detected by whole genome LRS, while detecting a wide spectrum of genomic and epigenomic pathogenic variation, leading to 13% (5 out 39) additional diagnoses in patients with rare diseases who had inconclusive testing using traditional methods. We developed a LRS-based “Epimarker” method using known episignature of 42 diseases to empirically profile patients in clinical setting. We also uncover, for the first time, an SMA-specific methylation profile which was incorporated into our clinical “Epimarker” profiling. Taken together, our results demonstrate the potential of long read sequencing as a single unified assay for routine clinical genetic testing and the discovery of novel rare disease variation.