Despite efforts to standardize infrastructure and integrate large-scale data in clinical research, challenges persist in personalized medicine (52, 53), especially in the documentation of genotype, phenotype, and clinical data for rare diseases (14, 52). Comprehensive documentation can aid patient recruitment, standard care monitoring, natural history assessment, genotype-phenotype correlation analyses, and disease burden evaluation, ultimately enhancing our understanding of rare diseases (30). However, the planning, design, maintenance, and sustainability of such large-scale medical studies need to be improved and streamlined (14).
In this study, we illuminated the impactful role of customized CDMs, particularly in the context of forming a collaborative foundation between medical experts and data scientists within domain-specific projects. By modeling and utilizing these tailored CDMs, we achieved two critical objectives: (a) establishing a shared knowledge base that enhances communication and understanding between medical experts and data scientists, and (b) streamlining the harmonization of source data. This harmonization, in turn, facilitates a seamless transfer to the internationally recognized research database, OMOP CDM, underscoring the versatility and effectiveness of our approach in advancing multi-center data-driven studies.
In particular, we developed a customized RD-CDM based on the OMOP information model and utilizing the FHIR communication standard. The combination of both enabled us to efficiently utilize the already existing ETL processes (40) for the mappings to OMOP. The final CDM consists of several modules, including “Person”, “Diagnosis”, “Procedure”, “Laboratory findings”, and “Medications”. These modules are also part of the FHIR MII CDS. Moreover, two additional modules, “Genotype” and “Phenotypes”, were added to the customizable CDM structure to better capture the unique characteristics of rare diseases.
Use case-specific application examples
Some diseases from disease groups, such as hyperthyroidism and its distinct etiologies can present with very close similarities and thus be challenging to differentiate solely based on clinical data. Our customizable RD-CDM can improve this, which will be further evaluated during the SATURN project, which seeks to assist the diagnostic process. Additionally, the RD-CDM-based data will be used in analytical studies to answer clinical questions. Here, the inclusion of the Genotype and Phenotypes modules into the CDM structure allows for the capture of more detailed and specific information related to rare diseases that may not be available in other standard CDMs.
The CDM has additional advantages applicable to all rare disease groups. These encompass the formulation of specific clinical questions for international and multi-center studies, including the prospective integration of genotype data, particularly concerning the potential incorporation of Human Genome Variation Society (HGVS) nomenclature (54) with a focus on exploring genotype-phenotype correlations.
Moreover, the CDM incorporates information about certain procedures by using OPS. This enhances research by facilitating the establishment of connections between imaging results with genotype information. Use case-specific applications would include the matching of genotype with liver imaging data in hemochromatosis or with sonography and scintigraphy data in diseases that affect the thyroid gland.
Furthermore, the CDM could be helpful for the clustering of subgroups for certain rare diseases based on phenotype or genotype. Within the field of endocrinology, it allows for differentiation based on laboratory parameters, enabling the distinction between latent and manifest forms of hypothyroidism or autoimmune forms, depending on the presence of specific antibodies. Tuberculosis can affect various organs, leading to different phenotypic expressions. Using a suitable CDM, patients with diverse manifestations could be categorized based on their phenotypic profiles. In the context of hemochromatosis, it allows subgroup formation based on genetic factors. When genetic information is unknown or inconspicuous, the CDM can facilitate the creation of subgroups based on symptoms only.
These benefits also encompass the collection of patients with similar Variants of Uncertain Significance (VUS) and the comparative analysis of symptoms and phenotypes, facilitating their categorization into subgroups for research purposes or potential reclassification. Additionally, the potential exists for investigating the co-occurrence of specific rare diseases and assessing whether certain mutations may render individuals more vulnerable to other conditions, such as heightened infection risk for hepatitis.
In cases of incomplete penetrance of certain mutations, f. e. in hemochromatosis, the CDM is valuable for aggregating asymptomatic or minimally symptomatic patients based on their genotype for long-term risk assessment. It also supports the categorization of asymptomatic family members who have been subsequently examined as a distinct group.
Moreover, the flexibility of tailoring therapy studies according to the specific genotype for gene therapy is a valuable prospect. A coherent mapping of clinical information and underlying disease biology as genotype-phenotype maps may not only aid in identifying disease categories with different clinical presentations but also tailor personalized treatment approaches to patient biology. Especially in high-stakes environments such as AML treatment where fast and accurate diagnosis, as well as rapid treatment initiation according to molecular subtypes, is crucial (55), a better understanding of genotype-phenotype associations in multi-center data-driven studies holds the promise of improved treatment outcomes with targeted therapies while avoiding resistance and relapse.
The benefits of RD-CDM are therefore evident for all use cases. Overall, the CDM significantly enhances our ability to comprehensively study and understand the complexities of RDs regardless of the focus domain.
General benefits of using CDMs
Integration of heterogeneous data is a ubiquitous topic in modern medicine. This arising large variety of data has the potential to be used for deriving insights about the different aspects of care and lead to improvements in health care. Yet, challenges, such as identifying and accessing relevant data, the association between different data sources, and ensuring the data quality given the structural variations amongst data sources are posing a barrier (56, 57). That is why data is still sparse, especially more patient-specific data such as genotypes and phenotypes which are especially important for RDs. Therefore, the development of a comprehensive CDM tailored to the unique domains of rare diseases is of importance. Our RD-CDM, built on the foundation of the Observational Medical Outcomes Partnership (OMOP), serves as a framework for standardizing additional data components across multiple domains. It is suitable for usage in analytic processes involving machine learning and statistical models. In addition, because OMOP is well established as a research data model, our CDM facilitates collaboration with different research groups at different sites on an international level, effectively addressing the challenge of data scarcity, which is particularly critical in the field of rare diseases.
Limitations of our model
The RD-CDM model is a prototype model developed using the data elements of four domains of RDs; endocrinology, gastroenterology, and pneumonology and hematology. Therefore, the RD-CDM tables are limited to the focus domains. Although the RD-CDM modules cover most of the medical data, to be able to use it for other domains, a customization step might be necessary. Additionally, the genomic terminologies used in the RD-CDM are limited to the mutations and clinical elements that are part of data entities in our included cohorts. Although the genomic elements were mapped with a 100% success rate, we often faced two or more to one mappings (31). Moreover, we used HPO terminologies for mapping the symptoms to OMOP, but the HPO terminology is still not integrated into the Athena terminology management tool for OMOP. Our implemented ETL only provides a quick and temporary solution. Further work is necessary to integrate HPO into OMOP terminology and introduce specific concept IDs for them.
Lessons learned for the six individual customization steps
Bridging the gap between clinical experts and technical implementation is important for the design of such a model, which is why we consider the inclusion of experts from both domains and interdisciplinary collaboration as essential. Regular communication with the stakeholders helps to keep everyone aligned and informed about the progress and possible feedback. An iterative design process is essential to incorporate evolving requirements and insights.
-
A clear definition of the use case(s) must be provided.
-
This is of particular significance for interdisciplinary use cases, in which multiple domains are included (e.g., clinical, computational, organizational).
-
An interdisciplinary team of stakeholders should be defined based on the use case(s) as early as possible.
-
A list of diagnostic entities should be created together with the stakeholders for the targeted use case(s). A large group of medical experts is necessary for the definition and evaluation of data elements to ensure that the list of included elements in the final model is comprehensive. A consensus method for final models should be defined beforehand to objectively quantify the time period.
-
The use of use case-specific entities should be mapped to the modules of customizable RD-CDM.
-
For the Person, Diagnosis, Laboratory Findings, Procedure, and Medication, the FHIR-to-OMOP ETL process should be used to transfer the data into OMOP. Testing of the ETL processes using smaller synthetic data that has the same attributes as the real-world data is recommended to become accustomed to the logs and outcomes of the ETL process, especially if the real-world data is not directly available. For Genotypes, the direct ETL processes from CSV to OMOP CDM should be used to transfer the data into OMOP. The standard genomic vocabulary in OMOP has been used to map the mutations to OMOP. By writing the study-specific HPO concepts to the SOURCE_TO_CONCEPT_MAP table, a temporary solution has been provided for the integration of phenotype information into OMOP.
-
A robust ETL process is essential for the accurate transfer of data into the OMOP framework. This requires careful planning, thorough testing, and validation to process multiple data sources and maintain data integrity. Familiarity with the data and ETL tools is also key to effective implementation and problem solving.
Outlook
This RD-CDM is the basis for the development of a decision support system, namely the SATURN platform, to be used at the point of care by the family practitioners. The platform will be equipped with different rule-based, case-based reasoning, and machine learning algorithms that aim to combine the available medical knowledge and clinical guidelines in the field with retrospective patient outcomes to support the diagnosis process of upcoming patients. General practitioners are often challenged with patients with symptoms that they do not have experience with. Therefore, this platform could support them in reaching a diagnosis in a shorter time.
A forward-looking usage for the platform could be integrating patient engagement features, such as a mobile app for tracking symptoms and facilitating communication with healthcare providers. This enhancement has the potential to empower patients significantly and enhance the management of their conditions. These additions could substantially boost the platform's effectiveness and reinforce its patient-centric orientation.
Impact to RD
The complexity attached to RDs is due to their heterogeneity and geographical dispersion limiting available knowledge (2). Patients are scattered in different countries and continents, and comprehensive data assembly is complicated not only by organizational, logistical, and communicative reasons but also by a lack of data collection standards and common frameworks. We provide a framework to easily integrate genetic information in large scale, multi-center studies, which in turn could reduce the amount of time spent on the characterization of phenotypes.
The customized RD-CDM based on OMOP can facilitate collaborations and investigations on an international level and in the long run improve patients’ quality of life through a faster diagnostic process.