Capturing provenance information for biomedical data and workflows: A scoping review

doi:10.21203/rs.3.rs-2408394/v1

Download PDF

Research Article

Capturing provenance information for biomedical data and workflows: A scoping review

https://doi.org/10.21203/rs.3.rs-2408394/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background:

Provenance enriched scientific results ensure their reproducibility and trustworthiness, particularly when containing sensitive data. Provenance information leads to higher interpretability of scientific results and enables reliable collaboration and data sharing. However, the lack of comprehensive evidence on provenance approaches hinders the uptake of good scientific practice in clinical research. Our scoping review identifies evidence regarding approaches and criteria for provenance tracking in the biomedical domain. We investigate the state-of-the-art frameworks, associated artifacts, and methodologies for provenance tracking.

Methods:

This scoping review followed the methodological framework by Arksey and O'Malley. PubMed and Web of Science databases were searched for English-language articles published from January 1, 2006, to March 23, 2021. Title and abstract screening were carried out by four independent reviewers using the Rayyan screening tool. A majority vote was required for consent on the eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading and screening were performed independently by two reviewers, and information was extracted into a pre-tested template for the five research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published.

Results:

The search resulted in a total of 564 papers. Of 469 identified, de-duplicated papers, 54 studies fulfilled the inclusion criteria and were subjected to five research questions. The review identified the heterogeneous tracking approaches, their artifacts, and varying degrees of fulfillment of the research questions. Based on this, we developed a roadmap for a tailor-made provenance framework considering the software life cycle.

Conclusions:

In this paper we investigate the state-of-the-art frameworks, associated artifacts, and methodologies for provenance tracking including real-life applications. We observe that most authors imply ideal conditions for provenance tracking. However, our analysis discloses several gaps for which we illustrate future steps toward a systematic provenance strategy. We believe the recommendations enforce quality and guide the implementation of auditable and measurable provenance approaches as well as solutions in the daily routine of biomedical scientists.

provenance

biomedical research

data management

lineage

scoping review

healthcare data

software life cycle

The (re-)use of electronic medical and patient-related data offers enormous potential for clinical research [1, 2]. National programs such as the German Medical Informatics Initiatives (MII) support knowledge discovery and data sharing using adequate computational infrastructure and secure processes [3]. In this context, provenance capture offers access to quality-assured, traceable, and credible shared data.

Advantages and opportunities of data provenance have been demonstrated, for instance, in the EU-Horizon 2020 TRANSFoRm project [4]. Researchers not considering the origin of data run into the hazard of systematically incomplete or wrong data [5].

Notably the concepts of sustainable research data management and FAIR (findable, accessible, interoperable, reusable) guiding principles for data stewardship [6] explicitly mention provenance [7, 8]. A provenance-oriented approach requires thorough planning, execution, and evaluation of data management processes in the respective application domain [2]. In the scientific context, adherence to criteria such as consistency, interoperability, and confidentiality are required across all software tools [1, 9, 10].

A basic understanding of the term provenance is given with a description of what happened to the data [4]. Several models formally define provenance, for instance, the World Wide Web Consortium (W3C) PROV standard or the common-workflow-language CWLProv [11, 12]. The concept and implementation of provenance are essential for most scientific domains, such as environmental fields (geoprocessing workflows or climate assessments), in fusion engineering, or material sciences [13, 14]. In particular, the biomedical domains demand comprehensive investigation and information about their data management scenarios, including Extract-Transform-Load (ETL) jobs for data transfer and integration. Reliable data and data pipelines both require provenance data to be embedded in concepts for traceability to understand the relationships between results and source data.

This scoping review aims to investigate existing evidence regarding approaches and criteria for provenance tracking and disclosing current knowledge gaps in the biomedical domain. This comprises modeling aspects and metadata frameworks for meaningful and usable provenance information during the creation, collection, and processing of (sensitive) scientific biomedical data. The review also covers the examination of quality aspects relating to provenance.

Overview

We followed Arksey and O’Malley’s scoping methodological framework [15] for conducting a scoping review with the following stages (1) Stage 1: Identification of the Research Questions, (2) Stage 2: Identification of Relevant Studies, (3) Stage 3: Study Selection, (4) Stage 4: Data Extraction and Charting, (5) Stage 5: Collating, Summarizing, and Reporting the Results. The protocol of this scoping review has been published in JMIR Research Protocols [16]. Thematic analysis methods [17] were applied to analyze the extracted data by organizing themes according to the research questions. In line with Arksey and O'Malley's framework, the review does not attempt to assess the quality of studies, the risk of bias or the generalizability of the results.

Stage 1: Identifying Research Questions

The main objective of this review was to investigate existing evidence regarding approaches and criteria for provenance tracking and disclosing current knowledge gaps in the biomedical domain. The objective led to the following research questions (RQ):

RQ 1: Which potential (methodological) approaches exist for the classification and tracking of provenance criteria and methods in a biomedical or domain-independent context?

RQ 2: How can the potential value of provenance information be harnessed and by whom? How can usability be provided?

RQ 3: What are the challenges and potential problems or bottlenecks for the accomplishment of provenance?

RQ 4: Which guidelines or demands for the consideration of provenance criteria in a biomedical or domain-independent context have to be followed?

RQ 5: How completely can provenance be mapped in the data lifecycle or during data management?

Stage 2: Identifying Relevant Studies

Concepts were categorized into four groups: Target domain refers to the context of the research topic and includes studies with a biomedical, health care, clinical, or scientific background. In this work, scientific background is limited to domain-independent studies and excludes all other domain-specific studies. Provenance concerns the information about the genesis of a given object. Provenance properties cover specific requirements tied to the term provenance or describe selected characteristics in this context. Objective includes the range of purposes or the intention of provenance. In order to retrieve relevant studies, we linked together the individual concepts via a database query using the logical AND - operator. Synonyms within each concept were connected with the logical OR - operator.

The comprehensive search strategy is recorded in the study protocol [16].

Stage 3: Study Selection

The PRISMA flow chart in Fig. 1 depicts the selection process. First, we identified all relevant studies in PubMed and Web of Science based on our search strategy. After deduplication, we launched a transparent screening process by importing all relevant studies into Rayyan [18], a systematic review supporting solution. The studies were then reviewed by two independent researchers. In the case of vote agreement, the study was either included in the next review phase or excluded from the review. A third independent reviewer was consulted to solve the conflict if no consensus could be reached. The study screening phase started with a title and abstract evaluation for eligibility. Included studies from this procedure were submitted to a full-text screening, a deep-dive into the study report. Reviewers voted for inclusion or exclusion considering the inclusion and exclusion criteria. Finally, the residing set of qualified studies was moved into the data extraction pipeline. A description of the study selection is provided in the protocol [16].

Stage 4: Charting the data

We followed a collaborative and iterative process to define a charting table for data extraction. Individual reviewers (KG, FK, FH, SG, AZ/DW) then scrutinized all studies and extracted central textual occurrences into the data extraction sheet. The variables in the data extraction sheet correspond with the research questions. As such general characteristics of the studies, approaches for classification and tracking of provenance, their related challenges along with the significance and completeness of provenance information in the given context were part of the investigational charting. The reviewers independently charted the data in a structured and consistent way, discussed the results and continuously updated the data-charting form in an iterative process.

Stage 5: Collating, Summarizing, and Reporting the Results

The extracted data were analyzed using summary statistics by calculating the total number and percentages of all studies per category, if applicable. Charts were presented for the distribution of the individual data elements where applicable. Further analysis was performed using qualitative evaluation. The reporting of the results and outcome was structured according to the research questions. Based on the analysis of the review results, we have developed a roadmap for a customized provenance framework that takes into account the life cycle of the software framework (Provenance-SFL). The meaning of the findings and their relation to the overall objectives was discussed. Implications for future research, practice, and policy were outlined. Our reporting adheres to the PRISMA-ScR reporting guidelines [19]. The data analysis was partially supported with scripts in Python 3.10.0 [20]. Plots were generated with R version 4.0.4 (R Core Team) [21] and version 1.3.0 of the tidyverse package [22].

Literature Search

The search in PubMed and Web of Science resulted in 564 hits and was last performed on March 23, 2020. Afterwards, 95 duplicates were removed. The remaining 469 papers were subjected to title-abstract screening in an interactive selection process, leaving 97 eligible papers for the full-text review. The full-text papers were further screened to identify papers eligible for the subsequent step of data charting. During this step, additional 43 papers were excluded (see stage 4). These papers either did neither meet the study design context (n = 26) nor the domain concept (n = 13). Three papers reported the same study, and one was not a full paper. A total of 54 articles were included in the data extraction phase and presented in an additional file [see Additional File 1]. The paper selection followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [23] approach shown in Fig. 1.

PRISMA flow diagram [23] of paper selection process displaying the number of studies in identification and screening phase and all included studies in the scoping review

Characteristics of the included studies (n = 54)

All documents were published between 2006 and 2020 (Table 1). More than half of the studies appeared in the literature five years before the start of the review. Predominantly, studies originated from the biomedical or healthcare domain (n = 36), followed by the domain-independent studies (n = 18).

All studies in this review were screened with respect to the five research questions described in the Methods section. The following subsections describe our findings for research question one to five. They also provide detailed characteristics about the respective provenance approaches.

Table 1

Document characteristics of the study corpus
	Document characteristics	Count	Citation
Year of publication
	2006–2008	5	[24], [25], [26], [27], [28]
	2009–2011	6	[10], [29], [30], [31], [32], [33]
	2012–2014	13	[1], [2], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44]
	2015–2017	9	[4], [45], [46], [47], [48], [49], [50], [51], [52]
	2018–2020	21	[9], [11], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71]
Target domain
	(Bio-) medical or healthcare domain	36	[1], [2], [4], [9], [10], [11], [28], [29], [31], [34], [36], [37], [38], [39], [40], [42], [43], [44], [46], [47], [48], [49], [51], [53], [55], [56], [57], [58], [59], [61], [62], [64], [65], [66], [69], [70]
	Domain independent	18	[24], [25], [26], [27], [30], [32], [33], [35], [41], [45], [50], [52], [54], [60], [63], [67], [68], [71]

Document characteristics of the study corpus containing presentation of studies between 2006 and 2020 and allocation to the target domain.

Research Question 1: Approaches for classification and tracking of provenance criteria in biomedical workflows and data

R1.1: Characteristics framework types (n = 54)

The reviewed literature presented heterogeneous approaches for classifying and tracking provenance criteria. Therefore, we subdivided the approaches by their focus. Table 2 lists the frameworks and their related subcategory and citations. Most articles (46/54) focus explicitly on practical provenance management approaches. Theoretical frameworks (8/54) referred to recommendations or reviews and can be classified into the following subcategories:

Semantics & models, ontologies & metadata: provenance tracking approaches on different granularity, ontology and model abstraction levels. The semantic PaCE approach [29] was developed to track provenance in RDF-based Semantic Web applications. An example of an annotation mechanism was introduced with COMAD [24]. The Provenance Metadata Model (ProvCaRe S3), built to support better Scientific Reproducibility, was represented with the OWL2 Web ontology language and provenance triples served as a basis for the provenance graph [53]. APIs for visualization [54] or querying purposes [34] were observed or a webservice for user access to provenance data [2].
Scientific workflows and workflow executions: mainly Open Provenance Model [72] (OPM)-oriented workflows on different semantic levels, like in the BioWorkbench [55], OpenPREDICT [56] or in OWL projects. Provenance data was stored in relational databases, like in OPMProv [35] or in graph databases [45]. Querying possibilities were offered via WebService or with specific querying languages at the graph level [35].
Privacy: Decentralized management and General Data Protection Regulation (GDPR) requirements led to the use of blockchain technologies [57] in combination with the PROV model standard. Another scenario incorporated blockchain in a proof-of-concept study [58] to enable an audit trail mechanism for a trusted AI model.
Visualization: The complexity of representing provenance information at different levels of aggregation was examined in the AVOCADO project [46]. The NeuroProv project [59] shows how visualization support clinicians in information tracking and reproducibility analysis.

Table 2

Studies and their respective assignment to a framework type.
Category	Subcategory	Count	Citation
Framework type – Practical Provenance^a		46
	Semantics & models, ontologies & metadata	23	[2], [9], [10], [24], [26], [29], [30], [31], [32], [33], [34], [37], [38], [39], [40], [47], [50], [51], [53], [54], [62], [66], [68]
	Scientific workflows and workflow execution	15	[11], [25], [27], [28], [35], [36], [41], [42], [43], [45], [52], [55], [56], [60], [63]
	Privacy aspect	5	[48], [57], [58], [61], [70]
	Visualization aspect	3	[46], [49], [59]
Framework type– Theoretical Provenance^b		8
	Different reviews, recommendations, or approaches from initiatives	8	[1], [4], [44], [64], [65], [67], [69], [71]
^aComprised development of a given provenance related solution with focus on given constraints
^bIncluded ideas or principles on which a provenance frame is based (rather than with practice and experiment)

Studies and their respective assignment to a framework type, including categories and characteristics of provenance. “Framework – practical provenance management” comprises practical efforts for development and implementation of a provenance approach. “Framework – theoretical provenance management” approach includes ideas and generic principles for provenance consideration.

R1.2: Model characteristics (n = 48)

The dominant provenance models refer to the PROV [12] specification (n = 18), established by the W3C as the de-facto standard for provenance modeling, and the frequently used Open Provenance Model (OPM) [72] (n = 17), see also Table 3. Other models either cite specific solutions (n = 9), are concerned with metadata provision (n = 4), or do not provide any information on the provenance model (n = 6).

OPM is the result of three provenance challenges (2011 until today). OPM v1.1 is exchangeable across systems and supports a process- and a dataflow-oriented view. It is based on the notion of the annotated causality graph with nodes as artifacts, processes, and agents. OPM was further developed into a provenance data model (PROV-DM). PROV [12] comprises a family of specifications for provenance, designed to promote the publication of provenance information on the Web. It offers interoperability across systems and is quite generic. The W3C PROV models have been used since 2013 in our review.

Table 3

Included articles and their related model.
Category	Subcategory	Count	Citation
W3C PROV^a		18
	FHIR	1	[70]
	W3C PROV extension	5	[38], [51], [52], [53], [63]
	W3C PROV-*	8	[11], [47], [57], [59], [64], [65], [69], [71]
	W3C PROV-* and other	3	[34], [56], [62]
	W3C, Dublin Core, Research Object	1	[39]
OPM^b		17
	OPM	13	[10], [29], [30], [33], [36], [41], [42], [43], [45], [48], [55], [61], [68]
	OPM extension	2	[32], [35]
	OPM and other	2	[1], [54]
Model related to specific solutions^c		9
	ADES model	1	[60]
	BERT	1	[40]
	CDISC ODM	1	[9]
	COMAD	1	[24]
	CRIM	1	[4]
	Mathematical model	1	[50]
	Provenance data model for an AI/ML model	1	[58]
	RDBM model	2	[31], [64]
Metadata^d		4
	Semantic model using metadata	4	[25], [27], [28], [44]
^aConceptual data model from W3C
^bModel transforming process
^cDifferent models applied for specific requirements
^dBuilding a semantic model based on metadata
*Placeholder for different PROV model extensions

Included articles grouped by their related model respectively by using similar approaches. Countings for categories and subcategories are given per model group and/or approach.

Figure 2 displays the temporal evolution of the characterized frameworks in dependency of the applied models. We observed an increased number of papers relating to implementation frameworks between 2016 and 2020. The reason was justified by the extension of the OPM and W3C PROV standards. The onset of the FAIR principles [56] and the FHIR framework [60] furthermore set new requirements for modeling and implementation projects.

R1.3: Validation status (n = 43)

Most of the studies (n = 43) report a successful validation of their provenance solution. Mainly, domain-specific use cases have been applied in the past. For example, functionality and effectiveness were proven within a usage scenario for the AVOCADO [46] project. Other validation approaches included classical semantic evaluation schemes which demonstrated feasibility by responding to competency questions. Examples are the provenance challenges or proof-of-concept frameworks [10], [25], [56], [58], [61]. To pass the provenance challenges, participants needed to solve predefined provenance queries [24], [26], [35]. Ozgu Can et al. evaluated their domain-independent model with an infectious disease use case and implementing the Healthcare Provenance Information System [61]. Curcin et al. [47] emphasized that the set-up of provenance data needs to be modeled and verified separately from the software implementation. Precise validation methods for provenance services focus on usability, performance, scalability, fault tolerance and functionality [36]. Moreover, they demanded more formal engineering techniques to foster provenance implementation across a broad range of software tools in the biomedical domain and beyond [1]. In that sense, formal validation as part of the software engineering process contributes to increased software quality, and formal validation requires testing efforts and testing evidence. However, accurate alignment of testing procedures against predefined requirements in the software lifecycle could not be identified.

R1.4: Provenance characteristics (n = 54)

The term “provenance” is subjected to an evolutionary and technical process with multifaceted meanings and roles. There is agreement that provenance is a piece of history. However, the focus of provenance work ranges from abstract workflow descriptions to summaries of workflow executions to more general knowledge about data sources and result dependencies [2], [25], [35], [37], [62]. For example, provenance as semantic metadata was specified in several works between 2007 and 2019. Monnin et al. [62] required the encoding of provenance of pharmacogenomics knowledge units. Other works refer to data provenance as knowledge about data sources [48] or as a piece of analytic software [49].

Sahoo et al. [53] state, that PROV-DM together with the PROV ontology (PROV-O) define the minimal categories of provenance metadata terms. Other studies discussed the combined provenance of data and workflows and introduce the terms prospective, retrospective and domain provenance [1], [38], [63]. While prospective provenance expresses future abstract workflow information, retrospective provenance gathers past workflow execution and data derivation information. Domain-specific provenance can be defined as an extension to the PROV-Ontology. Workflow provenance has repeatedly been mentioned in the context of workflow execution [27], [50], [55].

R1.5: Requirements for provenance frameworks (n = 34)

Out of 54 reviewed papers, 34 papers mentioned one or more functional and non-functional requirements for the referenced framework type. 20 papers did not identify any specific requirements. For those studies that did, we identified eight different word fields, matched them, and explained the citations in an additional file [see Additional File 2]. Figure 3 displays the reported provenance requirements axes. We conclude that the most popular requirements refer to the word fields integrity (n = 13) and reproducibility (n = 12), followed by organizational topics (n = 8). Others were related to the word fields interoperability, security, and traceability (each n = 6). Only a few studies reported on performance (n = 5) and trust (n = 4).

R1.6: Domain specific conditions including guidelines (n = 17)

We grasp the availability of relevant domain specific standards which are relevant for provenance tracking approaches. In this context, beyond the W3C standards, we identified the Open Archival Information System (OAIS) [39] Functional Model as a basis for the development of a research object concept. Another example is provided by the Internal Standard Organization ISO 15489-1 [37] which defines the term metadata. The National Institute of Health (NIH) guideline ‘Rigor and Reproducibility’ [51] addresses topics impeding the study replicability.

Research Question 2: Potential value of provenance information

R2.1: Impact of provenance information (n = 47)

The availability of provenance data impacts the scientific and biomedical communities. It has implications on the work of researchers, scientists, academia, investigators, and clinicians (n = 47). The majority of papers reported about guidance benefits (n = 16) and reproducibility-related effects (n = 10). Considerably less (n = 4) papers observed validity and confidence effects. Other studies reported impacts on openness to sharing and knowledge reuse. Interestingly, only four studies discussed implications on quality topics [25], [33], [40], [51]. Also, other involved team or staff members (n = 17) like developers, data managers or domain experts were affected by the availability of provenance information. The majority recognizes benefits in validity (n = 5) [51], [59], [60], [64], [65] and managing benefits (n = 5) [30], [42], [54], [55], [61], followed by guidance benefits (n = 4) [30], [41], [45], [64]. Also, reproducibility impacts (n = 2) [64], [65] were mentioned.

Only low impact on patients (n = 7) was described, mostly referring to the consent of their own data [48], [57], [58], [61], [65] to an improved measurable patient outcome, and trust in evidence for clinical recommendations [47].

Exceedingly few effects on other third parties (n = 5) like data privacy officers, authorities, government, or industry were reported. Related implications concerned mainly the evidence for data validity or sensitive data processing solutions [48], [57], [58], [61], [65].

In our review, a total of 47 papers reported diverse lasting impacts (n = 76) on different stakeholders, as displayed in Fig. 4.

R2.2: Data sources (n = 31)

The reported studies processed different types of data sources to generate provenance information. These kept information about data source, for example neurological data [1], [34], EHR data [55], study data [46], omics data [40], (bio-)medical data [36], computational data [25], and data from hybrid methods [58].

Research Question 3: Potential challenges, problems, and bottlenecks during accomplishment of provenance (n = 39)

39 papers reported 65 distinct challenges impeding the implementation of provenance. We categorized these challenges into organizational and technical groups in an additional file [see Additional File 4]. Figure 5 shows the categorization of reported challenges per year (2006–2020).

In summary, issues relate to data annotation, metadata, and modeling of provenance, as well as performance-related challenges. However, the need for more detailed provenance information, the consideration of security-related conditions along with quality and reusability principles (exchange, discovery, interoperability), appeared later in the course.

Furthermore, usability and scalability questions emerged very early in context with provenance consumption.

More than three quarters of the reported challenges are technical challenges (n = 55/65). Thereof, nearly one third is associated with provenance granularity issues (n = 15/55). Curcin et al. [1] points out that a granular tracking of relevant human interactions, automated processes or logging is needed, and emphasizes the difficulty of choosing a proper level of granularity of provenance and associated with this, the right semantic complexity [4], [47]. Beyond that, a balanced trade-off between fast execution and provenance granularity must be found [63]. In fact, a fine-granular provenance level impacts the computing and storage resources [11], [47]. Furthermore, managing sensitive data restriction requires the integration of adequate security level granularity into the provenance model [61].

A quarter of the reported challenges (n = 14/55) mention the insufficient availability of metadata, which subsequently leads to incomplete provenance models. An improved availability of provenance metadata and FAIR enrichment of the data was demanded [53], [56]. Furthermore, stakeholders should be involved in the semantic enrichment of provenance data [4], [37]. However, during this metadata annotation phase a lack of semi-automated procedures for ontology selection, semantic modeling or mapping techniques was reported [2], [37], [56]. Even though the use of existing models is encouraged [38], as it improves semantic interoperability [56], reusing existing vocabularies to represent provenance was reported as an extensive task [56]. In addition, Cheng et al. [31] note that it was necessary to properly integrate domain-specific demands into the provenance model.

One-tenth of the studies (n = 12/54) reported performance problems during the acquisition of provenance data, such as workflow overhead [35], [43] and scalability [10] issues (n = 12/55). One proposal with respect to the cost-intensive visualization of data provenance was to reduce the size of large provenance graphs [49]. Other reported challenges, related to quality [40], [42], [56], [65], [66] and usability [31], [35], [36], [43] [58]. According to the literature, data quality and reuse are lacking due to the deficit in provenance deployment, particularly for observational and administrative studies [65]. Furthermore, the lack of information about experimental origins in genomics data and their related systematic quality control assessment reduce the quality of provenance and the level of creditability [40]. In particular, the low uptake of high-quality semantic models [6] and the unavailability of provenance in general [66] cause information loss and data quality issues. A minor concern is the usability of provenance since it is recognized to be still in infancy [35]. The challenge of applying more software engineering techniques (n = 4) [1], [39], [41], [63] was reported to facilitate provenance implementation across a broad range of software tools in the biomedical domain and beyond [1].

Significantly fewer organizational challenges (n = 10/65) [1], [4], [11], [35], [36], [47], [53], [58], [61] were reported, partly attributable to a basic unawareness of provenance benefits and less exchange between stakeholders. Khan et al. [11] stress that provenance capture must be established as a standard practice, not as an afterthought. McClatchey et al. [36] also recommend working toward gaining the stakeholder’s acceptance and confidence in the infrastructure. In the same vein, it is recommended to integrate developers already in the design phase [1]. However, financial challenges were reported due to the necessary investments in provenance-enabled tooling and capabilities [4]. The upcoming relevance of patient-mediated data handling raised new challenges and requirements, especially with respect to policy and governance topics [58].

Research Question 4: Demands for the consideration of provenance (n = 15)

Because of the extensive information obtained from RQ1, we extended the research questions to gain more insights about the provenance tracing and classification requirements identified in RQ1.

Interestingly, most of the 15 papers referred to claims relating to quality aspects.

For example, a more robust assessment of data quality is required [66], clearer and more consistent policies and policy ontologies are requested to prevent disclosure of sensitive data [61] and more trained staff is required [44], including data managers, software-architects or semantic web specialists. User-friendly interfaces should help scientists in the provenance querying process [43]. Developers should recognize not only technologies but also principles during the design phase [1]. Performance of provenance reasoning needs to be improved [32] and approaches for extending ontologies be automated [4], [51]. The term “intelligible machines” rather than “intelligent machines” was suggested to better respect the specific aspects of Big Data technologies in medical research [47]. Integrating the Healthcare Enterprise (IHE) standards, healthcare legacy protocols, interoperability and legacy issues are furthermore mentioned [57], and mappings between entities of various provenance models should be completed [62]. Future integration into a recognized ISO standard similar to BioCompute was proposed [64].

Research Question 5: Completeness of provenance information during data management process or data life cycle (n = 18)

The literature predominantly reports on a qualitative evaluation of completeness during the data management processes. However, we found one study describing a data management process dealing with metadata for traceability in clinical studies which delivered complete provenance in this respect [9]. Curcin et al. [4] see an application of provenance in the validation against standards in the context of the Food and drug administration (FDA) regulation 21 Code of Federal Regulations (CFR) Part 11.

One study applied data from six clinical research studies and more than 100 variables to evaluate the coverage of the provenance ontology in the semantic annotation of the study descriptions [51]. Two other documents invoked the need for minimal information elements to ensure sufficient process specification [34] and the existence of rich provenance information for reconstructing and rerunning pipelines [56].

A visualization of provenance data in neuroimaging took a semi-qualitative approach for measuring the coverage. They mapped the metrics to use-cases for the traceability of results and concluded that there is no absolute measure possible to verify the visualization approach [59]. The authors tested 15% of their workflows for verifiability of results, comparability of workflows, progression of the data for the analysis and origin of results, and evolution to see how data products evolved during an experiment.

Furthermore, Sahoo et al. [53] examined the proportion of provenance metadata information across research articles using a qualitative hypothesis method. The method also provides a provenance ranking algorithm for the computation of a reproducibility rank for each article.

No numerical indication of completeness was not achieved in any of the other papers. However, the papers pointed out the advantages of provenance capture, for example, related to the longevity and accessibility of data after years [60].

Roadmap for a tailor-made provenance framework

Based on the insights obtained from the literature review, we developed a roadmap for the implementation of a tailor-made provenance framework based on the software-framework-lifecycle (Provenance-SFL). The heterogeneous tracking approaches, their artifacts, and varying degrees of fulfillment of the research questions are depicted in Fig. 6 and determine our main discussion points.

This scoping review investigates evidence regarding approaches and criteria for provenance tracking. It discloses knowledge gaps in the biomedical domain with a focus on modeling and metadata frameworks for (sensitive) scientific biomedical data. Following the previously published scoping review protocol led us to include 54 full-text papers from initially 564 fetched papers found in PubMed and WoS databases. Using a structured and pre-tested data extraction sheet, contextual, but detailed enough, results were extracted to answer the outlined five research questions in the protocol.

Following the data extraction and analysis, the findings led us to define a Provenance-SLF roadmap elements. We essentially distinguished between the framework types and model characteristics, the validation status, and the requirement and provenance characteristics (see Fig. 6).

The provenance challenges, dealing with the need for provenance standardization, started in 2006 and gave rise to tailor-made models and metadata frameworks for the representation of provenance. These were later superseded by general-purpose standardized provenance models, which have more recently been combined with domain and application specific models or extensions such as the Provenance, Authoring and Versioning (PAV) ontology [38] or the ProvCaRe model [53]. The predominantly used models reported in this review referred to the W3C PROV and OPM standards. As shown in Fig. 2, an increased number of papers were related to the implementation frameworks that appeared between 2016 and 2020. One reason for the increase in implementations might be the substantiation to extend W3C PROV and OPM [11].

As of now, heterogeneous data sources, dynamic infrastructures, data exchange across boundaries, and a lack of standards for quality measures characterize the state of electronic health record data sets [57]. Additionally, various aspects of the term provenance [27], [37], [46], [53], [65] hamper the unique understanding and harmonization and engineering efforts for modeling, implementation, and validation interventions until now.

A provenance framework for today’s demands must acknowledge the (semantic) complexity of the domain and its relevant facets and requirements [11] (see also Fig. 2). In addition to requirements analysis, a thorough strategy is necessary to plan the typical data management steps such as collecting, managing, and analyzing data (Pimentel et al. [67]). According to Curcin et al. [4], validation readiness can be achieved by separating modeling and verification of provenance data from the software implementation.

We agree that precise requirements analysis, as part of the software-life cycle, and the subsequent individual life-cycle steps, like testing and maintenance procedures, support the consequent temporal evolution and hence improve the quality of provenance frameworks and applications.

When incorporated in an official inspection, provenance information must be sufficient for a content-related validation against applicable and accepted standards [4]. Therefore precise validation methods for provenance services regarding usability and performance, scalability, fault tolerance, and functionality are needed [36]. We saw that validation approaches are linked to the evolution of provenance modeling and subsequent implementation attempts. Curcin et al. [1] argue that it was necessary to launch more formal software engineering techniques to foster provenance implementation across a broad range of software tools in the biomedical domain and beyond [1]. In that sense, formal validation as part of the software engineering process contributes to increased software and data quality. Formal validation requires testing efforts and testing evidence. Accurate alignment of testing procedures against predefined requirements in the software-lifecycle could not be identified in the included papers.

Provenance information is of high value for the scientific and biomedical community (eg. researchers), support staff (eg. developers), patients and other 3rd parties (eg. data privacy officer, authority) (see Fig. 4). It is interesting to see that despite the high impact of provenance [see Additional File 3] only some stakeholders provide sufficient provenance information. Rather, it appears that responsibility for overall provenance management is being shifted to the support staff [Gierend et al. (unpublished observations)]. We argue that available technology, IT knowledge and data management skills need to be paired with both domain-specific knowledge and combined with constraints of legal nature or guidance [4], [44]. This complexity indeed results in a very time-consuming business. However, automation and metadata collection can support this process [4], [73]. As a matter of fact, good provenance information strengthens the credibility of the data and proves that data have not been intentionally or unintentionally changed throughout the data life cycle [74].

Our review collects and summarizes the existing challenges during the accomplishment of provenance (Fig. 5). Challenges expressed in terms of missing, lacking, or hinderance on organizational and technical capabilities so far were triangulated into more specific subcategories such as organizational (e.g., Investment and training, Administrative) and technical (e.g., granularity, performance and modeling and metadata annotation, delimitation reproducibility and replicability) challenges.

First of all, we observed that increasing legal and scientific demands require research projects to be implemented more transparently. However, the granularity of provenance [48], [61], [63] could not yet be resolved and so-called knowledge bottlenecks [44], [62] persist.

In parallel appropriate provenance modeling [58] and provenance management technique [61] are required to protect sensitive provenance data, like from the patient consent. Curcin et al. [4] stipulated overcoming the gap between the provenance metadata collected and the reporting requirements.

Secondly, it remains unclear how to scale provenance systems for high amounts of data [2], [11], eg. how to store and represent provenance information in an aggregated and efficient manner or how to assist users in sophisticated provenance queries [10]. Without doubt, automated and scalable solutions become impelling due to new challenges arising from the disposal and usage of permanently increasing computing power [60]. Growing focus is on the useability of the interface, particularly when provenance systems are implemented in the broad medical community including patients, doctors, and researchers [35].

Third, this scoping review extracted data about the (in)completeness of provenance information during data management processes. Surprisingly, only one implementation paper [9] demonstrated complete traceability from data collection to the analysis datasets.

The lack of mandatory specifications or guidelines for provenance capture might be the reason why other papers only mention partial completeness. We strongly recommend doing more research on completeness checks as part of provenance tracing. The level of completeness and accuracy of provenance information (of core data elements), especially in real-world data, could reveal data integrity issues and thus, affect the overall validity of the study results. Furthermore, reproducibility significantly depends on the accuracy of provenance information. For example, Mondelli et al. [55] delivered a tool for better scientific and longitudinal data management, which supports users, reproducibility by provenance, and reproduction through docker containers.

Interestingly, the concept of “quality of provenance” is not clearly defined in any of the papers included for this review. We believe that data quality issues need to be addressed to reach completeness, accuracy, and timeliness of the data, and to create trust in it.

The ISO 8000-2:2022 [75] defines the term data quality and clearly recommends defining degrees of requirements. This definition should be considered for use in provenance systems.

Finally, upcoming trends can be observed regarding the scalability of software. Concurrently, while following the increasing capacity and functionalities based on users’ demand, scalable software needs to remain stable while adapting to changes. Another trend reveals the importance of good and systematic data management practices [37] and the coordination with relevant stakeholders through the data life cycle.

Strengths

The present work applied a rigorous scoping review methodology using Arksey and O'Malley’s framework [15]. All screening stages were carried out by at least two independent reviews of four members. A previously published protocol [16] guided our review. The fact that the scoping review includes comprehensive results for the five related research questions and roadmap for a tailor-made Provenance-SLF framework with many additional results as supplements can be considered a strength of this review.

In this paper we highlighted several essential provenance tracking frameworks and their associated artifacts, and we developed a roadmap for a tailor-made Provenance-SLF framework.

Provenance capture benefits all stakeholders involved in data processing (see Fig. 4), but it is associated with manifold and individual challenges (see Fig. 5) during design, implementation, and the active usage scenario phase.

Proper documentation, metadata expression and automation along the (sensitive) data processing pipelines needs to be scrutinized and implemented throughout the data life cycle and in adherence to the underlying infrastructure condition. Additionally, the role and responsibilities of a data stewardship escorting the data should be expressed in this context [76] and intensive training and education measures should be put in place. Guidance and recommendations are requested to provide the systematic measurement of provenance and calls for defining a minimal or gold standard. Governance for data management and scale-up of data management capabilities matter in this respect.

All mentioned artifacts, especially related to quality aspects, can be marked as a transition point derived from incomplete pre-work. Therefore, harmonized engineering efforts are now necessary to overcoming the existing hurdles. Awareness of these challenges can facilitate an easier qualified and accurate provenance construction and auditable [1] consumption while enforcing FAIR principles [56] and interoperability standards for data sharing [34]. The effect of provenance for data quality monitoring and the impact of expressive metadata on provenance quality can be considered as open research questions for future work.

ADES - Automation, Data, Environment and Sharing model

AI/ML - artificial intelligence/machine learning

API – application programming interface

BERT - Biologic-Experiment-Result

CDISC ODM - Clinical Data Interchange Standards Consortium standard, including Operational Data Model

CFR - Code of Federal Regulations

COMAD - collection-oriented modeling and design

CRIM - clinical research information model

CWL - common-workflow-language

EHR – Electronic Health Record

FAIR - findable, accessible, interoperable, reusable

FDA – Food and drug administration

FHIR – Fast Healthcare Interoperability Resources

GDPR - General Data Protection Regulation

IHE – Integrating the Healthcare Enterprise

ISO – International Standards Organization

NIH – National Institute of Health

OAIS - Open Archival Information System

OPM – Open Provenance Model

OWL – Web Ontology Language

PAV – Provenance, Authoring and Versioning

PROV-DM – Provenance data model

PROV-O – Provenance Ontology

RDBM - relational database model

SFL – Software Lifecycle

W3C - World Wide Web Consortium

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

The data supporting the conclusions of this article are included within the article and its additional files.

Competing interests

The authors declare that they have no competing interests.

Funding

This research is funded by the German Federal Ministry of Education and Research within the German Medical Informatics Initiative with the grant 01ZZ1801E (Medical Informatics in Research and Care in University Medicine), by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) SFB 1270/2-99150580, and by the National Research Data Infrastructure for Personal Health Data (NFDI4Health) DFG-funded project (Project 442326535).

Author’s contributions

K.G. contributed substantially to the conception, design, screening, data extraction, charting the data, analysis and interpretation of the data, drafted all sections of the manuscript, coordinated reviewing, incorporated the comments from the Co-Authors and submitted the paper.; F.K. contributed to the discussion of the concept, screening, data extraction, charting the data, finalization of manuscript and presented the graphical analysis of the extracted data; S.G. and F.H. contributed to data extraction and screening; F.S. contributed to finalization of manuscript; D.W. contributed to the discussion of the concept, partial screening, editorial revision, and finalization of the manuscript; A.Z. contributed to the discussion of the concept, partial data analysis, screening, data extraction, charting the data, editorial revision, and finalization of the manuscript; T.G. contributed to the discussion of the concept and finalization of the manuscript.

All authors reviewed and approved the submitted version of the manuscript. They agreed both to be personally accountable for the author’s own contributions and ensured that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.

Acknowledgements

Not applicable

Curcin V, Miles S, Danger R, Chen Y, Bache R, Taweel A. Implementing interoperable provenance in biomedical research. Future Generation Computer Systems. 2014;34:1–16.
Jayapandian CP, Zhao M, Ewing RM, Zhang G-Q, Sahoo SS. A semantic proteomics dashboard (SemPoD) for data management in translational research. BMC Syst Biol. 2012;6 Suppl 3:S20.
Cuggia M, Combes S. The French Health Data Hub and the German Medical Informatics Initiatives: Two National Projects to Promote Data Sharing in Healthcare. Yearb Med Inform. 2019;28:195–202.
Curcin V. Embedding data provenance into the Learning Health System to facilitate reproducible research. Learning Health Systems. 2017;1:e10019.
Johnson KE, Kamineni A, Fuller S, Olmstead D, Wernli KJ. How the Provenance of Electronic Health Record Data Matters for Research: A Case Example Using System Mapping. eGEMs. 2014;2:4.
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.
Inau ET, Sack J, Waltemath D, Zeleke AA. Initiatives, Concepts, and Implementation Practices of FAIR (Findable, Accessible, Interoperable, and Reusable) Data Principles in Health Data Stewardship Practice: Protocol for a Scoping Review. JMIR Res Protoc. 2021;10:e22505.
Jauer M-L, Deserno TM. Data Provenance Standards and Recommendations for FAIR Data. Stud Health Technol Inform. 2020;270:1237–8.
Hume S, Sarnikar S, Noteboom C. Enhancing Traceability in Clinical Research Data through a Metadata Framework. Methods Inf Med. 2020;59:075–85.
Sahoo SS, Nguyen V, Bodenreider O, Parikh P, Minning T, Sheth AP. A unified framework for managing provenance information in translational research. BMC Bioinformatics. 2011;12:461.
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaScience. 2019;8:giz095.
PROV-Overview. https://www.w3.org/TR/prov-overview/. Accessed 9 Dec 2022.
Yakutovich AV, Eimre K, Schütt O, Talirz L, Adorf CS, Andersen CW, et al. AiiDAlab – an ecosystem for developing, executing, and sharing scientific workflows. Computational Materials Science. 2021;188:110165.
Schissel DP, Abla G, Flanagan SM, Greenwald M, Lee X, Romosan A, et al. Automated metadata, provenance cataloging and navigable interfaces: Ensuring the usefulness of extreme-scale data. FUSION ENGINEERING AND DESIGN. 2014;89:745–9.
Arksey H, O’Malley L. Scoping studies: towards a methodological framework. International Journal of Social Research Methodology. 2005;8:19–32.
Gierend K, Krüger F, Waltemath D, Fünfgeld M, Ganslandt T, Zeleke AA. Approaches and Criteria for Provenance in Biomedical Data Sets and Workflows: Protocol for a Scoping Review. JMIR Res Protoc. 2021;10:e31750.
Braun V, Clarke V. Using thematic analysis in psychology. Qualitative Research in Psychology. 2006;3:77–101.
Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5:210.
Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169:467–73.
Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.
R: The R Project for Statistical Computing. https://www.r-project.org/. Accessed 13 Dec 2022.
Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. JOSS. 2019;4:1686.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;:n71.
Bowers S, McPhillips TM, Ludäscher B. Provenance in collection-oriented scientific workflows. Concurrency Computat: Pract Exper. 2008;20:519–29.
Kim J, Deelman E, Gil Y, Mehta G, Ratnakar V. Provenance trails in the Wings/Pegasus system. Concurrency Computat: Pract Exper. 2008;20:587–97.
Holland DA, Seltzer MI, Braun U, Muniswamy-Reddy K-K. PASSing the provenance challenge. Concurrency Computat: Pract Exper. 2008;20:531–40.
Golbeck J, Hendler J. A Semantic Web approach to the provenance challenge. Concurrency Computat: Pract Exper. 2008;20:431–9.
Schuchardt KL, Gibson T, Stephan E, Chin G. Applying content management to automated provenance capture. Concurrency Computat: Pract Exper. 2008;20:541–54.
Sahoo SS, Bodenreider O, Hitzler P, Sheth A, Thirunarayan K. Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data. In: Gertz M, Ludäscher B, editors. Scientific and Statistical Database Management. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. p. 461–70.
Groth P, Moreau L. Representing distributed systems using the Open Provenance Model. Future Generation Computer Systems. 2011;27:757–65.
Cheng X. Bio-Swarm-Pipeline (BSP): A light-weight, extensible batch processing system for efficient biomedical data processing. Front Neuroinform. 2009;3.
Lim C, Lu S, Chebotko A, Fotouhi F. Storing, reasoning, and querying OPM-compliant scientific workflow provenance using relational databases. Future Generation Computer Systems. 2011;27:781–9.
Moreau L. Provenance-based reproducibility in the Semantic Web. Journal of Web Semantics. 2011;9:202–21.
Keator DB. Towards structured sharing of raw and derived neuroimaging data across existing resources. 2013;:15.
Lim C, Lu S, Chebotko A, Fotouhi F, Kashlev A. OPQL: Querying scientific workflow provenance at the graph level. Data & Knowledge Engineering. 2013;88:37–59.
McClatchey R, Branson A, Anjum A, Bloodsworth P, Habib I, Munir K, et al. Providing traceability for neuroimaging analyses. International Journal of Medical Informatics. 2013;82:882–94.
Razick S, Močnik R, Thomas LF, Ryeng E, Drabløs F, Sætrom P. The eGenVar data management system—cataloguing and sharing sensitive data and metadata for the life sciences. Database. 2014;2014.
Ciccarese P, Soiland-Reyes S, Belhajjame K, Gray AJ, Goble C, Clark T. PAV ontology: provenance, authoring and versioning. J Biomed Sem. 2013;4:37.
Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, et al. Why linked data is not enough for scientists. Future Generation Computer Systems. 2013;29:599–611.
Saccone SF, Quan J, Jones PL. BioQ: tracing experimental origins in public genomic databases using a novel data provenance model. Bioinformatics. 2012;28:1189–91.
Madougou S, Shahand S, Santcroos M, van Schaik B, Benabdelkader A, van Kampen A, et al. Characterizing workflow-based activity on a production e-infrastructure using provenance data. Future Generation Computer Systems. 2013;29:1931–42.
Madougou S, Santcroos M, Benabdelkader A, van Schaik BDC, Shahand S, Korkhov V, et al. Provenance for distributed biomedical workflow execution. Stud Health Technol Inform. 2012;175:91–100.
Marinho A, Murta L, Werner C, Braganholo V, Cruz SMS da, Ogasawara E, et al. ProvManager: a provenance management system for scientific workflows: PROVENANCE MANAGEMENT SYSTEM FOR SCIENTIFIC WORKFLOWS. Concurrency Computat: Pract Exper. 2012;24:1513–30.
Curcin V, Soljak M, Majeed A. Managing and exploiting routinely collected NHS data for research. ipc. 2013;20:225–31.
Woodman S, Hiden H, Watson P. Applications of provenance in performance prediction and data storage optimisation. Future Generation Computer Systems. 2017;75:299–309.
Stitz H, Luger S, Streit M, Gehlenborg N. AVOCADO: Visualization of Workflow–Derived Data Provenance for Reproducible Biomedical Research. Computer Graphics Forum. 2016;35:481–90.
Curcin V, Fairweather E, Danger R, Corrigan D. Templates as a method for implementing data provenance in decision support systems. Journal of Biomedical Informatics. 2017;65:1–21.
Danger R, Curcin V, Missier P, Bryans J. Access control and view generation for provenance graphs. Future Generation Computer Systems. 2015;49:8–27.
Xu S, Rogers T, Fairweather E, Glenn A, Curran J, Curcin V. Application of Data Provenance in Healthcare Analytics Software: Information Visualisation of User Activities. AMIA Jt Summits Transl Sci Proc. 2018;2017:263–72.
Bánáti A, Kacsuk P, Kozlovszky M. Reproducibility Analysis of Scientific Workflows. Acta Polytechnica Hungarica. 2017;14:17.
Sahoo SS, Valdez J, Rueschman M. Scientific Reproducibility in Biomedical Research: Provenance Metadata Ontology for Semantic Annotation of Study Description. AMIA Annu Symp Proc. 2016;2016:1070–9.
Marinho A, de Oliveira D, Ogasawara E, Silva V, Ocaña K, Murta L, et al. Deriving scientific workflows from algebraic experiment lines: A practical approach. Future Generation Computer Systems. 2017;68:111–27.
Sahoo SS, Valdez J, Kim M, Rueschman M, Redline S. ProvCaRe: Characterizing scientific reproducibility of biomedical research studies using semantic provenance metadata. International Journal of Medical Informatics. 2019;121:10–8.
Jabal AA, Bertino E. A Comprehensive Query Language for Provenance Information. Int J Coop Info Syst. 2018;27:1850007.
Mondelli ML, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, et al. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ. 2018;6:e5551.
Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L, Kuhn T, et al. Towards FAIR protocols and workflows: the OpenPREDICT use case. PeerJ Computer Science. 2020;6:e281.
Margheri A, Masi M, Miladi A, Sassone V, Rosenzweig J. Decentralised provenance for healthcare data. International Journal of Medical Informatics. 2020;141:104197.
Jennath HS, Anoop VS, Asharaf S. Blockchain for Healthcare: Securing Patient Data and Enabling Trusted Artificial Intelligence. IJIMAI. 2020;6:15.
Arshad B, Munir K, McClatchey R, Shamdasani J, Khan Z. NeuroProv: Provenance data visualisation for neuroimaging analyses. Journal of Computer Languages. 2019;52:72–87.
Huber SP, Zoupanos S, Uhrin M, Talirz L, Kahle L, Häuselmann R, et al. AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance. Sci Data. 2020;7:300.
Can O, Yilmazer D. A novel approach to provenance management for privacy preservation. Journal of Information Science. 2020;46:147–60.
Monnin P, Legrand J, Husson G, Ringot P, Tchechmedjiev A, Jonquet C, et al. PGxO and PGxLOD: a reconciliation of pharmacogenomic knowledge of various provenances, enabling further comparison. BMC Bioinformatics. 2019;20:139.
Guedes T, Martins LB, Falci MLF, Silva V, Ocaña KACS, Mattoso M, et al. Capturing and Analyzing Provenance from Spark-based Scientific Workflows with SAMbA-RaP. Future Generation Computer Systems. 2020;112:658–69.
Alterovitz G, Dean D, Goble C, Crusoe MR, Soiland-Reyes S, Bell A, et al. Enabling precision medicine via standard communication of HTS provenance, analysis, and results. PLoS Biol. 2018;16:e3000099.
Parciak M, Bauer C, Bender T, Lodahl R, Schreiweis B, Tute E, et al. Provenance Solutions for Medical Research in Heterogeneous IT-Infrastructure: An Implementation Roadmap. Stud Health Technol Inform. 2019;264:298–302.
Danese MD, Halperin M, Duryea J, Duryea R. The Generalized Data Model for clinical research. BMC Med Inform Decis Mak. 2019;19:117.
Pimentel JF, Freire J, Murta L, Braganholo V. A Survey on Collecting, Managing, and Analyzing Provenance from Scripts. ACM Comput Surv. 2019;52:1–38.
Ornelas T, Braga R, David JMN, Campos F, Castro G. Provenance data discovery through Semantic Web resources. Concurrency Computat Pract Exper. 2018;30:e4366.
Daumke P, Heitmann KU, Heckmann S, Martínez-Costa C, Schulz S. Clinical Text Mining on FHIR. Stud Health Technol Inform. 2019;264:83–7.
Tyndall T, Tyndall A. FHIR Healthcare Directories: Adopting Shared Interfaces to Achieve Interoperable Medical Device Data Integration. Stud Health Technol Inform. 2018;249:181–4.
Thavasimani P, Cala J, Missier P. Why-Diff: Exploiting Provenance to Understand Outcome Differences From Non-Identical Reproduced Workflows. IEEE Access. 2019;7:34973–90.
Moreau L, Freire J, Futrelle J, McGrath RE, Myers J, Paulson P. The Open Provenance Model: An Overview. In: Freire J, Koop D, Moreau L, editors. Provenance and Annotation of Data and Processes. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 323–6.
Schröder M, Staehlke S, Groth P, Nebe JB, Spors S, Krüger F. Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. J Biomed Semant. 2022;13:4.
Wing JM. The Data Life Cycle. Harvard Data Science Review. 2019. https://doi.org/10.1162/99608f92.e26845b4.
:00-17:00. ISO 8000-2:2022. ISO. https://www.iso.org/standard/85032.html. Accessed 13 Dec 2022.
Peng G. The State of Assessing Data Stewardship Maturity – An Overview. Data Science Journal. 2018;17:7.

No competing interests reported.

AdditionalFile1.docx
Additional File1, *.doc, Table 1: List of included studies, Overview of all included studies in the scoping review with name of first author, title, publication year, framework, model, results, reference numbers according to this document
AdditionalFile2.docx
Additional File2, *.doc, Table 2: Reported requirements/factors for provenance, Studies cited to classification, description and counting of requirement/factor terms.
AdditionalFile3.docx
Additional File3, *.doc, Table 3: Reported impacts and stakeholders of provenance information, Included articles and counting of impacts per category; shows the structure and relationship between the individual stakeholders and the reported impacts. The comprehensive meaning of the impact is explained in the column ‘Description’ by the assignment of the individual statements from the mentioned papers.
AdditionalFile4.docx
Additional File4, *.doc, Table 4: Reported technical and organizational challenges, Included articles and counting per category and subcategory for reported challenges, problems, and bottlenecks during accomplishment of provenance

Download PDF

Version 1

posted

You are reading this latest preprint version

Capturing provenance information for biomedical data and workflows: A scoping review

Status:

Version 1

Abstract

Background:

Methods:

Results:

Conclusions:

Figures

Background

Methods

Overview

Stage 1: Identifying Research Questions

Stage 2: Identifying Relevant Studies

Stage 3: Study Selection

Stage 4: Charting the data

Stage 5: Collating, Summarizing, and Reporting the Results

Results

Literature Search

Characteristics of the included studies (n = 54)

Research Question 1: Approaches for classification and tracking of provenance criteria in biomedical workflows and data

R1.1: Characteristics framework types (n = 54)

R1.2: Model characteristics (n = 48)

R1.3: Validation status (n = 43)

R1.4: Provenance characteristics (n = 54)

R1.5: Requirements for provenance frameworks (n = 34)

R1.6: Domain specific conditions including guidelines (n = 17)

Research Question 2: Potential value of provenance information

R2.1: Impact of provenance information (n = 47)

R2.2: Data sources (n = 31)

Research Question 3: Potential challenges, problems, and bottlenecks during accomplishment of provenance (n = 39)

Research Question 4: Demands for the consideration of provenance (n = 15)

Research Question 5: Completeness of provenance information during data management process or data life cycle (n = 18)

Roadmap for a tailor-made provenance framework

Discussion

Strengths

Conclusions

List of abbreviations

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Competing interests

Funding

Author’s contributions

Acknowledgements

References

Additional Declarations

Supplementary Files

Status:

Version 1