This scoping review investigates evidence regarding approaches and criteria for provenance tracking. It discloses knowledge gaps in the biomedical domain with a focus on modeling and metadata frameworks for (sensitive) scientific biomedical data. Following the previously published scoping review protocol led us to include 54 full-text papers from initially 564 fetched papers found in PubMed and WoS databases. Using a structured and pre-tested data extraction sheet, contextual, but detailed enough, results were extracted to answer the outlined five research questions in the protocol.
Following the data extraction and analysis, the findings led us to define a Provenance-SLF roadmap elements. We essentially distinguished between the framework types and model characteristics, the validation status, and the requirement and provenance characteristics (see Fig. 6).
The provenance challenges, dealing with the need for provenance standardization, started in 2006 and gave rise to tailor-made models and metadata frameworks for the representation of provenance. These were later superseded by general-purpose standardized provenance models, which have more recently been combined with domain and application specific models or extensions such as the Provenance, Authoring and Versioning (PAV) ontology [38] or the ProvCaRe model [53]. The predominantly used models reported in this review referred to the W3C PROV and OPM standards. As shown in Fig. 2, an increased number of papers were related to the implementation frameworks that appeared between 2016 and 2020. One reason for the increase in implementations might be the substantiation to extend W3C PROV and OPM [11].
As of now, heterogeneous data sources, dynamic infrastructures, data exchange across boundaries, and a lack of standards for quality measures characterize the state of electronic health record data sets [57]. Additionally, various aspects of the term provenance [27], [37], [46], [53], [65] hamper the unique understanding and harmonization and engineering efforts for modeling, implementation, and validation interventions until now.
A provenance framework for today’s demands must acknowledge the (semantic) complexity of the domain and its relevant facets and requirements [11] (see also Fig. 2). In addition to requirements analysis, a thorough strategy is necessary to plan the typical data management steps such as collecting, managing, and analyzing data (Pimentel et al. [67]). According to Curcin et al. [4], validation readiness can be achieved by separating modeling and verification of provenance data from the software implementation.
We agree that precise requirements analysis, as part of the software-life cycle, and the subsequent individual life-cycle steps, like testing and maintenance procedures, support the consequent temporal evolution and hence improve the quality of provenance frameworks and applications.
When incorporated in an official inspection, provenance information must be sufficient for a content-related validation against applicable and accepted standards [4]. Therefore precise validation methods for provenance services regarding usability and performance, scalability, fault tolerance, and functionality are needed [36]. We saw that validation approaches are linked to the evolution of provenance modeling and subsequent implementation attempts. Curcin et al. [1] argue that it was necessary to launch more formal software engineering techniques to foster provenance implementation across a broad range of software tools in the biomedical domain and beyond [1]. In that sense, formal validation as part of the software engineering process contributes to increased software and data quality. Formal validation requires testing efforts and testing evidence. Accurate alignment of testing procedures against predefined requirements in the software-lifecycle could not be identified in the included papers.
Provenance information is of high value for the scientific and biomedical community (eg. researchers), support staff (eg. developers), patients and other 3rd parties (eg. data privacy officer, authority) (see Fig. 4). It is interesting to see that despite the high impact of provenance [see Additional File 3] only some stakeholders provide sufficient provenance information. Rather, it appears that responsibility for overall provenance management is being shifted to the support staff [Gierend et al. (unpublished observations)]. We argue that available technology, IT knowledge and data management skills need to be paired with both domain-specific knowledge and combined with constraints of legal nature or guidance [4], [44]. This complexity indeed results in a very time-consuming business. However, automation and metadata collection can support this process [4], [73]. As a matter of fact, good provenance information strengthens the credibility of the data and proves that data have not been intentionally or unintentionally changed throughout the data life cycle [74].
Our review collects and summarizes the existing challenges during the accomplishment of provenance (Fig. 5). Challenges expressed in terms of missing, lacking, or hinderance on organizational and technical capabilities so far were triangulated into more specific subcategories such as organizational (e.g., Investment and training, Administrative) and technical (e.g., granularity, performance and modeling and metadata annotation, delimitation reproducibility and replicability) challenges.
First of all, we observed that increasing legal and scientific demands require research projects to be implemented more transparently. However, the granularity of provenance [48], [61], [63] could not yet be resolved and so-called knowledge bottlenecks [44], [62] persist.
In parallel appropriate provenance modeling [58] and provenance management technique [61] are required to protect sensitive provenance data, like from the patient consent. Curcin et al. [4] stipulated overcoming the gap between the provenance metadata collected and the reporting requirements.
Secondly, it remains unclear how to scale provenance systems for high amounts of data [2], [11], eg. how to store and represent provenance information in an aggregated and efficient manner or how to assist users in sophisticated provenance queries [10]. Without doubt, automated and scalable solutions become impelling due to new challenges arising from the disposal and usage of permanently increasing computing power [60]. Growing focus is on the useability of the interface, particularly when provenance systems are implemented in the broad medical community including patients, doctors, and researchers [35].
Third, this scoping review extracted data about the (in)completeness of provenance information during data management processes. Surprisingly, only one implementation paper [9] demonstrated complete traceability from data collection to the analysis datasets.
The lack of mandatory specifications or guidelines for provenance capture might be the reason why other papers only mention partial completeness. We strongly recommend doing more research on completeness checks as part of provenance tracing. The level of completeness and accuracy of provenance information (of core data elements), especially in real-world data, could reveal data integrity issues and thus, affect the overall validity of the study results. Furthermore, reproducibility significantly depends on the accuracy of provenance information. For example, Mondelli et al. [55] delivered a tool for better scientific and longitudinal data management, which supports users, reproducibility by provenance, and reproduction through docker containers.
Interestingly, the concept of “quality of provenance” is not clearly defined in any of the papers included for this review. We believe that data quality issues need to be addressed to reach completeness, accuracy, and timeliness of the data, and to create trust in it.
The ISO 8000-2:2022 [75] defines the term data quality and clearly recommends defining degrees of requirements. This definition should be considered for use in provenance systems.
Finally, upcoming trends can be observed regarding the scalability of software. Concurrently, while following the increasing capacity and functionalities based on users’ demand, scalable software needs to remain stable while adapting to changes. Another trend reveals the importance of good and systematic data management practices [37] and the coordination with relevant stakeholders through the data life cycle.
Strengths
The present work applied a rigorous scoping review methodology using Arksey and O'Malley’s framework [15]. All screening stages were carried out by at least two independent reviews of four members. A previously published protocol [16] guided our review. The fact that the scoping review includes comprehensive results for the five related research questions and roadmap for a tailor-made Provenance-SLF framework with many additional results as supplements can be considered a strength of this review.