On legal interoperability
In the real life implementation of a federated approach, where many more nodes with different responsibility on the data curation and management, many more data sources could be used and data requirements could be larger in a context where research questions and queries grow exponentially. In this case, assuring the compliance with GDPR principles, in particular minimization and confidentiality, gains relevance and imposes new actions. In this context, Data Protection Officers (DPOs) will play a major role.
Recommendation 1: In the real life expansion of the JA Infact, data access will require to document how GDPR principles will be assured, mainly throughout: a) a protocol of the study behind the query (including the purpose and methodology) and a data management plan including the data schema (entities, variables, operational description with categories and values, and encoding systems), what are the measures to assure confidentiality and minimisation; who will be the actors and what role they will pay in data management and for how long.
Recommendation 2
In the context of the nodes, there will be a need for DPOs to understand how data accessing and data management procedures will work in the context of a federated approach. Specific training programs for DPOs could be recommendable. The other way round, the continuous exchange with DPOs will make each node be aware of the local and specific requirements and anticipate the data accessing needs.
Recommendation 3
In a scaled context, there will be a need for technological solutions that ensures security- and privacy-by-design. Robust authentication and authorization systems will be needed to manage data access to only authorized users and to provide traceability information for forensic analyses in the follow up of a given user.
On organizational interoperability
In the JA InfAct case study the number of actors interacting has been confined to a few: in the Coordination Hub a technological and a domain expert, in the different nodes one or two contact persons with mixed profiles. In this very controlled case study, bilateral interaction between the coordination node and the four participan nodes, close tutoring of the process, even online remote intervention can be used to solve queries on the deployment of the technological solutions.
As mentioned when describing the federated research infrastructure, a good wealth of tasks are developed inhouse by each of nodes within the federation - discussing of the research question, agreeing on a common data model, accessing and collecting the data in the way required by such a data model, deploying the technologies developed elsewhere in their technological infrastructures, running the scripts, interpreting the error logs and the outputs.
In a federation with much more nodes, or in a hybrid federation with one node serving as an orchestrator of other nodes, or in a peer to peer federation where any node can orchestrate or any node can interrogate the federation the needs for organizational interoperability will sky rocketed.
Recommendation 4: In the context of a scaled up infrastructure, nodes in the federation will require a number of profiles: domain experts (depending on the research question), data scientists, data managers and system administrators. The Coordination Hub requires, in addition, an IT profile expert in distributed computing.
Recommendation 5
Orchestrating the whole distribution in more complex federations will require a stepwise approach (see details in Appendix) that smooths down the exchange between the Coordination Hub and the nodes, while deploying an analytical pipeline that is transparent and reproducible at any step.
Recommendation 6
In the institutions composing the federation, rating data curation institutions according to their procedures to get data up-to-date and high quality; agreeing on a common data quality framework (see for example in [13]); cataloguing their data sources in a way that is standard (e.g., DCAT [14]); providing information on interoperability standards and reusability; publishing clear procedures to get access to their data.
On semantic interoperability
Data requirements within InfAct JA case studies have been intendedly limited and, consequently, the number of data sources and the data types. Thus, getting a common data model has then been rather uncomplicated. A pan European federated infrastructure that is expected to interact with unlimited research questions will require consider multiple data sources and many more data types. Some of them may come from routine collections; for example, administrative or claims data as the ones used in the stroke case study, disease-specific registries, population-based registries, socioeconomic repertoires, electronic and medical health records, data from lab tests, data from imaging tests, etc. Some of them from ad hoc data collections, e.g., DNA sequences, tissues, data from wearables, samples of texts or data from social media.
In addition to the variety of data sources is also the heterogeneity of data in their very nature (at one end, administrative data; at the other end, natural language) but also heterogeneous in the encoding systems. For example the inclusion of new countries in the stroke use case may have required other disease encoding systems, such as NOMESCO [15], OPCS [16], Leistsungkatalog [17] or ACHI [18]. In case of using data from lab tests LOINC [19] should have been mapped out.
Recommendation 7. In the context of a scaled up infrastructure, it would be recommendable to map out and catalogue the most prevalent semantic interoperability standards. In that sense, future initiatives should link to standards developers and curators, for example, using SNOMED [20], the ontology of reference terms for medical conditions.
Recommendation 8. A pan european federated infrastructure should link with the existing research infrastructures on health data. On the one hand, to learn how they have catalogued the standards of semantic (and syntactic) interoperability. On the other hand, to provide access to their standards to the population health research community that could be interested in data models including that variety of data sources. As a matter of example, standards on biosamples [21] or omics [22].
Recommendation 9 It is expected that most of the studies in an expanded version of a federated research infrastructure, the vast majority of research will be observational studies. A major multiparty initiative pursuing a common data model for observational research is OMOP. [23] A close follow up of this initiative is recommendable, even proactively advocating improvements to get the specificities of population health research well represented in the OMOP CDM.
On technological interoperability
Although setting up the foundations for the deployment of more complex pipelines the aforementioned technological elements in the JA InfAct stroke case study had a very modest scope. The eventual expansion of a federated research infrastructure such as the one tested in InfAct would require a technological upgrade considering three elements: the reduction of human interaction in the steps proposed in 4.2.2; the possibility of heavier computational workloads, and designing the architecture to allow full distribution of complex methodologies whose results are comparable to the ones using data pooling in a single repository.
The JA-InfAct federated analysis infrastructure can be considered a step towards more sophisticated solutions. It is a reliable solution for a problem specific scenario, but the foundations may be easily extended to include more analysis pipelines. For example, a generalized version of the infrastructure may support fully distributed statistical algorithms [24], [25] and, in the final term, state-of-the-art federated learning algorithms [26][27][28], the current cutting-edge analysis approach when leading with huge data sets distributed across multiple locations, without having the possibility of merging them. In addition, the current client-server architecture, which relies on a Coordination Hub that agglutinates a high level of responsibility can be moved to a peer-to-peer architecture, where all partners/Data Hubs can act as peers, having the capacity to coordinate analyses through the infrastructure.
Recommendation 10 When it comes to reducing human interaction, a way forward will be developing and implementing a robust deployment protocol to automatically orchestrate the federation set up activities between the Coordination Hub and the different, detailed in the stepwise process represented in Appendix.
Recommendation 11. One of the tasks of the coordination hub in an eventual scaled up federated infrastructure should be the assessment of the computational needs of the different research queries. To optimize the infrastructure resources and reduce the management overheads, it is sensible to outsource the computing or storage capabilities instead of having and maintaining high capacities inhouse. A solid European service providers such as, EGI, for computation capacities https://www.egi.eu, or EUDAT, for storage capacities (https://www.eudat.eu/) are primary choices for this purpose
Recommendation 12. In federated infrastructures, when research questions and research methodologies become more demanding the design of distributed analyses becomes paramount. These new and growing requirements should be faced in a future federated infrastructure by providing state-of-the-art federated machine learning algorithms and methods, as well as to provide the elements to easily develop or adapt new analysis techniques.
To conclude, it is important to note that all the know-how gathered during the development of the JA-InfAct federated analysis infrastructure and some of the recommendations provided are currently being implemented in PHIRI, population health information research infrastructure [https://www.phiri.eu], a practical roll out of DIPOH, distributed infrastructure on population health research, current candidate to get into the ESFRI roadmap. In addition, all this insight is playing a fundamental part of the European Health Research and Innovation Cloud, a cloud for health data exchange between European health research infrastructures and health services, to be designed under the HealthyCloud project [https://healthycloud.eu]. Finally, this knowledge is currently helping to give shape to the future European Health Data Space, the project to regulate the secondary use of health data across Europe, under the framework of the TEHDaS Joint Action [https://tehdas.eu].