Evidence synthesis and machine learning
Systematic reviews aim to identify and summarize all available evidence to draw inferences of causality, prognosis, diagnosis, prevalence, and so on, to inform policy and practice. Reviewers should adhere closely to principles of transparency, reproducibility, and methodological rigor to accurately synthesize the available evidence. These principles are pursued through adhering to explicit and pre-specified processes (Antman, Lau et al. 1992, Oxman and Guyatt 1993).
As noted above, ML can reduce the need for humans to perform repetitive and complex tasks. “Repetitive and complex” characterizes several systematic review steps, such as assessing the eligibility of thousands of studies according to a set of inclusion criteria, extracting data, and even assessing risk of bias domains using signaling questions. Not only are most tasks repeated many times for each study, but they are often conducted by two trained researchers.
Unsurprisingly, conducting a systematic review is a resource-intensive process. Although the amount of time taken to complete health reviews varies greatly (Nussbaumer-Streit, Ellen et al. 2021), fifteen months has been an estimate from both a systematic review (Borah, Brown et al. 2017) and a simulation study (Pham, Jovanovic et al. 2021). Cochrane suggests reviewers should prepare to spend one to two years (Cochrane Community), yet only half of reviews are completed within two years of protocol publication (Andersen, Gulen et al. 2020). Andersen et al. also report that median time-to-publication has been increasing. A worrying estimate from 2007 is that twenty-five percent of reviews are outdated within two years of publication due to the availability of new findings (Shojania, Sampson et al. 2007). Furthermore, resource use does not necessarily end with the publication of a review: many reviews — notably those published by Cochrane and health technology assessments in rapidly-advancing fields such as cancer treatment — must be updated (Elliott, Synnot et al. 2017).
ML offers the potential to reduce resource use, produce evidence syntheses in less time, and maintain or perhaps exceed current expectations of transparency, reproducibility, and methodological rigor. One example is the training of binary classifiers to predict the relevance of unread studies without human assessment: Aum and Choe recently used a classifier to predict systematic review study designs (Aum and Choe 2021), Stansfield and colleagues to update living reviews (Stansfield, Stokes et al. 2022), and Verdugo-Paiva and colleagues to update an entire COVID-19 database (Verdugo-Paiva, Vergara et al. 2022).
ML tools have been available for systematic reviewers for at least ten years, yet uptake has been slow. In 2013, Thomas asked why automation tools were not more widely used in evidence synthesis (Thomas 2013). Since then, an increasing amount of review software with ML functionalities are available (van der Mierden, Tsaioun et al. 2019, Harrison, Griffin et al. 2020), including functionalities that map to the most time-intensive phases (Clark, McFarlane et al. 2021, Nussbaumer-Streit, Ellen et al. 2021). The evidence in favor of time savings has grown with respect to specific review phases. O’Mara-Eves and colleagues’ review in 2015 found time savings of 40–70% in the screening phase when using various text mining software (O'Mara-Eves, Thomas et al. 2015); we reported similar or perhaps more (60–90%) time savings in 2021 (Muller, Ames et al. 2021). Automatic classification and exclusion of non-randomized designs with a study design classifier saved Cochrane Crowd from manually screening more than 40% of identified references in 2018 (Thomas, McDonald et al. 2021). We have also reported that categorizing studies using automated clustering used 33% of the time compared to manual categorization (Muller, Ames et al. 2021).
While the available estimates of time saved within distinct review phases are impressive, there are two additional outcomes that are more important to quantify: total resource use and time-to-completion. Studying resource use is important because producing evidence syntheses is expensive. Studying time-to-completion is important because answers that are late are not useful. We are unaware of any studies that have compared the use of ML and human-based review methods with respect to these outcomes. Knowing how ML may affect total resource use would help review producers to budget and price their products and services. Knowing how ML may affect time-to-completion would help review producers decide whether to adopt ML in general or for specific projects and, if they do, how project timelines may be affected. Clark et al. conclude their report of a review conducted in two weeks, attributed to full integration of software with and without ML, as well as project management changes, by predicting that adoption of ML will increase if “the increase in efficiency associated with their use becomes more apparent” (Clark, Glasziou et al. 2020) (page 89).
Context
The Cluster for Reviews and Health Technology Assessments in the Norwegian Institute of Public Health is staffed by about 60 employees and, before the COVID-19 pandemic, produced up to about 50 evidence synthesis products per year. This number has roughly doubled under COVID-19. Cluster management funded the ML team in late 2020 to coordinate implementation, including building the capacity of reviewers to independently use, interpret, and explain relevant ML concepts and tools. This team is tasked with the continuous identification, process evaluation, and implementation of ML tools that can aid the production of evidence synthesis products and tailoring them to institutional procedures and processes; see Fig. 1 for a schematic.
Recommended versus non-recommended use of ML
Fifteen months after the ML team was formed, we noticed that ML is sometimes used in addition to, rather than instead of, fully manual processes. One example of this is screening titles and abstracts with a ranking algorithm, reaching the “plateau” that indicates all relevant studies have been identified, but then continuing to use two blinded human reviewers to screen thousands of remaining and likely irrelevant studies.
It seems self-evident that introducing a new tool (e.g., ML) — but continuing to perform the tasks the tool seeks to replace — will not result in reduced resource use or decrease time-to-completion. If ML tools can deliver the savings they promise, and are to be adopted, then it is necessary to convince reviewers to adopt these new tools and use them as recommended. This protocol therefore distinguishes between “non-recommended” ML that merely adds additional tasks to normal, manual procedures, and “recommended” ML that corresponds to some level of automation that replaces manual procedures.
We do not mean to say that every project should use ML, or use it in the same way, but that if ML is adopted to reduce resource use or time-to-completion — as is the overarching aim in our institution — it should replace some human activities. There may be cases in which the use of ML alongside human activity is expected to be beneficial, for example if it is expected that important studies may be easy to miss even by humans, or to help new and inexperienced reviewers learn (Jardim, Rose et al. in press). Importantly, we do not mean to say that people have no role in evidence synthesis, but that it seems likely that people can make valuable higher-level contributions that machines cannot.