The SAFE Procedure: A Practical Stopping Heuristic for Active Learning-Based Screening in Systematic Reviews and Meta-Analyses

doi:10.21203/rs.3.rs-2856011/v1

Download PDF

Research Article

The SAFE Procedure: A Practical Stopping Heuristic for Active Learning-Based Screening in Systematic Reviews and Meta-Analyses

https://doi.org/10.21203/rs.3.rs-2856011/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Mar, 2024

Read the published version in Systematic Reviews →

You are reading this latest preprint version

Active learning has become an increasingly popular method for screening large amounts of data in systematic reviews and meta-analyses. The active learning process continually improves its predictions on the remaining unlabeled records, with the goal of identifying all relevant records as early as possible. However, determining the optimal point at which to stop the active learning process is a challenge. The cost of additional labeling of records by the reviewer must be balanced against the cost of errors made by the current model. This paper introduces the SAFE procedure, a practical and conservative set of stopping heuristics that offers a clear guideline for determining when to end the active learning process in screening software like ASReview. Our main conclusion of this paper is that relying on a single stopping rule is not sufficient and employing an eclectic mix of stopping heuristics helps to minimize the risk of missing relevant papers in the screening processThe SAFE procedure combines different heuristics to avoid stopping too early and potentially missing relevant records. The SAFE procedure takes into account the model's accuracy and uncertainty, as well as the cost of continuing to label records. This procedure for using active learning in systematic literature review screening provides a practical and efficient approach that can save significant time and resources while ensuring a conservative approach to determining when to end the active learning process. The proposed stopping heuristic balances the costs of continued screening with the risk of missing relevant records, providing a practical solution for reviewers to make informed decisions on when to stop screening. The proposed method can assist researchers in identifying relevant records early, which can ultimately lead to improved evidence synthesis and decision-making in many fields.

systematic review1

methodology2

active learning3

machine learning4

stopping heuristic5

stopping rule¬¬6

meta-analysis7

screening prioritization8

Conducting a systematic review or meta-analysis requires a significant amount of time. However, automation can be used to accelerate several steps in the process, particularly the screening phase (Adam et al., 2022; Cierco Jimenez et al., 2022; Cowie et al., 2022; Khalil et al., 2022; Nieto González, 2021; Pellegrini & Marsili, 2021; Qin et al., 2021; Robledo et al., 2021; Scott et al., 2021; Tsou et al., 2020; van de Schoot et al., 2021; Wagner et al., 2022; L. L. Wang & Lo, 2021). Artificial intelligence can assist reviewers with screening prioritization through active learning, a specific implementation of machine learning; for a detailed introduction we refer to Settles (2009). Active learning is an iterative process in which the machine continually reassesses unscreened records for relevance and the human screener providing labels to the most likely relevant records. As the machine receives more labeled data, it can use this new information to improve its predictions on the remaining unlabeled records, with the goal of identifying all relevant records as early as possible. Priority screening via active learning has been successfully implemented in various software tools such as Abstrackr (Wallace et al., 2012), ASReview (van de Schoot et al., 2021), Colandr (Cheng et al., 2018), EPPI-Reviewer (Thomas et al., 2020), FASTREAD (Yu et al., 2018), Rayyan (Ouzzani et al., 2016), RobotAnalyst (Przybyła et al., 2018), Research Screener (Chai et al., 2021), DistillerSR (Hamel et al., 2020), and robotreviewer (Marshall et al., 2017). However, among these tools, only ASReview offers the flexibility to implement the suggested model-switching approach proposed in this paper. For a curated comparison of these software tools, see van de Schoot (2023).

Priority screening via active learning allows for a more efficient and effective screening process compared to manual screening methods. By focusing early screening efforts on the most relevant records, screening fatigue might less likely result in records that may be missed using traditional screening approaches such as screening by year, title, author, or randomly. Moreover, with active learning, the relevant records are found early in the screening process, allowing the review team to start on subsequent steps of the review while less relevant records are still being screened. Another advantage of active learning is that it allows for a more sensitive and reproducible search with less filtering. With manual screening, search strategies are often designed to lead to a manageable amount of records, which may require applying filters or limiting the number of search terms. However, these filters can reduce the sensitivity of the search and may introduce bias, while also limiting the reproducibility of the search over time. Active learning can sort through large amounts of data more efficiently than manual screening and thus requires less filtering, enabling a more sensitive and reproducible search. Overall, active learning is a promising method for systematic reviews and meta-analyses due to its more efficient, effective, and transparent screening process.

However, determining the optimal point to stop screening is a critical and challenging task when using active learning. The main goal of active learning is to screen fewer records than random screening, so it is important to find an efficient stopping point in the active learning process (Yu & Menzies, 2019). However, defining a stopping rule is difficult as the cost of labeling additional records must be balanced against the cost of errors made by the current model (Cormack & Grossman, 2016). Active learning models continually improve their predictions as they receive more labeled data, but the process of collecting labeled data can be time-consuming and resource-intensive. While finding all relevant records is nearly impossible, even for human screeners (Z. Wang et al., 2020) it is important to consider that in the absence of labeled data, the number of remaining relevant records is unknown. Therefore, researchers may either stop too early and risk missing important records or continue for too long and incur unnecessary additional reading (Yu et al., 2018). At some point in the active learning process, most, if not all, relevant records have been presented to the screener, and only irrelevant research remains. Thus, finding an optimal stopping point is crucial to conserve resources and ensure the accuracy of the review.

Several statistical stopping rules have been proposed in the literature (Cormack & Grossman, 2016; Howard et al., 2020; Kastner et al., 2009; Ros et al., 2017; Stelfox et al., 2013; Wallace et al., 2010, 2012; Webster & Kemp, 2013; Yu & Menzies, 2019). However, these rules can be difficult to interpret and apply by non-specialists and have not been widely implemented in software tools.

Alternatively, heuristics have been proposed as a practical and effective way to define stopping rules for active learning-based screening in systematic reviews and meta-analyses. Several heuristics have been proposed, each focusing on a single aspect, including time-based, data-driven, and number-based strategies, such as those proposed by Bloodgood & Vijay-Shanker (2014), Olsson & Tomanek (2009), Ros et al. (2017), and Vlachos (2008). In time-based approach, the screener stops after a pre-determined amount of time, which can be useful when there is limited time to screen or when hourly costs of the screener are high. In the data-driven approach, the screener stops after labeling a pre-determined number of consecutive irrelevant records, such as labeling 50 records as irrelevant in a row. In the number-based approach, the screener stops after having evaluated a fixed number of records. This number can be based on an estimate of the total number of relevant records in the starting set (Cormack and Grossman, 2016). A variation of the number-based approach is to screen a predefined set of records randomly and use the observed fraction of relevant records to extrapolate an estimate of relevant records for the complete set (van Haastrecht et al., 2021). Lastly, we propose the key paper heuristic to validate the recall, which has been described in various sources, such as Tran et al. (2022), and Bramer et al. (2018). Key papers are typically used for validating the search strategy by ensuring that the search process adequately identifies relevant primary studies. When using the key paper heuristic for validating the active learning phase, a set of important papers is determined beforehand, for example, by expert consensus, and the screener stops if all these papers are found with active learning. In sum, these single-aspect heuristics offer practical and simple approaches to define stopping rules for active learning-based screening, and can help non-specialists more easily interpret the results. At the same time, using a single heuristic has limitations and may result in missing potentially relevant records. Therefore, we suggest implementing a combination of heuristics to avoid ending screening prematurely and increase the recall rate.

The goal of the current paper is to present a practical and conservative stopping heuristic that combines different heuristics to avoid stopping too early and potentially missing relevant records during screening, which can be applied in screening software like ASReview. The proposed stopping heuristic is easy to implement and can be effectively applied in various scenarios. The SAFE procedure consists of four phases: Screen a random set for training data; Apply active learning; Find more relevant records with a different model; Evaluate quality. We first present the results of an expert meeting in which we piloted and discussed the stopping heuristic. Next, we provide a detailed explanation of the heuristic, including its implementation and effectiveness in different scenarios. The proposed stopping heuristic balances the costs of continued screening with the risk of missing relevant records, providing a practical solution for reviewers to make informed decisions on when to stop screening. It is our aspiration that this practical and effective stopping heuristic will be widely adopted and implemented in systematic reviews and meta-analyses using active learning.

2.1 Method

The proposed stopping heuristic was initially developed in December 2022 and was inspired by the procedure used by Brouwer et al. (2022). It was subsequently peer-reviewed on 12-01-2023 by a group of 26 experts comprising information specialists, data scientists, and users of active learning aided systematic reviews from the Netherlands and Germany. The proposed stopping heuristic was presented to the participants, who provided feedback on several aspects, including the use of a minimum percentage to screen, a conservative standard to determine the pre-determined number of consecutive irrelevant records, the inclusion of a visual inspection of the recall plot as part of the stopping heuristic, and the use of key papers as prior knowledge. The feedback was collected digitally via Wooclap software, discussed by the authors, and used to adapt the stopping heuristic accordingly, resulting in a practical and effective stopping rule that can be implemented in systematic reviews and meta-analyses using active learning.

2.2 Results

First, the participants of the expert meeting were very enthusiastic about the general setup of the proposed stopping heuristic and agreed with the different stages, as they felt that it was a practical and effective solution to determine when to stop screening in systematic reviews and meta-analyses using active learning. They appreciated the conservative and practical approach, which would help ensure that relevant records are not missed while minimizing the amount of unnecessary screening.

The participants were in favor of using a minimum percentage of records to screen, but emphasized the importance of linking it to an estimate of the fraction of relevant records in the total dataset to avoid stopping too early. As one colleague noted, “Yes, I'd opt for a minimum percentage. I'd decide on this percentage based on the initial screening that you do. That percentage can be used as an indication of what percentage of articles will be relevant in the total sample.”. However, since this percentage could be either a under-or overestimation of the actual fraction of relevant records, colleagues advised using a minimum percentage based on simulation studies using active learning and building in a margin for the irrelevant records that may be incorrectly marked. This approach would help ensure a conservative stopping rule that balances the costs of continued screening with the risk of missing relevant records.

Second, researchers were positive about using a visual inspection of the recall plot (“Incredible idea, promising!”, “Seems very logical and rational to me”), but they considered it to be not precise enough as a stopping rule on its own (“With other rules combined it is good enough”, “I think it works if you have other stopping rules (such as the minimum %”, “I think it may be best to combine a percentage range and this”). However, visual inspection of the recall plot can be used to get an indication of whether it's time to apply the stopping heuristic. This makes the screening process more efficient, as applying the stopping rule(s) takes valuable time (e.g. checking for key papers).

Researchers shared their experience and agreed that a combination of the minimum percentage of records screened and a threshold of 50 consecutive irrelevant records was a "safe and reasonable" approach. The combination of these two checks helps minimize the risk of screening an excessive number of irrelevant records while ensuring enough relevant records are included in the review process. However, the experts acknowledged that a higher number of consecutive irrelevant records might be necessary for some applications, for instance, where labeling time is inexpensive, or where it is crucial to identify as many relevant records as possible. It is important to note that humans typically miss around 10% of the relevant records (Z. Wang et al., 2020), and some relevant records may not be included in the dataset due to limitations in the search or errors in the metadata of records.

During the expert meeting, it was agreed that using key papers to check screening results was a good practice. However, the researchers reached a consensus that these papers might not be the best set to use as prior knowledge in active learning, as they could be biased by the method used to identify them. For instance, experts asked to provide key papers in their field might be biased towards citing papers from their colleagues, which may not represent the relevant papers in the total dataset. Therefore, incorporating key papers as prior knowledge in active learning could result in a biased model. Nevertheless, key papers can still be used to validate the stopping heuristic. The input from the peer-review session led to the formulation of the SAFE procedure containing a set of stopping heuristics.

2.3 The SAFE procedure

The proposed procedure is meant to determine when to stop screening when applying active learning-aided screening while adherering to the PRISMA 2020 statement (Page et al., 2021) and Open Science principles to ensure reproducibility and transparency for AI-aided output (Lombaers et al., 2023). It is designed to be conservative and easily understood by non-experts and to enable the finding of a reasonable percentage of actual relevant records in the dataset, rather than aiming for 100% (Bramer et al., 2018; Papaioannou et al., 2010). The procedure combines the title/abstract and full text phase, and reasons for exclusions based on full-text will be recorded as notes in the software to ensure transparency and reproducibility. The procedure should be applied to a complete unlabeled deduplicated dataset, and includes a set of key papers in the field that should be included in the final selection. To achieve optimal results, we expect users to input high-quality data with minimal missing titles or abstracts and as few duplicates as possible, adhering to the "garbage in, garbage out" (GIGO) principle.

The SAFE procedure consists of four phases: (1) Screen a random set for training data; (2) Apply active learning; (3) Find more relevant records with a different model; (4) Evaluate the quality, which will be discussed consecutively. The procedure is graphically displayed in Fig. 1.

2.3.1 Phase 1: Screen a random set for training data

In order to train the first iteration of the machine learning model, it is necessary to have training data consisting of at least one labeled relevant record and one labeled irrelevant record. While key papers could be used for this purpose, the expert meeting suggested that such papers might introduce bias. Therefore, we propose to start by labeling a random set of records for the training data. The stopping rule for this phase is to screen a minimum of 1% of the total number of records and at least one relevant record must be found. If no relevant record is found after screening 1%, the screening continues.

Next, the fraction of relevant records in the training set (FRR_t) can be calculated by dividing the number of relevant records in the training set (RR_t) by the total number of records screened (T). If the records in the training set are a true random subset, multiplying the FRR_t by the total number of records T provides an estimate of the number of relevant records in the total dataset (RR_T). This value will be used in the stopping heuristic of the second phase.

2.3.2 Phase 2: Apply Active learning

The active learning phase aims to minimize the number of records needed to screen while finding all or as many relevant records as possible. The first iteration of the active learning model, for example, Naive Bayes or logistic regression as the classifier and TF-IDF as the feature extractor, will be trained using the labelled dataset from the preliminary screening. This model is quick and shown to be efficient in several simulation studies (Teijema et al., 2022).

During the active learning phase, the stopping heuristic is a four-fold rule: screening will be stopped when all of the following four conditions are met:

(1) all key papers have been marked as relevant;

(2) at least twice the RR_T records have been screened;

(3) a minimum of 10% of the total dataset has been screened.

(4) no extra relevant records have been identified in the last 50 records;

During the active learning phase, it may be helpful to inspect the recall plot in instances where a large number of consecutive records have been marked as irrelevant. The recall plot charts the number of identified relevant records against the number of viewed records. A visual analysis of the plot can reveal whether a plateau has been reached (see Fig. 2,) indicating that the probability of identifying new relevant records has become small. Once this plateau has been visually identified, the remaining stopping rules can be checked (e.g., check if the key papers already have been found) to determine whether it is appropriate to halt the screening process for this phase.

2.3.3 Phase 3: Find more relevant records with a different model

The third screening phase is to ensure that records are not missed due to suboptimal choice of the active learning model. It might be some relevant records are not presented to the reviewer because the text used in the abstract is not seen as potentially relevant because of concept ambiguity (Chen et al., 2012; Gama et al., 2014), that can make finding relevant records challenging. To identify such records the algorithms must “dig deeper” into a text to find its essence (Goodfellow et al., 2016). This problem is best tackled with deep learning models, which are better at finding complex connections within data than shallow networks like the simple model used in the first screening phase. However, deep learning models require more training data (Alwosheel et al., 2018) and are not expected to perform well in the first few iterations (Teijema et al., 2022). Hence, the labelling decisions from the first screening phase will be used as prior knowledge to train a neural network model as a classifier, along with for example sBert as the feature extractor. The unlabelled records will be re-ordered using a different algorithm, assuming switching is possible in the software, and screening can continue to check if there are relevant records missed by the first model. The stopping rule for the third screening phase dictates that screening will stop if no extra relevant records are identified in the last 50 records.

2.3.4 Phase 4: Evaluate quality

Quality checks are an essential part of a systematic review to ensure that the systematic review is as comprehensive and accurate as possible. Therefore, in Phase 4, the goal is to identify any incorrectly excluded but relevant records from the previous phases. The records that were previously labeled as irrelevant will be screened using a simple model, for example, Naive Bayes as the classifier and TF-IDF as the feature extractor. To train the active learning model, the 10 highest- and lowest-ranked records from the previous phase will be used. An senior screener will then go through the most likely to be relevant but excluded records to identify any relevant records that might have been wrongly excluded. The screening process will continue until the stopping rule is met, which is when no extra relevant records are identified in the last 50 records. If many additional relevant records are identified it is suggested to carefully inspect the inclusion and exclusion criteria used and the level of expertise of the screeners from the first three phases.

To ensure the comprehensiveness of the systematic review, additional quality checks can be performed using forward and/or backward citation with the final inclusions. This method is also suggested by the SYMBALS methodology (van Haastrecht et al., 2021). This can be automated, for example, through the use of SR-Accelerator’s Spidercite (Clark et al., 2020), Citation Chaser (Haddaway et al., 2021). Additionally, as an extra quality check, the complete author team can go through the records identified as relevant to check for incorrectly included but irrelevant records based on the inclusion criteria. Any irrelevant records will be marked and removed from the dataset of relevant records. This extra quality check can help ensure the accuracy and reliability of the final results. Table 1 presents a summary of the four phases of the SAFE procedure, their goal, suggested model setting and stopping heuristics.

Table 1

Overview of the four phases of the SAFE procedure.
Phase	Name	Goal	Dataset	Model	Prior knowledge	Stopping heuristic(s)
1	Screen a random set	• Partly label the dataset to warm-up the AI model; • get an indication of the fraction of relevant records (FRR_T)	Random subset	-	-	• screen a minimum of 1% of the total number of papers; • find at least one relevant record.
2	Active learning	find as many relevant records as possible	Complete dataset	simple model (e.g. Naive Bayes or logistic regression as classifier and TF-IDF as feature extractor)	labeled records phase 1	• All key papers have been marked as relevant; • At least twice the RR_T records have been screened; • A minimum of 10% of the total dataset has been screened. • No extra relevant records have been identified in the last 50 records.
3	Find more records using a different model	ensure that records are not missed due to suboptimal choice of the active learning model	Complete dataset	deep learning model (e.g. neural network as a classifier and sBert as a feature extractor)	labeled records phase 1 and 2	No extra relevant records are identified in the last 50 records.
4	Evaluate quality	avoid incorrectly excluding relevant records	Records labeled as irrelevant in phase 1, 2 and 3 plus the 10 highest-ranked records from phase 2	simple model (see phase 1)	the 10 highest- and lowest-ranked papers from phase 2	No extra relevant records are identified in the last 50 records.

2.3.5 Time investment of the four phases

The screening speed and relative duration of the four phases is displayed in Fig. 3. The time investment for the first phase is equal to that of random screening. In the first part of the second phase, it is expected that on average more time will be needed to screen for relevance. This is because the active learning model moves the most likely relevant records upfront. However, during the early screening phase of the second screening phase, it is also expected that the more challenging records will appear, which may require discussion on how to exactly apply the inclusion and exclusion criteria.. In the third phase, the start of the deep learning model may require significant computation time, depending on the size of the dataset and complexity of the neural network. Table 1 in Teijema et al. (2022) provides expected training times for neural networks; up to 6 hours on a high performance cluster. The fourth phase is relatively quick, taking maybe only 1–2 hours to complete. After the fourth phase, the active learning-aided screening is done and all records that are clearly not relevant will not be seen by the screener, thus saving time when compared to random screening. Overall, the SAFE procedure significantly speeds up the screening process and increases the efficiency of the review. However, it's important to note that the time invested in the screening process may vary depending on the complexity of the dataset and the specific active learning model used.

The utilization of active learning in systematic reviews is gaining more attention as it can improve the accuracy of the screening process and save time. However, there is a risk of missing relevant records if the screening process stops too early. In this paper, we have presented a procedure with stopping heuristics, aiming to balance time efficiency and completeness of the screening. The SAFE procedure includes a preliminary screening phase to warm-up the AI model, an active learning phase to find as many relevant records as possible, model switching to ensure that records are not missed due to suboptimal choice of the active learning model, and a quality check phase to avoid incorrectly excluding relevant records. We believe that our procedure can provide valuable guidance for researchers and practitioners in the field of systematic reviews who want to use active learning available in software like ASReview, to improve their screening process.

In addition to the proposed set of stopping heuristics, there are other potential stopping rules described in the literature. These include computing an inflection point (Cormack & Grossman, 2016; Howard et al., 2020; Ros et al., 2017; Stelfox et al., 2013; Wallace et al., 2010, 2012), estimating recall for the sequential screening of a ranked list of references (Kastner et al., 2009), or computing the lengths of consecutive spans of excluded documents that occur between each relevant document during screening (Stelfox et al., 2013; Webster & Kemp, 2013; Yu & Menzies, 2019). However, it's important to note that these methods might not always be highly accurate in all cases, and their suitability should be evaluated for each specific problem and dataset. In addition, these methods require statistical knowledge, making them less accessible to non-expert reviewers. As such, it's important to carefully evaluate the potential benefits and drawbacks of each stopping rule and choose the most appropriate one based on the specific context, expertise and resources available.

We acknowledge that a direct comparison with existing stopping heuristics has not been presented in this paper. However, our proposed SAFE procedure aims to provide a comprehensive and effective solution by combining an eclectic mix of stopping heuristics to minimize the risk of missing relevant papers. This unique approach sets it apart from other methods that typically focus on only one aspect of the screening process. In future research, a comparative analysis of the SAFE procedure with other established heuristics could further validate its effectiveness and efficiency in different contexts and settings.

It's worth noting that while active learning can significantly improve the efficiency and quality of the screening process, it also has its limitations. As the machine receives more labeled data, it can improve its predictions, but there may be a point of diminishing returns in terms of computation time and resources. This is particularly true for the deep learning phase, which may require extensive training times, especially for large datasets of over 50,000 records (Teijema et al., 2022). Cloud computing can help optimize processing times, but it may not always be practical or feasible. Moreover, it's important to note that by the time a researcher arrives at phase three, most if not all of relevant records may have already been identified. Therefore, the trade-off between the additional training and screening time required during the deep learning phase and the potential gains of identifying a few more relevant records should be carefully considered. It's important to note that during this phase, the model may identify records with slightly different textual structures, which may or may not be relevant to the review. Of course, it also depends on the availability of selecting different models in the software and options to run the software in the cloud. Ultimately, the decision of whether to invest in additional training and screening time in this phase should be based on a careful consideration of the potential gains and the costs involved, including the time and resources required to train the model and the potential impact of the additional records on the review's conclusions.

Another consideration are the cut-off values we used as an example. While the heuristic of using twice the observed fraction of relevant records in a preliminary set of 1% is a useful rule of thumb, it may not always be suitable for small datasets. For example, when working with a dataset of only 500 records, screening 1% would mean only 5 records are screened, and the observed fraction in such a small sample may not yield a representative estimate of relevant records for the complete set. Whereas the minimum of 1 relevant record limits the risk of underestimation, this method could easily lead to an overestimation of the fraction of relevant records in the complete set. In these cases, the rule of thumb could lead to unnecessarily screening too many records, but at the same time makes the SAFE procedure more conservative. Researchers should exercise caution and use appropriate statistical methods to estimate the fraction of relevant records when working with small datasets. When resources are not an issue, in some cases it might be equally suitable to screen the whole set manually.

Furthermore, the proposed batch size for the number of irrelevant records in a row depends on the research question, the domain of the review, and the desired level of recall. For example, in the field of medicine, missing any relevant records might not be acceptable, so a larger batch size is advised. Nevertheless, it's important to balance the cost of labeling more records with the cost of errors made by the current model and choose appropriate stopping rules to achieve the desired level of recall.

It is important to note that we did not discuss the possibility of working with multiple screeners in our procedure. However, according to the PRISMA 2020 Statement (Page et al., 2021), at least two independent reviewers should perform screening at each stage of the screening process to minimize the risk of bias and errors. It is possible to work with multiple screeners in the SAFE procedure. For example, two screeners could perform phases 1, 2 and 3. Inter-rater reliability can be assessed through the calculation of a kappa statistic, and discrepancies can be resolved through discussion between the two reviewers until agreement is reached. In cases of persistent disagreement, a third review author may be consulted to reach a consensus. The quality check in phase 4 could be performed by a third expert screener to independently evaluate the quality of the exclusions. The whole author team could check the final selection of papers.

In conclusion, this paper has presented a procedure for active learning in the screening phase of a systematic review, which consists of four phases with their stopping heuristics. While active learning can significantly improve the efficiency and quality of the screening process, it is important to carefully evaluate the suitability of the methods for specific problems and datasets. Additionally, it is essential to note that these methods may not be applicable for all active learning scenarios, as they may only apply to specific types of data and models. Ultimately, the decision of when to stop the active learning process should be based on a careful consideration of the trade-off between the cost of labeling more records and the cost of errors made by the current model. Further research is needed to evaluate the effectiveness of these heuristics for different types of reviews and to identify the most appropriate heuristics for different review settings. Overall, the proposed procedure provides a practical, conservative and efficient solution for determining when to stop with active learning in the screening phase of systematic reviews, which non-experts in the field can easily implement.

Ethics approval and consent to participate

This study involved the collection of feedback and opinions from experts during an expert meeting. The participants were informed about the purpose of the meeting, the research objectives, and the use of their input in the development of the SAFE procedure. All participants voluntarily agreed to participate in the meeting and provided their consent for their anonymized feedback to be used in the research. The study was conducted in accordance with the ethical guidelines of Utrecht University. As this research did not involve any personal data, sensitive information, or interventions with human participants, a formal ethics approval from an Institutional Review Board was not required.

Consent for publication

Not applicable

Availability of data and materials

Not applicable

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding

No funding was applied to support this research.

Authors’ Contributions

First authorship: Josien Boetje

Last authorship: Rens van de Schoot

Acknowledgments

We would like to express our sincere gratitude to Tale Evenhuis, information specialist at the HU University of Applied Sciences Utrecht, for providing valuable feedback on the proposed stopping heuristic. Tale's expertise was instrumental in refining the stopping rule and ensuring its practical relevance and accuracy.

We would also like to extend our appreciation to the participants of the expert meeting for their insightful feedback. Their enthusiasm about the general setup of the proposed stopping heuristic and agreement with the different stages were invaluable in validating our approach. Furthermore, their feedback helped us ensure that the stopping rule is practical, conservative, and efficient, while minimizing the amount of unnecessary screening and ensuring that relevant records are not missed.

Author’s information

Josien Boetje is a PhD candidate in the field of Digital Literacy Education at HU University of Applied Sciences Utrecht. Her main research interests include educational design principles, digital information literacy teaching, information synthesis and systematic text reviewing.

Rens van de Schoot is Professor of Methods and Statistics at Utrecht University, the Netherlands. His main research interests include automated systematic text reviewing, integrating expert (teacher) knowledge in primary school testing, and solutions for small data sets in the field of structural equation modeling.

Adam, G. P., Wallace, B. C., & Trikalinos, T. A. (2022). Semi-automated Tools for Systematic Searches. Methods in Molecular Biology, 2345, 17–40. https://doi.org/10.1007/978-1-0716-1566-9_2/COVER
Alwosheel, A., van Cranenburgh, S., & Chorus, C. G. (2018). Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modelling, 28(July), 167–182. https://doi.org/10.1016/j.jocm.2018.07.002
Boetje, J. (2023a). Graphical overview of the SAFE procedure for applying a practical stopping heuristic for active learning-aided systematic reviewing. (Version 1). figshare. https://doi.org/10.6084/m9.figshare.22227199.v1
Boetje, J. (2023b). Recall Plot for Active Learning-Based Screening of Literature (Version 1). figshare. https://doi.org/10.6084/m9.figshare.22227187.v1
Boetje, J. (2023c). Screening speed over time compared between active learning using the SAFE procedure and random screening. (Version 1). figshare. https://doi.org/10.6084/m9.figshare.22227202.v1
Bloodgood, M., & Vijay-Shanker, K. (2014). A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. ArXiv Preprint ArXiv:1409.5165.
Bramer, W. M., de Jonge, G. B., Rethlefsen, M. L., Mast, F., & Kleijnen, J. (2018). A systematic approach to searching: an efficient and complete method to develop literature searches. Journal of the Medical Library Association: JMLA, 106(4), 531.
Brouwer, A. M., Hofstee, L., Brand, S. van den, & Teijema, J. (2022). AI-aided Systematic Review to Create a Database with Potentially Relevant Papers on Depression , Anxiety , and Addiction.
Chai, K. E. K., Lines, R. L. J., Gucciardi, D. F., & Ng, L. (2021). Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Systematic Reviews, 10, 1–13.
Chen, Y., Mani, S., & Xu, H. (2012). Applying active learning to assertion classification of concepts in clinical text. Journal of Biomedical Informatics, 45(2), 265–272. https://doi.org/10.1016/j.jbi.2011.11.003
Cheng, S. H., Augustin, C., Bethel, A., Gill, D., Anzaroot, S., Brun, J., DeWilde, B., Minnich, R. C., Garside, R., & Masuda, Y. J. (2018). Using machine learning to advance synthesis and use of conservation and environmental evidence.
Cierco Jimenez, R., Lee, T., Rosillo, N., Cordova, R., Cree, I. A., Gonzalez, A., & Indave Ruiz, B. I. (2022). Machine learning computational tools to assist the performance of systematic reviews: A mapping review. BMC Medical Research Methodology, 22(1), 1–14. https://doi.org/10.1186/S12874-022-01805-4/FIGURES/3
Clark, J., Glasziou, P., del Mar, C., Bannach-Brown, A., Stehlik, P., & Scott, A. M. (2020). A full systematic review was completed in 2 weeks using automation tools: a case study. Journal of Clinical Epidemiology, 121, 81–90. https://doi.org/10.1016/j.jclinepi.2020.01.008
Cormack, G. v., & Grossman, M. R. (2016). Engineering quality and reliability in technology-assisted review. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 75–84. https://doi.org/10.1145/2911451.2911510
Cowie, K., Rahmatullah, A., Hardy, N., Holub, K., & Kallmes, K. (2022). Web-Based Software Tools for Systematic Literature Review in Medicine: Systematic Search and Feature Analysis. JMIR Med Inform 2022;10(5):E33219 Https://Medinform.Jmir.Org/2022/5/E33219, 10(5), e33219. https://doi.org/10.2196/33219
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 1–37.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Haddaway, N. R., Grainger, M. J., & Gray, C. T. (2021). citationchaser: an R package for forward and backward citations chasing in academic searching (0.0.3). https://doi.org/10.5281/zenodo.4543513
Hamel, C., Kelly, S. E., Thavorn, K., Rice, D. B., Wells, G. A., & Hutton, B. (2020). An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening–impact on reviewer-relevant outcomes. BMC Medical Research Methodology, 20, 1–14.
Howard, B. E., Phillips, J., Tandon, A., Maharana, A., Elmore, R., Mav, D., Sedykh, A., Thayer, K., Merrick, B. A., & Walker, V. (2020). SWIFT-Active Screener: Accelerated document screening through active learning and integrated recall estimation. Environment International, 138, 105623.
Kastner, M., Straus, S. E., McKibbon, K. A., & Goldsmith, C. H. (2009). The capture–mark–recapture technique can be used as a stopping rule when searching in systematic reviews. Journal of Clinical Epidemiology, 62(2), 149–157.
Khalil, H., Ameen, D., & Zarnegar, A. (2022). Tools to support the automation of systematic reviews: a scoping review. Journal of Clinical Epidemiology, 144, 22–42. https://doi.org/10.1016/j.jclinepi.2021.12.005
Lombaers, P., de Bruin, J., & van de Schoot, R. (2023). Reproducibility and Data storage Checklist for Active Learning-Aided Systematic Reviews. https://psyarxiv.com/g93zf/
Marshall, I. J., Kuiper, J., Banner, E., & Wallace, B. C. (2017). Automating biomedical evidence synthesis: RobotReviewer. Proceedings of the Conference. Association for Computational Linguistics. Meeting, 2017, 7.
Nieto González, D. M. (2021). Optimización de estrategias de búsquedas científicas médicas utilizando técnicas de inteligencia artificial. https://doi.org/https://doi.org/10.11144/Javeriana.10554.58492
Olsson, F., & Tomanek, K. (2009). An intrinsic stopping criterion for committee-based active learning. Thirteenth Conference on Computational Natural Language Learning (CoNLL), 4-5 June 2009, Boulder, Colorado, USA, 138–146.
Ouzzani, M., Hammady, H., Fedorowicz, Z., & Elmagarmid, A. (2016). Rayyan—a web and mobile app for systematic reviews. Systematic Reviews, 5, 1–10.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. The BMJ, 372. https://doi.org/10.1136/bmj.n71
Papaioannou, D., Sutton, A., Carroll, C., Booth, A., & Wong, R. (2010). Literature searching for social science systematic reviews: consideration of a range of search techniques. Health Information & Libraries Journal, 27(2), 114–122.
Pellegrini, M., & Marsili, F. (2021). Evaluating software tools to conduct systematic reviews: a feature analysis and user survey. Form@re - Open Journal per La Formazione in Rete, 21(2), 124–140. https://doi.org/10.36253/FORM-11343
Przybyła, P., Brockmeier, A. J., Kontonatsios, G., le Pogam, M., McNaught, J., von Elm, E., Nolan, K., & Ananiadou, S. (2018). Prioritising references for systematic reviews with RobotAnalyst: a user study. Research Synthesis Methods, 9(3), 470–488.
Qin, X., Liu, J., Wang, Y., Deng, K., Ma, Y., Zou, K., Li, L., & Sun, X. (2021). Application of nature language processing in systematic reviews. Chinese Journal of Evidence-Based Medicine, 21(6), 715–720. https://doi.org/10.7507/1672-2531.202012150
Robledo, S., Grisales Aguirre, A. M., Hughes, M., & Eggers, F. (2021). “Hasta la vista, baby” – will machine learning terminate human literature reviews in entrepreneurship? Https://Doi.Org/10.1080/00472778.2021.1955125. https://doi.org/10.1080/00472778.2021.1955125
Ros, R., Bjarnason, E., & Runeson, P. (2017). A machine learning approach for semi-automated search and selection in literature studies. Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, 118–127.
Scott, A. M., Forbes, C., Clark, J., Carter, M., Glasziou, P., & Munn, Z. (2021). Systematic review automation tools improve efficiency but lack of knowledge impedes their adoption: a survey. Journal of Clinical Epidemiology, 138, 80–94. https://doi.org/10.1016/j.jclinepi.2021.06.030
Settles, B. (2009). Active learning literature survey.
Stelfox, H. T., Foster, G., Niven, D., Kirkpatrick, A. W., & Goldsmith, C. H. (2013). Capture-mark-recapture to estimate the number of missed articles for systematic reviews in surgery. The American Journal of Surgery, 206(3), 439–440.
Teijema, J., Hofstee, L., Brouwer, M., de Bruin, J., Ferdinands, G., de Boer, J., Siso, P. V., van den Brand, S., Bockting, C., & van de Schoot, R. (2022). Active learning-based Systematic reviewing using switching classification models: the case of the onset, maintenance, and relapse of depressive disorders.
Thomas, J., Graziosi, S., Brunton, J., Ghouze, Z., O’Driscoll, P., & Bond, M. (2020). EPPI-Reviewer: Advanced software for systematic reviews, maps and evidence synthesis. EPPI-Centre Software. https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=2967
Tran, H. K. V., Börstler, J., bin Ali, N., & Unterkalmsteiner, M. (2022). How good are my search strings?: Reflections on using an existing review as a quasi-gold standard. Informatica Software Engineering Journal, 16(1), 69–89. http://www.doi.org/10.37190/e-Inf220103
Tsou, A. Y., Treadwell, J. R., Erinoff, E., & Schoelles, K. (2020). Machine learning for screening prioritization in systematic reviews: Comparative performance of Abstrackr and EPPI-Reviewer. Systematic Reviews, 9(1), 1–14. https://doi.org/10.1186/S13643-020-01324-7/FIGURES/11
van de Schoot, R. (2023). Software Overview: Machine Learning for Screening Text. GitHub repository. https://github.com/Rensvandeschoot/software-overview-machine-learning-for-screening-text. Accessed 21 April 2023.
van de Schoot, R., de Bruin, J., Schram, R., Zahedi, P., de Boer, J., Weijdema, F., Kramer, B., Huijts, M., Hoogerwerf, M., Ferdinands, G., Harkema, A., Willemsen, J., Ma, Y., Fang, Q., Hindriks, S., Tummers, L., & Oberski, D. L. (2021). An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence 2021 3:2, 3(2), 125–133. https://doi.org/10.1038/s42256-020-00287-7
van Haastrecht, M., Sarhan, I., Yigit Ozkan, B., Brinkhuis, M., & Spruit, M. (2021). SYMBALS: A Systematic Review Methodology Blending Active Learning and Snowballing. Frontiers in Research Metrics and Analytics, 6 (May), 1–14. https://doi.org/10.3389/frma.2021.685591
Vlachos, A. (2008). A stopping criterion for active learning. Computer Speech & Language, 22(3), 295–312.
Wagner, G., Lukyanenko, R., & Paré, G. (2022). Artificial intelligence and the conduct of literature reviews. Journal of Information Technology, 37(2), 209–226. https://doi.org/10.1177/02683962211048201/ASSET/IMAGES/LARGE/10.1177_02683962211048201-FIG1.JPEG
Wallace, B. C., Small, K., Brodley, C. E., Lau, J., & Trikalinos, T. A. (2012). Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 819–824.
Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C., & Schmid, C. H. (2010). Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11(1), 1–11.
Wang, L. L., & Lo, K. (2021). Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Briefings in Bioinformatics, 22(2), 781–799. https://doi.org/10.1093/BIB/BBAA296
Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P., & Murad, M. H. (2020). Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE, 15(1), 1–8. https://doi.org/10.1371/journal.pone.0227742
Webster, A. J., & Kemp, R. (2013). Estimating omissions from searches. The American Statistician, 67(2), 82–89.
Yu, Z., Kraft, N. A., & Menzies, T. (2018). Finding better active learners for faster literature reviews. Empirical Software Engineering, 23(6), 3161–3186.
Yu, Z., & Menzies, T. (2019). FAST2: An intelligent assistant for finding relevant papers. Expert Systems with Applications, 120, 57–71.

Download PDF

Journal Publication

published 01 Mar, 2024

Read the published version in Systematic Reviews →

Editorial decision: Minor revision
11 Dec, 2023
Reviewers agreed at journal
27 Nov, 2023
Reviewers invited by journal
27 Jul, 2023
Editor assigned by journal
10 May, 2023
First submitted to journal
24 Apr, 2023

You are reading this latest preprint version

The SAFE Procedure: A Practical Stopping Heuristic for Active Learning-Based Screening in Systematic Reviews and Meta-Analyses

Status:

Journal Publication

Version 1

Abstract

Figures

1 Introduction

2 Development

2.1 Method

2.2 Results

2.3 The SAFE procedure

2.3.1 Phase 1: Screen a random set for training data

2.3.2 Phase 2: Apply Active learning

2.3.3 Phase 3: Find more relevant records with a different model

2.3.4 Phase 4: Evaluate quality

2.3.5 Time investment of the four phases

3 Discussion

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Competing interests

Funding

Authors’ Contributions

Acknowledgments

Author’s information

References

Status:

Journal Publication

Version 1