To examine the potential of AI to facilitate evidence syntheses, we explore the AI team's performance at each stage. First, we examine its capability to develop search strings. Then, we turn to examining performance in screening articles, including the pilot screening and at a title and abstract screening.
Developing the search strategy
The first stage of this study was to evaluate the abilities of AI to craft a comprehensive, reliable and relevant search strategy on the topic of CBFM. Based on the prompts given by an independent team of reviewers (Table 1), the AI generated the following search string and screening criteria:
((community-based OR participatory OR co-management OR collaborative OR natural resource management) AND (fisheries OR marine resource OR fishery) AND (management OR conservation OR governance)) AND ((Pacific Island countries) AND (benefits OR advantages OR opportunities OR strengths OR positive OR opportunities OR potential) AND (barriers OR challenges OR obstacles OR limitations OR constraints OR difficulties))
Screening Criteria:
-
Studies published in English
-
Studies focused on Pacific Island countries
-
Studies that address the benefits and barriers of community-based fisheries management
-
Studies that were published between 2000 and 2023
-
Studies that were peer-reviewed
Whilst this search strategy superficially reflects the research topic, the human team of reviewers identified several shortcomings that undermine its usefulness for this review. Indeed, when used in the Scopus database, the search string above only finds 2 articles. The primary shortcoming is that the AI has over constrained the search string by employing too many ‘AND’ operators between keyword phrases which are themselves also constrained. For example, the inclusion of ‘Pacific Island countries’ as part of the search string, without enumerating specific countries or other classifications will not capture articles that only mention their specific locale. Further, separating the key phrases for benefits and barriers with an ‘AND’ operator will only identify studies that look at both, without finding relevant studies that look at one or the other. Relatedly, while the scope of the study is to explore the barriers and benefits of CBFM, we can imagine studies which do not explicitly use these words in the title or abstract or may use a variety of terminology to describe them, and so including key phrases for these concepts at all is overly restrictive. Instead, the human team decided that the literature on CBFM would be small enough that they could afford to capture all the potentially relevant literature on this topic without specifically searching or screening for benefits and barriers at this stage.
The screening criteria proposed by the AI, with the exception of the ‘2000–2023’ criteria, were well-aligned with those designed by the human team (Supplement D). However, the human team agreed that greater specificity was required to ensure consistency in what is meant by both the concepts ‘Pacific Island countries’ and ‘Community-based fisheries management’. They also pointed out that screening based on ‘benefits’ and ‘barriers’ at this stage was premature, given that papers with this kind of information may or may not mention it at the abstract level. Ultimately, the text of the screening criteria used to prompt the AI with the best outcomes were a compromise between the simplicity of the AI-generated criteria and the complexity of the human team’s (see Section 3.2.1 for quantitative outcomes for each prompt; see Supplement E for screening criteria used in AI prompts).
Despite these shortcomings, there are several strengths in this AI generated approach that the human team found invaluable. For example, while the AI utilised broad and sometimes tangential keywords (e.g., collaborative), it did prompt the human team to reflect on the use of the search term ‘co-management’, which was initially excluded due to being perceived as a distinct concept from CBFM, but which, upon reflection, could be included in the search string because it is sometimes used interchangeably for CBFM in parts of the existing literature. Whilst the AI did not necessarily use the best boolean operators between each key phrase, in this instance it did have the insight to divide the concept of CBFM into three distinct key phrases: synonyms of (community-based) AND (fisheries) AND (management), which was viewed as an improvement over human team’s original decision to consider CBFM as a distinct, singular concept. Thereafter, literature searches are often heavily influenced and limited by the positionality of the researchers engaging in the topic. This can be particularly problematic for topics that extend across specific and distinct local and cultural contexts, such as CBFM, resulting in important literature potentially not being included. In this case, we found that AI was able to support the creation of an inclusive search strategy through rapidly providing a list of non-English terms used to describe CBFM-related terms in the pacific (e.g. Ra'ui (or Rahui); Supplement C), expanding our original search beyond a narrow, western, academic lens. An initial search of the Scopus database of these terms coupled with the ‘fisheries’ key phrase above yielded 82 potential studies not identified in our original search. This AI capability could be used to produce more inclusive search strategies generally and help to identify those with specific relevant local and cultural expertise to advise further. AI does not replace local and contextual knowledge, however, it may provide a draft search string that can be taken to experts and/or actors with a stake in the topic in question.
Screening the articles
The second stage of this study was to evaluate the abilities of AI to screen articles based on a predetermined screening criteria (i.e., inclusion/exclusion). To do so, a list of articles was generated from Scopus using a simplified version of the AI search string, which was revised by the human team (as described in Methods). Here, we present the results of this two-part screening process, (i) pilot screening of 100 randomly selected articles from the previous stage and, after discussions to refine the screening criteria and select the best-performing AI implementation, (ii) screening all 1098 articles from the previous stage. We used quantitative measures to evaluate the abilities of the LLM in this process, including Kappa statistics to explore inter-rater reliability among human team and AI, and calculate the number of disagreements between the human team and AI (see Methods). When tallying disagreements, we define false positives as articles that the human team rejected and the AI accepted, and false negatives as articles that the human team accepted and the AI rejected.
Pilot screening
First, we followed common review practices of checking consistency among reviewers by undertaking a ‘pilot screen’ of 100 articles. All 5 human reviewers and AI independently screened the same 100 articles at the title-abstract level, determining whether the article should be included or excluded based on the screening criteria developed in the previous stage. Initial agreement within the human team in the pilot screen was ‘near perfect’ (Fleiss’ Kappa = 0.85; (Landis & Koch, 1977)), with disagreements over 13 articles (https://github.com/s-spillias/GPT-Screening/blob/main/CBFM/Paper-Results/Pilot_Result.xlsx). These were resolved via discussions within the human team, yielding a ‘consensus’ list of 18 included and 82 excluded articles.
When the AI was prompted to screen articles based on the exact text of the screening criteria used by the human team (Supplement D), there was only fair to moderate agreement with the human’s consensus list (Cohen’s Kappa = 0.21–0.44) and yielded false negative rates of 7–13% across AI screening strategies. We therefore revised the screening criteria fed to the AI (Supplement E) by simplifying the language, adopting a uniform way of expressing each criteria, and iteratively testing prompts using the conflicted studies between the initial human and AI pilot screen until inter-rater reliability scores were similar to those found between the individual human raters. This process substantially improved the agreement with the human pilot screen (Cohen’s Kappa = 0.72–0.86) and decreased the rate of false negatives to 1–4% across AI screening strategies. When AI was included alongside the human reviewers as a sixth collaborator, the inter-rater reliability (Fleiss’ kappa) for all implementations was greater than 0.8 which represents ‘near perfect agreement’ (Landis & Koch, 1977). We decided the ‘best’ performing AI strategy was that with the lowest number of false negatives, which was the ‘Committee with Any Acceptance and Reflection’ (Methods). This strategy only omitted one article that the human team included.
Screening all articles
Out of 1098 possible articles, the human team initially identified 100 that met the screening criteria (9%), whereas the best AI implementation identified 157 articles to include. We present the results for this implementation below, and refer the reader to the associated repository for the performance of all AI implementations (https://github.com/s-spillias/GPT-Screening/blob/main/CBFM/Paper-Results/All_Result.xlsx). The inter-rater reliability between the best AI screen and the human list was high, albeit not as high as the pilot screen (Cohen’s kappa = 0.63). Like the pilot screen, the false negative rate was low (1.2%). However, the false positive rate was higher than the pilot screen (6.5%).
Cross-referencing between this best-performing AI strategy and the human list identified 85 disputed papers (14 false negatives and 71 false positives). Upon further evaluation of these 85 disputed papers, eighteen were decided to be miscategorisations by the human team, with eight being removed from the human list, and ten added, resulting in a collaboratively generated list of 102 titles (Fig. 3). The remaining 67 articles were classified as AI miscategorisations. Of those that were removed from the initial human list after prompting by the AI, all were deemed unlikely to have a relevant case study in one of the PICs. Of those that were added to the initial human list, most had vague mentions of possibly relevant case studies that the AI highlighted. For example, one abstract says, ‘Drawing on case studies of the Community Conservation Research Network…’, which along with other mentions of community-based management concepts, was decided to be enough for inclusion. This re-categorisation suggests that the AI can improve the reliability in the literature screening stage by highlighting possibly relevant studies that might be missed by an initial human screen.
This re-assignment improved the inter-rater reliability of the best performing AI screen (Cohen’s kappa = 0.71), and decreased the false negative rate to (0.5%) and false positive rate to (5.6%). When the original human screen was compared with this collaborative list, the false negative rate was 1%, whilst the false positive rate was 0.7%. This suggests that the best performing AI implementation’s false negative rate was comparable to that of the human team, although the AI’s false positive rate was still much higher. Compared to the other AI implementations, we found a trade-off whereby decreasing the false negative rate comes at the cost of increasing the false positive rate, whilst the inter-rater reliability was fairly constant across implementations (Fig. 4).
The remaining six AI false negatives were all kept in the collaborative list because of language in their abstracts that hinted at a possible related case study. However, the AI consistently rejected them because they did not explicitly mention at least one of the 14 PICs. For example, one of these disputed abstracts concludes with the phrase ‘as illustrated in this paper with examples of marine commons’. The human team agreed that this was enough to meet the geographic screening criteria, whilst the AI consistently articulated that the article should be excluded due to the lack of mention of one of the 14 PICs.
The reasons for the disagreement in the remaining AI false positives (n = 61) can be explored here (https://github.com/s-spillias/GPT-Screening/blob/main/CBFM/Paper-Results/All_Result.xlsx). Of these, 17 were accepted by every AI implementation. In most cases, these do not appear to represent cases of the AI making incorrect assertions, so much as over-generous interpretations. For example, in a paper about aquaculture research and development, when asked if a community-based approach is explored, the AI argues ‘Maybe; The abstract mentions the need to gain active participation from resource users, but it is unclear if this will be a community-based approach’. In general, the most common error was the AI being over-inclusive with respect to the geographic criterion in the screening stage, seeking to include studies that were not based in one of the 14 PICs, but which were tangentially related because a specific nearby region was mentioned. For example, about a paper based in Cambodia, one AI agent decided upon reflection to ‘maybe’ include the paper because ‘... it is relevant to the broader Southeast Asian region and could provide insights for similar contexts’.