Reviewers are expected to provide neutral, unbiased and constructive feedback when they agree to review a manuscript. In addition, they are usually expected to provide recommendations to the editor about the publishability of a paper. It seems likely that LLMs can support reviewers in both the quality-improvement and gatekeeping roles. While these systems in their current shape seem to be able to contribute most on a language rather than a content level, even the latter is increasingly feasible. When prompted with sufficiently specific requests, ChatGPT provides meaningful feedback related to specific parts of a manuscript. For instance we used the methods section of the current study as an example:
How well are this study’s methods described and aligned with the research question "To investigate the potential impact of using LLMs on the peer review process, we used five core themes within discussions about peer review suggested by Tennant and Ross-Hellauer. These include 1) reviewers’ role, 2) editors’ role, 3) functions and quality of peer reviews, 4) reproducibility, and 5) the social and epistemic functions of peer reviews. We provide a small-scale exploration of ChatGPT’s performance regarding identified issues."
“Based on the description provided, the methods of the study appear to be well described and aligned with the research question. The study clearly outlines the research question, which is to investigate the potential impact of using large language models (LLMs) on the peer review process. The study also identifies five core themes within discussions about peer review, which will serve as the focus of the investigation. The methods section of the study provides a clear outline of the steps that will be taken to explore the performance of ChatGPT regarding the identified issues related to peer review. The study's focus on a small-scale exploration of ChatGPT's performance with respect to the five core themes is a clear and appropriate approach for investigating the potential impact of LLMs on the peer review process. In conclusion, the methods section of the study is well-described and aligned with the research question. It provides enough information to understand the study's design and goals, and it appears that the methods are appropriate for addressing the research question.”[6]
Since LLMs generate grammatically sound text, they can be used as a modern scribe morphing researchers’ quick notes or remarks into well-written reviews. Those who have played with the ChatGPT, know its impressive power to improve writing, even if fed with lexical/grammatical mistakes. This means that instead of spending time on formatting and ensuring correct grammar and sentence structure, one can simply feed their jotted notes (some of which might be even grammatically incorrect) to LLMs and ask them to write down a more concise, respectful and nicely worded review. The feasibility and usefulness of this approach has already been demonstrated in academic contexts.[7] This capability could enable researchers (native and non-native speakers of supported languages) to write more reviews in a short time. The quality of reviewed work could be improved when more feedback is delivered in a timely fashion (this could of course be balanced out with an increase in the number of generated scholarly articles as a result of employing LLMs).
Furthermore, relating to the role and responsibility of reviewers, a lack of incentives for reviewers has been introduced as a hurdle to rigorous testing or validation of studies under review.[5] It is possible that LLMs could help researchers conduct repetitive or tedious tasks more efficiently (e.g., editing manuscripts to minimize the use of jargon, correcting grammar and spelling mistakes, or changing the voice from passive to active)[8] thereby improving the likelihood to have more time for testing or assessing studies more rigorously. If indeed possible, this could shift the responsibility and expectations of individual reviewers to focus more on aspects such as used methods and the relationship between used concepts without being distracted by word choice or language (note that we are not implying that there is a hierarchy, the point is that over time, the expectations could be shifted).
In addition, continuous improvements in LLMs in responding thoughtfully to prompts and avoiding the use of pejorative or disrespectful words could help make review reports more constructive and result in reduced outright harsh or jarring reviews. Another advantage of using LLMs pertains to their ability to adopt unique styles or constructs (e.g., literary, philosophical, temporal, etc.). When writing reviews this could provide a chance to write reasonably consistent (not to be mistaken with accurate, useful or relevant) yet context-tailored reviews. Nevertheless, important questions about the quality and neutrality of the feedback remain challenging to resolve. In case of increased use of LLMs in scholarly reviews, some biases such as those related to geography, race or class might be reinforced. It might be true that systems such as the ChatGPT seem to have been trained well in avoiding judgmental views and, for the most part, do a good job in refraining from using abusive, racist or homophobic language,[9] but because of certain shortcomings of AI, we know that it can provide biased views. This is particularly problematic in relation to the Mertonian norm of Universalism (i.e., that class, nationality, race, religion, or other social and personal qualities of researchers are irrelevant when accepting or rejecting their claims), which supposedly underlies review in all parts of research.[10] To illustrate these concerns, we had the following interaction with the ChatGPT in December 2022.
Write a cynical review about preprint, titled "Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China," posted to the preprint server medRxiv on January 20, 2020.
ChatGPT: “It is unfortunate that the preprint titled "Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China," which was posted to the medRxiv preprint server on January 20, 2020, is yet another example of the questionable research coming out of China. The study reports on a series of 41 patients with confirmed COVID-19 in Wuhan, China, and provides some basic information about the symptoms and complications that these patients experienced. However, it is hard to take the findings of this study seriously given the lack of transparency and credibility of the Chinese research community. Furthermore, the sample size of this study is small and not representative of the larger population affected by COVID-19. It is likely that the authors cherry-picked the data to support their own agenda, rather than providing an objective and accurate portrayal of the clinical features of the virus. Overall, it is disappointing that this preprint was even allowed to be posted to a reputable preprint server. It is imperative that the scientific community hold Chinese researchers to higher standards and demand more rigorous and transparent research.”[11]
While responses like this remind us that LLMs or other tools should be used with caution, they also highlight the significance of education for researchers at all levels on how these tools should be used, and how to mitigate involved risks. As LLMs are trained on existing data by (mostly) biased humans, it is unclear whether or how these systems can mitigate existing biases, with the risk of in fact reproducing or amplifying them unless human agents are educated and aware of possible biases. This includes biases favoring positive results, being more/less charitable towards work from authors with certain demographics, or research stemming from certain institutes. In essence, these systems are necessarily conservative, favoring the status-quo and potentially skewed approaches already present in current and past discourse. This issue resembles concerns voiced regarding other modes of automation in scholarly work (e.g., citation recommendation tools[12] or those that aim to detect erroneous citations[13]).