As evaluated by practicing IR physicians, ChatGPT generated information for patient consent that was accurate, written in a conversational tone, and could be understood by a standard patient. The majority of physicians agreed the outputs were accurate, conversational, and could be read by a normal patient, all critical aspects of the informed consent process. Prior studies have evaluated both ChatGPT-3 and ChatGPT-4 utility in patient consent, finding that both models accurately answered patient inquiries regarding IR procedures [8]. However, the Flesch-Kincaid Grade Level (11.65) for GPT4 was well above the recommended 8th-grade level which is in concordance with findings seen in previous studies that examine the readability of the prior ChatGPT-3 model [4]. The similarities seen in both models highlight how updates to the ChatGPT model do not necessitate improvements in all domains. Due to the shifting scopes of LLMs like ChatGPT, there is need for ongoing reassessment to evaluate the potential strengths and pitfalls of this clinically applicable tool.
Furthermore, a periodical assessment of how physicians view this new-age tool is critical as the LLMs evolve and both the capabilities and opinions on this technology change. The surveyed physicians highlighted a potential pitfall of ChatGPT-4 as a viable clinical tool: a decreased ability to provide a comprehensive explanation for the IR procedure. Nearly one-third of physicians reported that the GPT-4 output did not comprehensively explain the procedures’ risks, benefits, and alternatives. While ChatGPT-4’s outputs were rated as accurate, a significant portion of physicians reported that the information was insufficient in explaining the necessary information for a patient's consent, a finding similarly observed in the previous iteration of ChatGPT [9]. This limitation may explain why one-third of all surveyed physicians (33%) reported that they were not comfortable providing the outputs to their patients. The data highlights the need for physician supervision and verifications of medical outputs from the current version of the model.
While physician verification of ChatGPT outputs may seem like the optimal path forward, lack of agreement across physicians may limit this implementation. The intraclass correlation coefficient (0.39) measured physician ratings across all five procedures and demonstrated poor interrater agreement. The surveying physicians were unable to agree on their evaluation of ChatGPT outputs, highlighting the subjective nature in which medical ChatGPT outputs are viewed by physicians. The poor interrater reliability exemplifies an obstacle that must be addressed in the implementation and future deployment of LLM technology.
In analyzing the physician demographics, the linear regression revealed an inverse relationship between years in training and average output score. The fewer years in practice of the physician the higher they rated the ChatGPT output, and conversely, the more years in practice of the physician the lower they rated the output. The relationship has been observed in other studies that surveyed the general public finding that people under 50 years old are more likely to find ChatGPT highly useful than those over 50 years old [10]. With technology becoming a more integral part of medicine, it is important to convey the current state of LLM technology to the IR community. IR is a specialty built by innovation. From its beginning, IR has been a field of rapid adaptors who continuously iterate and develop more advanced ways to impact patient care by constantly pushing the envelope on what is possible. As a specialty that resides at the forefront of innovation and at the intersection of medicine and technology, IR is a field ripe to capitalize on this rapidly evolving technology.
This study was limited in its design in that physicians surveyed in the study were not blinded to the author of the outputs (ChatGPT). Therefore, raters may have been biased both positively and negatively by their preconceived views of ChatGPT and artificial intelligence. Furthermore, the twenty-one physicians all from academic institutions are unlikely to be a completely accurate representation of all practicing interventional radiologists. A larger study is needed to better capture a more representative assessment of the model through the lens of a general IR physician. Still, the present study provides a strong evaluation of the state of ChatGPT-4 that is needed to further assess how ChatGPT evolves. Furthermore, our input prompt to ChatGPT was not validated to be comprehensive, so patients may use other wording when interfacing with LLMs that may produce information not captured by our query.