Principal Finding
The results of this study suggested that prompt engineering may enhance the accuracy of GPT-4 in answering medical questions. Additionally, GPT-4 does not always provide the same answer to the same medical questions. The novel ROT prompt outperformed other prompts in providing professional OA knowledge consistent with clinical guidelines.
Consistency with the guideline
The performance of different prompts varies. From the statistical results, ROT performs most evenly and prominently. In terms of 'strong' intensity, ROT is superior to IO, and it is not significantly inferior to other prompts at other levels. In contrast, although P-COT's answers at 'strong' intensity are better than IO's, its performance at the 'limited' intensity level is significantly worse than other prompts.
Development of ROT for medical-related questions
We develop ROT prompt to mimic medical consultation involving multiple doctors in a problem-based setting. It may be able to utilize GPT’s computational power to a greater extent and enhance the robustness of the answers. Directly, the total word count of Supplementary Files 4–7 significantly increased (45,547 words in IO, 76,592 words in 0-COT, 94,853 words in P-COT, 162,329 words in ROT). Additionally, the ROT prompt asked GPT-4 to return to previous thoughts and examine if they were appropriate, which may improve the reliability and robustness of the answer and reduce AI hallucinations.
Furthermore, the design based on ROT can minimize the occurrence of egregiously incorrect answers from GPT-4. For instance, regarding a 'strong' level suggestion, "Lateral wedge insoles are not recommended for patients with knee osteoarthritis." ROT provided four 'strong' answers and one 'moderate' answer in five responses. Among them, in this 'moderate' response (Supplementary File 8), initially, two “experts” provided "limited" as their answer, and one “expert” answered "moderate". After “discussion”, all “experts” agreed on a 'moderate' recommendation. The final reason was that even though there was high-quality evidence to support the advice, there might still be a slight potential benefit for some individuals. Notably, that the reasons given by the two experts for "limited" seem to be more in line with the statement "Lateral wedge insoles are recommended for patients with knee osteoarthritis." This implies that these two “experts” did not fully understand the medical advice correctly, as “Expert C” mentioned in step five: "Observes that the results are somewhat mixed, but there's a general agreement that the benefits, if any, from lateral wedge insoles are limited." However, after the “discussion”, the final revised recommendation and reason were deemed acceptable. Referring to the application of TOT in the 24-point game11, the prompt designed in the style of TOT as well as the ROT in this study could offer more possibilities at every step of the task, aiming to induce GPT to generate more accurate answers.
The statistical results for suggestions of other intensities are mixed. One reason might be that these medical recommendations themselves are controversial. GPT's answers can provide various perspectives. For instance, a recommendation from the AAOS that was rated as "moderate" states: “Arthroscopic partial meniscectomy can be used for the treatment of meniscal tears in patients with concomitant mild to moderate osteoarthritis who have failed physical therapy or other nonsurgical treatments.” There has been an editorial25 suggesting that the supporting evidence for this recommendation is limited, and the phrasing of the recommendation itself is problematic. GPT-4, in its answers across different prompts, tends to hover between limited and moderate, with reasons related to efficacy, quality of evidence, and other related factors.
Reliability
If we only consider whether the answers generated by GPT-4 are the same each time, the Fleiss Kappa values for each prompt show fair to moderate reliability. This could be related to GPT-4's parameter settings. The ‘temperature’ setting allows GPT-4 to produce flexible text outputs. Moreover, medical questions are not like mathematical problems. The process of logical reasoning is versatile and mutable. Just as the 24-point game can be definitively solved in just three steps, there is no completely determined process for evidence-based evaluation of a piece of medical advice, much like the variability in the steps given in 0-COT. Therefore, the answers from GPT-4 tend to be more flexible and diverse.
However, when considering the order of the data, Kendall's coefficient of concordance of each prompt shows that GPT-4's answers have a high degree of internal consistency. Combined with the previous analysis on consistency with the guideline, this implies that for recommendations rated as "strong" by the guideline, GPT-4 might rate them as "moderate", but it is unlikely to rate them as "consensus". In conclusion, even though the GPT-4 might not provide the same answer every time, by asking with an appropriate prompt, one can obtain relatively accurate responses.
Two previous studies6,7 briefly described reliability. The study of Yoshiyasu et al.7 only reproduced inaccurate responses. Walker et al.6 reported that the internal concordance of the provided information was complete (100%). In this study, reliability was investigated by asking GPT-4 the same question five times, and the results were analysed. According to the results of our study, it is suggested that GPT-4 does not always provide consistent answers for the same medical question.
How to better use GPT-4?
The results of this study suggested that the prompt technique and enquiry times significantly influenced the quality and reliability of answers provided by GPT-4 in answering professional medical questions. ROT may be recommended for medical-related questions, especially for in-depth questions. It is recommended to ask GPT-4 the same questions several times to obtain more comprehensive answers. You can keep asking the GPT-4 the same question until it does not provide any different information. A template of ROT prompts is presented and can be modified to answer medical-related questions in various scenarios (Fig. 5).
Advantage and limitation
This study is a pilot study to explore the application of prompt engineering in medical questions and the reliability of GPT-4's answers to medical questions. Our data collection is nonhuman subjective scoring, objectively reflecting the consistency and reliability of GPT-4 responses. However, the study was designed based on the expected answers from the guidelines and lacks prospective validation. This study is performed with GPT-4’s web interface, so it would not be feasible to modify the temperature of GPT-4 to create a more robust answer, which can be done by professional programmers who have acquired the Application Programming Interface (API) key for the LLM. However, this study is designed to mimic GPT-4’s performance in routine medical practice and provide users such as doctors and patients with information on how to improve the quality of their answers. Therefore, all experiments were performed in the web interface.
Implications for Future Research
Currently, prompt engineering is in full progress in the computer field, but there is no specific medical research related to it. Prompt engineering can help doctors learn how to ask better questions so that GPT can give better answers. Additionally, the reliability of the GPT-4 in answering medical questions is worth exploring. In future studies, our team will aim to explore the appropriate prompts for medical questions and try to call the API to change the parameters such as the ‘temperature’ to improve reliability. And We will further explore the application of prompt engineering and the reliability of answers if diagnosis, medical examination, and so on. Meanwhile It is suggested that future related studies should examine the engineering and repetition of cue words and formulate relevant guidelines.