Through a multicenter study involving multiple reviewers and iterations of output, we found that the GPT model is capable of generating coherent and understandable content while preserving the integrity of clinical data. The CDS provided by GPT also reached a level that was generally satisfactory to physicians, and its reliability was not compromised by the randomness inherent in multiple outputs. This indicates that the GPT model can assist urologists in completing a significant portion of repetitive tasks such as reading, organizing, and summarizing medical records, thereby notably improving clinical work efficiency.
In comparing the clinical recommendations of the GPT model with those of the NOS urologist, we found that the recommendations from GPT may better align with individual patient needs and prostate cancer-related diagnostic and treatment guidelines without significantly increasing the risk of patient harm. This result suggests that the GPT model may empower non-specialist urologists in clinical decision-making for prostate diseases. Furthermore, the CDS provided by GPT not only includes recommendations but also involves analysis and summarization of clinical data, which substantially reduces the difficulty patients face in understanding their disease status and personal condition, thus enhancing patient engagement and compliance during diagnosis and treatment. Meanwhile, NOS urologists showed significantly higher rates compared to the GPT model in cases rated 3 points for medical harm and factual consistency assessment (GPT vs NOS urologist, medical harm: 2.8% vs 10.5%; factual consistency: 0 vs 2.9%, p < 0.05). Feedback from the three reviewers indicated that NOS urologists provided more ambiguous clinical recommendations, requiring clearer communication with patients to arrive at definitive clinical advice. This aligns more closely with the realities of medical practice. Reviewers noted the difficulty in assessing the presence of medical harm risk and factual consistency in such cases. Suggestions provided by GPT were relatively clearer and more precise. Consequently, cases rated 3 points exhibited significant differences between GPT and NOS urologists.
In a few cases, reviewers found that the clinical recommendations provided by GPT lacked comprehensiveness (below 3 points, 5.7%), as they only offered the most reasonable clinical advice without considering other options. However, the final recommendations made by clinicians may align with those suggested by the GPT model. Reviewers also reported occasional issues with the factual consistency of GPT's clinical recommendations, where the preferred and alternative clinical advice suggested by GPT were sometimes reversed. While medical experts can easily identify such biases, they may confuse patients and affect their judgment of the condition. Although these discrepancies are likely to be resolved through subsequent communication between clinicians and patients, they may require clinicians to spend more time addressing patient concerns.
Moreover, some reviewers expressed concerns about potential medical harm (above 3 points, 12.4%) associated with the clinical recommendations provided by GPT. This was mainly observed in cases where the clinical advice was overly conservative, potentially overlooking the need for further investigation or treatment in patients. We speculate that this may be related to regional differences in the epidemiology of prostate cancer. Statistical data show that unlike in Western countries, where a large number of patients are diagnosed with localized prostate cancer or even clinically insignificant prostate cancer during initial consultations, in China, the incidence of prostate cancer is relatively low, but the mortality rate is high, with a high proportion of patients diagnosed at an advanced stage during the initial consultation[12]. Therefore, Chinese urologists may be more inclined to perform biopsies on patients with PSA levels in the gray zone and unclear MRI results. Furthermore, the significant disparity in published studies between European and American countries and Asian regions may have contributed to the deviation of GPT's clinical recommendations in the review process, as GPT models were trained using these data during pre-training and fine-tuning. All potential risks of medical harm can be mitigated to the greatest extent through the final clinical decisions made by urology specialists.
Of course, the clinical application of GPT models still faces many challenges and risks. First is data privacy. In this study, we only accessed ChatGPT and GPT-3.5 models for conversations, and these operations did not retain the content of the conversations. Additionally, they did not involve tasks such as fine-tuning that require uploading data. Furthermore, all conversation contents were anonymized, protecting patient privacy. Second is model hallucination, which refers to the generation of inaccurate or unrealistic information during text generation, which may affect the quality of the model's output. However, no such cases were observed in our study. Finally, there are ethical issues. As repeatedly emphasized, the clinical recommendations provided by GPT models need to be open and optional, and the final decision should be made by clinicians after thorough communication with patients.
Lastly, the limitations of this study mainly lie in several aspects: 1. This study did not use the GPT-4 model for research. The main reason is that GPT-4 is not a completely open and free platform. If patients need to obtain clinical recommendations through the GPT model, it may pose difficulties. Although GPT-4 has demonstrated powerful content output capabilities, considering usability and accessibility, we chose to conduct the study based on the GPT-3.5 model. GPT-4 will be included in our research scope in future studies. 2. All evaluation indicators were subjectively evaluated by reviewers. Although the introduction of multiple reviewers and multiple outputs reduces the impact of subjective evaluation, it may lead to an incomplete evaluation of the clinical practicality of GPT models, requiring more objective indicators and larger sample sizes for research. 3. Only one NOS urologist participated in this study, and whether GPT models empower non-specialist clinical physicians requires confirmation through more extensive research.
In summary, the GPT model demonstrates a clear supportive role and tremendous potential in CDS for patients with prostate diseases. The application of GPT in clinical settings may alleviate the workload of clinicians, but its empowering effect on urologists remains uncertain. This study also identified some potential issues and limitations of the GPT model, indicating that further research is needed before its widespread clinical application. However, it is imperative that human medical experts continue to safeguard the ultimate line of defense for health.