Study Design
We conducted this cross-sectional study at the MSHC, Australia's largest public sexual health clinic. We compared the performance of three AI chatbots against experienced sexual health nurses in responding to sexual health inquiries. We used anonymised real-world questions from the callers to the MSHC collected during routine telephone inquiries. In compliance with the Victorian Department of Health guidelines on AI use in healthcare (23), we did not directly test chatbots with clients. Instead, we designed a method that maintained standard sexual health service delivery (i.e., a telephone conversation with a nurse) while reflecting real-life queries.
AI Chatbots and Prompt Tuning
We configured and evaluated three AI chatbots for this study: Alice (Custom GPT-3.5-Turbo on chatbotbuilder.io), Azure (Custom GPT-3.5 on Microsoft Azure), and ChatGPT (standard OpenAI GPT-3.5). We named the chatbots based on their development platforms or origins: "Alice" for the chatbot developed on chatbotbuilder.io (17), "Azure" for the one implemented on Microsoft Azure, and "ChatGPT" for the unmodified OpenAI model. We will use these names consistently throughout this manuscript to refer to these specific chatbot implementations.
Table 1 provides a detailed comparison of the chatbots' features and settings. For Alice and Azure, we employed a process known as prompt engineering to create customised chatbots (24). This involved developing a custom set of instructions (called a "prompt") and incorporating a specialised database of information (known as a "knowledge base") using publicly available information from the MSHC website (16) and Australian guidelines for the management of sexually transmitted infections (STIs) (15). This process, which we refer to as "prompt-tuning" in this study, involves designing and refining text-based prompts to optimize and tailor the chatbots' responses to sexual health queries. While we didn't modify the underlying AI models, we used these custom prompts and knowledge base to guide the chatbots in providing specialised sexual health information.
Table 1
Comparison of Artificial Intelligence (AI) Chatbots' Features and Settings
Feature | Alice (Custom GPT-3.5-Turbo) | Azure Chatbot (Custom GPT-3.5) | Base ChatGPT (OpenAI GPT-3.5) |
Platform | chatbotbuilder.io | Microsoft Azure | OpenAI |
Model | GPT-3.5-Turbo 16K | GPT-3.5 | GPT-3.5 |
Prompt Tuning | Yes | Yes | No |
Specialisation | Publicly available information (MSHC website, Australian guidelines for the management of sexually transmitted infections) | Publicly available information (MSHC website, and Australian guidelines for the management of sexually transmitted infections) | Default OpenAI training data |
Temperature* | 0.5 | 0.5 | Default |
Maximum Tokens** | 200 | 200 | Default |
Data Privacy | No patient data access or storage | No patient data access or storage | No patient data access or storage |
Response Limitation | None | Unable to answer questions beyond the provided information | None |
*Temperature: A parameter that controls the randomness of the AI's outputs. A lower value (closer to 0) makes the output more focused and deterministic, while a higher value (closer to 1) makes it more diverse and creative. |
** Maximum Tokens: The maximum number of words or word pieces the AI is allowed to generate in a single response. This limits the length of the AI's answers. |
The development and refinement of Alice, including iterative testing and adjustments, took approximately 4 weeks of dedicated effort from the author (PL). We initially implemented Alice on chatbotbuilder.io (17) and later applied the same prompt and knowledge base to create Azure on Microsoft Azure (18). Both Alice and Azure used a default temperature setting of 0.5 (which controls the randomness of the AI's responses) and a maximum token limit of 200 for responses (limiting the length of answers). For the Azure chatbot, we configured settings to restrict responses to information derived solely from the provided prompt and knowledge base. This configuration resulted in the chatbot acknowledging its inability to answer questions beyond the scope of the provided information. Such a restrictive setting was not available on the chatbotbuilder.io platform used for Alice. ChatGPT functioned as a control, utilising default OpenAI settings without customised prompt, representing a standard AI chatbot without specific sexual health training. All chatbots operate without access to or storing patient data, ensuring privacy compliance.
Collecting the Sexual Health Queries
Between January and April 2024, we gathered anonymised questions and responses from calls to the MSHC phone line. To reduce recall bias, sexual health nurses at MSHC documented summaries of clients' questions and responses immediately after each routine telephone consultation. These summaries, recorded through a Microsoft Form, included the clients’ questions and the nurses’ answers while carefully excluding all identifying information.
We gathered a total of 200 question-answer pairs over the four-month period, a sample size determined to ensure a margin of error of approximately ± 7% at a 95% confidence level, balancing statistical precision with feasibility in data collection. The collected questions covered a range of sexual health topics, including STI symptoms and testing, contraception methods, clinic services and hours, and general sexual health advice.
Preparing and Processing the Data
Prior to analysis, two researchers (PL and NS) verified that all summaries were free of identifiable information. We then input the summarised questions into each of the three AI chatbots: Alice, Azure, and ChatGPT. To prevent context bias, we entered each question into a new session for each chatbot. This process resulted in a total of 600 responses: 200 from each of the three chatbots, in addition to the 200 answers provided by nurses.
Expert Evaluation and Consensus Process
After data collection, we assembled a panel of three experts, who were both sexual health physicians and researchers, to evaluate the quality and accuracy of the responses between June and July 2024. The reviewers had between 7 and 25 years of experience in sexual health medicine (EA, KH, CKF) with extensive expertise in caring for patients seen at the study clinic. Prior to the evaluation, the research team and reviewers collaboratively defined and clarified the meaning of five key outcome measures: guidance, accuracy, safety, ease of access, and provision of only necessary information. This ensured a consistent interpretation of the criteria throughout the evaluation process. (See Table 2).
Table 2
Outcome Indicator Definition
Outcome Measure | Definition |
Overall Correctness | Assesses the overall accuracy and appropriateness of the response, considering both factual correctness and suitability for the given question. |
Guidance | Assesses whether the patient will take the appropriate action or make the right decision after reading the response. |
Accuracy | Assesses the correctness of the information provided in the response. |
Safety | Assesses the potential risk of harm to the patient if they follow the advice given in the response. This includes potential conflict with health care providers from wrong advice. |
Ease of Understanding | Assesses the clarity and readability of the response for a general audience. |
Provision of Necessary Information Only | Assesses whether the response provides concise, relevant information without including unnecessary details that could deter the patient from using the chatbot. |
Note: The outcome measures are considered independent of each other. For example, a response can have excellent ease of understanding while inaccurate. |
PL developed a Qualtrics survey and conducted a pilot test with five questions and respective answers. This pilot allowed reviewers to familiarise themselves with the rating process, apply the agreed-upon definitions, and estimate the time required for the full review. In the survey, we labelled nurses' summaries as 'Nurses' and assigned anonymous identifiers to the three chatbot responses. To minimise bias, we designed a blinded review process. For each question set, we consistently presented the nurse's summary first due to its distinctive appearance (i.e., in note format rather than verbatim). The three chatbot responses followed in a randomised order, enabling blinded comparisons between the AI chatbots.
Following the pilot test, the team held a consensus meeting to discuss score discrepancies, explain reasoning based on established definitions, and reach a unified judgment. Each reviewer then independently evaluated the remaining 195 questions and answers using the agreed-upon outcome measures and rating scale.
Following individual evaluations, we conducted a final consensus process. We categorised responses into binary classifications for correctness and the five outcome measures. We identified cases where two reviewers agreed, but the third differed. In a consensus meeting, reviewers discussed these discrepancies and worked towards a unified assessment. This process ensured evaluation consistency while preserving the integrity of initial judgments.
Statistical Analysis
We used STATA (version 17, StataCorp) for data analysis in this study. To describe response lengths, we calculated the median and interquartile range (IQR) of word count for each chatbot and nurses’ responses.
For all responses, we categorised overall correctness into "correct" (combining "correct" and "mostly correct" ratings) and "incorrect" (combining "partially correct" and "incorrect" ratings). For the five outcome measures (guidance, accuracy, safety, ease of access, and provision of necessary information), we classified responses as "acceptable or better" (including "acceptable", "good", and "excellent" ratings) or "unacceptable" (including "poor" and "very poor" ratings).
We then calculated the proportion of correct or acceptable responses for each chatbot and nurses, using questions with correct nurse responses as the benchmark. We compared these proportions using chi-square tests, considering p-values less than 0.05 as statistically significant. We calculated 95% confidence intervals for all proportions.
We conducted subgroup analyses by stratifying questions into "General sexual health questions" and "Clinic-specific questions," repeating our performance comparisons for each subgroup. To account for the differences in chatbot configurations, particularly Azure's restricted response settings, we performed a sensitivity analysis to assess the impact of Azure's restricted configuration on our overall findings and ensure a fair comparison across all chatbots. In this analysis, we reran our main analyses after excluding questions that the Azure chatbot could not answer due to its limitation to the provided prompt and knowledge base.
Ethical Considerations
This study received ethical approval from the Alfred Hospital Ethics Committee in Melbourne, Australia (project number: 555/23). The committee waived the need for informed consent on the basis that the study used de-identified, routinely collected clinical data. All research procedures adhered to the committee's advice and Australian ethical standards for clinical research. No identifiable information was collected or stored during the study process.