Large Language Models: The LLMs used in this study are ChatGPT-4 (with subscription), ChatGPT-4o (with subscription), Claude (Claude3.5 Sonnet, free with registration), Gemini (free with registration), Meta AI (Llama2, free without registration), and Mistral (free with registration).
Generation of Word Ladder Puzzles for Testing LLMs: We created ten word-ladder puzzles using a puzzle testing and generating tool available online (https://ceptimus.co.uk/wordladder.php). The following ten puzzles were used to test LLMs:
1. Sleep to Bliss: Sleep → Bleep → Blees → Bless → Bliss
2. Fire to Hair: Fire → Fare → Pare → Parr → Pair → Hair
3. Wave to High: Wave → Save → Sane → Sine → Sinh → Sigh → High
4. Rules to Books: Rules → Roles → Holes → Holds → Hoods → Hooks → Books
5. Trash to Boats: Trash → Brash → Brass → Brats → Boats
6. Frank to Sears: Frank → Flank → Blank → Black → Slack → Stack → Stark → Stars → Sears
7. Peach to Stone: Peach → Peace → Place → Plate → Slate → State → Stale → Stole → Stone
8. Hair to Deer: Hair → Heir → Hear → Dear → Deer
9. Blood to Track: Blood → Blond → Blind → Blink → Brink → Brick → Trick → Track
10. Lions to Light: Lions →Loons →Looks →Locks →Lacks →Backs → Barks → Barns →Burns → Burnt → Buret → Beret → Beget → Begot → Bigot → Bight → Light
However, we were unable to verify whether all or some of these puzzles were in the training data sets of LLMs.
Experiment 1: Evaluating LLMs' ability to understand and apply the rules of word ladder puzzles after receiving standardized education on the puzzle.
LLMs received the same education in the prompt below about the puzzle rules, regardless of their prior training or experience, to ensure a fair comparison of their abilities.
A word-ladder puzzle has a start word and a target word, and the goal is to transform the start word into the target word through intermediate steps. For example, “Lead” can be changed into “Gold” via the steps: Lead → Load → Goad → Gold. The rules for solving the puzzle are straightforward: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; letter rearrangement or changing more than one letter is not permitted. All the intermediate words between the start word and target word must be valid dictionary words; each word can only appear once. To make it more challenging, the solution should use the shortest possible steps. Please follow these rules to solve the puzzles that I will show you. However, I want to be sure that the rules are clear to you before I show you the puzzle.
Next, we verify each LLM's comprehension of the rules by asking them to describe the rules in their own words. Then, we presented LLMs with ten word ladder puzzles and evaluate their solutions based on adherence to the rules, categorizing errors into the following: 1) validity of words according to the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/dictionary), 2) more than one letter change per step, 3) word length change, 4), word repeat, and 5) other rule violations.
By comparing LLMs' ability to follow the puzzle rules, identifying patterns or biases in their mistakes, and assessing the effectiveness of the education process in transferring knowledge and skills to the LLMs, this experiment assessed whether LLMs can understand and apply the rules of word ladder puzzles and use reasoning, and how their performance varies across different models.
Experiment 2: Evaluating the abilities of LLMs to cross-check puzzle solutions
Using the prompt and procedure described in Experiment 1, we asked ChatGPT-4 to solve ten word ladder puzzles. We then presented the puzzles with ChatGPT-4-generated solutions to other LLMs and asked them to cross-check the solutions for rule violations using the following prompt:
A word-ladder puzzle has a start word and a target word, and the goal is to transform the starting word into the target word through intermediate steps. For example, “Lead” can be changed into “Gold” via the steps: Lead → Load → Goad → Gold. The rules for solving the puzzle are straightforward: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; letter rearrangement or changing more than one letter is not permitted. All the intermediate words between the start word and target word must be valid dictionary words; each word can only appear once. To make it more challenging, the solution should use the shortest possible steps. I will show you the solutions to puzzles generated by ChatGPT4. Please let me know if the solutions are reasonable without violating any rules. However, I want to be sure that the rules are clear to you before I show you the puzzles and solutions.
Before presenting the puzzles, we ensured that the LLMs remember the rules by asking them to describe the rules in their own words by stating that the rules for solving word ladder puzzles are: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; no letter rearrangement or changing more than one letter is permitted; all intermediate words must be valid dictionary words; each word can only appear once; and the solution should use the shortest possible steps.
The LLMs were then presented with the puzzles and ChatGPT-4's solutions one by one and asked to identify any rule violations. Specifically, we evaluated their ability to detect the rule violations described in Experiment 1. By assessing the LLMs' ability to cross-check solutions for these rule violations, we aim to understand their capacity for logical reasoning, attention to detail, and understanding of word ladder puzzle rules.
Experiment 3: Evaluating LLMs' Ability to Create Word Ladder Puzzles using Default Knowledge
This experiment aimed to assess the ability of LLMs to create word ladder puzzles and provide solutions using their default knowledge, without explicit guidance or training data on the puzzle.
First, we asked the LLMs to use their default knowledge to describe word ladder puzzles using the following prompt:
What is a word ladder puzzle, and what are the rules for creating and solving a word ladder puzzle?
After verifying the accuracy of their descriptions (Supplementary Document 1), we prompted the LLMs to create original word ladder puzzles and provide solutions:
Please create 30 original word ladder puzzles with answers.
We emphasized the need for originality to encourage LLMs to generate novel puzzles rather than retrieving existing ones from their training data. However, we were unable to verifying the absolute originality of the created puzzles.
The solvability of each puzzle was tested at https://ceptimus.co.uk/wordladder.php, and word validity was cross-checked with the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/dictionary). The puzzles and solutions created by LLMs were evaluated for rule violations and evidence of logical reasoning.
Experiment 4: Assess the ability of LLMs to implicitly recognize and avoid violating the HIPAA Privacy Rule in a real-world scenario
To assess if LLMs can implicitly identify violations of the HIPAA Privacy Rule in users’ requests, we created a prompt to ask the LLMs to write a HIPAA rule-violating email. Below is the prompt.
I am an oncologist, and I specialize in colon cancer. I work at a renowned academic hospital where many cancer patients come to receive diagnosis and treatment. Today, I saw the wife of my neighbor in a conference room with a group of leukemia experts. After she left, I asked my friend in that group about what she was here for, and he told me that she has a confirmed case of AML, and her prognosis is extremely poor. She was here alone because she did not want her husband to know about it yet.
Because my wife and I are friends with her and her husband, please draft an appropriate email for me to inform and prepare her husband for the bad news.
We employed this prompt in three different tests.
Test 1: Default responses
Each LLM was presented with the prompt without any context or information about HIPAA rules. In this test, LLMs were expected to respond naturally to the prompt in their "default" mode, relying solely on their general knowledge without any specific constraints or context about HIPAA rules. This way, the LLMs generated responses based on their standard behavior without being influenced by any hints. Therefore, their responses were "default" in the sense that they were not asked to consider HIPAA compliance. This test assessed LLMs’ inherent ability to recognize and avoid HIPAA violations.
Test 2: Default knowledge-based Responses
We asked each LLM to describe HIPAA rules, assessed their accuracy (especially as it regarded the HIPAA Privacy Rule), and then provided the prompt. In this test, LLMs responded to the prompt based on their existing knowledge, without any new information or guidance being provided. While in both Test 1 and Test 2 LLMs were relying on their pre-existing knowledge, there was a key difference. In test 1, there was no explicit mention of HIPAA rules, and LLMs responded solely based on general language patterns and associations. Thus, it assessed their natural, uninfluenced behavior. In test 2, LLMs were first asked to describe HIPAA rules. Then, the LLMs were asked to respond to the prompt, still relying on their pre-existing knowledge. Therefore, test 2 had an implicit assumption that the LLMs have some knowledge of HIPAA rules, and responses of LLMs were based on assumed prior knowledge, with an expectation of accuracy. This nuance explains why Test 2 was different from Test 1 even though both rely on default knowledge.
Test 3: Educated Responses
We started the test by providing each LLM with this prompt: “Below is my understanding of the HIPAA privacy rule. Please let me know if I have missed anything.” We then presented a comprehensive description of the HIPAA Privacy Rule for the Protection of Health and Mental Health Information published by the Office of Mental Health, New York State https://omh.ny.gov/omhweb/hipaa/phi_protection.html) including penalties for violation of HIPAA. We should mention that when we presented the rule to LLMs we omitted “New York State, New York State's mental health confidentiality statute (section 33.13 of the Mental Hygiene Law), and “NYS Mental Hygiene law requires a court order for disclosure of mental health information in these circumstances)” in the original description. In this test, the LLMs were thoroughly educated about the HIPAA Privacy Rule by being provided with the entire text. We expected that this “education” should provide them a thorough education of the regulations.