Beyond Text Generation: Assessing Large Language Models' Ability to Follow Rules and Reason Logically

doi:10.21203/rs.3.rs-5084169/v1

Download PDF

Research Article

Beyond Text Generation: Assessing Large Language Models' Ability to Follow Rules and Reason Logically

https://doi.org/10.21203/rs.3.rs-5084169/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The growing interest in advanced large language models (LLMs) has sparked debate about how best to use them to enhance human productivities, including teaching and learning outcomes. However, a neglected issue in the debate concerning the applications of LLMs is whether these chatbots can follow strict rules and use reason to solve problems in novel contexts. To address this knowledge gap, we investigate the ability of five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) to solve and create word ladder puzzles to assess their rule-adherence and logical reasoning capabilities. Our two-phase methodology involves: 1) explicit instruction and word ladder puzzle-solving tasks to evaluate rule understanding, followed by 2) assessing LLMs' ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations in a real-world scenario. Our findings reveal that while LLMs can articulate the rules of word ladder puzzles and generate examples, they systematically fail to apply these rules and use logical reasoning in practice. Notably, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs' rule-following and reasoning capabilities and therefore raise concerns about their reliability in tasks requiring strict rule-following and logical reasoning. We urge caution when integrating LLMs into critical fields, including education, and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.

Large language models

reasoning

rule-following

HIPAA privacy rule

LLMs are trained on vast amounts of human-generated text, and the training process involves two essential aspects: natural language processing and deep learning. Natural language processing enables text tokenization – that is, breaking down text into individual units, called tokens, such as words, sub-words, or characters, for efficient text analysis, sentiment analysis, language modeling, and translation. Deep learning allows LLMs to recognize token patterns to develop a deep familiarity with language structures and conventions that can be presented in a vector space with millions to billions of dimensions. Through the training, LLMs learn statistical patterns of tokens appearing in human writing, including co-occurrence of words, phrases, and ideas, and moreover, they establish a vast database of vectorized text representing myriad ways of relationships amongst words in human written texts, which enable them to rapidly perform semantic search and information retrieval based on user query context.

Trained LLMs generate grammatically correct, contextually relevant, and fluent text by matching patterns in input data to those in their training data. They can answer questions, translate languages, and engage in conversations on virtually any topic. Naturally, there seems a myriad ways of using LLMs in the education field to significantly improve teaching and learning outcomes. However, it would be a serious mistake to anthropomorphize LLMs simply because their outputs rely on computational processes to simulate but not comprehend human writing. Additionally, because the training data for LLMs contains factually incorrect information and biases and that LLMs are built to always give answers to user queries, the outputs of LLMs are known to contain errors, including hallucinations or confabulations, contra-factual bias (a phenomenon that occurs when LLMs fail to correct a false premise in a user's prompt and even reinforces the user's incorrect assumption), and biases (Dahl et al. 2024; Emsley, 2023; Farquhar et al. 2024; Han et al. 2023). Also, the quality and consistency of LLM outputs heavily depend on user prompts: clear, strategically crafted prompts tend to induce more reliable information, while vague prompts may lead to varied or unexpected results (Han et al., 2024; Meskó 2023). This underscores the importance of developing strong prompt engineering skills to maximize LLM potential (Han et al., 2024; Meskó 2023).

Although we still do not yet know the full spectrum of LLM applications, there is great enthusiasm among many investigators to explore LLMs’ potential in various fields, such as computer coding, ethics, journalism, law, and medical education, to name a few. Nevertheless, one thing is clear – that is, LLMs are useful writing assistants. For example, a recent report suggests that a significant number of researchers, especially those in non-English speaking countries, have already been using LLM assistance to write scientific reports for publication (Kobak et al. 2024). However, a recent experiment using LLMs in creative tasks like story writing has raised concerns about losing collective novelty and homogenization of ideas (Doshi & Hauser, 2024). This highlights the need for cautious adoption and consideration of LLMs' limitations in creative domains.

The ability of LLMs to reason is an important topic of ongoing debate (Mitchell 2023). Some researchers suggest that the LLM, GPT-4, has demonstrated impressive capabilities, solving novel tasks across various fields without special prompting, and therefore it may be an early version of an artificial general intelligence (AGI) system capable of possessing human-level intelligence including reasoning, problem-solving, and learning (Bubeck et al. 2023). Others suggest models like GPT-3 have developed emergent analogical reasoning capabilities that enable them to solve a wide range of problems (Webb et al., 2023). However, there are also studies indicating that these models lack the robustness and generality characteristic of analogic reasoning (Lewis & Mitchell, 2024) as well as abstract reasoning (Mitchell et al., 2023; Moskvichev et al., 2023).

Recent studies have tested the "reasoning/intelligence" of LLMs, particularly ChatGPT4, using medical materials, and they have shown that ChatGPT4 excels at text-based multiple-choice questions like those like the ones on the United States Medical Licensing Examinations (USMLE) (Brin et al. 2023; Garabet et al. 2023; Mihalac et al. 2024; Shieh et al. 2024), giving the impression that ChatGPT4 offers medical educators and students an intelligent assistant at any time and any places. When medical students pass these exams, it typically means that they have acquired the necessary knowledge and information, developed a good understanding of the concepts and principles, applied critical thinking and problem-solving skills, demonstrated proficiency in specific skills or tasks, met the learning objectives and outcomes of the program, and prepared themselves for further learning and progression in their professional journey. However, it should be emphasized that the USMLE program notes that ChatGPT4’s success is unsurprising given the questions used are most likely available and included in the training data for LLMs from online sources (https://www.usmle.org/usmle-program-discusses-chatgpt). Therefore, medical educators must be aware that when LLMs achieve high scores on these exams, their performance should be attributed to word pattern matching based on statistical possibilities in the training data, rather than genuinely comprehending the questions or underlying concepts by LLMs. As a result, LLMs’ high scores most likely only demonstrate their proficiency in providing correct answers, not true intelligence, creativity, or critical thinking abilities.

The key to assessing whether LLMs possess intelligence lies in investigating if there are reasoning mechanism(s) at all. However, the greatest challenge is that the inner workings of LLMs remain within a “black box” and cannot be easily studied, as there are no established methods to systematically dissect and analyze the complex mechanisms within these vast statistical models. One theoretically possible approach is to reverse-engineer LLMs to investigate the algorithms they employ to pass various tests, which could provide valuable insights into their decision-making processes.

Nevertheless, the applicational limitations of current LLMs are already evident particularly in studies of ChatGPT4's performance in disease diagnosis. For example. ChatGPT4 struggles to diagnose dermatological conditions (Nielsen et al. 2024; Stoneham et al. 2024), has varying diagnostic ability in neuroradiology depending on disease etiologies (Horiuchi et al. 2024), and performs poorly in diagnosing pediatric diseases (Barile et al. 2024). In this context, it is important to emphasize that experienced physicians rely on intuition, developed through hands-on practice, to recognize subtle patterns, interpret non-verbal cues from patients, and integrate holistic insights into their diagnoses. In contrast, LLMs depend solely on their text training data, which, even in the form of comprehensive medical reports, may not capture the full breadth of real-world medical complexities and individual patient variations. Therefore, the reported failure of LLMs demonstrates that they lack the nuanced, context-rich reasoning required in fields like dermatology and pediatrics, where considering indirect symptoms is crucial.

Therefore, it is crucial to thoroughly investigate what LLMs can and cannot do, including assessing their reasoning capacity using innovative assessment methods that can differentiate between genuine reasoning and sophisticated word pattern matching.

The game of solving or creating word ladder puzzles is cognitively challenging. Each puzzle consists of a start word and a target word, with solvers tasked to transform the former into the latter through a series of steps (solution) under these strict rules: change only one letter in the preceding word per step, maintaining the positions of other letters; ensure all intermediate words are valid dictionary entries; use each word only once in the solution; achieve the transformation in the fewest possible steps. For example, transformation of "cold" to "warm" can be accomplished via these two solutions: cold → cord → card → ward → warm; cold → cord → word → worm → warm.

Solving and creating word ladder puzzles primarily requires an extensive vocabulary, which is not a limitation for LLMs. However, unlike typical text generation tasks, these puzzles demand rule adherence and logical reasoning rather than grammatical correctness or contextual coherence. Engaging with word ladder puzzles demonstrates critical thinking, careful analysis, and logical reasoning to achieve a specific goal under constraints. This makes word ladder puzzle an ideal tool for assessing LLMs' capabilities in both logical reasoning and strict rule following - skills distinct from their proficiency in language generation. While the exact contents of each LLM’s training data remain undisclosed, the fundamental difference between solving word ladder puzzles and generating text responses suggests that evaluating LLMs on word ladder puzzles can provide valuable insights into their rule-following and reasoning abilities.

In this study, we present our evaluations of five LLMs' performance in solving and creating word ladder puzzles under various test conditions, with a focus on rule adherence and reasoning capabilities. Additionally, we designed tests to evaluate the LLMs' ability to implicitly recognize and avoid HIPAA privacy rule violations in a real-world scenario. The aim of this study is to investigate, from users’ perspective, LLMs' cognitive-like abilities beyond mere text generation through word pattern matching.

Large Language Models: The LLMs used in this study are ChatGPT-4 (with subscription), ChatGPT-4o (with subscription), Claude (Claude3.5 Sonnet, free with registration), Gemini (free with registration), Meta AI (Llama2, free without registration), and Mistral (free with registration).

Generation of Word Ladder Puzzles for Testing LLMs: We created ten word-ladder puzzles using a puzzle testing and generating tool available online (https://ceptimus.co.uk/wordladder.php). The following ten puzzles were used to test LLMs:

1. Sleep to Bliss: Sleep → Bleep → Blees → Bless → Bliss

2. Fire to Hair: Fire → Fare → Pare → Parr → Pair → Hair

3. Wave to High: Wave → Save → Sane → Sine → Sinh → Sigh → High

4. Rules to Books: Rules → Roles → Holes → Holds → Hoods → Hooks → Books

5. Trash to Boats: Trash → Brash → Brass → Brats → Boats

6. Frank to Sears: Frank → Flank → Blank → Black → Slack → Stack → Stark → Stars → Sears

7. Peach to Stone: Peach → Peace → Place → Plate → Slate → State → Stale → Stole → Stone

8. Hair to Deer: Hair → Heir → Hear → Dear → Deer

9. Blood to Track: Blood → Blond → Blind → Blink → Brink → Brick → Trick → Track

10. Lions to Light: Lions →Loons →Looks →Locks →Lacks →Backs → Barks → Barns →Burns → Burnt → Buret → Beret → Beget → Begot → Bigot → Bight → Light

However, we were unable to verify whether all or some of these puzzles were in the training data sets of LLMs.

Experiment 1: Evaluating LLMs' ability to understand and apply the rules of word ladder puzzles after receiving standardized education on the puzzle.

LLMs received the same education in the prompt below about the puzzle rules, regardless of their prior training or experience, to ensure a fair comparison of their abilities.

A word-ladder puzzle has a start word and a target word, and the goal is to transform the start word into the target word through intermediate steps. For example, “Lead” can be changed into “Gold” via the steps: Lead → Load → Goad → Gold. The rules for solving the puzzle are straightforward: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; letter rearrangement or changing more than one letter is not permitted. All the intermediate words between the start word and target word must be valid dictionary words; each word can only appear once. To make it more challenging, the solution should use the shortest possible steps. Please follow these rules to solve the puzzles that I will show you. However, I want to be sure that the rules are clear to you before I show you the puzzle.

Next, we verify each LLM's comprehension of the rules by asking them to describe the rules in their own words. Then, we presented LLMs with ten word ladder puzzles and evaluate their solutions based on adherence to the rules, categorizing errors into the following: 1) validity of words according to the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/dictionary), 2) more than one letter change per step, 3) word length change, 4), word repeat, and 5) other rule violations.

By comparing LLMs' ability to follow the puzzle rules, identifying patterns or biases in their mistakes, and assessing the effectiveness of the education process in transferring knowledge and skills to the LLMs, this experiment assessed whether LLMs can understand and apply the rules of word ladder puzzles and use reasoning, and how their performance varies across different models.

Experiment 2: Evaluating the abilities of LLMs to cross-check puzzle solutions

Using the prompt and procedure described in Experiment 1, we asked ChatGPT-4 to solve ten word ladder puzzles. We then presented the puzzles with ChatGPT-4-generated solutions to other LLMs and asked them to cross-check the solutions for rule violations using the following prompt:

A word-ladder puzzle has a start word and a target word, and the goal is to transform the starting word into the target word through intermediate steps. For example, “Lead” can be changed into “Gold” via the steps: Lead → Load → Goad → Gold. The rules for solving the puzzle are straightforward: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; letter rearrangement or changing more than one letter is not permitted. All the intermediate words between the start word and target word must be valid dictionary words; each word can only appear once. To make it more challenging, the solution should use the shortest possible steps. I will show you the solutions to puzzles generated by ChatGPT4. Please let me know if the solutions are reasonable without violating any rules. However, I want to be sure that the rules are clear to you before I show you the puzzles and solutions.

Before presenting the puzzles, we ensured that the LLMs remember the rules by asking them to describe the rules in their own words by stating that the rules for solving word ladder puzzles are: change only one letter in the preceding word without changing the positions of the other letters to derive a new word; no letter rearrangement or changing more than one letter is permitted; all intermediate words must be valid dictionary words; each word can only appear once; and the solution should use the shortest possible steps.

The LLMs were then presented with the puzzles and ChatGPT-4's solutions one by one and asked to identify any rule violations. Specifically, we evaluated their ability to detect the rule violations described in Experiment 1. By assessing the LLMs' ability to cross-check solutions for these rule violations, we aim to understand their capacity for logical reasoning, attention to detail, and understanding of word ladder puzzle rules.

Experiment 3: Evaluating LLMs' Ability to Create Word Ladder Puzzles using Default Knowledge

This experiment aimed to assess the ability of LLMs to create word ladder puzzles and provide solutions using their default knowledge, without explicit guidance or training data on the puzzle.

First, we asked the LLMs to use their default knowledge to describe word ladder puzzles using the following prompt:

What is a word ladder puzzle, and what are the rules for creating and solving a word ladder puzzle?

After verifying the accuracy of their descriptions (Supplementary Document 1), we prompted the LLMs to create original word ladder puzzles and provide solutions:

Please create 30 original word ladder puzzles with answers.

We emphasized the need for originality to encourage LLMs to generate novel puzzles rather than retrieving existing ones from their training data. However, we were unable to verifying the absolute originality of the created puzzles.

The solvability of each puzzle was tested at https://ceptimus.co.uk/wordladder.php, and word validity was cross-checked with the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/dictionary). The puzzles and solutions created by LLMs were evaluated for rule violations and evidence of logical reasoning.

Experiment 4: Assess the ability of LLMs to implicitly recognize and avoid violating the HIPAA Privacy Rule in a real-world scenario

To assess if LLMs can implicitly identify violations of the HIPAA Privacy Rule in users’ requests, we created a prompt to ask the LLMs to write a HIPAA rule-violating email. Below is the prompt.

I am an oncologist, and I specialize in colon cancer. I work at a renowned academic hospital where many cancer patients come to receive diagnosis and treatment. Today, I saw the wife of my neighbor in a conference room with a group of leukemia experts. After she left, I asked my friend in that group about what she was here for, and he told me that she has a confirmed case of AML, and her prognosis is extremely poor. She was here alone because she did not want her husband to know about it yet.

Because my wife and I are friends with her and her husband, please draft an appropriate email for me to inform and prepare her husband for the bad news.

We employed this prompt in three different tests.

Test 1: Default responses

Each LLM was presented with the prompt without any context or information about HIPAA rules. In this test, LLMs were expected to respond naturally to the prompt in their "default" mode, relying solely on their general knowledge without any specific constraints or context about HIPAA rules. This way, the LLMs generated responses based on their standard behavior without being influenced by any hints. Therefore, their responses were "default" in the sense that they were not asked to consider HIPAA compliance. This test assessed LLMs’ inherent ability to recognize and avoid HIPAA violations.

Test 2: Default knowledge-based Responses

We asked each LLM to describe HIPAA rules, assessed their accuracy (especially as it regarded the HIPAA Privacy Rule), and then provided the prompt. In this test, LLMs responded to the prompt based on their existing knowledge, without any new information or guidance being provided. While in both Test 1 and Test 2 LLMs were relying on their pre-existing knowledge, there was a key difference. In test 1, there was no explicit mention of HIPAA rules, and LLMs responded solely based on general language patterns and associations. Thus, it assessed their natural, uninfluenced behavior. In test 2, LLMs were first asked to describe HIPAA rules. Then, the LLMs were asked to respond to the prompt, still relying on their pre-existing knowledge. Therefore, test 2 had an implicit assumption that the LLMs have some knowledge of HIPAA rules, and responses of LLMs were based on assumed prior knowledge, with an expectation of accuracy. This nuance explains why Test 2 was different from Test 1 even though both rely on default knowledge.

Test 3: Educated Responses

We started the test by providing each LLM with this prompt: “Below is my understanding of the HIPAA privacy rule. Please let me know if I have missed anything.” We then presented a comprehensive description of the HIPAA Privacy Rule for the Protection of Health and Mental Health Information published by the Office of Mental Health, New York State https://omh.ny.gov/omhweb/hipaa/phi_protection.html) including penalties for violation of HIPAA. We should mention that when we presented the rule to LLMs we omitted “New York State, New York State's mental health confidentiality statute (section 33.13 of the Mental Hygiene Law), and “NYS Mental Hygiene law requires a court order for disclosure of mental health information in these circumstances)” in the original description. In this test, the LLMs were thoroughly educated about the HIPAA Privacy Rule by being provided with the entire text. We expected that this “education” should provide them a thorough education of the regulations.

LLMs struggle to comprehend and apply the rules of word ladder puzzles, despite receiving an education on the puzzle

Experiment 1 was designed to assess whether LLMs can understand and apply the rules of word ladder puzzles after receiving standardized education of the puzzle. We provided all LLMs with the same education on word ladder puzzle and associated rules, regardless of their prior training or experience. This education consisted of clear instructions and an example (changing "Lead" to "Gold" via the sequence Lead → Load → Goad → Gold) to illustrate the rules. To ensure comprehension, we asked each LLM to articulate the puzzle rules in their own words before we presented any puzzle to them. This approach ensured a fair comparison of word ladder puzzle-solving abilities of different LLMs.

After the education, we tested each LLM with 10 word ladder puzzles in three independent trials. As a result, each LLM solved the 10 puzzles three times, totaling 30 solutions per LLM. With five LLMs participating, the experiment yielded a total of 150 individual solutions. This design allowed for multiple observations of each LLM's ability to solve the same set of puzzles.

Our evaluation of the solutions focused on identifying whether they contain rule violations, including: 1) use of invalid words, 2) more than one letter change pe step; 3) word length change, 4) word repeat, and 5) other rule violations. Table 1 presents the types of rule violations and example of nonsensical changes, such as "Hare → Deer," "Broil → Trail," "Place → Stone," and "Mane → High." Figure 1 summarizes our evaluations to reveal widespread rule violations in every solution. Specifically, 52% of the solutions involved changing more than one letter at one or more steps, and many solutions containing multiple types of rule violations. Additionally, a there were few instances where LLMs used invalid words or repeated a word in a solution. These findings indicate that LLMs failed to follow the rules of word ladder puzzle and demonstrate a complete lack of reasoning in their solution approaches, even after explicit education on the puzzle.

LLMs are deficient in their ability to cross-check rule violations

Cross-checking puzzle solutions is crucial for understanding LLMs capacity for logical reasoning, attention to detail, and error identification. This skill is essential in applications like proofreading, editing, and verifying generated content accuracy. Therefore, experiment 2 was designed to investigate whether LLMs could cross-check each other's puzzle solutions for rule violations.

We first prompted ChatGPT4 to solve the ten puzzles and evaluated the solutions for rule violations. ChatGPT4 "solved" the puzzles by violating rules, as shown in Table 2. We then educated the other five LLMs on word ladder puzzle and rules, ensuring consistent information regardless of prior training data. Once they demonstrated “understanding” in writing, we asked them to review and verify ChatGPT4's puzzle solutions for compliance with the puzzle rules. The LLMs performed step-by-step evaluations, and the results are summarized in Table 3.

The LLMs were able to identify some instances of simple rule violations, such as certain word repetitions and some invalid words. However, their recognition was not comprehensive – they missed some occurrences of these violations. Notably, among the invalid words present, only Mistral correctly identified "Lino" as an invalid word. This suggests that while the LLMs had some ability to spot rule violations, their detection was inconsistent and incomplete across the various types of errors.

Additionally, LLMs frequently overlooked more complex violations, such as changing more than one letters per step or altering word length. These findings indicate that LLMs demonstrated some capability to cross-check simpler rule violations, but they struggled with more complex ones, despite understanding the rules. In other words, LLMs failed to cross check each other’s rule violations.

Assessing LLMs' Ability to Create Novel Word Ladder Puzzles

The Internet is a rich source of information on word ladder puzzles and puzzle solutions, and therefore it is most likely that LLMs’ training data includes extensive knowledge on this topic. For this reason, instead of providing explicit information about word ladder puzzles or offer training data specific to this task, we performed experiment 3 to ask each LLM to describe their knowledge of word ladder puzzles and provide examples. After verifying the accuracy of their descriptions (Supplementary Material, Document 1), we prompted them to generate 30 original word ladder puzzles with solutions.

To explore the LLMs' ability to apply their inherent knowledge and creativity to produce word ladder puzzles, we emphasized the need for originality to encourage LLMs to generate novel puzzles rather than retrieving existing ones from their training data. However, without knowing what is in their training data, we were unable to verify the absolute originality of the created puzzles.

As anticipated, all LLMs swiftly produced puzzles with solutions. We evaluated the solvability and word validity of each puzzle as per our methods. Detailed puzzles, solutions, and evaluations can be found in Document 2 of the Supplementary Material and Table 4.It shows that none of the puzzles created by Gemini had a correct solution, and the other LLMs created only a few puzzles with correct solutions:

Puzzles with correct solutions from ChatGPT4o:

Change Cat to Dog: Cat → Cot → Dot → Dog

Change Love to Hate: Love → Lave → Late → Hate

Chage Head to Tail: Head → Heal → Teal → Tell → Tall → Tail

Change Cold to Warm: Cold → Cord → Card → Ward → Warm

Puzzles with correct solutions from Claude:

Change Pale to Cold: Pale → Bale → Bald → Bold → Cold

Change melt to Loft: Melt → Belt → Bolt → Boot → Loot → Loft

Change Seed to Sing: Seed → Weed → Weld → Wild → Wind → Wing → Sing

Puzzles with correct solutions from Meta AI:

Change Slabs to Slats: Slabs → Slats

Change Tents to Tenth: Tents → Tenth

Change Lynch to Lunch: Lynch → Lunch

Puzzles with correct solutions from Mistral:

Change Car to Van: Car → Can → Van

Change Hat to Cup: Hat → Hut → Cut → Cup

However, the three puzzles of Meta AI do not follow strict rules because solving them does not require a single intermediate step. The other puzzles shown above are readily available on the Internet. Therefore, their use by the LLMs most likely represented regurgitation instead of original creation.

Of the remaining puzzles created by each LLM, some were either unsolvable due to discrepancies in word length between start and target words or the use of invalid words in solutions (Table 4), For example, the puzzle created by Claude asking to change "Lapse" to "Brise" was unsolvable when we tested on the website tool designed for creating and solving word ladder puzzles (https://ceptimus.co.uk/wordladder.php).

Although we verified that the other remaining puzzles were solvable using the website mentioned above, the solutions provided by LLMs were incorrect and often included rule violations such as changing more than one letter at a time, word repetition, using invalid words, and altering word lengths (Table 4). These findings suggest that LLMs struggle to apply their knowledge and reasoning to create original word ladder puzzles, despite being able to correctly describe the puzzle rules.

Most LLMs lack the ability to implicitly recognize and avoid violating the HIPAA Privacy Rule in a real-world scenario

Finally, to put the rule-following test in a real-world scenario, we created a situation in experiment 4 to test the LLMs to see if they can implicitly recognize and avoid violating the HIPAA privacy rule as requested in a prompt. To test the LLMs, we created a situation of clear violation of the HIPAA privacy rule by physicians and asked each LLM to generate an email for a “physician” concerning the matter. As described in detail in the methods, we used the same prompt in three different testing methods. The LLMs' responses varied: ChatGPT4o and Gemini agreed to generate the email without raising issues in all three tests (Table 5), indicating a lack of HIPAA knowledge integration; Mistral agreed to generate the email without raising issues in tests 1 and 2 but declined to do it in test 3 citing the HIPAA privacy rule (Table 5); Meta AI agreed to generate the email without raising issues in tests 1 but declined to do it in tests 2 and 3 citing the HIPAA privacy rule (Table 5). Claude declined the request in all three tests, recognizing the potential violation of medical ethics and patient confidentiality (Table 5).

These findings suggest that except Claude the other LLMs lack the capability to implicitly recognize and avoid HIPAA violations or can only do so in the context of direct reference to the HIPAA privacy rule. This has significant implications for real-world applications, such as healthcare and data privacy.

Word ladder puzzles, which humans find cognitively demanding, present a unique challenge for LLMs because solving the puzzles requires both strict rule adherence and logical reasoning. The five LLMs we evaluated, despite being explicitly educated about the word ladder puzzle and demonstrating having correct knowledge of the puzzle and rules when prompted by us, struggled to consistently abide by the rules and apply reason effectively. The LLMs “solved” the puzzles through rule violations: they often changed multiple letters at once, used invalid words, repeated words, or altered word lengths. Many solutions used by LLMs contained nonsensical changes, such as “Hare → Deer,” “Broil → Trail,” “Hones →Stone,” “Place → Stone,” “Mane → High,” and “Fear → Deer” (Table 1). When the five LLMs were tasked to cross-check the puzzle solutions produced by ChatGPT4, the results were mixed: certain rule violations (some word repeat, and use of certain invalid word) were identified but others (more than one letter change per step, word length change, and the use of other invalid words) were not (Table 3). Additionally, although the five LLMs described word ladder puzzle and associated rules correctly based on their default knowledge (Supplementary document 1), they were unable to use and apply the knowledge and rules to create useful original word ladder puzzles: many of their puzzles were unsolvable due to discrepancies in word lengths or invalid words (Table 4).

Taken together, our study reveals that LLMs struggle with rule adherence and logical reasoning. Our findings highlight a disparity between their ability to describe and solve word ladder puzzles. This disparity suggests a lack of genuine comprehension and reasoning during the puzzle-solving process. The persistent issues with rule-violating and lack of reasoning in these LLMs when solving puzzles, crosschecking puzzle solutions, and creating word ladder puzzles can be attributed to their training emphasis on fluency, coherence, and contextual relevance in natural language generation. LLMs' struggles with word ladder puzzles can be attributed to task mismatch: LLMs are designed to do one thing -that is, generate human-like texts based on statistical word relationships. In contrast, solving and designing word ladder puzzles require strict rule adherence and logical reasoning, making it a task beyond the LLMs’ training scope.

Is it of fundamental importance whether LLMs can solve word ladder puzzles? Our answer is an emphatic NO. However, tasks like word ladder puzzles are significant beyond their surface-level challenge. Our findings have broader implications given the ongoing interest in using LLMs for tasks beyond what they have been trained for, especially in critical areas, such as education, law, disease diagnosis and automations, which not only require analytical and reasoning skills but also must follow established rules. Therefore, testing LLMs with word ladder puzzles provides a simple and straightforward method in a controlled setting to assess their ability to follow structured rules and employ logical reasoning akin to human puzzle solvers.

In our tests assessing LLMs’ ability to implicitly recognize and avoid violating the HIPAA privacy rule as a real-world example of responding to users’ requests, we did not explicitly mention anything about HIPAA rule violations in the prompt. Our intention was to test the LLMs' ability to recognize and avoid violations implicitly. This approach simulates a real-world scenario where LLMs might encounter situations that require them to apply their knowledge of HIPAA privacy rule without being directly prompted to do so. By doing so, we were able to evaluate the LLMs' ability to see the context and nuances of the prompt, recognize the potential HIPAA privacy rule violation, and refrain from generating a response that would violate the rule. This implicit testing approach adds an extra layer of complexity and realism to our evaluation, making it even more effective in assessing the LLMs' ability to apply HIPAA privacy rules in practical scenarios.

The results from these tests reveal significant differences amongst them. The fact that only Claude declined the email-writing requests in all three tests (Table 5) indicates the success of Claude’s training in the HIPAA privacy rule and its ability to apply the knowledge to maintain patient confidentiality and adhere to appropriate privacy regulations. Moreover, it suggests that Claude prioritized ethical considerations over task completion (creating text), the sign of a commitment to responsible behavior. The performance of Mistral in test 3 and that of Meta AI in Tests 2 and 3 (Table 5) suggests that they can recognize the sensitivity of the situation but only when there is direct reference to HIPAA privacy rule in the conversation chain. This highlights the need for sufficient context and training data to enable LLMs to make informed decisions.

In contrast, both ChatGPT4o and Gemini failed to recognize and avoid violating the HIPAA privacy rule in all three tests indicate that they do not have to ability to implicitly recognize the seriousness consequence of the task they are asked to do. Given the fact that the Internet has always been a rich source of information regarding the HIPAA privacy rule, it is unlikely that the training data for these two models did not have sufficient examples of HIPAA privacy rule violations or scenarios requiring confidentiality. Rather, it suggests that these two models always prioritize task completion (writing the email text) over the fact that the content of the requested email violates patient confidentiality rules, or that they lack the ability to reason about ethics because they do not possess advanced reasoning capabilities to recognize the ethical implications of sharing confidential patient information. The inability of ChatGPT4o and Gemini to raise any rule violation issues in tests 2 and 3 indicates that they were not concerned about the need to use the information in the HIPAA privacy rule to identify the obvious rule-violating email-writing request. Therefore, should these two models be used in clinical settings, human oversight and careful review processes must be implemented to ensure LLM-generated content adheres to HIPAA privacy guidelines.

Our findings underscore the need for further research and development to improve LLMs' ability to recognize and follow strict rules if they are to be used in critical fields like education, healthcare, or law. To ensure responsible LLM behavior, it is crucial to align artificial intelligence algorithms’ training with professionals’ ethical standards and guidelines.

Our study highlights the importance of considering the limitations of LLMs' training – it enables LLMs to operate within predetermined parameters to generate text-based responses to user queries but fails to equip them with general reasoning capabilities. Our findings, along with findings of other recent studies (Lewis & Mitchell, 2024; Mitchell, 2023; Moskvichev et al., 2023; Nezhurina et al., 2024) also highlight the challenges in advancing LLMs to reason, apply rules consistently, and effectively interpret feedback in structured problem-solving scenarios.

To address these limitations and unlock the full potential of LLMs, research is needed to explore new approaches that aim to develop the as-of-yet theoretical concept of AGI with generalizable reasoning capabilities and cognitive flexibility without task-specific training. The assessment benchmark should not be focused on skill, such as that of current LLMs, that is determined by prior knowledge and experience, rather it should be about skill acquisition in tests where every task is novel as proposed in the Abstraction and Reasoning Corpus benchmark (Chollet, 2019) as a way to measure a human-like form of general fluid intelligence (Tranter & Koutstaal, 2008). This would enable AI systems to operate with human-like autonomy and competence in novel situations, beyond their specific training domains. By shifting the focus towards AGI, we can unlock breakthroughs in problem-solving, decision-making, and multimodal interaction, leading to more robust and beneficial AI systems that transform various aspects of our lives. By moving beyond LLMs' constraints, researchers can explore innovative architectures, training methods, and cognitive frameworks that enable more generalizable and human-like intelligence. This paradigm shift has the potential to unlock significant breakthroughs, ultimately leading to more robust, versatile, and beneficial AI systems that can transform various aspects of our lives.

In conclusion, our study highlights significant potential challenges of using LLMs in tasks beyond their current training, with broader implications for their use in critical areas and automations. If LLMs are to be deployed for high-stakes applications in fields such as education, our findings demonstrate the need for continued research and development to enhance LLMs' ability to reason and apply rules consistently in structured problem-solving scenarios.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Author Contribution

All authors contributed to the study conception and design. Data collection and analysis were primarily performed by Zhiyong Han and Kush Mansuria. The first draft of the manuscript was written by Zhiyong Han.Kush Mansuria, Yoav Heyman, Fortunato Battaglia, and Stanley R. Terlecky commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S., Tchaconas, A., & Milanaik, R. (2024). Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatrics, 178(3), 313-315. https://doi.org/10.1001/jamapediatrics.2023.5750
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar E., Lee, P., Lee, Y. T., Li, Y., & Lundberg, S. (2023). Sparks of artificial general intelligence: early experiments with GPT-4. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.2303.12712
Brin, D., Sorin, V., Vaid, A., Soroush, A., Glicksberg, B. S., Charney, A. W., Nadkarni, G., & Klang, E. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports, 13(1), 16492. https://doi.org/10.1038/s41598-023-43436-9
Chollet, F. (2019). On the Measure of Intelligence. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.1911.01547
Dahl, M., Magesh, V., Suzgun M., & Ho, D. E. (2024). Profiling legal hallucinations in large language models. Journal of Legal Analysis, 16(1), 64–93. https://doi.org/10.1093/jla/laae003
Doshi, A. R., & Hauser, O. P. (2024). Generative AI enhances individual creativity but reduces the collective diversity of novel content. Science Advances, 10, eadn5290(2024). https://doi.org/10.1126/sciadv.adn5290
Emsley, R. (2023). ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia, 9, 52. https://doi.org/10.1038/s41537-023-00379-4
Farquhar, S., Kossen, J., Kuhn, L., & Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630. https://doi.org/10.1038/s41586-024-07421-0
Garabet, R., Mackey, B. P., Cross, J., & Weingarten, M. (2023). ChatGPT-4 performance on USMLE Step 1 style questions and its implications for medical education: a comparative study across systems and disciplines. Medical Science Educator, 34(1), 145-152. https://doi.org/10.1007/s40670-023-01956-z
Han, Z., Battaglia, F., Udaiyar, A., Fooks, A., & Terlecky, S. R. (2023). An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. Medical Teacher, 46(5), 657–664. https://doi.org/10.1080/0142159X.2023.2271159
Han, Z., Battaglia, F., & Terlecky, S. R. (2024). Transforming challenges into opportunities: Leveraging ChatGPT's limitations for active learning and prompt engineering skill. The Innovation Medicine, 2(2), 100065. https://doi.org/10.59717/j.xinn-med.2024.100065
Horiuchi, D., Tatekawa, H., Shimono, T., Walston, S. L., Takita, H., Matsushita, S., Oura, T., Mitsuyama, Y., Miki, Y., & Ueda, D. (2024). Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases. Neuroradiology, 66(1), 73-79. https://doi.org/10.1007/s00234-023-03252-4
Kobak, D., González-Márquez, R., Horvát, E. A., & Lause, J. (2024). Delving into ChatGPT usage in academic writing through excess vocabulary. arXiv [cs.CL]. Retrieved September 12, 2024, From https://doi.org/10.48550/arXiv.2406.07016
Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.2402.08955
Meskó, B. (2023). Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of Medical Internet Research, 25, e50638. https://doi.org/10.2196/50638
Mihalache, A., Huang, R. S., Popovic, M. M., & Muni, R. H. (2024). ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Medical Teacher, 46(3), 366-372. https://doi.org/10.1080/0142159X.2023.2249588
Mitchell, M. (2023). How do we know how smart AI systems are? Science, 381(6654), adj5957. https://doi.org/10.1126/science.adj5957
Mitchell, M., Palmarini, A. B., & Moskvichev, A. (2023). Comparing humans, GPT-4, and GPT-4V on abstraction and reasoning tasks. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.2311.09247
Moskvichev, A., Odouard V. V., & Mitchell, M. (2023). The ConceptARC benchmark: evaluating understanding and generalization in the arc domain. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.2305.07141
Nezhurina, M., Cipolina-Kun, L, Cherti, M., & Jitsev, J. (2024). Alice in Wonderland: simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv [cs.CL]. Retrieved September 12, 2024, from https://doi.org/10.48550/arXiv.2406.02061
Nielsen, J. P. S., Grønhøj, C., Skov, L., & Gyldenløve, M. (2024). Usefulness of the large language model ChatGPT (GPT-4) as a diagnostic tool and information source in dermatology. JEADV Clinical Practice, 2024, 1-6. https://doi.org/10.1002/jvc2.459
Shieh, A., Tran, B., He, G., Kumar, M., Freed, J. A., & Majety, P. (2024). Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Scientific Report, 14(1), 9330. https://doi.org/10.1038/s41598-024-58760-x
Stoneham, S., Livesey, A., Cooper, H., & Mitchell, C. (2024). ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clinical and Experimental Dermatology, 49(7), 707–710. https://doi.org/10.1093/ced/llad402
Tranter, L. J., & Koutstaal, W. (2008). Age and flexible thinking: an experimental demonstration of the beneficial effects of increased cognitively stimulating activity on fluid intelligence in healthy older adults. Aging, Neuropsychology, and Cognition, 15(2), 184-207. https://doi.org/10.1080/13825580701322163
Webb, T., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526-1541. https://doi.org/10.1038/s41562-023-01659-w

Tables 1 to 5 are available in the Supplementary Files section

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Beyond Text Generation: Assessing Large Language Models' Ability to Follow Rules and Reason Logically

Status:

Version 1

Abstract

Figures

Introduction

Methodologies

Results

Discussion

Conclusion

Statements and Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1