Differential effects of GPT-based tools on comprehension of standardized passages

doi:10.21203/rs.3.rs-4591602/v1

Download PDF

Article

Differential effects of GPT-based tools on comprehension of standardized passages

https://doi.org/10.21203/rs.3.rs-4591602/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Due to the rapidly improving capability of large language models such as Generative Pre-trained Transformer models (GPT), artificial intelligence (AI) based tools have entered use in education at scale. However, empirical data are largely lacking on the effects of AI tools on learning. Here, we determine the impact of four GPT-based tools on college-aged participants’ comprehension of standardized American College Test (ACT)-derived passages and associated tests using a randomized cross-over online study (n = 195). The four tools studied were AI-generated summaries, AI-generated outlines, a question-and-answer tutor chatbot, and a Socratic discussion chatbot. Consistent with our pre-registered hypotheses, we found a differential effect of AI tools based on baseline reading comprehension ability. AI tools significantly improved comprehension in lower performing participants and significantly worsened comprehension in higher performing participants. With respect to specific tools, low performers were most benefited by the Socratic chatbot while high performers were worsened most by the summary tool. These findings suggest that while AI tools have massive potential to enhance learning, blanket implementation may cause unintended harm to higher-performing students, calling for caution and further empirical study from developers and educators.

Introduction

Physical sciences/Mathematics and computing/Information technology

Physical sciences/Mathematics and computing/Software

Physical sciences/Mathematics and computing/Computer science

Biological sciences/Psychology/Human behaviour

Background and recent innovations for AI in education

Since the infancy of intelligent tutoring systems (ITS) in 1950¹, researchers have been developing ITSs to build scalable and personalized tutoring. ITSs are defined as computer programs that use computational models to model and assist in student learning, adapting to individual needs^2,3. As shown by Bloom’s two-sigma problem, where he found that one on one tutoring increased test scores by two standard deviations, such personalized systems could have massive impact on educational outcomes⁴. One such example is Autotutor, a natural language chatbot tutor developed in 2004 that exhibited significant learning gains in experimental groups⁵. As the field of artificial intelligence (AI) has progressed, so have intelligent tutoring systems. Recent advancements in large language models (LLMs), such as the Generative Pre-trained Transformer (GPT) models, mark a significant leap forward in AI text generation and conversational capability. GPT models can generate coherent text responses by repeatedly predicting subsequent words based on its training on massive and diverse text data⁶. Such LLMs have shown promise to revolutionize multiple fields, especially education. Current foundation LLMs in a chatbot interface can already do all four parts of an intelligent tutoring system as defined by Pappas et. al (domain expertise, pedagogical expertise, learner model, and interface)⁷. Due to this sophistication, AI tools have quickly diffused into education. In one study, 63.4% of German university students had used an AI-based tool for their studies⁸. Of the underlying AI models, OpenAI’s GPT was the most popular model used, and as such was the model used in this study. While previous literature suggests that AI tools could support students and improve learning outcomes⁹, there are existing concerns that students’ use of AI could decrease learning and retention of material, suggesting possible heterogeneity in the effect of such tools¹⁰.

Aims of this study

The lack of data due to the novelty of LLMs, already high usage in students, and possible risks of GPT-based tools create a pressing need for robust empirical data on the effect of such tools to help schools and students decide how to implement them¹¹. Specifically, empirical assessments that elucidate the heterogeneity in efficacy of AI tutoring tools, such as performance-based effects and tool-specific effects, are required prior to blanket implementation of this technology. This study addresses this need by determining the impact of GPT-based tools on a crucial component of learning—reading comprehension.

Rationale for the focus on reading comprehension

Our focus on reading comprehension is motivated by two primary factors. First, reading comprehension is a foundational skill for students and therefore subserves many other metrics of academic success. Poor reading comprehension can impede learning in most subjects, as difficulty understanding passages or books has been shown to negatively impact outcomes¹². The ubiquity of reading comprehension positions it as a proxy for learning in general. Additionally, tests of reading comprehension, such as the Scholastic Aptitude Test (SAT) or the American College Test (ACT), are core components of college admissions, further emphasizing its importance. Thus, AI tools improving or worsening reading comprehension can greatly impact a student’s overall educational outcomes. Second, GPT is particularly well-suited as a tool for improving reading comprehension due to its text-based nature. Past studies assessing the effect of intelligent tutoring systems on standardized reading comprehension test performance have shown significant but small effect size improvements, even after lengthy interventions¹³. However, these studies did not implement GPT or similarly advanced LLM technologies in the tools they tested. Therefore, these modest effect sizes could be, in part, due to limitations of the technology underlying outdated tutoring tools. Additionally, heterogeneity in the population-averaged efficacy of AI tutoring tools may mask higher effect sizes in certain subgroups, calling for a more specific understanding of effects on different groups in order to personalize such tools to maximize positive impact.

Description of this study

In the current study we investigate GPT-based tools’ effect on reading comprehension, and how these effects vary as a function of participants’ baseline reading comprehension ability. We developed and tested four validated AI tools. 1) AI-generated summaries: One of the most common uses of ChatGPT is to summarize long or dense texts. We are not aware of prior research on the effect solely reading a summary has on comprehension despite the wide prevalence of its use by students^14,15, underscoring the importance of including a summary tool in this study. 2) AI-generated outlines: the outline tool splits the passage into an annotated outline, adding topic headings and dividing the text into ideas, which has been shown to improve recall in artificial and textbook passages¹⁶. 3) Q&A tutor chatbot: The Q&A tutor chatbot is modeled off human tutoring, where students ask questions and receive instruction until they understand the answer. This use case makes up 56.5 percent of all AI use in students⁸. 4) Socratic discussion chatbot: The Socratic discussion chatbot is modeled off the Socratic seminar method, which is a method in which students take part in thoughtful back and forth dialogue aimed to increase deep and complex understanding of a subject¹⁷. While some uses of GPT can replace students’ critical thinking¹⁸ and therefore comprehension, the Socratic questioning implemented in this tool has been shown to significantly increase critical thinking and comprehension in students^19,20. This tool, along with the Q&A tutor chatbot, take advantage of GPT’s strength in conversation. To assess the effect of these tools, and their sensitivity to baseline reading comprehension performance, we used the Reading section of the American College Test (ACT), where participants read a passage and answered the corresponding comprehension questions. After conducting a 16-person pilot study, we designed a pre-registered and well-powered prospective study of the effect of these 4 AI tools on reading comprehension of ACT reading passages. Our pre-registered hypotheses were as follows:

AI tools will improve reading comprehension in lower performing participants (participants who performed below median on a control passage)

AI tools will worsen reading comprehension in higher performing participants

The tutor and outline tools will improve quiz accuracy the most for low performers and the summary tool will not affect accuracy, while reducing time spent on passage.

Participants

Data for this pre-registered study²¹ was collected from 228 participants sourced from the online research platform Prolific. Participants were required to currently reside in the United States (in states where age of majority is 18), be fluent English speakers, and be aged 18–22. Participants were compensated at $10 per hour with an additional $6 bonus for the highest performing 5% of participants, which was made known to them at the beginning of the study. This study was reviewed by the Advarra Institutional Review Board (IRB) and approved as an exempt protocol. No protected health information was collected from participants. All methods were carried out in accordance with the guidelines outlined in the Declaration of Helsinki for protection of human subjects. Informed consent was obtained from all subjects before participation in the study.

Study Design

This study used a within-subject cross-over design with four experimental AI conditions and one control (no AI) condition, all performed in a single session with the order of presentation randomized. This ensured all condition sequences were similarly represented. Data were collected between March 7, 2024 and March 16, 2024 (date of pre-registration was March 4, 2024). The task was administered via a custom-built web portal, accessible only on desktop devices. Prolific-verified descriptive data was extracted on all participants, including demographic information and student status. SAT/ACT score (if applicable), parental education, childhood household income, childhood ZIP code, and previous AI experience were independently collected in our portal. Participants were presented with a practice block, consisting of a brief tutorial passage and quiz, to learn and understand the mechanics of the task prior to formally beginning the study. Following the practice block, the participant was presented with the task block, consisting of five conditions, each condition pairing a novel text passage with one of four AI tools (i.e. “experimental conditions”), or no AI tool (i.e. “control condition”). Each passage was randomly selected for each participant from prior official ACT practice tests (specific passages used in this study are described in the supplemental methods), with their corresponding 10-question multiple choice comprehension tests. This ensured that passages and corresponding assessments were both standardized and well-validated as tests of reading comprehension. The AI tools presented in the experimental conditions included: an AI-generated passage summary, a Socratic method discussion chatbot, a Q&A tutor chatbot, and an AI-generated collapsible/expandable passage outline. More information on the usage of the AI tools, their GPT 4.0 prompts and images of their user interfaces are available in the supplemental methods. Given the current state of LLM technologies, it is not uncommon for these models to exhibit inherent inconsistencies or hallucinations. While this is not fully preventable, the seed values in the model settings were set to be constant in an attempt to mitigate inconsistencies.

As seen in Fig. 1, within a given condition, participants were instructed to read the passage and if applicable, utilize the tool provided. Following the participant-initiated progression to the testing portion of the condition, participants were presented a series of 10 multiple choice (4 options) ACT questions, all visible at once, but without access to the passage in order to isolate initial comprehension by preventing participants from searching for the answer. In the case that a question referenced a specific part of the passage (e.g. “In lines x-y, what was the main theme?”), participants were shown the related excerpt only. Participants also rated the AI tool on a scale of 1–5 on its perceived effectiveness and enjoyment.

Data Processing

Consistent with our pre-registration, participants were removed based on the following quality control metrics: 1) scoring less than chance (below 30%, 3/10) on the comprehension quiz of two or more passages, or 2) being an outlier for time spent on any passage based on the 1.5 IQR rule. 33 participants were removed, resulting in a final sample size of 195 participants.

Per our pre-registered hypotheses²¹, as seen above, the sample was then split into low and high performer groups, based on control passage quiz accuracy (no-AI passage). Subsequent follow-up analyses did likewise based on overall performance across conditions, as well as based on SAT scores. The low performer group was in all cases defined as participants performing below the median. The high performer group was defined as participants who performed at or above the median. The control passage performance split resulted in slightly imbalanced sample sizes between low and high performers, due to the limited number of discrete possible values for control passage test accuracy.

For participants who took the SAT or ACT, we converted their ACT scores into equivalent SAT scores based on guidelines released by the ACT²². Passage and quiz times (in seconds) were log transformed to achieve a normal distribution.

Statistics

Data were analyzed in SPSS version 29. Data were found to be normally distributed and thus parametric tests (T-tests) were used, as described in the results section, except for categorical variable analyses where chi-square tests were employed. Repeated measures ANOVAs were also used. T-values and Cohen’s d effect sizes were standardized to be positive to aid interpretation. All p values are two-sided.

Based on our pre-registered hypotheses above, the primary analyses for this study examined the impact of AI tools on participant quiz accuracy, as well as time on passage. Secondary analyses included investigating the impact of AI tools on time on quiz and participant tool ratings.

Effect of AI tools as a whole on accuracy

Split by control passage performance

“Low Control Performer” and “High Control Performer” sample characteristics are reported in Table 1. The only significant difference was seen on SAT scores and the portion of each group that had taken the SAT (taken approximately 1–5 years prior to this study). As expected, low control passage performers had substantially lower SAT scores than high control performers (t(138) = 3.8, p < 0.001; Cohen’s d = 0.69), and a smaller portion of low performers had taken the SAT/ACT (𝜒² = 8.8, p = 0.005). The groups did not differ in age, gender, student status, parental education, childhood household income, race/ethnicity, or previous AI experience (p’s > 0.053). The SAT score difference, reflected also by the correlation of SAT scores and control passage performance across the whole sample (r = 0.31, p < 0.001), is furthermore consistent in magnitude with the reported correlation between SAT scores and college GPA in prior work (r ~ 0.35)²³. This finding provides additional validity to our group definition based on control passage performance as splitting by this variable has long-term predictive value with respect to SAT scores, aligning with how SAT scores predict college performance.

Table 1

Comparison of the low and high performer groups when the sample is split based on control passage performance. All tests are independent sample t-tests, except categorical variables for which Chi-Square tests were used.
Measure	Control Passage Performance Split
Measure	Low Performer Group	High Performer Group	p-value
Sample Size	71	124
Age (yr; M (SD))	20.7 (1.4)	21.3 (2)	0.053
Gender (% female)	71%	60%	0.161
Taken SAT/ACT (% who took test)	59%	79%	0.005
SAT Score (M (SD))	1194 (183.2)	1311 (163.9)	< 0.001
Student Status (% student)	67%	73%	0.491
Parental Highest Level of Education (% > 12 years)	41%	52%	0.137
Childhood Household Income (% > 50k)	62%	62%	1
Race/Ethnicity (% White)	46%	52%	0.552
Previous AI Experience (% None or Little Experience)	48%	60%	0.135

Using a repeated measures mixed ANOVA, we found a significant interaction between the overall use of AI tools and participant group (F(1,193) = 125.59, p < 0.001). In order to address primary hypotheses 1 & 2, regarding AI’s overall effect on quiz accuracy, paired sample t-tests compared control passage performance to average AI tool condition performance within groups to explore the effect found in the ANOVA. Figure 2A shows a strongly significant effect of AI tools overall relative to control passage performance in improving comprehension test performance in low performers (as determined by a median split on control passage accuracy; t(70) = 7.4, p < 0.001; d = 0.88). By contrast, comprehension test results in high performers were significantly worsened by use of AI tools (t(123) = 8.3, p < 0.001; d = 0.75).

Split by overall performance (equal weight control and all AI conditions)

Although our study randomized the order and passage for each condition, and took place in a single experimental sitting, the strong and one-sided effects we observed when splitting participants based on their control passage performance (which was also used to determine the effect of each AI tool) raised concerns about the potential confound of regression to the mean. To address this concern, we conducted a complementary analysis grouping participants based on overall performance. For this analysis, we split participants at the median based on overall performance, averaging the control passage performance with the average AI passage performance such that AI and control passage performance, the two components of this contrast, were equally weighted when defining low and high performing participants. This approach mitigates the risk of regression to the mean by balancing the influence of both performance types (AI and no AI control) which reduces the impact of extreme scores from either set. This creates a more stable and representative measure of participant’s true performance level to split based on, one in which we are not splitting and comparing participants based on the same variable.

As seen in Supplemental Table 1, similar to the control split, low performers in this split had significantly lower SAT scores than high performers (t(138) = 4, p < 0.001; d = 0.68), and a smaller portion took the SAT/ACT compared to high performers (𝜒² = 11.16, p = 0.001). Despite the difference in how low/high performers were defined, the same pattern of findings on the effect of AI tools were noted, as seen in Fig. 2B. Low performers’ performance improved significantly (t(93) = 2.4, p = 0.02; d = 0.25), while outcomes in high performers were worse (t(100) = 5.1, p < 0.001; d = 0.51).

Effects of individual AI tools in low and high performers

Having observed the overall AI effect we had predicted, we then investigated the specific impact of each AI tool on reading comprehension between groups. None of the AI responses resembled hallucinations or a non-deterministic nature. A repeated measures mixed ANOVA found a significant interaction between the use of individual AI tools and participant group (F(4,772) = 21.35, p < 0.001). As seen in Fig. 3A, when using the split based on control passage performance, all AI tools improved quiz accuracy for low performing participants. The greatest improvement was with the Socratic Method Discussion Chatbot tool (“Socratic”; t(70) = 7.2, p < 0.001; d = 0.86), followed by the AI-Generated Passage Outline tool (“outline”; t(70) = 6.3, p < 0.001; d = 0.74), the AI Q&A Tutor Chatbot (“tutor”; t(70) = 4.2, p < 0.001; d = 0.5), and the AI-Generated Passage Summary tool (“summary”; t(70) = 3.8, p < 0.001; d = 0.45). By contrast, as seen in Fig. 3B, all AI tools decreased quiz accuracy for high performing participants. The greatest decrease was with the summary tool (t(123) = 9.2, p < 0.001; d = 0.83), followed by the outline tool (t(123) = 5.7, p < 0.001; d = 0.51), the AI tutor chatbot (t(123) = 4.1, p < 0.001; d = 0.37), and the Socratic chatbot (t(123) = 3.7, p < 0.001; d = 0.33).

Again, to control for the possibility of regression to the mean impacting the results, an additional participant split was created, and the same individual AI tool effect tests were run. In this analysis, participants were split based on overall performance, equally weighting each condition (AI conditions and control) to balance the impact on each pairwise comparison. As seen in Supplemental Table 1, much like the other splits, low performers had substantially lower SAT scores than high performers (t(138) = 3.5, p < 0.001; d = 0.59), and a smaller portion took the SAT/ACT compared to high performers (𝜒² = 5.69, p = 0.025).

The per-tool accuracy results followed the same trend as the control split with respect to what the most impactful tools were (Fig. 3C and 3D). Specifically, for low performers, the Socratic chatbot significantly improved their quiz accuracy (t(93) = 2.2, p = 0.03; d = 0.23; Fig. 3C), while the other tools were not significant (p’s > 0.3). For high performers, performance was significantly negatively affected by the use of the AI summary (t(100) = 5.1, p < 0.001; d = 0.5; Fig. 3D) as well as the outline tools (t(100) = 2.4, p = 0.02; d = 0.24). The Socratic and tutor chatbots both had no significant effect (p’s > 0.45). Additionally, the effect of the summary tool differed significantly between low and high performers (p = 0.028), with high performers being made worse than low performers by this tool.

While the portion of participants who took the SAT/ACT and SAT scores were significantly different in this overall performance (all conditions equally weighted) participant split much like the other splits, in this one, low and high performers also differed marginally in terms of past AI experience. However, this disparity was not found to significantly explain the effects of AI tools. See the supplemental results for more information.

We next ran correlations between control passage accuracy and benefit from AI tools to account for the sensitivity of these findings to between subject variability, agnostic to performance groupings. As seen in Fig. 4, we found a strong negative correlation between participant control passage accuracy and participant average effect of AI on test accuracy (r=-0.785, p < 0.001).

In a complementary analysis, we split our sample based on SAT/ACT performance (taken 1–5 years prior to the study), which we validated based on a significant difference in control passage performance (see supplemental results). Consistent with our results above, lower performers benefited significantly from the Socratic tool, while high performers were significantly negatively impacted by the summary tool.

Passage and quiz time, and tool ratings

Finally, we examined passage and quiz times, as well as participants’ perceived tool effectiveness/enjoyment. There were no group differences on these measures (see supplemental results).

In this study, we found that AI tools have performance based differential effects across individuals, significantly helping lower performers and significantly hurting higher performers. This effect was found in particular for the AI-generated passage summary and Socratic discussion chatbot tools, consistent for all analysis methods employed. The AI-generated passage outline tool displayed this differential effect in two out of three analysis methods, while the Q&A tutor chatbot displayed it in only one. Together, these findings support our two key pre-registered hypotheses on the effects of AI tools. Additionally, these findings corroborate previous literature in that Intelligent Tutoring Systems (ITS) disproportionately help lower performers over higher performers^24,25

To rule out the possibility of regression to the mean (which is low due to our randomized design), we conducted complementary analyses using different low - high performer splits, splitting on SAT/ACT scores (taken 1–5 years prior to the current study) and overall performance on all tests (including AI conditions).

The differential effect across performance-based groups was strong for the Socratic chatbot, where the goal was to reinforce comprehension through Socratic questioning, as a tutor might. Lower performers were significantly helped by use of this tool. However, higher performers were hurt by the usage of the Socratic chatbot in some of our analyses. Although significance was inconsistent between analyses, directionality of the negative effect remained consistent. We speculate that upon finishing reading, higher performers have a strong enough grasp of the passage that usage of the chatbot does not help them comprehend better, possibly serving as a distraction that impedes comprehension. These findings demonstrate the potential benefit of similar tools (which constitute the popular vision of a personalized tutor) to help those that need it the most, but also caution against blanket use of such tools in all students, as it may cause unintended harm.

To our knowledge, this study is the first to report data on the effect that reading an AI-generated summary has on comprehension, despite its prevalence in educational contexts. According to one recent study, 39.3% of AI use by German university students is in text processing, text analysis, and text creation⁸. Reasonably, summarization of long texts makes up a major portion of this use case. As may be expected, reading an AI-generated summary instead of the full passage significantly worsened comprehension in higher performers, likely because much of the detail and nuance of the passage was lost in the summary. The AI-generated summary’s effect on lower performers was inconsistent. In the control split analyses, reading a summary significantly improved comprehension, whereas when splitting by SAT score or overall performance, the AI-generated summary tool had no significant effect. This stands in contrast to the consistent and strong negative effect reading a summary had on higher performers, evident across all analysis methods. Overall, this differential effect across low and high performers may be driven by difficulties extracting a passage’s themes and meaning from a distractingly long text in lower performers, which reading a distilled summary may mitigate.

It was expected that the addition of topic headings by sorting the text into an AI-generated outline would improve comprehension¹⁶. We observed this effect in lower performers; the control split yielded highly positive effects while the other splits yielded directionally consistent but non-significant effects. Perhaps unexpected, but in line with our pre-registered hypotheses, we observed a negative effect in high performers that was significant in both the control and overall performance splits and not significant in the SAT score split. This could again be due to distraction from the text or the quality of the AI-generated outline and further emphasizes the need for caution and testing when developing and implementing AI tools.

The effect of the Q&A tutor tool was less readily interpretable. The differential effect was significant for both groups in the control split analyses, but insignificant for both groups in the other two splits. This could have been at the fault of our implementation/prompt or due to a lack of quality engagement (usage was not required like it was for the Socratic chatbot). The Q&A tutor was entirely self directed, and past research suggests that students may not have the metacognitive skills to take full advantage of such on demand help systems²⁶. Future studies should teach students how to best use the tutor in order to amplify its effects.

The findings of this study are strengthened by several aspects of its design, execution and analysis. We pre-registered our hypotheses and methods, which proved successful for our core hypotheses (AI tools helping lower performers and hurting higher performers). Second, testing the AI tools in college-aged participants ensured our findings generalized to a population that is already heavily and increasingly using AI tools. Third, the underlying approaches of the AI tools and the assessment method (ACT Reading tests) are well validated. Finally, performance on the control passage (which we used to split high and low performers) was highly correlated with SAT score, to a degree similar to the correlation of SAT score and college GPA, which means our high-low performer split is likely well validated.

Across all tools, we repeatedly found heterogeneity in the effect of AI tools on reading comprehension, where they helped lower performers and hurt higher performers, underscoring the need for caution and extensive testing before implementing such tools into the educational system en masse.

Limitations

As the data used in this study were sourced using the online research platform Prolific, the participant sample reflects those individuals who actively use Prolific and were interested in a reading comprehension study, which may skew the range of people on whom we have information. For example, our sample had more female participants (64%) compared to the population average. Additionally, the portion of our sample that were students exceeded the national average for a similar age bracket²⁷. SAT/ACT scores of the participants in our sample were also higher than the national averages²⁸. Even our low performer groups in this study tended to have average SAT scores higher than the national average. Ultimately, it will be important in future work to more closely mirror the broader population. Additionally, as a major incentive in participation was monetary, effort levels may be variable, though we designed our quality control process to identify and exclude low-effort participants. Given AI’s relatively novel and controversial position in society, participants may also have varying confidence and trust levels in AI, affecting their usage of the tools in this study. Our results are also subject to our implementation of the tools (i.e. the prompts we used to create the tools as well as the underlying LLM). Negative findings may therefore be due to insufficiently robust AI tools, which might be further improved in the future. Lastly, participants only took each condition one time, potentially limiting detection power or increasing variance in our results.

Areas of Future Research

Building on the findings of this study, we identify several areas where further investigation is important to enhance our understanding of AI’s impact on education. It is crucial to better understand how to develop tools that will benefit all learners, not just lower performers. To do this, analyses of the effects of other AI tools, beyond those used in this study, and for other aspects of learning beyond reading comprehension, are necessary. The consistency of this differential effect should be determined. Next, in-classroom testing is necessary for a more realistic environment with higher levels of effort and motivation from students. Additionally, the effect of AI tools on other samples should be studied. For example, it should be assessed in K-12 students, who make up the majority of the educational system and may be less equipped to best use LLM-based tutoring tools. Additional samples could include participants from different countries or with varying languages or learning ability. As mentioned above, the low performers in this study still had higher SAT/ACT scores than the US average SAT score, potentially indicating the presence of an additional group of low performers below those of our study. The effect should be studied in this group as they may have more potential to benefit from the AI tools. Finally, longitudinal research should be conducted to test the effect of AI-based tools on learning in the long term.

Data availability

The data are available upon reasonable request, contact Kai Etkin ([email protected]).

Author Contribution

HKE and KJE conceived and executed the experiment and design, analyzed the data, and wrote the manuscript. RJC and CER provided input on data analysis, provided oversight, and revised the manuscript.

Acknowledgement

We thank Joshua Jordan for his comments on the manuscript.

Data Availability

The data are available upon reasonable request, contact Kai Etkin ([email protected]).

Financial disclosure: the authors have no conflicts of interest to disclose.

Skinner, B. F. Teaching Machines. Sci. Am. 205, 90–106 (1961).
A Systematic Literature Review of Intelligent Tutoring Systems With Dialogue in Natural Language | IEEE Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/9186073.
Graesser, A. C., Conley, M. W. & Olney, A. Intelligent tutoring systems. in APA educational psychology handbook, Vol 3: Application to learning and teaching 451–473 (American Psychological Association, Washington, DC, US, 2012). doi:10.1037/13275-018.
Bloom, B. S. The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educ. Res. 13, 4–16 (1984).
Graesser, A. C. et al. AutoTutor: A tutor with dialogue in natural language. Behav. Res. Methods Instrum. Comput. 36, 180–192 (2004).
OpenAI et al. GPT-4 Technical Report. arXiv.org https://arxiv.org/abs/2303.08774v6 (2023).
Pappas, M. & Drigas, A. Incorporation of Artificial Intelligence Tutoring Techniques in Mathematics. Int. J. Eng. Pedagogy IJEP 6, 12–16 (2016).
von Garrel, J. & Mayer, J. Artificial Intelligence in studies—use of ChatGPT and AI-based tools among students in Germany. Humanit. Soc. Sci. Commun. 10, 1–9 (2023).
Grassini, S. Shaping the Future of Education: Exploring the Potential and Consequences of AI and ChatGPT in Educational Settings. Educ. Sci. 13, 692 (2023).
Abbas, M., Jam, F. A. & Khan, T. I. Is it harmful or helpful? Examining the causes and consequences of generative AI usage among university students. Int. J. Educ. Technol. High. Educ. 21, 10 (2024).
Crompton, H. & Burke, D. Artificial intelligence in higher education: the state of the field. Int. J. Educ. Technol. High. Educ. 20, 22 (2023).
Bigozzi, L., Tarchi, C., Vagnoli, L., Valente, E. & Pinto, G. Reading Fluency As a Predictor of School Outcomes across Grades 4–9. Front. Psychol. 8, (2017).
Xu, Z., Wijekumar, K. (Kay), Ramirez, G., Hu, X. & Irey, R. The effectiveness of intelligent tutoring systems on K-12 students’ reading comprehension: A meta-analysis. Br. J. Educ. Technol. 50, 3119–3137 (2019).
Hadi Mogavi, R. et al. ChatGPT in education: A blessing or a curse? A qualitative study exploring early adopters’ utilization and perceptions. Comput. Hum. Behav. Artif. Hum. 2, 100027 (2024).
Črček, N. & Patekar, J. Writing with AI: University Students’ Use of ChatGPT. J. Lang. Educ. 9, 128–138 (2023).
Krug, D., George, B., Hannon, S. A. & Glover, J. A. The effect of outlines and headings on readers’ recall of text. Contemp. Educ. Psychol. 14, 111–123 (1989).
Pihlgren, D. A. S. Thoughtful Dialogues and Socratic Seminars.
Vargas-Murillo, A. R., Pari-Bedoya, I. N. M. de la A. & Guevara-Soto, F. de J. Challenges and Opportunities of AI-Assisted Learning: A Systematic Literature Review on the Impact of ChatGPT Usage in Higher Education. Int. J. Learn. Teach. Educ. Res. 22, (2023).
Yang, Y.-T. C., Newby, T. J. & Bill, R. L. Using Socratic Questioning to Promote Critical Thinking Skills Through Asynchronous Discussion Forums in Distance Learning Environments. Am. J. Distance Educ. 19, 163–181 (2005).
Mahmud, L. & Tryana. Promoting Reading Comprehension by Using Socratic Questioning. J. Onoma Pendidik. Bhs. Dan Sastra 9, 218–226 (2023).
Etkin, K. & Etkin, H. Effects of Artificial Intelligence Tools on Reading Comprehension. (2024) doi:10.17605/OSF.IO/F63X8.
ACT. ACT SAT Concordance Tables. https://www.act.org/content/dam/act/unsecured/documents/ACT-SAT-Concordance-Tables.pdf (2018).
Coyle, T. R. & Pillow, D. R. SAT and ACT predict college GPA after removing g. Intelligence 36, 719–729 (2008).
Ruan, S. et al. Reinforcement learning tutor better supported lower performers in a math task. Mach. Learn. 113, 3023–3048 (2024).
Thomas, D. R. et al. Improving Student Learning with Hybrid Human-AI Tutoring: A Three-Study Quasi-Experimental Investigation. in Proceedings of the 14th Learning Analytics and Knowledge Conference 404–415 (Association for Computing Machinery, New York, NY, USA, 2024). doi:10.1145/3636555.3636896.
Aleven, V., Stahl, E., Schworm, S., Fischer, F. & Wallace, R. Help Seeking and Help Design in Interactive Learning Environments. Rev. Educ. Res. 73, (2003).
U.S. Department of Education, Institute of Education Sciences. College Enrollment Rates. https://nces.ed.gov/programs/coe/indicator/cpb (2024).
CollegeBoard. SAT Suite of Assessments Annual Report. https://reports.collegeboard.org/media/pdf/2023-total-group-sat-suite-of-assessments-annual-report%20ADA.pdf (2023).

No competing interests reported.

SupplementalforSubmission.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Differential effects of GPT-based tools on comprehension of standardized passages

Status:

Version 1

Abstract

Figures

Introduction

Background and recent innovations for AI in education

Aims of this study

Rationale for the focus on reading comprehension

Description of this study

Methods

Participants

Study Design

Data Processing

Statistics

Results

Effect of AI tools as a whole on accuracy

Split by control passage performance

Split by overall performance (equal weight control and all AI conditions)

Effects of individual AI tools in low and high performers

Passage and quiz time, and tool ratings

Discussion

Limitations

Areas of Future Research

Declarations

Data availability

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1