In this study, we found that AI tools have performance based differential effects across individuals, significantly helping lower performers and significantly hurting higher performers. This effect was found in particular for the AI-generated passage summary and Socratic discussion chatbot tools, consistent for all analysis methods employed. The AI-generated passage outline tool displayed this differential effect in two out of three analysis methods, while the Q&A tutor chatbot displayed it in only one. Together, these findings support our two key pre-registered hypotheses on the effects of AI tools. Additionally, these findings corroborate previous literature in that Intelligent Tutoring Systems (ITS) disproportionately help lower performers over higher performers24,25
To rule out the possibility of regression to the mean (which is low due to our randomized design), we conducted complementary analyses using different low - high performer splits, splitting on SAT/ACT scores (taken 1–5 years prior to the current study) and overall performance on all tests (including AI conditions).
The differential effect across performance-based groups was strong for the Socratic chatbot, where the goal was to reinforce comprehension through Socratic questioning, as a tutor might. Lower performers were significantly helped by use of this tool. However, higher performers were hurt by the usage of the Socratic chatbot in some of our analyses. Although significance was inconsistent between analyses, directionality of the negative effect remained consistent. We speculate that upon finishing reading, higher performers have a strong enough grasp of the passage that usage of the chatbot does not help them comprehend better, possibly serving as a distraction that impedes comprehension. These findings demonstrate the potential benefit of similar tools (which constitute the popular vision of a personalized tutor) to help those that need it the most, but also caution against blanket use of such tools in all students, as it may cause unintended harm.
To our knowledge, this study is the first to report data on the effect that reading an AI-generated summary has on comprehension, despite its prevalence in educational contexts. According to one recent study, 39.3% of AI use by German university students is in text processing, text analysis, and text creation8. Reasonably, summarization of long texts makes up a major portion of this use case. As may be expected, reading an AI-generated summary instead of the full passage significantly worsened comprehension in higher performers, likely because much of the detail and nuance of the passage was lost in the summary. The AI-generated summary’s effect on lower performers was inconsistent. In the control split analyses, reading a summary significantly improved comprehension, whereas when splitting by SAT score or overall performance, the AI-generated summary tool had no significant effect. This stands in contrast to the consistent and strong negative effect reading a summary had on higher performers, evident across all analysis methods. Overall, this differential effect across low and high performers may be driven by difficulties extracting a passage’s themes and meaning from a distractingly long text in lower performers, which reading a distilled summary may mitigate.
It was expected that the addition of topic headings by sorting the text into an AI-generated outline would improve comprehension16. We observed this effect in lower performers; the control split yielded highly positive effects while the other splits yielded directionally consistent but non-significant effects. Perhaps unexpected, but in line with our pre-registered hypotheses, we observed a negative effect in high performers that was significant in both the control and overall performance splits and not significant in the SAT score split. This could again be due to distraction from the text or the quality of the AI-generated outline and further emphasizes the need for caution and testing when developing and implementing AI tools.
The effect of the Q&A tutor tool was less readily interpretable. The differential effect was significant for both groups in the control split analyses, but insignificant for both groups in the other two splits. This could have been at the fault of our implementation/prompt or due to a lack of quality engagement (usage was not required like it was for the Socratic chatbot). The Q&A tutor was entirely self directed, and past research suggests that students may not have the metacognitive skills to take full advantage of such on demand help systems26. Future studies should teach students how to best use the tutor in order to amplify its effects.
The findings of this study are strengthened by several aspects of its design, execution and analysis. We pre-registered our hypotheses and methods, which proved successful for our core hypotheses (AI tools helping lower performers and hurting higher performers). Second, testing the AI tools in college-aged participants ensured our findings generalized to a population that is already heavily and increasingly using AI tools. Third, the underlying approaches of the AI tools and the assessment method (ACT Reading tests) are well validated. Finally, performance on the control passage (which we used to split high and low performers) was highly correlated with SAT score, to a degree similar to the correlation of SAT score and college GPA, which means our high-low performer split is likely well validated.
Across all tools, we repeatedly found heterogeneity in the effect of AI tools on reading comprehension, where they helped lower performers and hurt higher performers, underscoring the need for caution and extensive testing before implementing such tools into the educational system en masse.
Limitations
As the data used in this study were sourced using the online research platform Prolific, the participant sample reflects those individuals who actively use Prolific and were interested in a reading comprehension study, which may skew the range of people on whom we have information. For example, our sample had more female participants (64%) compared to the population average. Additionally, the portion of our sample that were students exceeded the national average for a similar age bracket27. SAT/ACT scores of the participants in our sample were also higher than the national averages28. Even our low performer groups in this study tended to have average SAT scores higher than the national average. Ultimately, it will be important in future work to more closely mirror the broader population. Additionally, as a major incentive in participation was monetary, effort levels may be variable, though we designed our quality control process to identify and exclude low-effort participants. Given AI’s relatively novel and controversial position in society, participants may also have varying confidence and trust levels in AI, affecting their usage of the tools in this study. Our results are also subject to our implementation of the tools (i.e. the prompts we used to create the tools as well as the underlying LLM). Negative findings may therefore be due to insufficiently robust AI tools, which might be further improved in the future. Lastly, participants only took each condition one time, potentially limiting detection power or increasing variance in our results.
Areas of Future Research
Building on the findings of this study, we identify several areas where further investigation is important to enhance our understanding of AI’s impact on education. It is crucial to better understand how to develop tools that will benefit all learners, not just lower performers. To do this, analyses of the effects of other AI tools, beyond those used in this study, and for other aspects of learning beyond reading comprehension, are necessary. The consistency of this differential effect should be determined. Next, in-classroom testing is necessary for a more realistic environment with higher levels of effort and motivation from students. Additionally, the effect of AI tools on other samples should be studied. For example, it should be assessed in K-12 students, who make up the majority of the educational system and may be less equipped to best use LLM-based tutoring tools. Additional samples could include participants from different countries or with varying languages or learning ability. As mentioned above, the low performers in this study still had higher SAT/ACT scores than the US average SAT score, potentially indicating the presence of an additional group of low performers below those of our study. The effect should be studied in this group as they may have more potential to benefit from the AI tools. Finally, longitudinal research should be conducted to test the effect of AI-based tools on learning in the long term.