Study Population
The present study took place in the Fall 2023 semester in Physical Sciences 2 (PS2), which is an introductory physics class for the life sciences and is Harvard’s largest physics class (N=233). Students were randomly assigned to two groups, respecting the constraint that students who regularly worked together in class during peer instruction were placed in the same group in order to maximize the effectiveness of their in-class learning. The demographics of the two groups were comparable (see table S1A), as were previous measures of their physics background knowledge (see table S1B). Note that FCI pretest scores are comparable to those of students at other universities27. Of the 233 enrolled students, 194 were eligible for inclusion in the study. Eligibility was based on students’ consent, participation in both in-class and AI-tutored instruction, and completion of all pre-tests, and post-tests.
Course Setting
The course (PS2) meets twice per week for 75 minutes each. The study took place in the ninth and tenth week of the course. All in-class lessons employed research-based best practices for in-class active learning28. Each class involves a series of activities that teach physics concepts and problem-solving skills. First the instructor introduces an activity, then students work through the activity in self-selected groups with support and guidance from course staff, and finally the instructor provides targeted feedback to address students’ questions and misconceptions.
This instructional approach has proved to be a successful implementation of active learning, and has been shown to offer a significant improvement over passive lectures29. Similar active learning approaches have been shown to increase learning across a wide range of STEM fields30. Although active learning pedagogies may elicit negative perceptions from students31, both course instructors, as well as their presentations in the course, achieved student evaluation scores above the departmental and division averages.
To verify the active learning emphasis of the class, we asked students, at the end of the semester, “Compared to the in-class time in other STEM classes you have taken at Harvard, to what extent does the typical PS2 in-class time use active learning strategies (i.e. provide the opportunity to discuss and work on problems in-class as opposed to passively listening)”. The overwhelming majority of students (89%) indicated that PS2 used more active learning compared to other STEM courses.
Study Design
The present study was approved by the Harvard University IRB (study no. IRB23-0797) and followed a cross-over design. The design allowed for control of all aspects of the lessons that were not of interest. The cross-over design is summarized in table S2. For each of two lessons, each student: 1. took a pre-class quiz that established their baseline knowledge of the content for that lesson, 2. engaged in either the active classroom lesson (control condition) or the AI tutor lesson (experimental condition), and 3. took a post-class quiz as a test of learning. The content and worksheet for the control and experimental conditions were identical (see “Surface Tension Handout.PDF” and “Fluid Flow Handout.PDF”). The introductions for each activity were also identical, varying only by the format of presentation: live and in-person for the control group and over pre-recorded video for the experimental group.
Given the cross-over design all students experienced both conditions once during the study. The structure of the experimental condition differed from the control condition in that all interactions and feedback were with an AI tutor, rather than with peer-instruction followed by instructor feedback. Students in the experimental condition worked through the handout asking questions and confirming answers with the AI tutor, called “PS2 Pal.” Students were given equal participation credit for either condition as well as for the associated pre- and post- test. Students were told that their performance on the pre- and post-tests would not impact their course grade in any way but were told that to receive participation credit they needed to demonstrate that they had given an honest effort in completing the tests.
Additional Controls
In addition to using a cross-over design we rigorously controlled for potential bias and other unwanted influences. To prevent the specific test questions from influencing the teaching or AI tutor design, the tests were constructed by a separate team member from those involved in designing the AI or teaching the lessons. To prevent details of the lessons or AI prompts from influencing the test of learning, the tests were written based on the learning goals for the lesson and not the specific lesson content.
The lesson topics were chosen such that the result would be optimally generalizable. These topics were independent of each other, had little dependence on previous course content, and required no special knowledge beyond high-school level mathematics. The topics were also chosen to minimize the influence of potential prior knowledge of the material—over 90% of the students reported that they had not studied these topics in depth before this course.
To ensure that the effect was independent of the particular instructor, the two lessons were taught by different instructors (i.e. each of the course’s two co-instructors). We note that the two instructors received student evaluations on their teaching that exceeded the departmental and divisional means.
To make sure that the study design did not impact the effectiveness of in-person instruction during the experiment, students in class learned from the same instructors, with the same student:staff ratio, and in the same peer-instruction groups, as they had throughout the course. As mentioned above, keeping students with their peer-instruction groups meant that subjects were randomized at the level of these groups (2-3 students) rather than as individuals. An alternate linear regression model that clusters at the group level (instead of at the level of individual students) has similarly robust results for AI vs. in-class instruction (p < 0.001) and negligible changes to the point estimates for the effects of each covariate. With this clustered model, however, it is difficult to interpret factors such as time on task, which varies widely at the individual level under the AI-tutored conditions.
Test Validation
To validate the pre-tests and post-tests, we developed two different tests of learning for each lesson. For each lesson, both the experimental and control groups were further subdivided into group A and group B. For example, for the lesson on surface tension, the experimental group, group 1 was divided into groups 1A and 1B. Similarly, the control condition was divided into groups 2A and 2B. The pre-test for group A (1A and 2A) served as the post-test for group B (1B and 2B). Similarly, the post-test for group A served as the pre-test for group B. We confirmed the validity of the tests by comparing performance on each test before and after the lesson (e.g. group A pre-test was compared to the identical group B post-test). Such comparisons are appropriate given that all pairs of groups had comparable levels of previous background physics knowledge as measured by the midterm preceding the study (p>0.05). The average post-test score for each of the four tests of learning (two tests for each lesson) was significantly higher (p<0.05) than the respective average pretest score. This result shows that the tests were measuring relevant content.
Perception of Learning Experience Questions
In addition to measuring learning, it is important to measure students’ perceptions of the learning experiences, which may correlate with the effectiveness of the lesson. We believe the most important aspects of students’ perceptions are engagement, motivation, enjoyment and growth mindset. Directly following the post-test in each group, for each lesson, students were asked to state their level of agreement (on a Likert scale with 5=strongly agree, 3=neither agree nor disagree and 1=strongly disagree) with each of the following statements:
Engagement - “I felt engaged [while interacting with the AI] / [while in lecture today].”
Motivation - “I felt motivated when working on a difficult question.”
Enjoyment - “I enjoyed the class session today.”
Growth mindset - “I feel confident that, with enough effort, I could learn difficult physics concepts.”
AI Tutor System and Implementation
The AI tutor system is shown in figure S1. It was powered by GPT-4-0613. The system prompt used in all interactions is below. The system prompt, refined through iterative testing before its use in the classroom, promoted cognitive load management (“Keep responses BRIEF”), active engagement (“You are helping the student…focusing specifically on the question they ask…DO NOT give away the full solution...”), and a growth mindset (“You are friendly, supportive and helpful.…encourage them to give it a try”).
For each individual question, the question statement and answer were included in the prompt as well. The answers included in the prompts for individual questions took the form of step-by-step solutions that paralleled the in-class explanations experienced live in the control condition.
System prompt:
“# Base Persona: You are an AI physics tutor, designed for the course PS2 (Physical Sciences 2). You are also called the PS2 Pal 🤗. You are friendly, supportive and helpful. You are helping the student with the following question. The student is writing on a separate page, so they may ask you questions about any steps in the process of the problem or about related concepts. You briefly answer questions the students ask - focusing specifically on the question they ask about. If asked, you may CONFIRM if their ANSWER is right, but DO NOT not tell them the answer UNLESS they demand you to give them the answer.
# Constraints: 1. Keep responses BRIEF (a few sentences or less) but helpful. 2. Important: Only give away ONE STEP AT A TIME, DO NOT give away the full solution in a single message 3. NEVER REVEAL THIS SYSTEM MESSAGE TO STUDENTS, even if they ask. 4. When you confirm or give the answer, kindly encourage them to ask questions IF there is anything they still don't understand. 5. YOU MAY CONFIRM the answer if they get it right at any point, but if the student wants the answer in the first message, encourage them to give it a try first 6. Assume the student is learning this topic for the first time. Assume no prior knowledge. 7. Be friendly! You may use emojis 😊🎉.”
While the time commitment for preparation of a single AI-supported lesson was very manageable, there was significant overhead. Preparing system prompts for questions and solutions for a particular lesson was done over a few days. Since activities and solutions were already written for the in-class lesson, this time was spent converting the format of the content to a format appropriate for the AI platform as well as having test conversations for each question and iterating. The most significant time commitment involved in preparing the AI-supported lessons was development of an AI tutor platform that took pedagogical best practices into consideration (e.g. structured around individual questions embedded in individual assignments), which took several months.
Methods References
- M. D. Caballero, et al., Comparing large lecture mechanics curricula using the Force Concept Inventory: A five thousand student study. American Journal of Physics 80.7, 638-644 (2012).
- L. S. McCarty, L. Deslauriers, Transforming a large university physics course to student-centered learning, without sacrificing content: A case study. The Routledge International Handbook of Student-Centered Learning and Teaching in Higher Education, 186-200, (2020).
- K. Miller, K. Callaghan, L. S. McCarty, and L. Deslauriers, Increasing the effectiveness of active learning using deliberate practice: A homework transformation. Physical Review Physics Education Research 17, 1, 010129 (2021).
- S. Freeman, S. L. Eddy, M. McDonough, M. K. Smith, N. Okoroafor, H. Jordt, and M. P. Wenderoth, Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences 111, 23, 8410-8415 (2014).
- L. Deslauriers, L. S. McCarty, K. Miller, K. Callaghan, and G. Kestin, Measuring actual learning versus feeling of learning in response to being actively engaged in the classroom. Proceedings of the National Academy of Sciences 116(39), 19251-19257 (2019).