The increased rate of ASD diagnosis in recent years (CDC, 2020) has fueled the research in the area of machine learning with the purpose of improving the learning experience of those affected. The focus of research has been mainly on developing academic or social skills learning applications (Foster et al., 2010; Roman et al., 2018), improving diagnosis efficiency (Kosmicki et al., 2015), and modelling social and behavioral aspects of ASD (Stevens et al., 2017). However, we are not aware of any research that applied reinforcement learning to solve the MSP. The following sections review related work on digitized behavior management and reinforcement learning, in an attempt to identify the gap between available technology tools and the need to provide solutions for therapy recommendations in special education.
Digitized Behavior Intervention
There are many available applications that allow therapists, teachers and parents to monitor the behavior of children with special needs (Marcu et al., 2013; Vannest et al., 2011). These applications allow the people involved with the intervention of children with ASD to track, store and share important information. This information is used to plan for interventions, monitor progress towards IEP objectives, and generate reports. While these applications are very helpful and a good replacement of paper-based data collection, the data collected in special needs settings is usually complex, unstandardized and incomplete (Marcu et al., 2013). Many studies suggested using data mining techniques to support intervention decisions (Thabtah, 2019). For instance, Burns et al. (2015) developed a mobile app that employed association rule mining to reveal pattern in behavior causes and effects to inform the therapists decisions. In the study, parents use a mobile app to collect Antecedent, Behavior and Consequence (ABC) data. The data mining techniques aimed to identify behavior causes and effects patterns to enable therapists to improve intervention. Linstead et al. (2016) introduced the Autism Management Platform (AMP), an integrated health care information system for managing data related to the diagnosis and treatment of children with ASD. The authors developed a mobile application to facilitate information and multimedia sharing between parents and clinicians. The system also includes a web interface and analytics platforms, allowing specialists to mine patient data in real-time. The analytics platform uses machine learning techniques to provide users with personalized data searching preferences. Bhuyan et al. (2017) studied temporal data to identify factors that aid caregivers in creating an effective intervention plan and predict the right treatment based on the data in other contexts.
Previous studies also focused on using mobile technology to help children with ASD and their caregivers in regulating challenging behaviors. For instance, Crutchfield et al. (2015) evaluated the impact of the I-Connect app on stereotypy in adolescents with ASD in a school setting. Préfontaine et al. (2019) developed the iSTIM app to support parents of younger children with ASD in reducing stereotypy behavior. The app was evaluated and found successful in regulating stereotypy behavior when used by trained researches as well as parents who do not have the required ABA training (Trudel et al., 2020). In another related study, Begoli et al. (2013) aimed to develop a computational representation for ABA to serve as a reasoning foundation for intelligent-agent mediated therapies by formulating the representation of ABA concepts as a process ontology. Concepts that are relevant to the reasoning and operations functions aspects of the agents (e.g., rewarding and prompting) were represented in the ontology and then were formalized as a Belief-Desire-Intention (BDI) reasoning framework. Such formalization is feasible because of the procedural, repetitive and prescriptive nature of ABA (Begoli, 2014).
Reinforcement Learning (RL)
As a subfield of machine learning, RL has been widely implemented, resulting in its increasing applicability in real-life problems and decision-support systems (Yu et al., 2020). For instance, RL has been used to improve the delivery of personalized care by optimizing medication choices, medicine doses, and intervention timings (S. Liu et al., 2020). In the healthcare and therapy domain, data is characterized by its high dimensionality and complex interdependencies (Gräßer et al., 2017). RL has the potential to automatically explore various treatment options by analyzing patient data to derive a policy and personalized therapy without the need of pre-established rules (S. Liu et al., 2020).
Recommender systems has also been leveraged using RL. Recommender systems based on RL have the advantage of updating the policies during online interaction, which enables the system to generate recommendations that best suit users evolving preferences (Zhao et al., 2019). Examples include news (Zheng et al., 2018), music recommendations (Hong et al., 2020) and personalized learning systems (Shawky & Badawi, 2019).
RL has proven to be an appropriate framework for interaction modeling and optimization of problems that can be formulated as MDPs. The advantage of such methods is the ability to model the stochastic variation of outcomes as transition probabilities between states and action (Tsiakas et al., 2016). RL and MDPs have been successfully applied to personalized learning systems (Sayed et al., 2020; Shawky & Badawi, 2019), intelligent tutoring systems (Barnes & Stamper, 2008; Stamper et al., 2013), adaptive serious games for ASD (Khabbaz et al., 2017) and robot assisted therapy (Tsiakas et al., 2016). For instance, Bennane (2013) automated the selection of the content of a tutoring system and its pedagogical approach to provide differentiated instruction. Similarly, Shawky and Badawi (2019) used RL to build an intelligent environment to provide learners with suitable content as well as adapt to the learner’s evolving states. Khabbaz et al. (2017) proposed an adaptive serious game for rating social ability in children with ASD using RL. The game adapts itself according to the level of the ASD child by adjusting the difficulty level of the activities. In the field of Robot Assisted Therapy, Tsiakas et al. (2016) proposed an Interactive RL framework that adapts to the user’s preferences and refine its learned policy when coping with new users.
In this work, we aim to develop an app that can be used by any of the child caregivers in any setting. Moreover, we aim to provide teachers and therapists with a tool that facilitates intervention planning once a problematic behavior was detected through the recommendation of motivators using RL. Unlike previous studies, we rely on online learning rather than on previously collected data. While online learning does not benefit from the offline repetitive training period, it allows the model to adjust the polices to match the non-stationary environment and individuality of each child with SEND (P. Liu & Chen, 2017).
Solving the MSP
The aim of this work is to leverage the power of RL to solve the problem of selecting the best motivator for each intervention session. We first model the MSP as a Markov Decision Process (MDP) problem. By using MDPs, the proposed model can explicitly model future rewards, which will benefit the motivator recommendation accuracy significantly in the long run. MDPs can address many of the challenges faced in therapy decision-making. We then apply RL by using Q-Learning to solve the modeled problem.
Markov Decision Processes (MDP)
The MSP can be formulated as an MDP. An MDP is a standard formalization of sequential decision making, which is widely used for applications where an autonomous agent interacts with its surrounding environment through actions. An MDP can be defined as a four-tuple; (δ, A, P, R), where δ is a set of states called the state space A, is a set of actions called the action space, P is the state transition function, which is the probability of transitioning between every pair of states given an action, and R is the reward function that assigns an immediate reward after transitioning to a new state due to an action (Sutton & Barto, 2018).
The agent, which is situated in the therapists or teacher mobile application, interacts with the environment at discrete time steps. In our example, a time step is considered each time a therapist records a behavior in the mobile application. At each time, the agent receives a state St from the environment from a set of possible states, δ. Based on this state, the agent selects an action At from a set of valid actions A in state St. Actions in our example are motivators the therapist can use to motivate the student. Based in part on the agent’s action, the agent finds itself in a new state St+1 one time-step later. The environment also provides the agent a scalar reward Rt+1 from a set of possible rewards, R. The reward in our example will depend on whether the student becomes motivated and to which degree, among other factors that are explained in the next sections. The transition (st, at, rt+1, s t+1) is stored in memory Μ. The ultimate variation of this system aims at enhancing the learning and therapy experience of the child by recommending the right motivator (Sutton & Barto, 2018).
The agent-environment interaction produces a trajectory of experience consisting of state-action-reward tuples. Actions influence immediate rewards as well as future states and, therefore, future rewards. When the agent takes an action in a state, the transition dynamics function p (s′, r | s, a), formalizes the state transaction probability. This produces the probability of transitioning to state s′ with reward r, from state s when taking action a.
Modelling MSP as an MDP
The research suggests that, in various clinical settings, modeling treatment decisions through MDPs is effective and can yield better results than therapists’ intuition alone (Bennett & Hauser, 2013). However, there are no previous attempts to model the MSP and an MDP. Careful formulation of the problem and state/action space in essential to obtain satisfactory results and satisfy the Markov assumption that the current timepoint (t) is dependent only on the previous time point (t-1) (Sutton & Barto, 2018). Fig 1 shows how the MSP problem, represented by the ABA intervention, is mapped as an MDP in this work. The following sub-sections describe how each component of the MDP is used to model the MSP problem.
1. State
One of the most challenging and critical issues in designing the MDP model is to properly identify the factors that influence the effectiveness of a motivator, especially when these factors may differ from one child to another. The personalization of intervention can be achieved by carefully determining these features that represent the state space (Shawky & Badawi, 2019). Through careful investigation of the research that investigates motivation stimuli for students with ASD, the features outlined in Table 1 were considered:
Table 1: Features representing the state space
Feature
|
Description
|
Number of values
|
Reference
|
Contextual features
|
|
|
|
Antecedent event (trigger)
|
Event or activity that immediately preceded a problem behavior (alone, given a direction or demand, transitioned to new activity, denied access to an item)
|
4
|
(Bhuyan et al., 2017; Stichter et al., 2009)
|
Time of Day
|
Time of the day the problem behavior occurred (morning, noon, evening)
|
3
|
(Burns et al., 2015)
|
Subject
|
The aim of this feature is to account to the place and person the problem behavior occurred with (academic subjects, therapy sessions, home)
|
8
|
(Burns et al., 2015)
|
Behavior
|
|
|
|
Behavior
|
The problem behavior that requires intervention, grouped into seven categories (aggression, self-injury, disruption, elopement, stereotypy, tantrums, non-compliance)
|
7
|
(Stevens et al., 2017)
|
Behavior Function
|
The reason the behavior is occurring (sensory stimulation, escape, access to attention, access to tangibles)
|
4
|
(Alstot & Alstot, 2015)
|
History
|
|
|
|
Last unsuccessful motivator
|
The ID of the last motivator used that was not successful in motivating the student within an episode, including an option for “none”.
|
7
|
|
Motivator past usage
|
The number of times each motivator was used within a week grouped in categories of <5, 5-10, >11. This factor is composed of six features according to the number of motivators (actions) available (edibles, sensory, activities, tokens, social, choice).
|
36
|
(Çetin, 2021)
|
Therapists and teachers aim to identify appropriate intervention for multiple settings. However, these interventions may fail if no attention is given to contextual differences (Stichter et al., 2009). Contextual features such as antecedent events, time of day, and location (where and with whom) all impact the child response to a proposed intervention, therefore informing optimal motivators. Moreover, while interventionists aim to track and remediate problem behaviors, the ability to understand the reason behind the occurrence of a behavior is essential as the behavior itself for creating appropriate behavior plans (Schaeffer, 2018).
Problem behaviors in special education are numerous and diverse. In this study, challenging behaviors are grouped into eight widely observed behaviors (Stevens et al., 2017); aggression (e.g. hitting, biting), self-injury (e.g. head-banging, hitting walls), disruption (e.g. yelling, knocking things over), elopement (e.g. wandering, escaping), stereotypy (e.g. rocking, hand-flapping), tantrums (e.g. crying, screaming), non-compliance (e.g. whining, defying orders), obsession (e.g. constantly talking about same topic).
Keeping track of the last ineffective motivator used is essential in our problem definition to maintain the Markov property where future state and reward depends only on the current state and action (Sutton & Barto, 2018). We consider this feature a part of the state to prevent suggesting the same motivator repeatedly. Moreover, we keep track of the number of times a motivator group was used to prevent satiation (Matheson & Douglas, 2017; Rincover & Newsom, 1985). While studies have shown that extrinsic reward does not directly harm a child’s intrinsic motivation (Cameron & Pierce, 1994), we consider the repeated long-use of tangible rewards, such as edibles or tokens, to have a negative impact when not carefully administered, and therefore should be limited (Witzel & Mercer, 2003).
2. Actions
There has been a controversy regarding what type of reward best motivate children with SEND to follow routines and complete academic tasks without negatively impacting their future behavior. Nevertheless, there is a strong evidence that rewarded children report higher intrinsic motivation than the nonrewarded ones (Cameron & Pierce, 1994).
However, the dilemma regarding which motivator is best suited for each intervention withstands. There are many factors that impact the choice of the right contingent reward (motivator) during a therapy or academic session. According to ABA techniques, there is a need to address what happens before the behavior, what is the behavior itself, and what is done immediately after the behavior. In this study, the goal is to recommend an action (contingent motivator) that can be given to the student after completing a certain task or complying to a certain command. The teacher or therapist needs to decide on which motivator to use from a list of six motivators categories (see Table 2); edibles, sensory, activities, tokens, social, and choice (Çetin, 2021). For example, if a student is yelling to get the teacher’s attention, the teacher may promise the student a favorite food item (edible) if the student stops yelling and completes her task. Alternatively, the teacher may assign a leadership role (social) as a motivator to the student once she is done with the activity. If another student is wandering to escape a task, the teacher may promise extra computer time (activity) once the student completes the task in hand. Therapists also consider the long-term effect of the motivator. For example, edible items, especially unhealthy choices, should be avoided. Repetitive use of the same motivator should be avoided as well to prevent satiation. Experienced interventionists sometimes use the same motivator for a specific period of time to stablish a routine but change it later to prevent the student dependency on that particular reward to complete the tasks.
Table 2: Motivators Categories
Motivator
|
Description
|
Edible
|
Food items, such as fruits, snacks, and juice.
|
Sensory
|
Items or activities that realizes pleasure to the senses of the child, such as listening to music, sitting in a rocking chair, or playing with sand.
|
Activity
|
Activities may include drawing, playing with the computer, or jumping on a trampoline.
|
Token
|
Tangible items that the child values, such as stickers, money, or stars on an honor chart.
|
Social
|
Attention or interaction with another person, such as high-fives, smiles, and praise.
|
Choice
|
Giving the child the chance to choose between two different items or methods, such as asking whether she prefers to use a pencil or crayons to write.
|
3. Rewards
The reward in our problem definition is the measure of student motivation after introducing the motivator. We adopt in this study the subjective measure of responsiveness proposed by Koegel and Egel (1979) as shown in Table 3. The teacher or therapist rates the student’s responsiveness after introducing a motivator and carrying out an activity.
Table 3: Scale of child’s responsiveness (adapted from Koegel and Egel (1979))
Output
|
Description
|
Reward
|
Negative
|
Child continues problem behavior (tantrums, kicking, screaming) or does not comply with instructions and engages in behavior unrelated to the activity (rocking, yawning, tapping).
|
-1
|
Neutral
|
Complies with instructions but tends to get restless or loses attention.
|
+2
|
Positive
|
Performs task readily. Attends to task quickly, smiles while doing the task, and presents appropriate behavior.
|
+4
|
Rejected recommendation
|
The user rejects the motivator recommendation and does not introduce it to the child.
|
-0.25
|
Edible item
|
The motivator selected was an edible item.
|
-1
|
Token item
|
The motivator selected was a token item.
|
-0.5
|
Each student responsiveness category results in the agent receiving a reward, as shown in Table 3. The agent receives a reward of -1 if the motivator did not work or the student response was negative. Alternatively, it receives a +2 if the response was neutral, or +4 if the response was positive. If the caregiver chooses not to follow the recommendation, the reward is -0.25. In formulating the problem, we also aim to balance two competing objectives; receiving positive responsiveness from the student, and limiting long-term exposure to unhealthy items. The definition of “safe Reinforcement Learning” has been proposed in the literature, especially for recommender systems that aim to balance user’s satisfaction and the avoidance of recommending harmful items like violent movies (Heger, 1994). Therefore, the agent receives a penalty of -1 when recommending edibles and -0.5 for recommending tokens.
Q-Learning
To solve the proposed MDP problem, a Q-learning algorithm with an epsilon-greedy (e-greedy) policy with linearly decreasing exploration rate was used. Q-learning is an off-policy, value-based RL algorithm that aims to find the best action to take according to the current state. Q-learning seeks to learn a policy that maximizes the total reward. Q-learning is considered off-policy as the Q-learning function learns from actions chosen according to a behavior policy that differs from the updated policy. A policy is equivalent to an ABA-based intervention protocol with the advantage of capturing more individualized details of students. In our case, the agent choses an action according to an e-greedy policy, while learning the optimal policy. e-greedy is a method used to balance exploration and exploitation, where epsilon (e) refers to the probability of choosing to explore (i.e., choosing a random action) rather than exploit (i.e., choosing the optimal action). The policy is represented by a table that maps all possible states with actions. While following the e-greedy policy, the agent exploits with a probability of (1-e) and with a probability of exploring of (e). This probability (e) decays over time by some rate as the agent learns more about the environment. The agent will become “greedy” in terms of exploiting and the probability of exploration becomes less. If the agent becomes “well-trained”, it is possible to select the best action given the state. This process is described as acting according to an optimal policy (Sutton & Barto, 2018).
The reward is an estimation of the scores the state S receives under the action a, which is denoted as Q (s, a) and updated based on Equation 1 (Q-Learning Function), which is based on Bellman’s optimality equation (Bellman, 1966).
|
(1)
|
where α is the learning rate, r is the observed reward, s' is the new state, γ<1 is the discount factor for the future rewards received, and Q (s', a') is the estimation of the maximum reward that can be obtained by taking some future action in the state s'. The learning process can continue for any number of episodes. In our case, we consider the end of the episode when the student becomes motivated. The Q-learning algorithm can be found in Appendix A.
While it seems straightforward to apply standard learning algorithms to learn the agent’s optimal policy and then use it to recommend motivators to the user, this approach cannot be applied in practice to our problem. Unlike traditional reinforcement tasks such as Atari games (Mnih et al., 2015), therapy recommendation tasks cannot benefit from the possibility of interacting with the user repeatedly to obtain any amount of experience to update the policy towards an optimal one (Lei & Li, 2019). Moreover, there is no previously collected data to train the algorithm offline before the online interaction. Therefore, we do not vary the experimental parameters in this study.
Additionally, this study is considered to have a “cold start” as all the values in the Q-Table were set to “zero” before the deployment phase. A cold start can be considered problematic as it bothers users by requiring too many interactions for collecting enough experience for learning (C. Zhang et al., 2021). On the other hand, online learning is beneficial for therapy recommendations due to the highly dynamic nature of children preferences and responses to intervention. Moreover, online learning allows us to obtain user’s feedback by tracking whether the suggested motivator was used or not (Arzate Cruz & Igarashi, 2020; P. Liu & Chen, 2017).
Each episode starts when a caregiver records a behavior on the mobile app. Then, it is terminated by reaching the final state in which the student becomes motivated. We use a learning rate α of 0.1 and a discount γ of 0.95. We apply an e-greedy policy that starts with a high ε of 0.9 to encourage state exploration. Then, it decays exponentially with a rate of 0.99 until it reaches 0.05.
As shown in Fig 3, on each time step t, the therapist or teacher records a behavior instance and requests a motivator recommendation. The agent takes the feature representation of the current state and recommends a motivator using e-greedy policy. The caregiver then administers the intervention and provides feedback by rating the response of the student. Alternatively, the caregiver can choose not to use the recommended motivator if deemed unappropriated or skip the recommendation if the item is not available (e.g., edible items) or cannot be applied to the current activity (e.g., choice). When the agent chooses to exploit, it selects an action by selecting the highest Q(s,a) for the observed state from the Q-table. Otherwise, the agent “explores” by selecting a random action.