Dynamic Goal Tracking for Differential Drive Robot using Deep Reinforcement Learning

doi:10.21203/rs.3.rs-2189021/v1

Download PDF

Research Article

Dynamic Goal Tracking for Differential Drive Robot using Deep Reinforcement Learning

https://doi.org/10.21203/rs.3.rs-2189021/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

To ensure the steady navigation for robot stable controls are one of the basic requirements. Control values selection is highly environment dependent. To ensure reusability of control parameter system needs to generalize over the environment. Adding adaptability in robots to perform effectively in the environments with no prior knowledge reinforcement leaning is a promising approach. However, tuning hyper parameters and attaining correlation between state space and reward function to train a stable reinforcement learning agent is a challenge. In this paper we designed a continuous reward function to minimizing the sparsity and stabilizes the policy convergence, to attain control generalization for differential drive robot. We Implemented Twin Delayed Deep Deterministic Policy Gradient on Open-AI Gym Race Car. System was trained to achieve smart primitive control policy, moving forward in the direction of goal by maintaining an appropriate distance from walls to avoid collisions. Resulting policy was tested on unseen environments including dynamic goal environment, boundary free environment and continuous path environment on which it outperformed Deep Deterministic Policy Gradient.

Reinforcement Learning

differential drive

TD3

DDPG

model free learning

Wheeled robots make one major paradigm of mobile robots. They are gaining importance in the day-to-day routine by effectively assisting in humans in diverse range of disciplines. Including healthcare and elderly assistance, contributing socially and commercially at marketplaces and in transportation, as well as providing assistance at schools, offices and even at homes, covering both indoor and outdoor environments. Roomba, TurtleBots, Pioneer3-AT, Pepper robot are few examples [1][2].

Such robots are interacting with human-inhabited environments. These environments are dynamic, typically unstructured, and highly uncertain. Traditional control systems in such environments experience diminution in terms of performance or even lead to a total failure[ref]. traditional control techniques are based on manual pre-programming of hand-crafted models or traditional physics-based modelling tools. Therefore, they fail to generalize over the environment setting and are incompetent to effectively adapt with the changing environment [3][4].

Model learning controls have tendency to generalize over the problem setting and can adapt with dynamic environment. Contrary to traditional controls, parameters of model can be estimated directly from the dataset of the real system. Therefore, they can effectively incorporate the unknown nonlinearities of the system [3]. However, to deploy model-predictive controls, state-of-the-art approaches often adopt sequential pipeline. It starts with state estimation, along with managing contact with the environment, followed by trajectory prediction and optimization, model-based control prediction and finally operational space command [5][6]. Design and development of such approach is dependent on availability of accurate dynamic models of robot. It is not a trivial task but requires technical expertise of the system.

In contrast to that, end-to-end deep reinforcement learning (DRL) can effectively abstract the low-level system dependent specifications with relatively simpler reward function. Therefore, excluding the need for any prior knowledge about the dynamics of robots and environments. This can ensure the performance of robot without explicit system identification or manual engineering [7]. If DRL is effectively implemented, it can automate controller design by eliminating the requirement for system identification. It produces controls that are directly tuned and generalized for a specific robot and environment [6].

For training an agent with a continuous action space using Reinforcement learning (RL), Twin Delayed Deep Deterministic Policy Gradient (TD3)[8] is state of the art algorithm. It is a successor of Deep Deterministic Policy Gradient(DDPG) [9], which ensure a stable and robust actor update along with the minimization of overestimation bias in critic network and eventually stabilizes the learning for continuous action spaces. Implementation of DDPG and TD3 on numerous robotic systems are available in literature [7]. Extensively used on unmanned aerial vehicles (UAVs) [10], covering legged robots of OpenAI Gym environments including ant, half cheetah and walker[8]. Moreover, fewer implementations of DDPG on differential drives are available in literature, e.g., autonomous driving cars [11], skid steering in differential drive [12], optimal torque distribution[13], and obstacle detection for differential drive [14]. However, implementation of TD3 on differential drive to the best of our knowledge has not been done.

This paper is focused on designing and testing a forward moving primitive policy using TD3 for differential drive Open-AI Gym Racecar model. To ensure the movement in the direction of goal while maintaining an appropriate distance from the surround obstacles. To achieve this, we make the following contributions:

Design a continuous reward function to train TD3 on differential drive car.
Test the performance of TD3 on various unseen environments.
Performance based Comparison of TD3 with DDPG on similar training and testing setup.

Structure of this paper is as follows; RL preliminaries and required algorithms are covered in section II; Design of system is discussed in section III; implementation of experimental setup and testing details of system are in section IV. Whereas section V covers discussion and results, followed by conclusion in section VI.

This section introduces the essential terminologies and architecture of RL framework. Along with the fundamentals of RL algorithms DDPG and TD3 are covered.

A. Reinforcement Learning Preliminaries

RL framework solves a problem by trial and error-based interaction with the environment. An agent observes an environment state and takes an action, receives a reward against every action and experiences the transition to the next state. Goal is achieved by selecting actions that maximizes the received reward. Therefore, RL-agent aims to learn a maximum cumulative reward generating policy over timesteps for a given problem setting. Sequential data is considered, as RL-agent applies the policy on each step and improve itself over the time to estimate the best action for the given problem state.

To efficiently deal with the sequential data, RL problem is typically modeled using Markov decision process (MDP). As, tuple of MDP can effectively represent the data of sequential transition. To model the interaction of agent with the environment for discrete time step t = 0, 1, 2, …, ∞. A single tuple of MDP represents, existing state s_t ∈ S state space of agent, executed action a_t ∈ A action space of agent, received reward r_t based upon the reward function R(s, a), next state s_t+1 ∈ S experienced based on transition probability P(s_t+1 | s_t, a) and discount factor γ. It is used to compute the return G_t, discounted reward over time steps.

$${G}_{t}= {R}_{t+1}+{\gamma R}_{t+2}+\dots = \sum _{k=0}^{{\infty }}{{\gamma }^{k}R}_{t+k+1}$$

Where R_t+1 and R_t+2 are the expected sequence of future reward received from the given state s_t and γ is bounded between [0, 1] ranging myopic to farsighted evaluation. This is to incorporate the reward of future states in the current state by incentify s_t based on the expected sequence of future reward lead by s_t and a_t. G_t is aimed to improve leaning for RL agent by considering the expected future trail of reward before selecting next action.

An agent starts form state s₀ ~ P(s₀), for a given s_t it samples an action using given policy π(A | S) distribution of action space given state space, receives a reward signal r_t using reward function R and experience the transition to next state s_t+1. Consequently, tuples of experience (s_t, a_t, r_t, s_t+1) is constructed to be utilized as environment experience to solve MDP i.e., optimize policy, distribution selecting actions with maximum expected reward.

To evaluate policy and deduce the goodness of state-action pair Q-value function approximator Q(s, a) is used. Q-value function computes G_t the discounted reward over the given state-action pair (equation reference). Maximizing Q-value leads to optimal policy i.e., selected actions for all the state are expected to produce maximum reward.

$${Q}_{\pi }\left(s,a\right)=E\left[{R}_{t+1}+ \gamma {Q}_{\pi }\left({S}_{t+1}, { A}_{t+1}\right) | {S}_{t}=s, { A}_{t}=a\right]$$

B. DDPG

DDPG is a model free RL algorithm [9] i.e., no environment model required it learns by directly interacting with the environment. It is based on actor critic architecture [15] and is significantly contributes for continuous action space. In actor critic architecture policy approximator is utilized as actor and Q-value function approximator is used as critic. Both actor and critic approximators are concurrently updated to solve MDP.

DDPG architecture comprises of 2 critic approximators and 2 actor approximators, typically all the approximators are modeled using neural networks. One of the two critic network is target Q-network Q_ϴtarget, utilizing tuples of environment state (s_t, a_t, r_t, s_t+1) ~ D to generate discounted target i.e., discounted estimated Q-value for given s_t+1 and estimated next action a_t+1 by target policy network π_ϴtarget. Sum of actual r_t and discounted target is computed to compare it with the estimated Q-value for s_t and a_t by current Q-network Q_ϴ, to minimize the difference loss L(Q_ϴ, D) for Q_ϴ update. Whereas current policy π_ϴ is used to explore environment by computing a_t against s_t. π_ϴ is updated in the direction to maximize Q_ϴ against action approximated by π_ϴ belonging to continuous action space. To update π_ϴ loss L(π_ϴ, D) is computed as follows.

$$L\left({\mu }_{\theta }, D\right)=\genfrac{}{}{0pt}{}{max}{\theta }\genfrac{}{}{0pt}{}{E}{\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)∼D}\left[{Q}_{\theta }\left({s}_{t},{\mu }_{\theta }\left({s}_{t}\right)\right)\right]$$

$$L\left({Q}_{\theta }, D\right)=\genfrac{}{}{0pt}{}{E}{\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)∼D}\left[{\left({Q}_{\theta }\left({s}_{t},{a}_{t}\right)-\left({r}_{t}+\gamma \left(1-d\right){Q}_{\theta target}\left({{s}_{t+1} ,\mu }_{\theta target}\left({s}_{t+1}\right)\right)\right)\right)}^{2}\right]$$

However, both π_ϴtarget and Q_ϴtarget target networks experience soft update i.e., weights of target networks are updated probabilistically by the current networks π_ϴ and Q_ϴ respectively on each epoch.

C. TD3

TD3 [8] is a successor of DDPG effectively enhancing the performance by reducing overestimation bias problem, stabilizing actor network updates, and ensuring robustness against noise. To deal with over estimation bias TD3 introduces the twin pair of critic networks i.e., 4 in total, 2 comprising target Q-networks Q_ϴtarget1 and Q_ϴtarget2. Other 2 are used as current Q-networks Q_ϴ1 and Q_ϴ2. Minimum of both the networks contributes to compute loss leads to underestimation of bias, this changes loss function as follows.

$${Q}_{\theta target }({S}_{t+1}, {\pi }_{ϴtarget}({S}_{t+1}\left)\right)=\text{min}\left[\begin{array}{c}{Q}_{\theta target1}\left({S}_{t+1}, {\pi }_{ϴtarget}\left({S}_{t+1}\right)\right)\\ {Q}_{\theta target2}\left({S}_{t+1}, {\pi }_{ϴtarget}\left({S}_{t+1}\right)\right)\end{array},\right]$$

$$L\left({Q}_{\theta }, D\right)=\genfrac{}{}{0pt}{}{E}{\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)∼D}\left[{\left({Q}_{\theta 1}\left({s}_{t},{a}_{t}\right)-\left({r}_{t}+\gamma \left(1-d\right){Q}_{\theta target}\left({{s}_{t+1} ,\mu }_{\theta target}\left({s}_{t+1}\right)\right)\right)\right)}^{2}+{\left({Q}_{\theta 2}\left({s}_{t},{a}_{t}\right)-\left({r}_{t}+\gamma \left(1-d\right){Q}_{\theta target}\left({{s}_{t+1} ,\mu }_{\theta target}\left({s}_{t+1}\right)\right)\right)\right)}^{2}\right]$$

Number of actor networks remain 2. However, to stabilize the actor learning delayed actor updates are introduced i.e., frequency of actor networks update is lower than critic networks update. This reduces the chance of actor converging on unstable critic update spike. Moreover, to ensure robustness in policy noise regulation is done by clipping noise to the generated action and training is conducted over the corresponding noise regulated actions.

This section discusses the design of the RL-agent. Staring Design of state space and action space in section A. Design and specifications of reward function is descripted in section B.

A. State Space & Action Space

State space is the only interpretation of environment for the agent. Therefore, it must be composed of factors consistent enough for agent to make sense of its environment, in terms of problem identification.

Therefore, to approximate the desired policy a continuous state space is designed to ensure localization and mobility towards the goal along with obstacle detection. State space consist of following 7 components:

robot’s base position
robot’s orientation
distance to goal
robot’s linear velocity
robot’s angular velocity
active lidar ray
lidar hit point

Component i-iii are to add localization, component iv and v ensures mobility in the direction of goal and to add obstacle detection components vi and vii are utilized.

Dimension of entire state space is (1 × 414). Shown in Fig. 1, robot’s base position is represented in global cartesian coordinate frame [x, y, z], dimensions (1 × 3). Robot’s orientation is in quaternion [x, y, z, w], dimensions (1 × 4). Distance to goal is perpendicular distance between robot and goal, add 1 component to state space. For linear velocity in [x, y, z] direction of cartesian coordinate frame (1 × 3) is used. Angular velocity [wx, wy, wz] in cartesian world state coordinate frame, dimension (1 × 3). To represent LIDAR of 100 rays list of (1 × 100) is required. Finally, to Represent each LIDAR hit point in global coordinate frame consisting of [x, y, z] coordinates required list of (1 × 300) for 100 rays.

Action space of robot consist of 2 components velocity of robot, and steering angle. Both of these components are continuous bounded by [-1, 1]. The dimension of action space is (1 × 2), resultant velocity and steering angle respectively.

B. Reward Function Design

The Reward functions sets the foundation of solution formulation in RL since, performance of RL agent is entirely dependent upon received reward signal. Therefore, is a vital component of RL agent training [16].

Reward function binds the environment’s state space with the problem under consideration to acquire the effective solution. Therefore, the stability in learning is closely associated with the correlation between state space and reward function. However, sparsity in reward function can lead the training system to divergence[17].

The performance of learning system depends upon the stable reward function which can effectively map state space over action space, in the direction of goal. To attain compatibility, a continuous reward function is designed for a continuous state space. Designed reward function comprises of 4 components, each component adds a distinct feature to achieve a step toward the goal. Consisting of collision avoidance reward, closeness to goal reward, linear velocity, and angular velocity reward

Collision avoidance is ensured by collision reward (CR), which generates sharp − 1 in case of collision with the wall and 1 otherwise.

$$CR=\left\{\begin{array}{c}-1, distance to wall<0\\ 1, Else\end{array}\right.$$

To add perception of goal, and encourage the displacement in the direction of goal, a components closeness to goal reward(CGR) is incorporated.

$$CGR=1-\frac{current distance}{total distance}, \left[-1, 1\right]$$

Where current distance is the perpendicular distance between the car base and goal whereas, total distance is perpendicular distance between car base and goal at initial position of episode. CGR is bounded between − 1 and 1, approaches − 1 if car displaces away from goal and approaches 1 if displacement is in the direction of goal.

To encourage rectilinear motion in the direction of goal, linear velocity reward (LVR) suppresses the velocity in lateral direction (LV_X) and encourages high velocity in vertical direction (LV_Y). Resultant velocity of the system is bounded between 1 and -1 therefore, max attainable velocity in either direction (lateral or vertical) is 1 and minimum is -1 in case the other component is suppressed to 0. So, the difference between lateral and vertical components of linear velocity will be bounded between 1 and -1 representing movement in positive and negative direction of axis respectively. However, to avoid local maxima weightage of linear velocity is multiplied by the factor of 2.

$$LVR=2*\left({LV}_{Y}- \left|{LV}_{X}\right|\right), \left[-\text{1,1}\right]$$

To avoid deflection from the straight path the angular velocity of the car is suppressed. Angular velocity reward component (AVR) is introduced to ensure -1 reward if angular velocity exceeds the empirically set threshold value. However, reward approaches to 1 if angular velocities in all the directions are suppressed to zero.

$$AVR= \left\{\begin{array}{c}-1, \left|{X}_{AV}\right|+\left|{Y}_{AV}\right|+ \left|{Z}_{AV}\right|>0.09\\ 1-\left(\left|{X}_{AV}\right|+\left|{Y}_{AV}\right|+ \left|{Z}_{AV}\right|\right), Else\end{array}\right.$$

Sum of all the components of the reward function is normalized to ensure stability in learning [7] and get a continuous, composite reward equation (Reward) bounded between − 1 and 1. Where − 1 represents the worst and 1 represents the best strategy as the car approaches goal.

$$Reward= \frac{CR+CGR+LVR+AVR}{5}, \left[-1, 1\right]$$

This section describes the implementation and training of designed RL-agent. Section A discusses the experimental setup. Section B covers training specifications of both DDPG and TD3. Training performance comparison is described in section C.

A. Experimental setup

To train and evaluate the performance of system on differential drive, race car model in Open AI Gym[ref] environment is utilized. Race car is equipped with LIDAR sensor to incorporate obstacle detection. LIDAR of 100 rays is fixed to Hokuyo joint covering the front portion of the car. A forward moving path environment was designed, bounded by wall from all four sides, following all the constrains of typical world model of Gazebo to train the RL agent. However, policy approximator (actor) and Q-value function approximators (critic) for RL agents are modeled using artificial neural networks.

Actor networks’ architecture as visualized in Fig. 2, are comprised of 2 hidden layers, along with input and output layer. Input layer consisting of 414 neurons to accommodate state space and Relu activation is used to add nonlinearity. Hidden layer-1 is composed of 400 neurons along with Relu activation non-linearity. Hidden layer-2 is composed of 300 neurons and tanh activation non-linearity. Finally, output layer consists of 2 neurons to accommodate the components of action space.

Similarly, critic networks’ architecture shown in Fig. 3, comprises an input layer and 2 hidden layers followed by the output layer. Input layer consists of 416 neurons to ensure compatibility with 414 components for state space and 2 components of action space. Nonlinearity is ensured by Relu activation function. Hidden layer-1 comprises 400 neurons along with Relu nonlinearity followed by 300 component based linear hidden layer-2, converging the flow to signal neuron comprising output layer to yield the Q-value.

B. Training

Training of both algorithm TD-3 and DDPG were implemented and trained in the similar experimental setup to draw an effective comparison on both performance and training efficiency.

1) D-3

The training of implemented TD3 system has acquired precisely converging graphical results figure. Losses of both actor and critic network are converging effectively justifying the theoretical explanation and maximizing the received reward. Best trained policy was attained at around 26k steps of training, generated average reward value of 235 when evaluated over 10 episodes.

Actor network’s loss plot in figure 4a shows loss decreasing against training time step. Starting from zero and settling around -30 after 150k training steps. As actor’s loss is negative Q-value of estimated action against the given state. Therefore, loss minimization is proportional to Q-value maximization. Lower the actor’s loss higher the Q-value for the estimated action. Critic network’s loss defines the closeness between estimated target Q-value and estimated current Q-value, minimum difference i.e., closer to zero ensures convergence. Critic’s loss plot in figure 4b, starting from zero due to identical target and critic network, progressed to 4 after being exposed to 150k steps of training. However, average reward gained is increasing with time steps. To estimate the performance of training models, they are saved on regular intervals to be evaluated in terms of gained reward, as shown in figure 4c. High reward generating peaks and stable plateau of the episode reward plot are evaluated. Highest reward producing model, peak at 26k steps generating reward value of 235 was selected for the testing on unseen environment. However, overall performance of training can be evaluated by average reward plot against training steps figure 4d. It started increasing after a sharp decline on the initial training steps, this ensures agent exploring environment and eventually converging in the anticipated direction.

2) DDPG

Training DDPG system ensured precisely converging graphical results Figure 5. Losses of both actor and critic network are converging effectively and maximizing the received average reward. Best trained policy was attained at 105k steps of training, generated average reward value of 210.69 when evaluated over 10 episodes.

Actor network’s loss plot figure 5a shows loss decreasing against training time step. Started from zero and decreasing, approaching -20 after 200k training step. Actor’s loss is negative Q-value of estimated action against the given state. Therefore, loss minimization is proportional to Q-value maximization. Lower the actor’s loss higher the Q-value for the estimated action. Critic network’s loss defines the closeness between estimated target Q-value and estimated current Q-value. Minimum difference i.e., closer to zero ensures convergence. Critic’s loss plot in Fig. 5b, starting from zero due to identical target and critic network, progressed to 1 after being exposed to 200k steps of training. However, average reward gained is increasing with time steps. To estimate the performance of training models, rewards of episode on different training steps were evaluated, shown in Fig. 5c. Models with high reward generating peaks and stable plateau of 5c are evaluated. Highest reward producing model, peak at 105k steps generating reward value of 210.69 was selected for the testing on unseen environment. However, overall performance of training can be evaluated by average reward plot against training steps in Fig. 5d. The trend of Fig. 5d shows average reward increasing with the progressing training steps ensuring converging in the anticipated direction.

C. TD3 v/s DDPG Training Comparison

To conduct justifiable training performance comparison between TD-3 and DDPG, both the algorithms were trained under similar conditions. Values of corresponding hyper parameters were kept equal while training both TD-3 and DDPG. However, performance on the basis of convergence and reward maximization over training steps were observed.

Training updates of TD-3 and DDPG were based on batch size of 100 steps. Discount factor γ was set to 0.99 to maximize the reward foresight. However, neural network dependent hyper parameters were tuned accordingly, e.g., actor update frequency was set 2 for TD3 and 1 for DDPG. Although, architecture of both actor and critic networks were kept similar for both the algorithms but learning rate differed to aid the convergence of different setups. For TD3 both actor and critic observed same learning rates of 10^− 6. Whereas actor and critic learning rate for DDPG were 10^− 4 and 10^− 2 respectively.

Table 1

Training Performance Comparison between TD3 and DDPG
Algorithm	Batch Size	Actor’s Update Frequency	Discount Factor	Actor Learning Rate	Critic Learning Rate	Training Steps	Actor Loss Convergence	Critic Loss Convergence	Max Avg Reward	Max Episode Reward
TD-3	100	2	0.99	10^− 6	10^− 6	200k	-30	5	201.86	235.27
DDPG	100	1	0.99	10^− 4	10^− 2	250k	-20	1	165.26	210.69

Despite keeping similar training architecture, convergence performance of the two algorithms visibly differs from each other as summarized in Table 1. Actor network loss of DDPG has converged to -20 and critic network loss has converged to 1 with maximum average reward of 165.26 even after 250k steps of training. Contrary to that, actor network loss of TD3 has converged to -30 and critic network loss has converged to 5 with maximum average reward rising to 201.86 in just 200k steps of training. However, maximum reward generating policy for DDPG is with episode reward value of 210.69 on 105k training steps. For TD3, best policy is generating 235.27-episode reward value on 26k training policy. Moreover, instability in training of DDPG is evident by noisy reward graphs. Overall training performance of two algorithms suggests that TD3 outperformed DDPG in the designed setting. Training TD3 is resource efficient and stable as compared to DDPG.

To analyze the stability and performance of trained policy, it was tested on numerous unseen environments. Evaluation was conducted on the bases of reward gain and performance in simulated environments. Best policies of both DDPG and TD3 were tested under similar conditions. The performance of agent is evaluated on 400 step episode based on reward per step graph. Following are the 3 environments on which both the policies are tested:

A. Dynamic Goal Environment

Performance of trained agent was evaluated on a continuous path in which goal is shifted forward, as the robot approaches the defined vicinity of goal, environment shown in figure 6a. Figure 6b shows the performance of TD3 agent. Starting from less than 0.5, increasing on every step approaching 1 experiencing max reward of 0.63. It is experiencing a sharp decline when the goal is shifted but rapidly recovers by gaining reward. Episode ended with total reward 241.38 over 400 steps. However, performance of DDPG policy reward plot figure 6c is relatively noisy, starting from 0.25 and policy reward plot figure 6c is relatively noisy, starting approaching 1 with max reward experienced is 0.58. Episode ended with total reward of 217.58 over 400 steps.

B. Boundary Free Environment

Boundary free environment neglects the notion of obstacles by removing the boundary wall of the environment, visualized in Fig. 7a. This can effectively change the knowledgeable state space values, as agent will experience no active lidar values unlike training environment. Evaluation results of TD3 in Fig. 7b, clearly display that change in environment is not affecting the agent’s performance, evident by gained reward plot. Trend of reward graph shows the increasing reward gaining policy performance, starting from around 0.44 reward and progressing to 0.54 reward, as agent approaches in the direction of goal, without deviating from the straight path. Total reward of 400 steps episode is 238.73. However, according to reward plot of DDPG agent in Fig. 7c, starting reward is 0.52 and it is increasing maximum up to 0.625. Sharp spike on initial steps shows the noisy performance of policy. Total reward gained is 213.78 over the episode.

C. Continuous Path Environment

To evaluate the agent on continuous path visualized in Fig. 8a, front wall of the training environment is removed which was the closest obstacle to goal in the training environment. Subsequently this changes the experienced states of environment for agent. Performs of TD3 agent on this environment can be evaluated by gained reward plot shown in Fig. 8b. Graph displays the increasing reward gaining policy performance, starting from around 0.53 reward and progressing to 0.625 reward in just 400 steps episode, as agent approaches in the direction of goal. Total reward of episode is 238.60. Performance of DDPG agent on continuous path environment is shown in Fig. 8c, starting from 0.20 and reaching 0.60. However, experienced reward plot is considerably noisy. Total reward of episode is 213.79.

Testing Results of DDPG and TD3 on unseen environment clarified the performance and stability of TD3 agent over DDPG agent. Testing results stated in Table 2, shows the performance of both the agents in previously unobserved environments tested over 50 episodes, where each episode is of 400 steps. It is evident that TD3 out-performed DDPG in all the environments, generating average rewards of 241.38, 238.72 and 238.60 in dynamic goal environment, boundary free environment and continuous path environment respectively. Contrary to that, DDPG generated 217.58, 213.79 and 210.69 in dynamic goal environment, boundary free environment and continuous path environment respectively.

Moreover, training performance of TD3 and DDPG as discussed in Table 1, shows that training TD3 is efficient and stable in terms of early convergence and reward maximization. Training of TD3 produced max reward generating policy, with 235.26 reward value is trained in 26k training steps. Resulting in a relatively smooth episode reward over training steps plot. However, maximum reward generating policy by DDPG agent was acquired after 105k steps with the maximum reward value of 210.69.

Table 2

Testing Results on DDPG and TD3 on unseen environments
Environments	RL Agent	Episode Steps	Total Evaluated Episodes	Average Reward
Dynamic Goal Environment	TD3	400	50	241.38
Dynamic Goal Environment	DDPG	400	50	217.58
Boundary Free Environment	TD3	400	50	238.72
Boundary Free Environment	DDPG	400	50	213.79
Continuous Path Environment	TD3	400	50	238.72
Continuous Path Environment	DDPG	400	50	213.78

TD3 is effectively implement on differential drive system model of OpenAI Gym racecar with continuous action spaces. A continuous reward function and state space is designed to attain the required policy. Trained TD3 agent is effectively tested on various unseen environments including dynamic goal environment, boundary free environment and continuous path environment. To evaluate the performance TD3 agent it is compared with DDPG agent trained and tested under similar setting. According to training and evaluation results of designed system, it is concluded that, training TD3 for differential drive is stable and efficient compared to DDPG. As TD3 agent converged earlier policy generating maximum reward of 235 on 26k training steps was achieved whereas DDPG agent’s maximum reward generating policy is achieved at 105k training steps with reward of 210. However, smoothness of reward plots proves TD3 to be more stable compared to DDPG. Moreover, correlation between designed reward function and state space is ensured by effective training of both TD3 and DDPG. Testing results verifies that trained policy can be effectively adapted in unobserved environments. Therefore, future direction of this work will be implementation of trained policy over hardware. Moreover, this policy can be combined with other primitive policies to perform a complex task, by generating composite policy.

A. Funding

This research received no external funding.

B. Conflicts of Interest

The authors declare no conflict of interest.

C. Author Contributions

Conceptualization: Yasar Ayaz.

Methodology: Yasar Ayaz, Semab Neimat Khan, and Mahrukh Shahid.

Software: Mahrukh Shahid.

Writing: Mahrukh Shahid and Khawaja Fahad Iqbal.

Project Administration: Yasar Ayaz, Sara Ali, and Khawaja Fahad Iqbal.

Cooper, S., Di Fava, A., Vivas, C., Marchionni, L., & Ferro, F. (2020). “ARI: The Social Assistive Robot and Companion,” 29th IEEE Int. Conf. Robot Hum. Interact. Commun. RO-MAN 2020, pp. 745–751, doi: 10.1109/RO-MAN47096.2020.9223470.
Rubio, F., Valero, F., & Llopis-Albert, C. (2019). A review of mobile robots: Concepts, methods, theoretical framework, and applications. Int J Adv Robot Syst, 16(2), 1–22. doi: 10.1177/1729881419839596.
Peters, D. N. J. (2011). “Model learning for robot control: a survey,” pp.319–340, doi: 10.1007/s10339-011-0404-1.
Ugurlu, H. I., Kalkan, S., & Saranli, A. (2021). Reinforcement Learning versus Conventional Control for Controlling a Planar Bi-rotor Platform with Tail Appendage. J Intell Robot Syst Theory Appl, 102(4), doi: 10.1007/s10846-021-01412-3.
Bledt, G., Powell, M. J., Katz, B., Di Carlo, J., Wensing, P. M., & Kim, S. (2018). MIT Cheetah 3: Design and Control of a Robust, Dynamic Quadruped Robot. IEEE Int Conf Intell Robot Syst, 2245–2252. doi: 10.1109/IROS.2018.8593885.
Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., & Levine, S. (2019). Learning to Walk via Deep Reinforcement Learning. Robot Sci Syst. doi: 10.15607/RSS.2019.XV.011.
Abo Mosali, N., Shamsudin, S. S., Alfandi, O., Omar, R., & Al-Fadhali, N. (2022). Twin Delayed Deep Deterministic Policy Gradient-Based Target Tracking for Unmanned Aerial Vehicle with Achievement Rewarding and Multistage Training. Ieee Access : Practical Innovations, Open Solutions, 10, 23545–23559. doi: 10.1109/ACCESS.2022.3154388.
Fujimoto, S., Van Hoof, H., & Meger, D. (2018). “Addressing Function Approximation Error in Actor-Critic Methods,” 35th Int. Conf. Mach. Learn. ICML 2018, vol. 4, pp. 2587–2601,
Lillicrap, T. P., et al. (2016). “Continuous control with deep reinforcement learning,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc.,
Xu, X., Chen, Y., & Bai, C. (2021). “Deep reinforcement learning-based accurate control of planetary soft landing,” Sensors, vol. 21, no. 23, pp. 1–16, doi: 10.3390/s21238161.
Ó, Pérez-Gil, et al. (2022). Deep reinforcement learning based control for Autonomous Vehicles in CARLA. Multimed Tools Appl, 81(3), 3553–3576. doi: 10.1007/s11042-021-11437-3.
Dai, H., Chen, P., & Yang, H. (2022). Driving Torque Distribution Strategy of Skid-Steering Vehicles with Knowledge-Assisted Reinforcement Learning. Appl Sci, 12(10), 5171. doi: 10.3390/app12105171.
Jin, L., Tian, D., Zhang, Q., & Wang, J. (2020). Optimal torque distribution control of multi-axle electric vehicles with in-wheel motors based on DDPG algorithm. Energies, 16(3), doi: 10.3390/en13061331.
Chen, Y., Han, W., Zhu, Q., Liu, Y., & Zhao, J. (2022). Target – driven obstacle avoidance algorithm based on DDPG for connected autonomous vehicles. Eurasip Journal On Advances In Signal Processing, 5, doi: 10.1186/s13634-022-00872-5.
Konda, V. R., & Tsitsiklis, J. N. (2000). “Actor-critic algorithms,”Adv. Neural Inf. Process. Syst., pp.1008–1014,
Zhou, W., & Li, W., “Programmatic Reward Design by Example,” 2021, [Online]. Available: http://arxiv.org/abs/2112.08438
Devidze, R., Radanovic, G., Kamalaruban, P., & Singla, A. (2021). “Explicable Reward Design for Reinforcement Learning Agents,” Adv. Neural Inf. Process. Syst., vol. 24, no. NeurIPS, pp. 20118–20131,

No competing interests reported.

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Dynamic Goal Tracking for Differential Drive Robot using Deep Reinforcement Learning

Archived Versions:

Version 1

Abstract

Figures

I. Introduction

Ii. Background

Iii. Methodology

Iv. Experimental Evaluation

V. Results

Vi. Discussion

Vii. Conclusion

Declarations

References

Additional Declarations

Archived Versions:

Version 1