This section discusses the design of the RL-agent. Staring Design of state space and action space in section A. Design and specifications of reward function is descripted in section B.
A. State Space & Action Space
State space is the only interpretation of environment for the agent. Therefore, it must be composed of factors consistent enough for agent to make sense of its environment, in terms of problem identification.
Therefore, to approximate the desired policy a continuous state space is designed to ensure localization and mobility towards the goal along with obstacle detection. State space consist of following 7 components:
- robot’s base position
- robot’s orientation
- distance to goal
- robot’s linear velocity
- robot’s angular velocity
- active lidar ray
- lidar hit point
Component i-iii are to add localization, component iv and v ensures mobility in the direction of goal and to add obstacle detection components vi and vii are utilized.
Dimension of entire state space is (1 × 414). Shown in Fig. 1, robot’s base position is represented in global cartesian coordinate frame [x, y, z], dimensions (1 × 3). Robot’s orientation is in quaternion [x, y, z, w], dimensions (1 × 4). Distance to goal is perpendicular distance between robot and goal, add 1 component to state space. For linear velocity in [x, y, z] direction of cartesian coordinate frame (1 × 3) is used. Angular velocity [wx, wy, wz] in cartesian world state coordinate frame, dimension (1 × 3). To represent LIDAR of 100 rays list of (1 × 100) is required. Finally, to Represent each LIDAR hit point in global coordinate frame consisting of [x, y, z] coordinates required list of (1 × 300) for 100 rays.
Action space of robot consist of 2 components velocity of robot, and steering angle. Both of these components are continuous bounded by [-1, 1]. The dimension of action space is (1 × 2), resultant velocity and steering angle respectively.
B. Reward Function Design
The Reward functions sets the foundation of solution formulation in RL since, performance of RL agent is entirely dependent upon received reward signal. Therefore, is a vital component of RL agent training [16].
Reward function binds the environment’s state space with the problem under consideration to acquire the effective solution. Therefore, the stability in learning is closely associated with the correlation between state space and reward function. However, sparsity in reward function can lead the training system to divergence[17].
The performance of learning system depends upon the stable reward function which can effectively map state space over action space, in the direction of goal. To attain compatibility, a continuous reward function is designed for a continuous state space. Designed reward function comprises of 4 components, each component adds a distinct feature to achieve a step toward the goal. Consisting of collision avoidance reward, closeness to goal reward, linear velocity, and angular velocity reward
Collision avoidance is ensured by collision reward (CR), which generates sharp − 1 in case of collision with the wall and 1 otherwise.
$$CR=\left\{\begin{array}{c}-1, distance to wall<0\\ 1, Else\end{array}\right.$$
To add perception of goal, and encourage the displacement in the direction of goal, a components closeness to goal reward(CGR) is incorporated.
$$CGR=1-\frac{current distance}{total distance}, \left[-1, 1\right]$$
Where current distance is the perpendicular distance between the car base and goal whereas, total distance is perpendicular distance between car base and goal at initial position of episode. CGR is bounded between − 1 and 1, approaches − 1 if car displaces away from goal and approaches 1 if displacement is in the direction of goal.
To encourage rectilinear motion in the direction of goal, linear velocity reward (LVR) suppresses the velocity in lateral direction (LVX) and encourages high velocity in vertical direction (LVY). Resultant velocity of the system is bounded between 1 and -1 therefore, max attainable velocity in either direction (lateral or vertical) is 1 and minimum is -1 in case the other component is suppressed to 0. So, the difference between lateral and vertical components of linear velocity will be bounded between 1 and -1 representing movement in positive and negative direction of axis respectively. However, to avoid local maxima weightage of linear velocity is multiplied by the factor of 2.
$$LVR=2*\left({LV}_{Y}- \left|{LV}_{X}\right|\right), \left[-\text{1,1}\right]$$
To avoid deflection from the straight path the angular velocity of the car is suppressed. Angular velocity reward component (AVR) is introduced to ensure -1 reward if angular velocity exceeds the empirically set threshold value. However, reward approaches to 1 if angular velocities in all the directions are suppressed to zero.
$$AVR= \left\{\begin{array}{c}-1, \left|{X}_{AV}\right|+\left|{Y}_{AV}\right|+ \left|{Z}_{AV}\right|>0.09\\ 1-\left(\left|{X}_{AV}\right|+\left|{Y}_{AV}\right|+ \left|{Z}_{AV}\right|\right), Else\end{array}\right.$$
Sum of all the components of the reward function is normalized to ensure stability in learning [7] and get a continuous, composite reward equation (Reward) bounded between − 1 and 1. Where − 1 represents the worst and 1 represents the best strategy as the car approaches goal.
$$Reward= \frac{CR+CGR+LVR+AVR}{5}, \left[-1, 1\right]$$