To plan routes for robotic arms, forward kinematics is generally first used to obtain the position of the end of a robotic arm, position of the target, and distance between them. Subsequently, stochastic actions, instead of inverse kinematics, are adopted to complete the route planning. To confine the system to producing movements relevant to reaching the target instead of generating random rotational movements without a goal, this study adopted the deep deterministic policy gradient (DDPG) [7], a reinforcement learning algorithm that features the use of rewards and penalties to impose constraints, which resulted in clear and traceable steps. In reinforcement learning, an intelligent agent observes the current state, selects an action, and uses the reward received as a reference for deciding the next action. Accordingly, the most prominent feature of reinforcement learning is its ongoing trial-and-error process, which facilitates the exploration of an unknown environment and identification of the method that yields the greatest reward. Figure 1 shows the flowchart of reinforcement learning.
Reinforcement learning is divided into two types according to its action selection method; one selects actions on the basis of probability and the other does so on the basis of value. Probability-based reinforcement learning makes judgments according to current environmental factors, outputs the probability of each selection, and makes random selection. Value-based reinforcement learning executes the action with the highest value. The main strength of probability-based reinforcement learning is that it allows continuous action selection; its major weakness is its failure to update the reward for the current action. Therefore, Witten integrated the two types of reinforcement learning and proposed actor–critic learning [8], which the present study adopted in conjunction with deep learning to produce the DDPG.
2.1 Policy-Based Reinforcement Learning
Probability-based reinforcement learning achieves high performance in high-dimensional and continuous action spaces and boasts favorable convergence properties. Its most notable difference from value-based reinforcement learning is its ability to employ stochastic policy in certain cases of stochastic policy learning, enabling it to outperform deterministic policy learning in particular contexts. Nevertheless, stochastic policy learning exhibits problems such as low efficiency and the tendency to converge to local optimal solutions. A stochastic policy reinforcement learning method related to actor–critic learning is the policy gradient method, which has an artificial neural network (policy) that outputs the probability of each action for an input environment to influence the agent’s selection. Because actions are selected according to probability and rewards are stochastically provided in certain given environments, the following steps are used in the policy gradient method (see Fig. 2): calculate the expected rewards, use gradient ascent to identify a direction that yields the highest reward, and update the probability and weight of each action progressively to achieve the optimal state.
2.1.1 Value-based reinforcement learning
The most frequently adopted value-based reinforcement learning method is Q-learning which, in contrast to the policy gradient method, requires the collection and sorage of data from various rounds, with the Q-function updated at each round according to each new batch of data collected. In Q-learning, a matrix is produced before each action to determine the expected reward for executing an action in a given environment, providing increasingly clear instructions regarding the next step with each step taken and revealing the relativeness between the current and the next steps. Accordingly, Q-learning is able to update at each step, which increases learning efficiency and prevents low learning efficiency caused by an excessively large number of steps taken in a round. However, it is unable to process continuous actions when a large amount of information is involved. The detail flowchart of the Q-learning update as shown in Fig. 3.
To update matrix Q(s,a), a certain action (at) is first executed in a given environment (st) to obtain the preupdated matrix Q(st,at). After the execution of at, the reward values (R) in the next environment (st+1)) are examined; the action that yields the highest reward (max Q(st+1,at+1)) in this environment is executed. This reward value is multiplied by the discount rate (γ) and then summed with R in st+1; the difference between this sum and Q(st,at) is then obtained. Finally, this difference is multiplied by the learning rate (η) to update the old matrix (Q(st,at)).
2.1.2 Actor-Critic
The actor–critic framework is a concept proposed by Witten (1977). This framework integrates the strengths of value-based and policy-based reinforcement learning, thereby facilitating the processing of continuous and high-dimensional values and concurrently single-step updates. The actor incorporates the policy gradient selection feature of policy-based learning to promote stochastic action selection and interactions between an intelligent agent (π) and the environment (S). The critic involves the use of value-based reinforcement learning, with Q-learning being the most frequently adopted method, to judge the benefit of executing each action.
Because of the various possibilities involved in continuous actions, the policy gradient method typically involves the selection of a random value, referred to as the sampling method, for updates. However, when an extreme value is selected, the established learning process may be disrupted, resulting in failure to converge. Actor–critic learning is most different from policy gradient in that it does not directly multiply the stochastic gradient of a current condition with the sampled rewards of the current round during the updating of action probabilities. Instead, it calculates the expected total reward following an action selected in Q-learning, thereby avoiding the stochasticity in rewards sampled for the current round and effectively eliminating the possibility of the learning process being disrupted by extreme values. Figure 4 shows the flowchart of the actor–critic updates.
2.1.3 deep deterministic policy gradient (DDPG)
In actor–critic learning, the actor acts and performs updates on the basis of values defined by the critic; updates are continually being made because data change according to changes in the actor. The actor and critic functions usually fail to converge when they update concurrently. In response to this problem, the Google DeepMind team proposed an updated version of actor–critic learning, namely the DDPG, which integrates the concept of deep leaning to solve challenges in convergence.
The DDPG was proposed in 2016, and its underlying framework is based on three learning methods, actor–critic, deep Q-network (DQN) [9], and deterministic policy gradient (DPG) [10] learning. DPG learning is an extension of policy gradient learning, its key difference being that it incorporates the function μ into the probability output by its artificial neural network and outputs a final action, thus reducing unnecessary various calculations and accelerating convergence in high-dimensional or continuous actions. DQN learning is based on the concept of Q-learning, its main difference being that it calculates the Q value by using an artificial neural network instead of matrix storage. Specifically, it estimates the expected reward of the continuous actions of an intelligent agent (π) in the current state (s). The integrated artificial neural networks of the DQN and DPG are combined with experience replay and fixed Q-target mechanisms, which have been successfully applied with DQN, to remove the correlation between samples and increase the likelihood of convergence.