Deep Q learning is one of the efficient RL algorithms for learning the environment faster. The main advantage of Deep Q learning is it learns from a batch of experience. As a result, Deep Q learning is able to predict Q values more quickly and may be applied to bigger networks. Here, authors have designed Deep Q learning for underwater networks using ns3-ai framework. Let us discuss the details of implementation in this section.
A. Deep Q Learning
Deep Q learning uses a neural network to approximate Q values. The sensor's current state is provided as an input, and the Q value for every action that could be taken is created as an output. The Fig. 2 portrays the difference between Q values generated by Q learning and Deep Q learning.
Deep Q learning uses the SARSA equation [16] to predict the Q value action of every state. Eq. 1 is used in the design of Deep Q learning for underwater networks.
$$Q\left({s}_{t} , {a}_{t}\right)=Q\left({s}_{t} , {a}_{t}\right)+\propto \left[{R}_{t+1}+\gamma *{max}_{a}Q\left({s}_{t+1} , {a}_{t}\right)-Q\left({s}_{t} , {a}_{t}\right) \right]$$
1
Q(st,, at) – current state and action
α – learning rate
γ – discount factor
R – reward earned
Q(st+1,, at) – next state and action
Steps in Deep Q network
- Preprocess and feed the state of the sensor node image to the deep neural network. It will return the Q-values of all possible actions of the state.
- Select the action using epsilon greedy policy or geo location as described in [3]
- Perform the action selected in step 2 in a state s and move to the state s’. This indeed becomes the preprocessed image of the next sensor node image. This transition is stored in the buffer as <s, a, r, s’>
- This replay or experienced buffer is used for learning the model. A uniform sample distribution of buffer is taken for learning purpose.
- Calculate the loss after learning the model as in equation 2.
$$loss=(\propto +\gamma {max}_{{a}^{{\prime }}} Q\left({s}^{{\prime }},{a}^{{\prime }}; {\theta }^{{\prime }}\right)- Q\left(s,a; \theta \right)){ }^{2}$$
2
This represents the squared difference between target and predicted Q
6. After n no. of iterations update network weights as calculated in loss function (2)
7. Repeat above steps for M no. of episodes
Using the above procedure, Deep Q learning helps to predict Q values for every state.
B. Deep Q based Routing
Deep Q based is implemented using ns3-ai. The structure of the environment, here underwater environment is simulated using ns3 simulator. The parameters, sensor states, environment of the underwater environment created by ns3 simulator is given as input to Deep Q learning algorithm. Deep Q learning analyzes parameters to approximate Q values for the appropriate state. Algorithm 1 describes Deep Q based routing for finding an optimal between source and destination.
Algorithm 1: Deep Q Based Routing
Step 1: Simulate 1500 * 1500 m underwater environment in ns3
Step 2: Initialize environment parameters such as bandwidth, delay, duration and no. of source nodes.
Step 3: Using message interface, the environment and its parameters are passed to RL algorithm (Python side)
Step 4: Select DQN agent algorithm
Initialize s, a, r, s’
Store the above initialized values as a transition
If stored transitions reaches memory capacity:
Learn the neural network
Else:
Initialize s as source state
Use observation parameters as command window, segments acknowledged, bytesinflight
Calculate reward
Reward = segments acknowledged – bytesinflight
Find the action using equation 1 in neural network
Find the next state s’
Store the transition as (s, a, r, s’) in the memory buffer
Repeat the above steps for m no. of episodes
Ns3-ai framework is used to simulate above described algorithm. It contains two interfaces- ns3, python (i.e RL algorithm). A 1500 * 1500m underwater environment is created using ns3. Acoustic sensor nodes are created using aquasim ng. Node id’s and socket id’s are created as a part of initialization. Environment parameters are set according to the table1. The data and events are generated at source end. In order to route the data to the destination or sink node, deep Q learning is used to find next optimal node. As a part of ns3-ai framework, state of the environment and its parameters are passed to python interface using message passing mechanism. In the RL part, deep Q learning agent is used to analyze the parameters received from ns3 and find optimal path.
In the deep Q learning, every transition is stored in the buffer as a quadruple (s, a, r, a’). s represents the state, here, it represents source node that sends the data. a represents action, here, it represents to which node the data has to be transmitted to. r represents reward, here it calculates the reward earned by the node for choosing appropriate action. s’ represents next target, here it represents next node to which the data is forwarded to. Let us discuss, the working deep Q learning in finding the appropriate action. Initially the parameters of RL environment state, action, reward and next state are initialized. Every transition is stored in the buffer. If number of the transitions exceeds the defined capacity, then neural network takes uniform distributed sample from the stored transitions. Neural network then uses this sample to learn and train the model using Eq. 2. Otherwise, using observation parameters like segments acknowledged, segment size and bytes in flight, q values and action are calculated using Eq. 1. Then, appropriate reward is calculated as discussed in the algorithm. It then identifies next state based on action selected. This transition is stored in the buffer for further use.
Q value along with action is stored in the shared memory pool that is shared by ns3 and ai interfaces. Ns3 simulator now reads the value stored in shared memory and uses the same to perform action on the source node. Source node now forwards the data to the node as suggested by deep q learning algorithm (ai interface). The updated parameters are again sent to ai interface using shared memory pool. The process is repeated for ‘m’ no. of episodes.