In this section, we explain the system model and the two proposed schemes for cell association for both H2H and M2M devices, and how the load balancing between macro and pico cells is achieved using the Q-learning algorithm.
3.1 System Model
In LTE systems UE devices need to associate with an eNB to transmit data or to communicate with other devices. Also, when a user device moves within the service area, it performs cell selection/re-selection and handover where it has to measure the signal strength of the neighbour cells before association with any of them [30–33].
In our system model shown in Fig. 1, there is one macro-cell surrounded by multiple pico-cells and multiple H2H and M2M devices. A device typically chooses the eNB according to the largest RSRP signal from all eNBs. Thus, the devices very close to PeNB will connect to it, while the majority of devices would connect to the MeNB due to its larger RSRP signal. This could result in overloading theMeNB will and would lead to performance degradation due to the increase indevice blocking probabilities. Hence, load balancing is needed to distribute the load among the serving eNBs. A possible method of load balancing is cell range expansion where the effective RSRP signal is used by adding a biasing value to the real RSRP of the PeNB. This has an effect to entice the devices to associate with the PeNBs and reduce the load on the MeNB [34–38].
In LTE networks, a UE measures some parameters on the reference signal that help in the cell selection process: the Received Signal Strength Indicator (RSSI),the Reference Signal Received Power (RSRP), and the Reference Signal Received Quality (RSRQ). The parameters are defined below. Table 1 lists the symbols for these parameters and other symbols used in the rest of the paper:
RSSI
The carrier RSSI (Receive Strength Signal Indicator) measures the average total received power over N resource blocks (in Watts or dBm). The total received power of the carrier RSSI includes the power from co-channel serving and non-serving cells, adjacent channel interference, thermal noise, etc.
RSRP
It is defined as the linear average of the power contributions (in Watts or dBm) of the resource elements that carry cell-specific reference signals within the considered measurement frequency bandwidth.
RSRQ
A metric that considers RSSI and the number of used Resource Blocks (N) RSRQ = (N*RSRP)/RSSI measured over the same bandwidth. The RSRQ is a carrier/interference (C/I) type of measurement and it indicates the quality of the received reference signal. The RSRQ measurement provides additional information when RSRP is not sufficient to make a reliable handover or cell re-selection decision.
M2M communication devices are typically battery-operated. Thus, they need to transmit at the lowest possible transmission power such that they increase their battery life and do not need battery replacement over their lifetimes.To achieve this goal, we propose a Q-learning based algorithm applied by the M2M communication devices to associate with eNB using low uplink transmission power to increase its battery life while satisfying its QoS requirements.
On the other hand, H2H communication devices require high downlink data transmission rates that can be achieved by connecting with the eNB with a large number of radio channels that can be assigned to different devices to achieve their QoS requirements.
Based on the available system bandwidth, each eNB\(j\)will have \({N}_{j}\) radio channels to allocate amongst its associated devices. We assume that the total transmitted power of each eNB is equally divided between all its available radio channels. The SINR experienced by device \(i\)when connected toeNB\(j\)on one resource block (RB) is expressed as
$${SINR}_{j}^{i}=\frac{{P}_{j}{g}_{j}^{i}}{\left(\sum _{k\in B,k\ne j}{P}_{j}{g}_{j}^{i}+{\sigma }^{2}\right)}$$
2
where \({P}_{j}\) represents the transmitted power from eNB\(j\) over one RB; \({g}_{j}^{i}\) represents the average channel gain of the link between eNB\(j\) and device \(i\), and \({\sigma }^{2}\) represents the additive noise power on each RB. Thus, the upper bound of the DL rate of device \(i\) associated with eNB\(j\) over one channel with bandwidth \(w\) can be expressedby of Shannon-Hartley capacity formula as:
$${\rho }_{j}^{i}= w {\text{log}}_{2}\left(1+{SINR}_{j}^{i}\right)$$
3
Assuming all RBs for device \(i\) when connected to eNB\(j\) have the same channel condition, and the number of radio channels allocated to a device\(i\) by eNB\(j\)is\({\vartheta }_{j}^{i}\), then the total DL rate available to device\(i\) from eNB\(j\) is given by:
$${\rho }_{j}^{total,i}={\vartheta }_{j}^{i}{\rho }_{j}^{i}$$
4
Each H2H device \(i\in D\) requires its achievable rate to be higher than its downlink rate threshold when associated with an eNB\(j\). Thus,\({\rho }_{j}^{total,i}\)should satisfy the following constraints:
$${\rho }_{j}^{total,i}\ge {ƞ}^{i}$$
5
where \({ƞ}_{i}\) is the minimum DL data raterequired by the device.
Thus, the integer number of radio channels allocated to a device \(i\) associated with eNB\(j\) to achieve its QoS requirement is \({F}_{j}^{i}=⌈\frac{{ƞ}^{i}}{{\rho }_{j}^{i}}⌉\) where \(⌈.⌉\) is the ceiling function. Therefore, when a device\(i\) is associated with eNB\(j\), the eNB\(j\) must assign an integer number of radio channels \({\vartheta }_{j}^{i}\)satisfying:
$${\vartheta }_{j}^{i}\ge {F}_{j}^{i}$$
6
Table 1
Symbol
|
Representation
|
\({B}_{YeNB,j}\)
|
B represents the instantaneous blocking factor from the eNB j
Y represents the eNB type, \(Y\in \left\{P, M\right\}\) referring to PeNB and MeNB respectively.
|
D
|
The set of devices
|
\({g}_{j}^{i}\)
|
The average channel gain of the link between eNB\(j\) and device\(i\)
|
i
|
Device index
|
j
|
eNB index
|
Nj
|
The radio channels
|
\({P}_{j}\)
|
The transmitted power from eNB\(j\) over one RB
|
\({P}_{j}^{M2M,i}\)
|
The UL transmit power from the M2M device\(i\) associated with eNB\(j\)
|
\({PL}_{j}^{i}\)
|
The path loss of the link between any device\(i\) and eNB\(j\) measured in dBm
|
\({P}_{MAX}^{M2M}\)
|
The maximum uplink transmission power.
|
\(Q\left({s}_{t}, {a}_{t}\right)\)
|
The action-value function at time \(t\) is given for every state \({s}_{t}\in S\) and action\({a}_{t}\in A\left({s}_{t}\right)\)
|
\({QoS}_{j}^{Z2Z,i}\)
|
The quality of service for device \(i\) and eNB\(j\)
Z represents the device type, \(Z\in \left\{H,M\right\}\) referring to H2H and M2M respectively
|
\({RSRP}_{YeNB,j}\)
|
The Reference Signal Received Power from eNB j
Y represents the eNB type, \(Y\in \left\{P, M\right\}\) referring to PeNB and MeNB respectively.
|
\({R}^{M2M}\left(t\right)\)
|
The reward of the M2M QL algorithm at time t.
|
\({{R}_{LB}}_{}\left(t\right)\)
|
The reward of the LB-QL algorithm
|
\({SINR}_{j}^{i}\)
|
The SINR received by device \(i\)when associated with eNB\(j\)
|
α
|
The learning rate
|
ɣ
|
The discount factor
|
\(\varGamma\)
|
The target signal to noise ratio (SNR) on the whole frequency range for each eNB
|
\({\text{ƞ}}_{i}\)
|
The minimum DL data rate required by device \(i\).
|
\({\vartheta }_{j}^{i}\)
|
The number of radio channels allocated to a device\(i\) by eNB
|
\({\sigma }^{2}\)
|
The additive noise power on each RB
|
\(\mathcal{l}\)
|
The weighting factor
|
\({\rho }_{j}^{i}\)
|
The DL rate of device \(i\) associated with eNB\(j\)
|
We use Q-learning for cell association and load balancing between the macro and the pico-cells to improve the HetNet system performance and reduce the device blocking probability.In the next section, we provide an overview of the basics of the Q-learning algorithm.
3.2 Basics of Q-learning
Reinforcement learning (RL) decides an action in a certain situation to maximize the reward. Reinforcement learning differs from supervised learning where training data is labeled, so the model is trained with the correct answer.However,in reinforcement learning an agent decides its action based on a system of rewards and penalties resulting from taking decision at a certain state. The agent learns from its experience: good decisions are rewarded and bad decisions are penalized. RL can be used in different practical applications such as robotics for industrial automation and the creation of training systems that provide custom instruction. There are different types of RL such as State-Action-Reward-State-Action (SARSA), Deep Q-Network (DQN) [39], Deep Deterministic Policy Gradient (DDPG) [40], and Q-Learning (QL) [41].
Q-learning was introduced by Watkins in 1989 [42]. An agent interacts within the environment in one of two ways either exploring or exploiting. Exploring means taking an action randomly, while exploiting means using the available information from the previous learning to make a decision.At each step, an agent ina certain state chooses an action based on hispreviouslearning. These actions may have either positive or negative outcomes called the rewards. A reward is the value received after completing a certain action at a given state. The overall goal of Q-learning is to learn a policy that maximizes the total reward [42–44]. Figure 2depicts the Q-learning interaction with the environment.
An action-value function \(Q\left({s}_{t}, {a}_{t}\right)\) at time \(t\) is given for every state \({s}_{t}\in S\) in a finite Markov decision process, where \(S\) is the set of all possible states, and action \({ a}_{t}\in A\left({s}_{t}\right)\), where \(A\left({s}_{t}\right)\)is the set of all possible actions chosen based on a given policy for state \({s}_{t}\) [42]. If action \({ a}_{t}\)was chosen for state\({\text{s}}_{\text{t}}\), then the system goes to the next state \({\text{s}}_{\text{t}+1}=\text{s}{\prime } \in \text{S}\)and receives a reward \({R}_{t+1}\).
A policy is the rule that the agent uses in selecting its actions in different states. A policy is denoted by \({\pi }_{t}({s}_{t}, {a}_{t})\). We define the action-value function under a policy \(\pi\) as \({Q}^{\pi }({s}_{t}, {a}_{t})\):
$${Q}^{\pi }\left({s}_{t}, {a}_{t}\right)={E}_{\pi }\left\{{R}_{t}|{s}_{t}=s, {a}_{t}=a\right\}= {E}_{\pi }\left\{\sum _{k=0}^{\infty }{\gamma }^{k }{R}_{t+k+1}|{s}_{t}=s, {a}_{t}=a \right\}.$$
7
where ɣ is the discount factor, (0.8 ≤ γ ≤ 1).It isused to balance the immediate and future rewards. If it takes the value0, then it will make the agent only consider the current rewards.On the other hand, if it approaches 1, it will make the agent look for long-term high reward.
The Q-learning algorithm it relatively updates a table of all possible states and all possible actions for these states called the Q-table. The Q-table is a look-up table that contains all states \({s}_{u},\)all possible actions that can be taken when the system is at a state \({a}_{v}\), the resulting rewards \({R}_{uv}({s}_{u},{a}_{v})\)when taking action \({a}_{v}\) at state \({s}_{u}\), and the Q-function \(Q({s}_{u},{a}_{v})\). At each time \(t\), the agent selects an action \({a}_{t}\), and receives a reward \({ R}_{t+1}\)where\({ R}_{t+1}\in R\)is calculated and moves to the new state \({ s}_{t+1}\). On selecting the new state \({ s}_{t+1}\), the Q-function and Q-table are updated. The formula used in updating the Q-function is given by:
$$Q({s}_{t+1}, {a}_{t+1}) \leftarrow Q({s}_{t}, {a}_{t}) + \alpha ({s}_{t}, {a}_{t}) \times [{R}_{t+1} + \text{ɣ} \times E \{ Q({s}_{t+1}, {a}_{t+1}) \left| {s}_{t} \right\}- Q({s}_{t}, {a}_{t})]$$
8
where α(st, at) is the learning rate \(\left(0<\alpha <1\right)\) and represents how much the newly acquired information is taken into account. The value of \(E \left\{ Q\right({s}_{t+1}, {a}_{t+1}\left) \right| {s}_{t} \}\)is the estimate of the optimal future value for the action-value function [36].\(Q({s}_{t}, {a}_{t})\)is updated via the algorithm until it converges. An example for the Q-table is shown in Table 2.
Table 2
An Example of the Q-table
States\({ {s}}_{{i}}\)
|
Action\({{a}}_{1}\)
|
Actions\({{a}}_{2}\)
|
Actions\({{a}}_{{v}}\)
|
\({ {s}}_{1}\)
|
\(\left\{ {Q}\left({{s}}_{1},{{a}}_{1}\right), {R}\left({{s}}_{1},{{a}}_{1}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{1},{{a}}_{2}\right), {R}\left({{s}}_{1},{{a}}_{2}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{1},{{a}}_{{v}}\right), {R}\left({{s}}_{1},{{a}}_{{v}}\right)\right\}\)
|
\({ {s}}_{2}\)
|
\(\left\{ {Q}\left({{s}}_{2},{{a}}_{1}\right), {R}\left({{s}}_{2},{{a}}_{1}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{2},{{a}}_{2}\right), {R}\left({{s}}_{2},{{a}}_{2}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{2},{{a}}_{{v}}\right), {R}\left({{s}}_{2},{{a}}_{{v}}\right)\right\}\)
|
\({ {s}}_{3}\)
|
\(\left\{ {Q}\left({{s}}_{3},{{a}}_{1}\right), {R}\left({{s}}_{3},{{a}}_{1}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{3},{{a}}_{2}\right), {R}\left({{s}}_{3},{{a}}_{2}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{3},{{a}}_{{v}}\right), {R}\left({{s}}_{3},{{a}}_{{v}}\right)\right\}\)
|
...
.
.
|
.....
|
....
|
.....
|
\({ {s}}_{{i}}\)
|
\(\left\{ {Q}\left({{s}}_{{u}},{{a}}_{1}\right), {R}\left({{s}}_{{u}},{{a}}_{1}\right)\right\}\)
|
\(\left\{ {Q}\left({{s}}_{{u}},{{a}}_{2}\right), {R}\left({{s}}_{{u}},{{a}}_{2}\right)\right\}\)
|
\(\left\{ {Q}\left({u},{{a}}_{{v}}\right), {R}\left({{s}}_{{u}},{{a}}_{{v}}\right)\right\}\)
|
Different policies can be used in choosing an action such as the ɛ-Greedy strategy used in our system model. The value of ɛ is in the range \(\left(0<\epsilon <1\right).\)The value of \(\epsilon\)is used to determine the type of action taken either a random exploration or exploitation of its previous states and knowledge. If ε is small,the agent chooses more random actions resulting in more exploration.On the other hand, larger values for ε make the agent exploit more of the previous learning. Ε can be varied over time to allow higher exploration followed by higher exploitation.
3.3 The Proposed Q-Learning based Cell Association Scheme
When a device wants to establish a connection or transmit connectionless data, it needs to associate with an eNB. The device needs to associate with the eNB that satisfies its QoS requirements. With the co-existence of different types of devices such as H2H and M2M communication devices, they will have different QoS requirements.
The target of the proposed scheme is to find the best eNB to associate with for both types of communication devices satisfying their QoS requirements. This is done using two different methods: one for M2M devices and one for H2H devices.
The target for an H2H communication device (agent) is to associate with an eNB with a high number of available radio channels such that the H2H device can have a high downlink rate satisfying Eqs. (3) and (4). On the other hand, the target for an M2M communication device (agent) is to associate with an eNB that enables the M2M device to transmit with low transmission power to increase its battery life.The UL transmit power for a device\(i\) associated with \({eNB}_{j}\) ignoring the interference power is approximated as follows [10]:
$${P}_{j}^{M2M,i}=10\times {log}_{10}\left\{\text{min}\left\{\varGamma \times \frac{{\sigma }^{2}}{{10}^{- \frac{{PL}_{j}^{i}}{10}}} , {10}^{\frac{{P}_{max}^{M2M,i}}{10}}\right\}\right\}$$
9
where \({PL}_{j}^{i}\)denotes the path loss of the link between device\(i\) and eNB\(j\) in dBm, \({P}_{max}^{M2M,i}\)is the maximum UL transmitted power of device\(i\) in dBm, and \(\varGamma\) is the target signal to noise ratio (SNR) on the whole frequency range for each eNB. Note that we only consider the noise ignoring interference. Thus, we use this value as an estimated upper bound for the M2M UL transmit power.
The scheme has different inputs that help it to manage the device-eNB association decision. These inputs are reference signal received power (RSRP) for both MeNB and PeNBj defined as \({RSRP}_{MeNB}, {RSRP}_{PeNB,j}\), the\({SINR}_{j}^{i}\) received by device \(i\in D\) from eNB\(j\) as in Eq. 1, and the number of available channels the eNB\(j\) may assign to user device \(i\).
Each device uses the inputs to calculate the DL rate for each eNB\(j\) using Equations(3) and (4). Also, it calculates the UL transmit power that will be used when associated with eNB\(j\)using Eq. (9). Using these values, the device performs a QoS-based selection that guarantees both its requirements. The QoS-based selection differs according to the device type whether H2H or M2M.
3.3.1 Cell Association for H2H UEs
The H2H device uses the traditional method in its cell selection, where it uses the value of the RSRP signal. The H2H device is more interested in DL rate than the UL transmission power, thus its QoS equation is expressed as follows:
$${QoS}_{j}^{H2H,i}= {\rho }_{j}^{total,i}$$
10
First, the device calculates the DL rate \({\rho }_{j}^{total,i}\)for all surrounding eNB as in Eq. (3) and Eq. (4) and\({QoS}_{j}^{H2H,i}\) as in Eq. (10). Secondly, it selects the eNB that gives maximum \({QoS}_{j}^{H2H,i}\) even if it has a lower RSRP signal and sends an association request to connect with it.
3.3.2 Cell Association for M2M Devices Using Q-Learning (CAM-QL)
The M2M device is more interested in lower UL transmission power than the DL rate, but still needs an acceptable DL rate for receiving the acknowledgments. Thus, its QoS equation is expressed as follows:
$${QoS}_{j}^{M2M,i}= {P}_{j}^{M2M,i}$$
11
The device uses the Q-learning algorithm where the M2M device learns how to choose the best eNB that meets its QoS requirements. The QL algorithm has different inputs that help anM2M device to choose the best eNB. These inputs are the action-value function \(Q\left(S,a\right), {RSRP}_{MeNB}, {RSRP}_{PeNB,j}\), \({B}_{PeNB,j} , {B}_{MeNB}\).\({B}_{PeNB,j}\left(t\right)\)is an instantaneous blocking factor from the MeNB that is used to update the moving average factor \({\tilde{B}}_{MeNB}\left({t}\right)\)used in the eNB decision.\({B}_{MeNB}\left({t}\right)\) takes the values from0 to1, it is 0if the device is not blocked from MeNB, else it is 1 if the UE is blocked from MeNB. \({B}_{PeNB,j}\left(t\right)\)is an instantaneous blocking factor from PeNBj that is used to update the moving average \({\tilde{B}}_{PeNB,j}\left({t}\right)\)used in the eNB decision. Similarly,\({B}_{PeNB,j}\left(t\right)\)varies from 0 to 1, it is 0if the device is not blocked from PeNBj, else it is 1 if the UE is blocked from PeNBj\(,{ RSRP}_{MeNB}\)is the RSRP signal from MeNB. \({RSRP}_{PeNB,j}\)is the RSRP signal from PeNBj. The moving average is represented as:
$${\tilde{B}}_{MeNB}\left(t+1\right)=\mathcal{ }\mathcal{l}\mathcal{ }\times {B}_{MeNB}\left(t\right)+\left(1-\mathcal{l}\right)\times {\tilde{B}}_{MeNB}\left(t-1\right)$$
12
$${\tilde{B}}_{PeNB,j}\left(t+1\right)=\mathcal{ }\mathcal{l}\mathcal{ }\times {B}_{PeNB,j}\left(t\right)+(1-\mathcal{l})\times {\tilde{B}}_{PeNB,j}(t-1)$$
13
where \(\mathcal{l}\mathcal{ }\) is the weighting factor that defines how much the average factor will depend on the current blocking factor and the history of the selected eNB. Typically, we take \(\mathcal{l}\mathcal{ }\)= 0.6 in our model.
The system uses the vector \(\left\{{RSRP}_{MeNB}, {RSRP}_{PeNB,1},{RSRP}_{PeNB,2}, {RSRP}_{PeNB,3}\dots ,{RSRP}_{PeNB,K-1}\right\}\) of the RSRPs of different MeNB and PeNB as its input where K-1 is the number of PeNB. The system state is the eNB selected by the M2M device (the agent). The Q-learning algorithm starts with an initial state \({S}_{init}\) that is the eNB with max received power of all the PeNBj;\(1\le \text{j} \le K-1\)and the MeNB. The action \(a\left(t\right)\) is defined as the choice of a specific eNB where the M2M device transit between the states to choose the best eNB to connect to according to the action value. This is done using the ɛ-greedy algorithm. When choosing an eNB, it checks the moving average that depends on the blocking factor for the chosen eNB.
-
If \({\tilde{B}}_{MeNB} and {\tilde{B}}_{PeNB,j}\) is from 0 to 0.5, then continue with the chosen eNB.
-
Else if\({\tilde{B}}_{MeNB} and {\tilde{B}}_{PeNB,j}\)is from 0.5 to 1, then explore another action to choose another eNB to connect to using the Q-learning where it is either random or according to Q-function using ɛ- greedy policy.
It calculates \({QoS}_{j}^{M2M,i}\) for the chosen eNB as in Eq. (11)to calculate the reward \({R}^{M2M}\left(t\right)\).The reward \({R}^{M2M}\left(t\right)\) is defined as the comparison between the \({P}_{j}^{M2M,i}\)and \({P}_{MAX}^{M2M}\) that is the maximum allowable transmission power for an M2M device such that if \({P}_{j}^{M2M,i}<{P}_{MAX}^{M2M}\) the device is rewarded for saving transmission power. Simply the less transmission power used, the more the reward.The reward is expressed as follows:
$${{R}^{M2M}}_{}\left(t\right)=\left({P}_{MAX}^{M2M}-{P}_{j}^{M2M,i}\left(t\right)\right)$$
14
where \({P}_{j}^{M2M,i}\) is the calculated uplink transmit power of chosen eNB and \({P}_{MAX}^{M2M}\) is the maximum uplink transmit power.
The reward can be either negative\({R}^{M2M}\left(t\right)<0\), which means that the chosen eNB is far and the device needs transmission power higher than the threshold value to send its data.In this case,the device needs to choose another eNB. If \({R}^{M2M}\left(t\right)>0\), it means the chosen eNB is close and the device need slow transmission power to transmit its data, thus the device can associate with this eNB.
For each of both positive and negative values of the reward \({R}^{M2M}\left(t\right)\), an action \(a\left(t\right)\) is taken and the next state \({s}^{{\prime }}\left(t\right)\) is visited.Then the function Q(s'(t),a(t)) is updated as in Eq. (8). Figure 3depicts the underlying Markov decision process for the Q-learning algorithm. We can see that agent can transit to any state according to the action taken depending on the value of max Q(s'(t),a(t)) that depends on the reward value calculated in the previous state.
The algorithm is executed for a period called the learning time, where it explores all different states (all eNB) and establishes the Q-table. It contains all Q-function values of all states\({s}_{u}\), actions pairs \({a}_{v}\), and it is calculated using Eq. (8). After the learning period, the algorithm converges when the difference between two subsequent Q-matrices is very small nearing 0, and starts to use the Q-table for the association process.Finally, the Q-learning algorithm returns the chosen eNB that gives the best QoSfor the M2M device.Thus, the M2M device sends an association request to the chosen eNB to connect with it and updates the blocking factor such that:
-
If the UE is blocked, then update \({B}_{PeNB,j}\left(\text{t}\right) \text{o}\text{r} {B}_{MeNB}\left(\text{t}\right)\)of the eNBat which the device is blocked to be equal to 1 and the UE is considered to be blocked.
-
If the UE is not blocked, then \({B}_{PeNB,j}\left(\text{t}\right)\text{o}\text{r} {B}_{MeNB}\left(t\right)\)of the eNB the device is connected to is set to be equal to 0.
After the learning period is finished, the M2M device has learned the best eNB to connect to according to its state using the Q-table shown in Table 2. However, there can be a change in the environment around the M2M devices such as a change in the number of eNB available to the M2M device if it moved or due to changes in the available RBs in the eNB. This will appear in the calculated reward that will be different than the reward calculated before and recorded in the Q-table. Thus, the M2M device calls the Q-learning algorithm again to determine the best eNB to associate with. The scheme is shown in Fig. 4.
3.4 The Q-Learning Based Load Balancing Scheme at the Pico-eNB (LB-QL)
This system is responsible for balancing the load between the PeNB and the MeNB. The eNB will accept the device association if it has free RBs that can be assigned to the device for transmitting and receiving, else the device request is rejected and called an outage device.
A PeNB uses Cell Range Expansion (CRE) to balance the load between PeNB and MeNB. CRE is considered a key design feature to enhance HetNets efficiency. In the normal case without RE, the UE chooses the serving cell based on the highest DL received power, this technique is referred to as maximum reference signal received power (Max RSRP). With RE, a positive offset is added to the RSRP of the pico-cell to increase the range served by it. This offset is announced by a participating eNB by including it in the broadcast channel (PBCH). With RE the UE selects the eNBj according to the rule given as:
$$Serving cell=argmax( {RSRP}_{j}+{Bias}_{j})$$
15
where RSRP and Bias are expressed in dBm, this rule implies that a UE does not necessarily connect to the eNB that has the strongest DL received power.More UEs are attracted by pico-cell compared to the case without RE which increases HetNet efficiency.
3.4.1 The Q-Learning Based Cell Range Expansion (CRE) Scheme (LB-QL)
When the device requests access to an eNB, it is either assigned RBs and admitted or blocked and becomes an outage device. The PeNB uses the proposed Q-learning algorithm to perform the load balancing between the MeNB and the set of PeNBs to minimize the total number of outage devices. PeNBjbroadcasts the bias value which is added to the signal strength received by the devices according to Eq. (15), thus it can attract more devices to connect to it to balance the load.
The states are defined as the different biasing values \(S\left(t\right)=\left\{0, 2,\dots .., 20 dbm\right\}.\)The action \(a\left(t\right)\) is defined as the transition between the states. The reward \({{r}_{LB}}_{}\left(t\right)\) is defined as a function of both the available number of physical resource blocks (RBs) and the number of requested resource blocks, it can be expressed as:
$${\text{R}}_{\text{L}\text{B}}\left(t\right)=\left({RB}_{Avail}\right(t)-{RB}_{Need}({t}\left)\right)$$
16
where \({PRB}_{Avail}\left(t\right)\) is the available number of physical resource blocks and \({PRB}_{Need}\left({t}\right)\) is the number of requested resource blocks from the devices that want to access the PeNBj.
Each PeNBjchecks if there are free available RBs in which case the QL algorithm is called. It starts with an initial state \({S}_{init}\) with biasing value '0'. Then using the ɛ-greedy algorithm, it chooses an action to transit to another state (i.e. another bias value). After executing the action, the PeNBj either increases its biasing value to increase its range and accepts more devices to its cell or reduces the biasing value to discourage new devices from connecting. If thePeNBjbecomes full capacity or near the maximum capacity, it reduces its biasing value and offloads the devices to other eNBs. After that the reward \({{R}_{LB}}_{}\left(t\right)\) is calculated as in Eq. (16), then the system goes to the next state\({s}^{{\prime }}\left(t\right)\). Then the function Q(s'(t),a(t)) is updated using Eq. (8).
The reward \({{R}_{LB}}_{}\left(t\right)\)is either positive\({{R}_{LB}}_{}\left(t\right)>0\), then there are free RBs for outage devices, and it increases its biasing value by 2 dBm and thus moves from \({s}_{n}\left(t\right)\to {s}_{n+1}\left(t+1\right)\). Or it is negative, then there are not enough RBs for outage devices and cannot accept any new devices, thus it reduces its biasing value by 2 dBm and thus moves from \({s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)\). Figure 5 shows the Markov chain decision process for the Q-learning algorithm. In CRE, a fixed bias value may vary from 2 dBm to 20dBm, wherethe increasing values make more UEs select pico-cells over than macro-cell as their eNB. When the bias value increases, more devices are attracted to the pico-cells [43–45].
The Q-learning algorithm returns the final bias value used to increase or decrease the bias of PeNBj to balance the load. The algorithm works for a period where it explores all different states and updates the Q-function which is calledthe learning phase as explained in Fig. 6.a. When the number of free RBs in PeNBj changes, the algorithm is capable of capturing that due to the change in the reward value set in the Q-table. This change in the RBs can be from either a device associated to or de-associated from PeNBj. When PeNBj senses this change, it calls the Q-learning algorithm again but in relearning mode to update the Q-table accordingly with the new environment. Figure 6.b explains how the scheme works and how the load is balanced between the MeNB and the PeNBs to minimize the total number of outage devices.
3.4.2 The Greedy Scheme for Dynamic Adjustment of Bias Value in the Load Balancing Scheme
In this section, we study the effect of changing the method of setting the biasing value at the PeNB. The reward has three ranges: small, medium, and large as defined in Table 3. The difference represents how the PeNB is near to or far from full capacity.
Small Difference
For the positive reward, the difference between the free RBs and the used RBs is small, which means that the PeNB is near full capacity. Thus, PeNB will increase the biasing value with a small value (2 dBm), and thus moves from
For negative reward \({{R}_{LB}}_{}\left(t\right)\),it means then there are not enough RBs and cannot accept any new devices, thus it reduces its biasing value by 2dBm and thus moves from \({s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)\) to reduce the potentialof increasing blocking.
Medium Difference
For the positive reward ,the difference between the free RBs and the used RBs is medium, which means that the PeNB has enough RBs to accept new devices. Thus, PeNB will increase the biasing value by6 dBm, and thus moves from
For negative reward \({{R}_{LB}}_{}\left(t\right)\),it means that there are not enough PRBs and cannot accept any new devices, thus it reduces its biasing value and thus moves from\({s}_{n}\left(t\right)\to {s}_{n-3}\left(t+1\right)\)
Large Difference
For positive reward ,the difference between the free RBs and the used RBs is large, which means that the PeNB has a lot of unused RBs. Thus, PeNB will become more greedy and increases the biasing value with a large value as 10 dBm, and thus moves from
For negative reward \({{R}_{LB}}_{}\left(t\right)\),it means that the PeNB is near full capacity and if it keeps this biasing value the blocking probability will increase. Thus, it cannot accept any new user devices, and it reduces its biasing value with a large value, thus it moves from \({s}_{n}\left(t\right)\to {s}_{n-5}\left(t+1\right)\) to reduce the number of devices that request access to it and to reduce its blocking probability.
All the transitions are bounded within the range of (Biasmin, Biasmax) where Biasmaxis the maximum biasing value of 20dBm and Biasminis the minimum biasing value of 0dBm. For example, if the present state is \({s}_{n }\)and the next state is \({s}_{n+5 }\), then n+ 5 must be bounded by Biasmaxsuch that (n + 5) ≤Biasmax. If the next state is \({s}_{n-5 }\), then it is bounded by 0 such that (n− 5) ≥Biasmin. The new Q-learning Markov decision process is depicted in Fig. 7.
3.5 The Overall Scheme Description
In our system model, we have two types of communication devices: H2H and M2M devices. Each device starts with estimating the RSRP signals of all eNB.
Forthe H2H devices, the H2H device executes the H2H algorithm (section 3.3.1) to choose its eNB that provides its required QoS. Each M2M device executes the CAM-QL algorithm to choose the eNB that potentially provides its required QoS. It starts by going through a learning phase for the first time until it learns its surrounding environment as shown in the flowchart of Fig. 4a.The algorithm begins with checking all RSRP signals. It startsin either initializing or relearning mode. In initializing mode, the action-value function \(Q\left(s,a\right)\), as in step 2. The algorithm starts the Q-learning algorithm by setting the state \({S}_{init}\) as step 3a. In the relearning mode, it uses the last value for both \(Q\left(s,a\right)\)and S. The algorithm generates a random number to choose an action in step 4, that either goes to step 5a,where it exploits and chooses the action according to max \(Q\left(s,a\right)\)or step 5b,where it chooses random action to explore different states. After step 5 the action is executed and both the reward and the new \(Q\left(s,a\right)\) is calculated and the state is updated as in steps 6 to 9. Then it checks the convergence of the Q-function as in step 10. If not, it continues with the algorithm. If convergence occurs, it checks if there are any changes in the observed RSRP signals. If there is no change, then the learning phase is finished, and the Q-table is used for the eNB association. If any change occurs, then it calls the Q-learning again in relearning mode to adapt to the new environment.
After the learning phase is finished, the M2M device uses its learning to choose the best eNB that provides its QoS requirements. The algorithm described in Fig. 4b starts with step 1. It chooses an eNB according to the Q-table such that the chosen eNB gives the max value of \(Q\left(s,a\right)\) and checks if it satisfies the device’s connection request. If the eNB has available RBs, it checks the moving average of the blocking factor\({\tilde{B}}_{eNB}\left({t}\right)\) of the chosen eNB. If \({\tilde{B}}_{eNB}\left(t\right)\)< 0.5, then it sends the access request as in step 6, else it chooses another eNB as in step 1. If the chosen eNB does not have available RBs, it updates the blocking factor of this eNB\({B}_{eNB}\left({t}\right)\)to 1. Then, the next eNB in the list is attempted. If all eNBs are attempted without success, the device is blocked. In the case of a successful association, the device starts to transmit its data and checks if the new reward calculated is identical to the reward in its Q-table as in step 8. If both are equal, then it continues using this Q-table when it performs an eNB association. But if the two rewards are different, this means that there is a change in the environment and the device calls the Q-learning algorithm in re-learning mode as in step 9.
As for the PeNBj, each PeNBj checks if there are free RBsor not. If the PeNBj has free RBs, it calls the load balancing based Q-learning algorithm to start calculating the CRE bias for balancing the load as shown in Fig. 6a. It starts the Q-learning algorithm by in either initializing or relearning mode. In the initialization case, it and initializes the action-value function \(Q\left(s,a\right)\), as in step2 and then setting the state \({S}_{init}\) as step 3a. In the relearning mode, it uses the last value for both \(Q\left(s,a\right)\)and S. The algorithm goes as explained in the M2M learning phase until the PeNBj learns its surrounding environment and ends up choosing the bias value.
The association /de-association process in PeNBj is explained in Fig. 6b. When a device sends a request to PeNBj, it checks if the request is an association request or de-association request as in step 1. If it is a de-association request, the number of available RBs increase by 1 as in step 2b. If the request is an association request, then the PeNBj checks if there is a free RBs to assign to the device as in step 2a. If there is a free RB, it is assigned to the device and the number of RBs is updated in step 3. Else if there are not any free RBs, the device is blocked as in step 5. After the update of the number of available RBs, the PeNBj calls the Q-learning again in the relearning mode to update its Q-table according to the new environment.
3.5.1 The Interplay between the LB-QL at the PeNB and the QL Scheme at the M2M Devices
The M2M decision has a big effect on the load balancing scheme at the PeNBs. If more M2M devices chose to connect to the MeNB, then there will be free RBs in the PeNBs which will entice them to call the Q-learning scheme to increase their bias to balance the load between the MeNB and PeNBs. While if more M2M devices chose to connect to PeNB, this will result in two scenarios: 1) either the load will be balanced between the MeNB and the PeNB such that all eNBs provide good network performance, thus the Q-learning will not be called; Or 2) there will be some PeNB that have high load and some that have a low load. In this case, the PeNB that hasa low load will start to increase its bias to attract more devices while those with a high load will start to reduce bias toalienate devices from sending association requests to it. This should have an effect akin to balancing the load between all eNBs in the network to attempt to provide good network performance to all devices. This interaction will be further demonstrated in the results section.
Table 3
The Reward ranges and corresponding actions.
Reward Ranges
|
The Reward Values\({R}_{LB}\left({t}-1\right)\)
|
The Action\({{a}}_{{n} }\left({t}\right)\)
|
Small Difference
\({Ʀ}_{1 }= \left[1, 10\right]\)
|
\({{R}_{LB}\left(t-1\right) \le Ʀ}_{1 }\)
|
For positive \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n+1}\left(t+1\right)\)
For negative \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)\)
|
Medium Difference
\({Ʀ}_{2 }= [10, 20]\)
|
\({{R}_{LB}\left(t-1\right) \le Ʀ}_{2 }\)
|
For positive \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n+3}\left(t+1\right)\)
For negative\({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-3}\left(t+1\right)\)
|
Large Difference
\({\text{Ʀ}}_{3 }= >20\)
|
\({{R}_{LB}\left(t-1\right) \le Ʀ}_{3 }\)
|
For positive \({R}_{LB}\left(t-1\right)\) : Move from\({s}_{n}\left(t\right)\to {s}_{n+5}\left(t+1\right)\)
For negative \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-5}\left(t+1\right)\)
|