Q-Learning based Load Balancing in Heterogeneous Networks with Human and Machine Type Communication Co-existence

doi:10.21203/rs.3.rs-2533717/v1

Download PDF

Research Article

Q-Learning based Load Balancing in Heterogeneous Networks with Human and Machine Type Communication Co-existence

https://doi.org/10.21203/rs.3.rs-2533717/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

A heterogeneous network (HetNet) is a network comprised of many different wireless network nodes with varying capabilities and features deployed within the coverage area of cellular service. Low power nodes such as pico-cells are deployed within the coverage area of a large macro-cell to cover areas with high user density or areas not well-covered by the macro eNB. In this paper, we focus on a model comprised of macro eNBs and pico eNBs serving both human-to-human (H2H) and machine-to-machine (M2M) devices that have different quality of service (QoS) requirements. We propose a new Q-learning based scheme for cell association and network load balancing for both types of devices. The scheme is comprised of two independent algorithms: an algorithm applied at the M2M devices that uses Q-learning to associate the device with the eNB that best meets its QoS requirements and a second algorithm applied at pico eNBs that uses Q-learning to tune the parameters of the cell range expansion to balance the load between the macro-cell and pico-cells. To evaluate the proposed scheme performance, we compare the H2H and M2M blocking probability and the M2M uplink transmission power with the traditional method and a scheme that uses Q-learning at the UE devices to assist the load balancing. The results indicate that the proposed scheme reduces the blocking probability by about 10% for both M2M and H2H devices and also reduces the uplink transmission power for M2M devices by 50% even under high load conditions.

Heterogeneous Networks

Q-learning

Load Balancing

Cell Association

Range Expansion

Due to the continuously expanding demands for mobile connectivity and the projected massive deployment of Machine-to-Machine (M2M) communication devices, mobile cellular networks must be effectively planned to meet the service requirements of all types of communication devices. This leads to an immediate as well as long-term needs to ramp up the network capacity offered by the mobile operators. However, the allocated wireless spectrum for 4th generation wireless and beyond has not increased proportionally to the projected demand. Moreover, new service requirements are introduced.

Machine-to-Machine (M2M) communication system is a network that has a large number of machine type communication (MTC) devices communicate used in many applications such as monitoring, control [1]. They have become an important component in the fifth generation (5G) wireless systems to meet the demands brought in by the deployment of Internet of Things (IoT) based systems [2].

M2M has traffic characteristics that differ from H2H traffic as follows:

The traffic is mostly in the uplink direction and it is periodic or event-driven.
M2M traffic typically emits a short and small number of packets[3].

Heterogeneous networks (HetNet) architecture was introduced in LTE release 10 to provide a flexible framework for the design and deployment of wireless cellular networks. In heterogeneous networks as in Fig. 1 the cells of different sizes are referred to as macro-, micro-, pico-, and femto-cells; listed in order of decreasing base station transmission power. The picoeNBwas introduced in LTE Rel. 9 (R9) [4]. It is a low-power eNB that is mainly used to provide indoor coverage for Closed Subscriber Groups (CSG) such as femto-cells and pico-cells. Pico-cells typically have a coverage area of up to 30,000 square meters and can support up to 100 users.The Relay Node (RN) is another type of low-power base station added to the LTE Rel. 10 specifications [5]. From the UE perspective, the RN will act as an eNB. On the other hand, from the eNB’s perspective,the RN will be treated as a UE.

A major challenge in heterogeneous networks is to ensure that the small cells are well utilized. Typically, a user scans the broadcast reference signals transmitted by the different base stations in its area and connects to the one with the highest received signal strength (RSS), this technique is referred to as maximum reference signal received power (Max RSRP). This may result in users connecting to the macro basestation in most areas as the transmission power of the macro is much higher than the pico base station. Thatmay lead to low uplink link performance, as it may need a higher power to transmit to it.Load balancing [6–7] is the process via which the load in a given area is well balanced between the macro and pico cells to achieve the best attainable performance. One method to achieve load balancing between the macro and pico cells is to increase the effective area served by the small cell thus enabling it to serve more users. One of the well-known techniques is Cell Range Extension (CRE)[8].

CRE is introduced for LPNs and in particular pico-cells and relays (RN). RE adds an offset to the Pico-cell RSS to effectively increase its DL coverage, which is sent via the broadcast channel (PBCH). The positive bias is added to the actual RSS detected by each scanning User Equipment (UE) device thus resulting in effectively increasing the pico-cell coverage area. This can result in the User Equipment (UE) being served by a cell with lower received power compared with that of the macro eNB.That is good for the uplink performance as the UE would not need a higher power to transmit to the close-by pico compared with transmitting to the macro that could be much farther.With RE the serving cell of eNB_jis selected according to the rule given as:

$$Serving cell=\underset{j}{\underset{⏟}{\text{arg}max}}\left({RSRP}_{j}+{Bias}_{j}\right)$$

where RSRP and Bias are expressed in dBm, this rule implies that a UE does not necessarily connect to the eNB that has the strongest DL received power [9–10].

1.1 Contributions of the Paper

This paper proposes schemes forcell association and load balancing for H2H and M2M devices in a Macro-Pico HetNet scenario. We propose a new scheme to associate the users with the best eNB and balance the network load using a Q-learning based algorithm. The proposed scheme is comprised of two components as follows:

The first component is applied at the UE and is responsible for the user association to the different eNB either Marco eNB (MeNB) or Pico eNB (PeNB). The association is done based on the QoS requirement of the user which depends on the user devicetype whether anH2H or anM2M communication device.
The second component is applied at the PeNB aiming to maximize the resources utilization, reduce the interference and improve the network performance through cell load balancing between the Macro and Pico cells using Q-learning based cell range expansion.

The rest of this paper is organized as follows: In section 2, we provide an overview of related work. Section 3 presents an overview of Q-learning fundamentals, the system model, and the details of the proposed approach. Performance evaluation results are presented and discussed in Section 4. The paper is concluded in Section 5.

Many research works tackled the problems of cell selection and load balancing in heterogeneous networks (HetNets). Several approaches are proposed and studied in [11–20].One of the fundamental techniques isbasic cell selection [11] which has two types based on the user-mode either the idle mode or the active mode. In the idle mode or the RRC-IDLE state, the UE initiates an RRC connection. This process is known as cell selection, which is comprised of the procedure of initial cell detection, received signal strength measurement, reading of system data, and finally association.

In the active mode or RRC-CONNECTED state, this process is initiated and managed by the network and is known as handover. The handover in the RRC-CONNECTED state depends on the link quality and other parameters [12]. Other techniques such as adaptive cell selection, mobility optimization, and load balancing using Fuzzy Q-learning [21–25] have been proposed.

In [13], the authors presented a different technique used for cell selection based on load conditions, called Adaptive Cell Range Control. It estimates the load condition variation in each cell to avoid interference to cell-edge UE. The algorithm dynamically increases the coverage area of Pico eNB to avoid any unnecessary interference associated with CRE.

Gu. et al. [14] presented the Capacity Analysis and Optimization with Adaptive Cell Range Control scheme.They proposed an adaptive extension of a Pico eNB coverage area by adding a bias as in Eq. (1) during cell selection. That produced a new cell-specific offset (CSO) value that can improve the overall capacity of the cell load.Another technique was proposed by L. Roque et.al. in [15] that adds two additional criteria as an improvement for the cell selection and handover procedures. The first criterion is the base station capacity estimation usedto avoid overloaded base stations.The second is the estimation of the user speedto avoid the selection of short radius cells when the user is in vehicular speed.

In [16]the authors discussed the problem of cell selection and resource allocation (CS-RA) in heterogeneous wireless networks. They proposed a distributed cell selection and resource allocation mechanism, in which the CS-RA processes are performed independently. They also formulated the problem as a two-tier game:the inter-cell game and the intra-cell game, respectively. In the inter-cell game, mobile users select the best cell according to an optimal cell selection strategy derived from the expected payoff. In the intra-cell game, mobile users choose the radio resource in the serving cell that achieves maximum payoff.

Khoshkholgh et al. [17]proposed a general model for cell association, namely Practical Cell Association (PCA). The main parameters in this model are the required pilot strength for eNBdetection at UEs,thesize of the candidate set from which each UE selects its serving BS, and the partitioning of the candidate set for different tiers of eNB. The work proposed an analytical framework for modeling PCA that can be used to predict the coverage of HetNets. Moreover, the PCA design was formulated as an optimization problem to maximize the coverage probability of the HetNets.

In the area of load balancing, many researchers presented different schemes for balancing the load in HetNets. Altard et al.[18]proposed a general load-balancing algorithm that helps congested cells to handle traffic dynamically. It is based on clustering methods and can be automatically controlled and triggered when needed for any cell on the system.In [19], the authors proposeda practical distributed algorithm with lower complexity. They formulated the user-eNB association problem as a multi-objective optimization problem, which optimizes the load balancing and the network average load for users with QoS requirements while maximizing the network utilization of other users. Physical resources and users’ data rate factors were taken as the constraints. This scheme enables appropriate users to be handed over to the neighboring cells.

De-Veciana et al. [20] presented thecell selection with cell range expansion (CRE).They proposed an iterative distributed user association policy that adapts to spatial traffic loads and converges to an optimal allocation.

2.1 Q-Learning based Cell Association and Load Balancing

Recently, Q-learning was introduced for solving the problem of cell association and load balancing. Kudo et al. [21] proposed a CRE method distributed Q-learning in Het Netto determine the bias value of each user.Each UE learns which cell to select by selecting its biasing value that minimizes the number of outage users. Each user learns its biasing value from its previous experience independently.

In [22], the authors presented a scheme for cell selection called Q-learning-based adaptive CRE with RB rationing. The model engages aUE as an agent in multi-agent systems that observes its state of the environment from the corresponding sets in the Q-learning system. The UE learns from the previous performance by analyzing the system parameters.Q-learning is also used in addressing the load balancing problem.

Munoz et al. [23]presented a Fuzzy Logic Controller (FLC) scheme optimized by the fuzzy Q-learning algorithm. It was proposed for handling the load balancing problem to reduce the call blocking in congested cells, and restricting the call dropping in neighboring cells.

Ali and Coucheney [24] used the cell range expansion (CRE) for user association and almost blank sub-frame (ABS) for interference management. They modeled the interactions between the base stations to achieve load balancing as a near-potential game, in which the potential function is the α-fairness function.

Yu and Yang [25] proposed using the deep reinforcement learning (DRL)-based framework to maximize the BS connectivity, energy efficiency while meeting the demand of each user to solve the cell outage compensation (COC) problem in 5G cloud radio access network (C-RAN). First, they used the K-means clustering algorithm to allocate the compensation users (these are the cell outage users who either have no connection or very poor connection to an eNB to adjacent Remote Radio Heads (RRHs). Then, deep Q-learning (DQN) is used to find the antenna down-tilt and the power allocated to compensation users.

Wang and Liu [26] proposed a cell selection mechanism based on deep reinforcement learning for sparse mobile crowd sensing (MCS). They proposed to use a deep recurrent Q-network for learning the Q-function that can help decide which cell is a better choice under a certain state during cell selection. Then, they used the transfer learning techniques to reduce the amount of data required for training the Q-function if multiple correlated MCS tasks need to be conducted in the same target area.

Tosounidis and Pavlidis [27] employed deep reinforcement learning to effectively load balance requests to services in a data center network, resulting in an approach that can dynamically adapt to varying request loads, including changes in the infrastructure’s capabilities.

Tennakoon et. al [28] proposed a Q-learning approach for load balancing in Software Defined Networks to reduce the number of Unsatisfied Users in a 5G network. This solution integrates Q-learning techniques with a fairness function to improve the user experience at peak traffic conditions.

In [29], the author sproposed Q-learning based load balancing routing (Q-LBR) through a combination of three key techniques, namely, a low-overhead technique for estimating the network load through the queue status obtained from each ground vehicular node by the URN, a load balancing scheme based on Q-learning and a reward control function for rapid convergence of Q-learning. Through diverse simulations, we demonstrate that Q-LBR improves the packet delivery ratio, network utilization, and latency.

In the proposed approach, we use the Q-learning algorithm to solve the problem of user association and load balancing between macro-cell and pico-cell in the heterogeneous network for both human and machine-type communication. As clear from the above survey, the proposed schemes solve these two problems with the existence of two types of users not only one type as the previous schemes with two schemes one implemented at the user device and the second at the PeNB. We also examine the interaction between the two schemes. This has not been tackled before in the literature. The combined scheme also exhibits superior performance as will be demonstrated in the results section.

In this section, we explain the system model and the two proposed schemes for cell association for both H2H and M2M devices, and how the load balancing between macro and pico cells is achieved using the Q-learning algorithm.

3.1 System Model

In LTE systems UE devices need to associate with an eNB to transmit data or to communicate with other devices. Also, when a user device moves within the service area, it performs cell selection/re-selection and handover where it has to measure the signal strength of the neighbour cells before association with any of them [30–33].

In our system model shown in Fig. 1, there is one macro-cell surrounded by multiple pico-cells and multiple H2H and M2M devices. A device typically chooses the eNB according to the largest RSRP signal from all eNBs. Thus, the devices very close to PeNB will connect to it, while the majority of devices would connect to the MeNB due to its larger RSRP signal. This could result in overloading theMeNB will and would lead to performance degradation due to the increase indevice blocking probabilities. Hence, load balancing is needed to distribute the load among the serving eNBs. A possible method of load balancing is cell range expansion where the effective RSRP signal is used by adding a biasing value to the real RSRP of the PeNB. This has an effect to entice the devices to associate with the PeNBs and reduce the load on the MeNB [34–38].

In LTE networks, a UE measures some parameters on the reference signal that help in the cell selection process: the Received Signal Strength Indicator (RSSI),the Reference Signal Received Power (RSRP), and the Reference Signal Received Quality (RSRQ). The parameters are defined below. Table 1 lists the symbols for these parameters and other symbols used in the rest of the paper:

RSSI

The carrier RSSI (Receive Strength Signal Indicator) measures the average total received power over N resource blocks (in Watts or dBm). The total received power of the carrier RSSI includes the power from co-channel serving and non-serving cells, adjacent channel interference, thermal noise, etc.

RSRP

It is defined as the linear average of the power contributions (in Watts or dBm) of the resource elements that carry cell-specific reference signals within the considered measurement frequency bandwidth.

RSRQ

A metric that considers RSSI and the number of used Resource Blocks (N) RSRQ = (N*RSRP)/RSSI measured over the same bandwidth. The RSRQ is a carrier/interference (C/I) type of measurement and it indicates the quality of the received reference signal. The RSRQ measurement provides additional information when RSRP is not sufficient to make a reliable handover or cell re-selection decision.

M2M communication devices are typically battery-operated. Thus, they need to transmit at the lowest possible transmission power such that they increase their battery life and do not need battery replacement over their lifetimes.To achieve this goal, we propose a Q-learning based algorithm applied by the M2M communication devices to associate with eNB using low uplink transmission power to increase its battery life while satisfying its QoS requirements.

On the other hand, H2H communication devices require high downlink data transmission rates that can be achieved by connecting with the eNB with a large number of radio channels that can be assigned to different devices to achieve their QoS requirements.

Based on the available system bandwidth, each eNB$j$will have ${N}_{j}$ radio channels to allocate amongst its associated devices. We assume that the total transmitted power of each eNB is equally divided between all its available radio channels. The SINR experienced by device $i$when connected toeNB$j$on one resource block (RB) is expressed as

$${SINR}_{j}^{i}=\frac{{P}_{j}{g}_{j}^{i}}{\left(\sum _{k\in B,k\ne j}{P}_{j}{g}_{j}^{i}+{\sigma }^{2}\right)}$$

where ${P}_{j}$ represents the transmitted power from eNB$j$ over one RB; ${g}_{j}^{i}$ represents the average channel gain of the link between eNB$j$ and device $i$, and ${\sigma }^{2}$ represents the additive noise power on each RB. Thus, the upper bound of the DL rate of device $i$ associated with eNB$j$ over one channel with bandwidth $w$ can be expressedby of Shannon-Hartley capacity formula as:

$${\rho }_{j}^{i}= w {\text{log}}_{2}\left(1+{SINR}_{j}^{i}\right)$$

Assuming all RBs for device $i$ when connected to eNB$j$ have the same channel condition, and the number of radio channels allocated to a device$i$ by eNB$j$is${\vartheta }_{j}^{i}$, then the total DL rate available to device$i$ from eNB$j$ is given by:

$${\rho }_{j}^{total,i}={\vartheta }_{j}^{i}{\rho }_{j}^{i}$$

Each H2H device $i\in D$ requires its achievable rate to be higher than its downlink rate threshold when associated with an eNB$j$. Thus,${\rho }_{j}^{total,i}$should satisfy the following constraints:

$${\rho }_{j}^{total,i}\ge {ƞ}^{i}$$

where ${ƞ}_{i}$ is the minimum DL data raterequired by the device.

Thus, the integer number of radio channels allocated to a device $i$ associated with eNB$j$ to achieve its QoS requirement is ${F}_{j}^{i}=⌈\frac{{ƞ}^{i}}{{\rho }_{j}^{i}}⌉$ where $⌈.⌉$ is the ceiling function. Therefore, when a device$i$ is associated with eNB$j$, the eNB$j$ must assign an integer number of radio channels ${\vartheta }_{j}^{i}$satisfying:

$${\vartheta }_{j}^{i}\ge {F}_{j}^{i}$$

Table 1

List of Symbols
Symbol	Representation
${B}_{YeNB,j}$	B represents the instantaneous blocking factor from the eNB j Y represents the eNB type, $Y\in \left\{P, M\right\}$ referring to PeNB and MeNB respectively.
D	The set of devices
${g}_{j}^{i}$	The average channel gain of the link between eNB$j$ and device$i$
i	Device index
j	eNB index
Nj	The radio channels
${P}_{j}$	The transmitted power from eNB$j$ over one RB
${P}_{j}^{M2M,i}$	The UL transmit power from the M2M device$i$ associated with eNB$j$
${PL}_{j}^{i}$	The path loss of the link between any device$i$ and eNB$j$ measured in dBm
${P}_{MAX}^{M2M}$	The maximum uplink transmission power.
$Q\left({s}_{t}, {a}_{t}\right)$	The action-value function at time $t$ is given for every state ${s}_{t}\in S$ and action${a}_{t}\in A\left({s}_{t}\right)$
${QoS}_{j}^{Z2Z,i}$	The quality of service for device $i$ and eNB$j$ Z represents the device type, $Z\in \left\{H,M\right\}$ referring to H2H and M2M respectively
${RSRP}_{YeNB,j}$	The Reference Signal Received Power from eNB j Y represents the eNB type, $Y\in \left\{P, M\right\}$ referring to PeNB and MeNB respectively.
${R}^{M2M}\left(t\right)$	The reward of the M2M QL algorithm at time t.
${{R}_{LB}}_{}\left(t\right)$	The reward of the LB-QL algorithm
${SINR}_{j}^{i}$	The SINR received by device $i$when associated with eNB$j$
α	The learning rate
ɣ	The discount factor
$\varGamma$	The target signal to noise ratio (SNR) on the whole frequency range for each eNB
${\text{ƞ}}_{i}$	The minimum DL data rate required by device $i$.
${\vartheta }_{j}^{i}$	The number of radio channels allocated to a device$i$ by eNB
${\sigma }^{2}$	The additive noise power on each RB
$\mathcal{l}$	The weighting factor
${\rho }_{j}^{i}$	The DL rate of device $i$ associated with eNB$j$

We use Q-learning for cell association and load balancing between the macro and the pico-cells to improve the HetNet system performance and reduce the device blocking probability.In the next section, we provide an overview of the basics of the Q-learning algorithm.

3.2 Basics of Q-learning

Reinforcement learning (RL) decides an action in a certain situation to maximize the reward. Reinforcement learning differs from supervised learning where training data is labeled, so the model is trained with the correct answer.However,in reinforcement learning an agent decides its action based on a system of rewards and penalties resulting from taking decision at a certain state. The agent learns from its experience: good decisions are rewarded and bad decisions are penalized. RL can be used in different practical applications such as robotics for industrial automation and the creation of training systems that provide custom instruction. There are different types of RL such as State-Action-Reward-State-Action (SARSA), Deep Q-Network (DQN) [39], Deep Deterministic Policy Gradient (DDPG) [40], and Q-Learning (QL) [41].

Q-learning was introduced by Watkins in 1989 [42]. An agent interacts within the environment in one of two ways either exploring or exploiting. Exploring means taking an action randomly, while exploiting means using the available information from the previous learning to make a decision.At each step, an agent ina certain state chooses an action based on hispreviouslearning. These actions may have either positive or negative outcomes called the rewards. A reward is the value received after completing a certain action at a given state. The overall goal of Q-learning is to learn a policy that maximizes the total reward [42–44]. Figure 2depicts the Q-learning interaction with the environment.

An action-value function $Q\left({s}_{t}, {a}_{t}\right)$ at time $t$ is given for every state ${s}_{t}\in S$ in a finite Markov decision process, where $S$ is the set of all possible states, and action ${ a}_{t}\in A\left({s}_{t}\right)$, where $A\left({s}_{t}\right)$is the set of all possible actions chosen based on a given policy for state ${s}_{t}$ [42]. If action ${ a}_{t}$was chosen for state${\text{s}}_{\text{t}}$, then the system goes to the next state ${\text{s}}_{\text{t}+1}=\text{s}{\prime } \in \text{S}$and receives a reward ${R}_{t+1}$.

A policy is the rule that the agent uses in selecting its actions in different states. A policy is denoted by ${\pi }_{t}({s}_{t}, {a}_{t})$. We define the action-value function under a policy $\pi$ as ${Q}^{\pi }({s}_{t}, {a}_{t})$:

$${Q}^{\pi }\left({s}_{t}, {a}_{t}\right)={E}_{\pi }\left\{{R}_{t}|{s}_{t}=s, {a}_{t}=a\right\}= {E}_{\pi }\left\{\sum _{k=0}^{\infty }{\gamma }^{k }{R}_{t+k+1}|{s}_{t}=s, {a}_{t}=a \right\}.$$

where ɣ is the discount factor, (0.8 ≤ γ ≤ 1).It isused to balance the immediate and future rewards. If it takes the value0, then it will make the agent only consider the current rewards.On the other hand, if it approaches 1, it will make the agent look for long-term high reward.

The Q-learning algorithm it relatively updates a table of all possible states and all possible actions for these states called the Q-table. The Q-table is a look-up table that contains all states ${s}_{u},$all possible actions that can be taken when the system is at a state ${a}_{v}$, the resulting rewards ${R}_{uv}({s}_{u},{a}_{v})$when taking action ${a}_{v}$ at state ${s}_{u}$, and the Q-function $Q({s}_{u},{a}_{v})$. At each time $t$, the agent selects an action ${a}_{t}$, and receives a reward ${ R}_{t+1}$where${ R}_{t+1}\in R$is calculated and moves to the new state ${ s}_{t+1}$. On selecting the new state ${ s}_{t+1}$, the Q-function and Q-table are updated. The formula used in updating the Q-function is given by:

$$Q({s}_{t+1}, {a}_{t+1}) \leftarrow Q({s}_{t}, {a}_{t}) + \alpha ({s}_{t}, {a}_{t}) \times [{R}_{t+1} + \text{ɣ} \times E \{ Q({s}_{t+1}, {a}_{t+1}) \left| {s}_{t} \right\}- Q({s}_{t}, {a}_{t})]$$

where α(s_t, a_t) is the learning rate $\left(0<\alpha <1\right)$ and represents how much the newly acquired information is taken into account. The value of $E \left\{ Q\right({s}_{t+1}, {a}_{t+1}\left) \right| {s}_{t} \}$is the estimate of the optimal future value for the action-value function [36].$Q({s}_{t}, {a}_{t})$is updated via the algorithm until it converges. An example for the Q-table is shown in Table 2.

Table 2

An Example of the Q-table
States${ {s}}_{{i}}$	Action${{a}}_{1}$	Actions${{a}}_{2}$	Actions${{a}}_{{v}}$
${ {s}}_{1}$	$\left\{ {Q}\left({{s}}_{1},{{a}}_{1}\right), {R}\left({{s}}_{1},{{a}}_{1}\right)\right\}$	$\left\{ {Q}\left({{s}}_{1},{{a}}_{2}\right), {R}\left({{s}}_{1},{{a}}_{2}\right)\right\}$	$\left\{ {Q}\left({{s}}_{1},{{a}}_{{v}}\right), {R}\left({{s}}_{1},{{a}}_{{v}}\right)\right\}$
${ {s}}_{2}$	$\left\{ {Q}\left({{s}}_{2},{{a}}_{1}\right), {R}\left({{s}}_{2},{{a}}_{1}\right)\right\}$	$\left\{ {Q}\left({{s}}_{2},{{a}}_{2}\right), {R}\left({{s}}_{2},{{a}}_{2}\right)\right\}$	$\left\{ {Q}\left({{s}}_{2},{{a}}_{{v}}\right), {R}\left({{s}}_{2},{{a}}_{{v}}\right)\right\}$
${ {s}}_{3}$	$\left\{ {Q}\left({{s}}_{3},{{a}}_{1}\right), {R}\left({{s}}_{3},{{a}}_{1}\right)\right\}$	$\left\{ {Q}\left({{s}}_{3},{{a}}_{2}\right), {R}\left({{s}}_{3},{{a}}_{2}\right)\right\}$	$\left\{ {Q}\left({{s}}_{3},{{a}}_{{v}}\right), {R}\left({{s}}_{3},{{a}}_{{v}}\right)\right\}$
... . .	.....	....	.....
${ {s}}_{{i}}$	$\left\{ {Q}\left({{s}}_{{u}},{{a}}_{1}\right), {R}\left({{s}}_{{u}},{{a}}_{1}\right)\right\}$	$\left\{ {Q}\left({{s}}_{{u}},{{a}}_{2}\right), {R}\left({{s}}_{{u}},{{a}}_{2}\right)\right\}$	$\left\{ {Q}\left({u},{{a}}_{{v}}\right), {R}\left({{s}}_{{u}},{{a}}_{{v}}\right)\right\}$

Different policies can be used in choosing an action such as the ɛ-Greedy strategy used in our system model. The value of ɛ is in the range $\left(0<\epsilon <1\right).$The value of $\epsilon$is used to determine the type of action taken either a random exploration or exploitation of its previous states and knowledge. If ε is small,the agent chooses more random actions resulting in more exploration.On the other hand, larger values for ε make the agent exploit more of the previous learning. Ε can be varied over time to allow higher exploration followed by higher exploitation.

3.3 The Proposed Q-Learning based Cell Association Scheme

When a device wants to establish a connection or transmit connectionless data, it needs to associate with an eNB. The device needs to associate with the eNB that satisfies its QoS requirements. With the co-existence of different types of devices such as H2H and M2M communication devices, they will have different QoS requirements.

The target of the proposed scheme is to find the best eNB to associate with for both types of communication devices satisfying their QoS requirements. This is done using two different methods: one for M2M devices and one for H2H devices.

The target for an H2H communication device (agent) is to associate with an eNB with a high number of available radio channels such that the H2H device can have a high downlink rate satisfying Eqs. (3) and (4). On the other hand, the target for an M2M communication device (agent) is to associate with an eNB that enables the M2M device to transmit with low transmission power to increase its battery life.The UL transmit power for a device$i$ associated with ${eNB}_{j}$ ignoring the interference power is approximated as follows [10]:

$${P}_{j}^{M2M,i}=10\times {log}_{10}\left\{\text{min}\left\{\varGamma \times \frac{{\sigma }^{2}}{{10}^{- \frac{{PL}_{j}^{i}}{10}}} , {10}^{\frac{{P}_{max}^{M2M,i}}{10}}\right\}\right\}$$

where ${PL}_{j}^{i}$denotes the path loss of the link between device$i$ and eNB$j$ in dBm, ${P}_{max}^{M2M,i}$is the maximum UL transmitted power of device$i$ in dBm, and $\varGamma$ is the target signal to noise ratio (SNR) on the whole frequency range for each eNB. Note that we only consider the noise ignoring interference. Thus, we use this value as an estimated upper bound for the M2M UL transmit power.

The scheme has different inputs that help it to manage the device-eNB association decision. These inputs are reference signal received power (RSRP) for both MeNB and PeNB_j defined as ${RSRP}_{MeNB}, {RSRP}_{PeNB,j}$, the${SINR}_{j}^{i}$ received by device $i\in D$ from eNB$j$ as in Eq. 1, and the number of available channels the eNB$j$ may assign to user device $i$.

Each device uses the inputs to calculate the DL rate for each eNB$j$ using Equations(3) and (4). Also, it calculates the UL transmit power that will be used when associated with eNB$j$using Eq. (9). Using these values, the device performs a QoS-based selection that guarantees both its requirements. The QoS-based selection differs according to the device type whether H2H or M2M.

3.3.1 Cell Association for H2H UEs

The H2H device uses the traditional method in its cell selection, where it uses the value of the RSRP signal. The H2H device is more interested in DL rate than the UL transmission power, thus its QoS equation is expressed as follows:

$${QoS}_{j}^{H2H,i}= {\rho }_{j}^{total,i}$$

First, the device calculates the DL rate ${\rho }_{j}^{total,i}$for all surrounding eNB as in Eq. (3) and Eq. (4) and${QoS}_{j}^{H2H,i}$ as in Eq. (10). Secondly, it selects the eNB that gives maximum ${QoS}_{j}^{H2H,i}$ even if it has a lower RSRP signal and sends an association request to connect with it.

3.3.2 Cell Association for M2M Devices Using Q-Learning (CAM-QL)

The M2M device is more interested in lower UL transmission power than the DL rate, but still needs an acceptable DL rate for receiving the acknowledgments. Thus, its QoS equation is expressed as follows:

$${QoS}_{j}^{M2M,i}= {P}_{j}^{M2M,i}$$

The device uses the Q-learning algorithm where the M2M device learns how to choose the best eNB that meets its QoS requirements. The QL algorithm has different inputs that help anM2M device to choose the best eNB. These inputs are the action-value function $Q\left(S,a\right), {RSRP}_{MeNB}, {RSRP}_{PeNB,j}$, ${B}_{PeNB,j} , {B}_{MeNB}$.${B}_{PeNB,j}\left(t\right)$is an instantaneous blocking factor from the MeNB that is used to update the moving average factor ${\tilde{B}}_{MeNB}\left({t}\right)$used in the eNB decision.${B}_{MeNB}\left({t}\right)$ takes the values from0 to1, it is 0if the device is not blocked from MeNB, else it is 1 if the UE is blocked from MeNB. ${B}_{PeNB,j}\left(t\right)$is an instantaneous blocking factor from PeNB_j that is used to update the moving average ${\tilde{B}}_{PeNB,j}\left({t}\right)$used in the eNB decision. Similarly,${B}_{PeNB,j}\left(t\right)$varies from 0 to 1, it is 0if the device is not blocked from PeNB_j, else it is 1 if the UE is blocked from PeNB_j$,{ RSRP}_{MeNB}$is the RSRP signal from MeNB. ${RSRP}_{PeNB,j}$is the RSRP signal from PeNB_j. The moving average is represented as:

$${\tilde{B}}_{MeNB}\left(t+1\right)=\mathcal{ }\mathcal{l}\mathcal{ }\times {B}_{MeNB}\left(t\right)+\left(1-\mathcal{l}\right)\times {\tilde{B}}_{MeNB}\left(t-1\right)$$

$${\tilde{B}}_{PeNB,j}\left(t+1\right)=\mathcal{ }\mathcal{l}\mathcal{ }\times {B}_{PeNB,j}\left(t\right)+(1-\mathcal{l})\times {\tilde{B}}_{PeNB,j}(t-1)$$

where $\mathcal{l}\mathcal{ }$ is the weighting factor that defines how much the average factor will depend on the current blocking factor and the history of the selected eNB. Typically, we take $\mathcal{l}\mathcal{ }$= 0.6 in our model.

The system uses the vector $\left\{{RSRP}_{MeNB}, {RSRP}_{PeNB,1},{RSRP}_{PeNB,2}, {RSRP}_{PeNB,3}\dots ,{RSRP}_{PeNB,K-1}\right\}$ of the RSRPs of different MeNB and PeNB as its input where K-1 is the number of PeNB. The system state is the eNB selected by the M2M device (the agent). The Q-learning algorithm starts with an initial state ${S}_{init}$ that is the eNB with max received power of all the PeNB_j;$1\le \text{j} \le K-1$and the MeNB. The action $a\left(t\right)$ is defined as the choice of a specific eNB where the M2M device transit between the states to choose the best eNB to connect to according to the action value. This is done using the ɛ-greedy algorithm. When choosing an eNB, it checks the moving average that depends on the blocking factor for the chosen eNB.

If ${\tilde{B}}_{MeNB} and {\tilde{B}}_{PeNB,j}$ is from 0 to 0.5, then continue with the chosen eNB.
Else if${\tilde{B}}_{MeNB} and {\tilde{B}}_{PeNB,j}$is from 0.5 to 1, then explore another action to choose another eNB to connect to using the Q-learning where it is either random or according to Q-function using ɛ- greedy policy.

It calculates ${QoS}_{j}^{M2M,i}$ for the chosen eNB as in Eq. (11)to calculate the reward ${R}^{M2M}\left(t\right)$.The reward ${R}^{M2M}\left(t\right)$ is defined as the comparison between the ${P}_{j}^{M2M,i}$and ${P}_{MAX}^{M2M}$ that is the maximum allowable transmission power for an M2M device such that if ${P}_{j}^{M2M,i}<{P}_{MAX}^{M2M}$ the device is rewarded for saving transmission power. Simply the less transmission power used, the more the reward.The reward is expressed as follows:

$${{R}^{M2M}}_{}\left(t\right)=\left({P}_{MAX}^{M2M}-{P}_{j}^{M2M,i}\left(t\right)\right)$$

where ${P}_{j}^{M2M,i}$ is the calculated uplink transmit power of chosen eNB and ${P}_{MAX}^{M2M}$ is the maximum uplink transmit power.

The reward can be either negative${R}^{M2M}\left(t\right)<0$, which means that the chosen eNB is far and the device needs transmission power higher than the threshold value to send its data.In this case,the device needs to choose another eNB. If ${R}^{M2M}\left(t\right)>0$, it means the chosen eNB is close and the device need slow transmission power to transmit its data, thus the device can associate with this eNB.

For each of both positive and negative values of the reward ${R}^{M2M}\left(t\right)$, an action $a\left(t\right)$ is taken and the next state ${s}^{{\prime }}\left(t\right)$ is visited.Then the function Q(s'(t),a(t)) is updated as in Eq. (8). Figure 3depicts the underlying Markov decision process for the Q-learning algorithm. We can see that agent can transit to any state according to the action taken depending on the value of max Q(s'(t),a(t)) that depends on the reward value calculated in the previous state.

The algorithm is executed for a period called the learning time, where it explores all different states (all eNB) and establishes the Q-table. It contains all Q-function values of all states${s}_{u}$, actions pairs ${a}_{v}$, and it is calculated using Eq. (8). After the learning period, the algorithm converges when the difference between two subsequent Q-matrices is very small nearing 0, and starts to use the Q-table for the association process.Finally, the Q-learning algorithm returns the chosen eNB that gives the best QoSfor the M2M device.Thus, the M2M device sends an association request to the chosen eNB to connect with it and updates the blocking factor such that:

If the UE is blocked, then update ${B}_{PeNB,j}\left(\text{t}\right) \text{o}\text{r} {B}_{MeNB}\left(\text{t}\right)$of the eNBat which the device is blocked to be equal to 1 and the UE is considered to be blocked.
If the UE is not blocked, then ${B}_{PeNB,j}\left(\text{t}\right)\text{o}\text{r} {B}_{MeNB}\left(t\right)$of the eNB the device is connected to is set to be equal to 0.

After the learning period is finished, the M2M device has learned the best eNB to connect to according to its state using the Q-table shown in Table 2. However, there can be a change in the environment around the M2M devices such as a change in the number of eNB available to the M2M device if it moved or due to changes in the available RBs in the eNB. This will appear in the calculated reward that will be different than the reward calculated before and recorded in the Q-table. Thus, the M2M device calls the Q-learning algorithm again to determine the best eNB to associate with. The scheme is shown in Fig. 4.

3.4 The Q-Learning Based Load Balancing Scheme at the Pico-eNB (LB-QL)

This system is responsible for balancing the load between the PeNB and the MeNB. The eNB will accept the device association if it has free RBs that can be assigned to the device for transmitting and receiving, else the device request is rejected and called an outage device.

A PeNB uses Cell Range Expansion (CRE) to balance the load between PeNB and MeNB. CRE is considered a key design feature to enhance HetNets efficiency. In the normal case without RE, the UE chooses the serving cell based on the highest DL received power, this technique is referred to as maximum reference signal received power (Max RSRP). With RE, a positive offset is added to the RSRP of the pico-cell to increase the range served by it. This offset is announced by a participating eNB by including it in the broadcast channel (PBCH). With RE the UE selects the eNB_j according to the rule given as:

$$Serving cell=argmax( {RSRP}_{j}+{Bias}_{j})$$

where RSRP and Bias are expressed in dBm, this rule implies that a UE does not necessarily connect to the eNB that has the strongest DL received power.More UEs are attracted by pico-cell compared to the case without RE which increases HetNet efficiency.

3.4.1 The Q-Learning Based Cell Range Expansion (CRE) Scheme (LB-QL)

When the device requests access to an eNB, it is either assigned RBs and admitted or blocked and becomes an outage device. The PeNB uses the proposed Q-learning algorithm to perform the load balancing between the MeNB and the set of PeNBs to minimize the total number of outage devices. PeNB_jbroadcasts the bias value which is added to the signal strength received by the devices according to Eq. (15), thus it can attract more devices to connect to it to balance the load.

The states are defined as the different biasing values $S\left(t\right)=\left\{0, 2,\dots .., 20 dbm\right\}.$The action $a\left(t\right)$ is defined as the transition between the states. The reward ${{r}_{LB}}_{}\left(t\right)$ is defined as a function of both the available number of physical resource blocks (RBs) and the number of requested resource blocks, it can be expressed as:

$${\text{R}}_{\text{L}\text{B}}\left(t\right)=\left({RB}_{Avail}\right(t)-{RB}_{Need}({t}\left)\right)$$

where ${PRB}_{Avail}\left(t\right)$ is the available number of physical resource blocks and ${PRB}_{Need}\left({t}\right)$ is the number of requested resource blocks from the devices that want to access the PeNB_j.

Each PeNB_jchecks if there are free available RBs in which case the QL algorithm is called. It starts with an initial state ${S}_{init}$ with biasing value '0'. Then using the ɛ-greedy algorithm, it chooses an action to transit to another state (i.e. another bias value). After executing the action, the PeNB_j either increases its biasing value to increase its range and accepts more devices to its cell or reduces the biasing value to discourage new devices from connecting. If thePeNB_jbecomes full capacity or near the maximum capacity, it reduces its biasing value and offloads the devices to other eNBs. After that the reward ${{R}_{LB}}_{}\left(t\right)$ is calculated as in Eq. (16), then the system goes to the next state${s}^{{\prime }}\left(t\right)$. Then the function Q(s'(t),a(t)) is updated using Eq. (8).

The reward ${{R}_{LB}}_{}\left(t\right)$is either positive${{R}_{LB}}_{}\left(t\right)>0$, then there are free RBs for outage devices, and it increases its biasing value by 2 dBm and thus moves from ${s}_{n}\left(t\right)\to {s}_{n+1}\left(t+1\right)$. Or it is negative, then there are not enough RBs for outage devices and cannot accept any new devices, thus it reduces its biasing value by 2 dBm and thus moves from ${s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)$. Figure 5 shows the Markov chain decision process for the Q-learning algorithm. In CRE, a fixed bias value may vary from 2 dBm to 20dBm, wherethe increasing values make more UEs select pico-cells over than macro-cell as their eNB. When the bias value increases, more devices are attracted to the pico-cells [43–45].

The Q-learning algorithm returns the final bias value used to increase or decrease the bias of PeNB_j to balance the load. The algorithm works for a period where it explores all different states and updates the Q-function which is calledthe learning phase as explained in Fig. 6.a. When the number of free RBs in PeNB_j changes, the algorithm is capable of capturing that due to the change in the reward value set in the Q-table. This change in the RBs can be from either a device associated to or de-associated from PeNB_j. When PeNB_j senses this change, it calls the Q-learning algorithm again but in relearning mode to update the Q-table accordingly with the new environment. Figure 6.b explains how the scheme works and how the load is balanced between the MeNB and the PeNBs to minimize the total number of outage devices.

3.4.2 The Greedy Scheme for Dynamic Adjustment of Bias Value in the Load Balancing Scheme

In this section, we study the effect of changing the method of setting the biasing value at the PeNB. The reward has three ranges: small, medium, and large as defined in Table 3. The difference represents how the PeNB is near to or far from full capacity.

Small Difference

For the positive reward, the difference between the free RBs and the used RBs is small, which means that the PeNB is near full capacity. Thus, PeNB will increase the biasing value with a small value (2 dBm), and thus moves from

For negative reward ${{R}_{LB}}_{}\left(t\right)$,it means then there are not enough RBs and cannot accept any new devices, thus it reduces its biasing value by 2dBm and thus moves from ${s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)$ to reduce the potentialof increasing blocking.

Medium Difference

For the positive reward ,the difference between the free RBs and the used RBs is medium, which means that the PeNB has enough RBs to accept new devices. Thus, PeNB will increase the biasing value by6 dBm, and thus moves from

For negative reward ${{R}_{LB}}_{}\left(t\right)$,it means that there are not enough PRBs and cannot accept any new devices, thus it reduces its biasing value and thus moves from${s}_{n}\left(t\right)\to {s}_{n-3}\left(t+1\right)$

Large Difference

For positive reward ,the difference between the free RBs and the used RBs is large, which means that the PeNB has a lot of unused RBs. Thus, PeNB will become more greedy and increases the biasing value with a large value as 10 dBm, and thus moves from

For negative reward ${{R}_{LB}}_{}\left(t\right)$,it means that the PeNB is near full capacity and if it keeps this biasing value the blocking probability will increase. Thus, it cannot accept any new user devices, and it reduces its biasing value with a large value, thus it moves from ${s}_{n}\left(t\right)\to {s}_{n-5}\left(t+1\right)$ to reduce the number of devices that request access to it and to reduce its blocking probability.

All the transitions are bounded within the range of (Bias_min, Bias_max) where Bias_maxis the maximum biasing value of 20dBm and Bias_minis the minimum biasing value of 0dBm. For example, if the present state is ${s}_{n }$and the next state is ${s}_{n+5 }$, then n+ 5 must be bounded by Bias_maxsuch that (n + 5) ≤Bias_max. If the next state is ${s}_{n-5 }$, then it is bounded by 0 such that (n− 5) ≥Bias_min. The new Q-learning Markov decision process is depicted in Fig. 7.

3.5 The Overall Scheme Description

In our system model, we have two types of communication devices: H2H and M2M devices. Each device starts with estimating the RSRP signals of all eNB.

Forthe H2H devices, the H2H device executes the H2H algorithm (section 3.3.1) to choose its eNB that provides its required QoS. Each M2M device executes the CAM-QL algorithm to choose the eNB that potentially provides its required QoS. It starts by going through a learning phase for the first time until it learns its surrounding environment as shown in the flowchart of Fig. 4a.The algorithm begins with checking all RSRP signals. It startsin either initializing or relearning mode. In initializing mode, the action-value function $Q\left(s,a\right)$, as in step 2. The algorithm starts the Q-learning algorithm by setting the state ${S}_{init}$ as step 3a. In the relearning mode, it uses the last value for both $Q\left(s,a\right)$and S. The algorithm generates a random number to choose an action in step 4, that either goes to step 5a,where it exploits and chooses the action according to max $Q\left(s,a\right)$or step 5b,where it chooses random action to explore different states. After step 5 the action is executed and both the reward and the new $Q\left(s,a\right)$ is calculated and the state is updated as in steps 6 to 9. Then it checks the convergence of the Q-function as in step 10. If not, it continues with the algorithm. If convergence occurs, it checks if there are any changes in the observed RSRP signals. If there is no change, then the learning phase is finished, and the Q-table is used for the eNB association. If any change occurs, then it calls the Q-learning again in relearning mode to adapt to the new environment.

After the learning phase is finished, the M2M device uses its learning to choose the best eNB that provides its QoS requirements. The algorithm described in Fig. 4b starts with step 1. It chooses an eNB according to the Q-table such that the chosen eNB gives the max value of $Q\left(s,a\right)$ and checks if it satisfies the device’s connection request. If the eNB has available RBs, it checks the moving average of the blocking factor${\tilde{B}}_{eNB}\left({t}\right)$ of the chosen eNB. If ${\tilde{B}}_{eNB}\left(t\right)$< 0.5, then it sends the access request as in step 6, else it chooses another eNB as in step 1. If the chosen eNB does not have available RBs, it updates the blocking factor of this eNB${B}_{eNB}\left({t}\right)$to 1. Then, the next eNB in the list is attempted. If all eNBs are attempted without success, the device is blocked. In the case of a successful association, the device starts to transmit its data and checks if the new reward calculated is identical to the reward in its Q-table as in step 8. If both are equal, then it continues using this Q-table when it performs an eNB association. But if the two rewards are different, this means that there is a change in the environment and the device calls the Q-learning algorithm in re-learning mode as in step 9.

As for the PeNB_j, each PeNB_j checks if there are free RBsor not. If the PeNB_j has free RBs, it calls the load balancing based Q-learning algorithm to start calculating the CRE bias for balancing the load as shown in Fig. 6a. It starts the Q-learning algorithm by in either initializing or relearning mode. In the initialization case, it and initializes the action-value function $Q\left(s,a\right)$, as in step2 and then setting the state ${S}_{init}$ as step 3a. In the relearning mode, it uses the last value for both $Q\left(s,a\right)$and S. The algorithm goes as explained in the M2M learning phase until the PeNB_j learns its surrounding environment and ends up choosing the bias value.

The association /de-association process in PeNB_j is explained in Fig. 6b. When a device sends a request to PeNB_j, it checks if the request is an association request or de-association request as in step 1. If it is a de-association request, the number of available RBs increase by 1 as in step 2b. If the request is an association request, then the PeNB_j checks if there is a free RBs to assign to the device as in step 2a. If there is a free RB, it is assigned to the device and the number of RBs is updated in step 3. Else if there are not any free RBs, the device is blocked as in step 5. After the update of the number of available RBs, the PeNB_j calls the Q-learning again in the relearning mode to update its Q-table according to the new environment.

3.5.1 The Interplay between the LB-QL at the PeNB and the QL Scheme at the M2M Devices

The M2M decision has a big effect on the load balancing scheme at the PeNBs. If more M2M devices chose to connect to the MeNB, then there will be free RBs in the PeNBs which will entice them to call the Q-learning scheme to increase their bias to balance the load between the MeNB and PeNBs. While if more M2M devices chose to connect to PeNB, this will result in two scenarios: 1) either the load will be balanced between the MeNB and the PeNB such that all eNBs provide good network performance, thus the Q-learning will not be called; Or 2) there will be some PeNB that have high load and some that have a low load. In this case, the PeNB that hasa low load will start to increase its bias to attract more devices while those with a high load will start to reduce bias toalienate devices from sending association requests to it. This should have an effect akin to balancing the load between all eNBs in the network to attempt to provide good network performance to all devices. This interaction will be further demonstrated in the results section.

Table 3

The Reward ranges and corresponding actions.
Reward Ranges	The Reward Values${R}_{LB}\left({t}-1\right)$	The Action${{a}}_{{n} }\left({t}\right)$
Small Difference ${Ʀ}_{1 }= \left[1, 10\right]$	${{R}_{LB}\left(t-1\right) \le Ʀ}_{1 }$	For positive ${R}_{LB}\left(t-1\right)$: Move from${s}_{n}\left(t\right)\to {s}_{n+1}\left(t+1\right)$ For negative ${R}_{LB}\left(t-1\right)$: Move from${s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)$
Medium Difference ${Ʀ}_{2 }= [10, 20]$	${{R}_{LB}\left(t-1\right) \le Ʀ}_{2 }$	For positive ${R}_{LB}\left(t-1\right)$: Move from${s}_{n}\left(t\right)\to {s}_{n+3}\left(t+1\right)$ For negative${R}_{LB}\left(t-1\right)$: Move from${s}_{n}\left(t\right)\to {s}_{n-3}\left(t+1\right)$
Large Difference ${\text{Ʀ}}_{3 }= >20$	${{R}_{LB}\left(t-1\right) \le Ʀ}_{3 }$	For positive ${R}_{LB}\left(t-1\right)$ : Move from${s}_{n}\left(t\right)\to {s}_{n+5}\left(t+1\right)$ For negative ${R}_{LB}\left(t-1\right)$: Move from${s}_{n}\left(t\right)\to {s}_{n-5}\left(t+1\right)$

We consider a Heterogeneous Cellular Network (HCN) which consists of $K$eNB with one MeNB and $K-1$PeNB. These PeNBs share the same frequency bandwidth with MeNB. There are two types of communication devices: M2M and H2H communication devices that attempt to associate with these eNBs. Both M2M and H2H communication devices have different QoS requirements for the association with the eNB. Access requests from both types of communication devices arrive in the serving area according to a Poisson distribution with arrival rates λ_h and λ_m requests per frame respectively.M2M devices hold the allocated RBs for a random number of frames uniformly distributed in the range of 50 to 100. As for H2H devices, they hold the allocated RBs for a random number of frames uniformly distributed in the range from 600to 1000 according to the time needed for the data transmission. The frame duration is taken as 10ms and the sub-frame is 1ms.These values are chosen to mainly reflect the difference in the nature of the communication of H2H and M2M devices. The number of H2H communication devices is typically smaller than the number of M2M communication devices as shown in Fig. 1.

To evaluate the performance of the proposed Q-learning scheme, we consider the following performance metrics:

1) Blocking probability: the probability of blocking due to unavailability of RBs for both H2H and M2M devices.

2) Average Resource Utilization for MeNB and PeNBs: the average utilization of both MeNB and PeNB over time.

3) Average Uplink Transmission Power: the total average powerused to transmit the data between the M2M device and its associated eNB.

We build a detailed model using MATLAB to simulate the system as it behaves according to the proposed schemes. The overall system is simulated as follows:

First, we generate a grid of 500m*500m. The MeNB is placed in the center at (0,0) whereas PeNB_j,$1<j\le K-1,$ is placed at random places around the MeNB inside the grid as shown in Fig. 1.

The simulation loop starts by generating at each subframe the number of H2H and M2M arrivals and their requested demand. The access requests of the M2M and H2Hare randomly generated at the beginning of each subframe according to a Poisson distribution with rates λ_h and λ_m respectively. Two other random generators are used to determine the period the M2M and the H2H deviceholds the RBs as specified above.

The arrival rates λ_h and λ_mare calculated as follows:

$${{\lambda }}_{\text{h}}=\frac{40 \times \text{S}\text{y}\text{s}\text{t}\text{e}\text{m} \text{U}\text{t}\text{i}\text{l}\text{i}\text{z}\text{a}\text{t}\text{i}\text{o}\text{n} \times \text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{R}\text{B}}{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{f}\text{r}\text{a}\text{m}\text{e}\text{s} \text{d}\text{e}\text{v}\text{i}\text{c}\text{e} \text{h}\text{o}\text{l}\text{d}\text{s} \text{t}\text{h}\text{e} \text{R}\text{B} \times 100 }$$

$${{\lambda }}_{\text{m}}=\frac{60 \times \text{S}\text{y}\text{s}\text{t}\text{e}\text{m} \text{U}\text{t}\text{i}\text{l}\text{i}\text{z}\text{a}\text{t}\text{i}\text{o}\text{n} \times \text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{R}\text{B}}{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{f}\text{r}\text{a}\text{m}\text{e}\text{s} \text{d}\text{e}\text{v}\text{i}\text{c}\text{e} \text{h}\text{o}\text{l}\text{d}\text{s} \text{t}\text{h}\text{e} \text{R}\text{B} \times 100}$$

such that the H2H traffic and M2M traffic account for40% and 60% of the system utilization respectively. We nominally simulate a period of 10 millionframes. The simulation parameters are presented inTable 4.

Table 4

Simulation Parameters
Parameter	Value
Network Type	Heterogeneous network with H2H and M2M traffic
subframe duration	1 ms
Simulation Duration	10,000,000 subframes
Number of RBs in one eNB	50
Number of MacroeNB	1
Number of PicoeNB (K-1)	8
Macro cell radius	289 m
Pico cell radius	40 m
Carrier Frequency	2.0 GHz
Bandwidth	10 MHz
The number of frames anRB is held by an H2H device	Uniformly distributed in the period [600,1000]
The number of frames anRB is held by an M2M device	Uniformly distributed in the period [50,100]
$\mathcal{l}\mathcal{ }$ Theweightingfactor	0.6
Macro path loss model	128.1 + 37.6log10(R) dBm (R [km]) as in [10]
Pico path loss model	140.1 + 36.7log10(R) dBm (R [km]) as in [10]
Macro BS transmit power	46 dBm
Pico BS transmit power	30 dBm
Thermal noise density	-174 dBm/Hz

4.1 Performance Comparisons with Other Schemes

None of the previous work considered a system serving devices with different QoS needs as H2H and M2M. However, we compare two schemes to show the superior performance of the proposed approach. For comparison, we use the first system based on a normal association and the second is CR-QL [22] that uses Q-learning at the devices to choose the bias value to reduce the number of outage devices. In the simulations, we use 8 PeNBs and 1 MeNB.

4.1.1 Access Blocking Probability

We compare the H2H blocking probability and M2M blocking probability of all three schemes. This is shown in Fig. 8and Fig. 9.

4.1.1.1 H2H Blocking Probability

In this section, we study the H2H blocking probability. H2H devices do not use the Q-learning in their choice for the eNB.

Figure 8shows the H2H blocking probability for the three schemes; our approach that uses theQ-learning for load balancing at PeNB (LB-QL) and Q-learning at M2M for association (CAM-QL), the CR-QLscheme [22],and the scheme without using the Q-learning. Also, it shows the effect of the greedy action taken by the PeNB on the H2H blocking probability. It shows that the proposed approach gives a better H2H blocking probability than the other approaches.The second scheme that uses the Q-learning for load balancing gives higher blocking than our approach but lower than the scheme without Q-learning. The scheme with normal association gives the highest blocking probability of all schemes as it does not care about any QoS requirements.While our approach which uses Q-learning to balance the load between the MeNB and PeNBs enables the devices that are blocked in the first access attempt to be served by other nearby eNBs.Hence, the blocking probability is lower than the other schemes.

When the PeNB becomes more greedy in selecting its biasing value to attract more devices to PeNBs, the blocking probability of H2H traffic becomes higher than the case of the normal LB-QL algorithm at PeNB.However, the blocking probability is still lower than the other two schemes. This demonstrates the resilience of the proposed Q-learning schemes: if the PeNB makes a wrong or probably a non-rational decision in attracting more devices, then the scheme at the user devices autonomously adjusts itsbehavior resulting in avoiding the PeNB that have made bad decisions. This will be further demonstrated when the blocking probability of the M2M devices is discussed.

4.1.1.2 The M2MBlocking Probability

In this section, we study the M2M blocking probability. M2M devices use CAM-QL algorithms to choose the eNB according to the M2M QoS requirement which is low uplink transmission power, and the second is the LB-QL scheme that is used to balance the load between the MeNB and PeNBs to reduce the number of outage devices.

Figure 9shows the M2M blocking probability for the proposed scheme, the CR-QL scheme, and the scheme with the normal association. Also, it shows how the M2M scheme adapts itself to handle the changes in the load-balancing scheme at PeNB when the scheme goes from the normal decision to the greedy decision. It can be seen that the proposed approach gives better M2M blocking probability compared withthe two other approaches. The CR-QL scheme exhibits a higher blocking probability compared with the proposedapproach as it only cares about balancing the load between MeNB and PeNB wherea large number of M2M devices are blocked due to the huge demands on MeNB resources. The scheme with the normal association gives the highest blocking probability as it does not care about any QoS requirements. Our proposed scheme using CAM-QL and LB-QL results in the lowest blocking probability.

Also, the results show that our proposed scheme via the CAM-QL can adapt to any changes in the load-balancing scheme at PeNB (LB-QL). When the PeNB takes a greedy decision in choosing its biasing to attract more devices, it affects the blocking probability of the M2M devices attracted to it as they either will not find resources to connect or their QoS is not as they accept.Thus, they react via the CAM-QL and adjust their behavior to avoid this PeNB to have better QoS. This results in higher blocking probability compared with the non-greedy approach, but yetit exhibits better performance and lower blocking probability than the two other schemes even at the high system utilization.This shows that the scheme at the M2M devices can adjust to non-rational decisions made by the PeNBs.

4.1.2 Average Resource Utilization for MeNB and PeNBs

In this section, we examine the average resource utilization in both the MeNB and PeNBs. The average utilization is calculated using the three schemes:our approach that uses the Q-learning for load balancing at PeNB (LB-QL) and Q-learning at M2M for association (CAM-QL), the CR-QL scheme [22], and the scheme without using the Q-learning at all. The comparison is shown in Fig. 10and Fig. 11.

Figure 10shows the MeNB average utilization for the three schemes. It shows that the scheme with the normal association has the highest resource utilization as most devices prefer to connect to the MeNB as it has the highest RSRP signal, which increases its utilization and its blocking rate. On the other hand, the proposed scheme gives lower MeNB resource utilization and is closed to the CR-QL scheme as both use Q-learning to balance the load between MeNB and PeNB. When the PeNB uses the greedy method in choosing the biasing value, we can see that MeNB average utilization changes to be lower than the normal case as more devices choose to connect to PeNB due to its larger RSRP signal.

Figure 11shows the average utilization of all PeNBs for the three schemes. It shows that the scheme with the normal association has the lowest resource utilization as most devices prefer to connect to MeNB as it has the highest RSRP signal so PeNB will have many free resources. The proposed CAM-QL and BL-QL schemes exhibit higher PeNB resource utilization and are close to the CR-QL scheme as both use Q-learning to balance the load between MeNB and PeNB. In the proposed approach, a PeNB nominally increases its biasing value to attract more devices from MeNB to serve them in away to improve the network performance to give higher throughput and lower blocking probability. When PeNB uses the greedy method, it attracts more devices,and hence the average MeNB resource utilization decreases. This increases its utilization as more devices choose to connect to PeNB due to its higher effective RSRP, but as M2M devices use the Q-learning in choosing the best eNB that gives their required QoS to connect to, they attempt to avoid this greedy PeNB which reduces the PeNB utilization again. This results in average utilization near to the normal case.

4.1.3 Average Uplink Transmission Power

Another important metric to explore the advantages of the proposed scheme is the average uplink transmission power of M2M devices.This is defined as the total average powerused to transmit the data between the device and the eNB. In this section, we compare the three schemes based on the average uplink transmission power for M2M traffic.

Figure 12shows that the proposed Q-learning approach results in the average uplink transmission power being lower than both the normal association and the CR-QL. In the other two schemes, M2M devices select the eNB based on the RSRP signals. That results in possibilities to connect to a more distant eNBthat requires higher transmission power to send their data. In the proposed approach using Q-learning, the M2M devices attempt to choose the best eNB that best meet sits QoS requirement (low uplink transmission power).This results in the lowest transmission power.

When PeNB uses the greedy method in choosing its biasing value, it tends to attract more devices which increases the devices' blocking probability. Thus, M2M devices using the Q-learning algorithmreact by avoiding such highly loaded PeNBs to avoid being blocked.TheseM2M devices attempt to connect to other eNBs even if not in close vicinity. This increases the average distance due to not being connected to the nearest eNB. In our reported results we noticed that the average distance when using the greedy scheme increases by2%. Also, the percentage of devices not connected to the nearest eNB increases from 37–41% compared with the regular biasing or greedy biasing respectively. This explains the increase in the M2M uplink transmission power when the greedy approach is used.

4.2 Adaptive Behaviour of the Bias Value

In this section, we show how the proposed algorithm is capable of adaptively adjusting the bias value as the traffic load changes. To demonstrate this, we change the traffic over different intervals T₁, T₂, T₃. The traffic starts normally with rates λ_h and λ_m during interval T₁, then at the beginning of the second interval T₂, the overall traffic rate is increased by 30%, and in the third interval T₃, the traffic rate is decreased by 65% from the 1st interval.

Figure 13 shows how the LB-QL scheme adaptively changes the chosen bias value according to the traffic load. It depicts the moving average of the bias values at one PeNB over the intervals T₁, T₂, and T₃. In T₁, the traffic starts at its normal rate. In T₂, the overall traffic load increases with a percentage of 30%, thus the network needs urgent balancing, so the LB-QL increases the bias values to take more devices from MeNB to balance the load between all eNodeBs and increase the network efficiency. While when the traffic load decreases in interval T₃, the number of devices in the network becomes low and so the network can be balanced with a small bias value, thus the LB-QL algorithm assigns small bias values such that pico-cells attract a small number of (edge) devices to enhance the network efficiency. This verifies that the LB-QL scheme properly adapts the bias of PeNB_j to balance the load.

The increase of the traffic demands and the heterogeneoustypesof devices in 4G cellular networksrequire enhancements to networks resource management. This leads to the usage of low power nodes (LPNs) in networks known as heterogeneous networks in LTE and LTE-A. In this paper, we propose a new scheme to handle the eNB selection and load balancing between different types of eNB using two Q-learning based algorithms, one applied at M2M devices (CAM-QL) and the other at PeNBs (LB-QL). Simulation results show that taking into consideration the H2H devices' and the M2M devices' QoS requirements improve the performance compared with the other schemes.

To show the proposed approach performance, a comparison is made with the normal association and the CR-QL [22]. The comparison shows that the proposed Q-learning approach gives better performance, where it exhibit slower H2H and M2M blocking probability for M2M devices and lower average uplink transmission power than that provided by the other two schemes. Also, we show how the two proposed schemes at M2M devices and PeNBs interact with each other. The results show that the M2M scheme can flawlessly adapt with the changes in the LB-QL scheme at PeNB and gives better performance and lower M2M blocking probability than the other two schemes.

The simulations show how the LB-QL adaptively adjusts the bias values for PeNBs according to the traffic load. Also, there is a strong interaction between the two Q-learning schemes where the M2M decision has a big effect on the load balancing scheme at the PeNBs and vice versa. Effectively, if more M2M devices chose to connect to the MeNB, then there will be free RBs in the PeNBs causing the Q-learning scheme to increase their bias to balance the load between the MeNB and PeNBs. While if PeNB uses the greedy method in choosing its biasing value, it tends to attract more devices which increases the devices' blocking probability. Thus, the M2M devices using the Q-learning algorithm will react by avoiding such highly loaded PeNBs to avoid being blocked.

This shows that the proposed Q-learning approach has good potential for being used to manage services for both H2H and M2M communications in LTE-Advanced systems due to its excellent performance, adaptability, and simplicity.

conflict of interest statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Statements and Declarations

The article does not involve any of the cases identified in the Guide for Authors that require providing a statement or declaration.

Work was not funded by an external agency and was performed while first author pursuing Ph.D. degree with the second author the thesis adviser.

Wu, G., Talwar, S., Johnsson, K., Himayat, N., & Johnson, K. D. (2011). M2M: From mobile to embedded Internet, IEEE Comm. Magazine, vol. 49, no. 4, pp. 36–43, Apr.
Biral, A., Centenaro, M., Zanella, A., Vangelista, L., & Zorzi, M. (2015). The challenges of M2M massive access in wireless cellular networks. Digital Communications and Networks, 1(1), 1–19.
Ratasuk, R., Vejlgaard, B., Mangalvedhe, N., & Ghosh, A. (2016). NB-IoT system for M2M communication, Wireless Communications and Networking Conference Workshops (WCNCW) IEEE, pp. 428–432,
Saito, Y., Sangiamwong, J., Miki, N., Abe, T., Nagata, S., & Okumura, Y. (2011). Investigation on cell selection methods associated with intercell interference coordination in heterogeneous networks for LTE-advanced downlink, in Proc. Eur. Wireless Conf., Vienna, Austria, pp. 1–6, Apr,
3GPP TR 36.913, Requirements for further advancements for Evolved UTRA (EUTRA Release 10), version 10.0.0., March 2011. Available: .
Warabino, T., Kaneko, S., Nanba, S., & Kishi, Y. (2012). Advanced load balancing in LTE/LTE-A cellular network, Personal Indoor and Mobile Radio Communications (PIMRC), 2012 IEEE 23rd International Symposium on, vol.9, no.12, pp.530,535, Sept.
Zia, N., & Mitschele, A. (2013). Self-organized neighbourhood mobility load balancing for LTE networks, Wireless Days (WD), 2013 IFIP, vol.13, no.15, pp.1,6, Nov.
3GPP, R1-105294 (2010). On Cell Range Expansion, 3rd Generation Partnership Project (3GPP),Texas Instruments,
Gao, L., Tian, H., Wang, M., & Tian, P. (2013). Dynamic bias setting for range extension in LTE-advanced macro-pico heterogeneous networks. The Journal of China Universities of Posts and Telecommunications vol, 20(4), 28–33. 10.1016/S1005-8885(13)60065-5.
Ahamed, M. M., & Faruque, S. (2015). Path loss slope based cell selection and handover in heterogeneous networks, in Proceedings of the IEEE International Conference on Electro/Information Technology (EIT15), pp. 499–504, Dekalb, Ill, USA, May
Amzallag, D., Bar-Yehuda, R., Raz, D., & Scalosub, G. (2013). Cell Selection in 4G Cellular Networks, Mobile Computing, IEEE Transactions on (Volume:12, Issue: 7), pp. 1443–1455.
Gadam, M. A., Ahmed, M. A., Kyun, C. (2016). Review of Adaptive Cell Selection Techniques in LTE-Advanced Heterogeneous Networks, Journal of Computer Networks and Communications, Vol. ID 7394136, pp. 12, 2016.
Gu, X., Li, W., & Zhang, L. (2013). Adaptive cell range control in heterogeneous networks, in Proceedings of the International Conference on Wireless Communications and Signal Processing (WCSP ’13), pp. 1–5, Hangzhou, China, October
Gu, X., Deng, X., Li, Q., Zhang, L., & Li, W. (2014). "Capacity analysis and optimization in heterogeneous network with adaptive cell range control", International Journal of Antennas and Propagation, vol. Article ID 215803, 10 pages, 2014.
Roque, L., Patrick, C., & Frances, C. (2015). "A New Cell Selection and Handover Approach in Heterogeneous LTE Networks: Additional Criteria Based on Capacity Estimation and User Speed", The Eleventh Advanced International Conference on Telecommunications (AICT), ISSN: 2308–4030, pp. 57–65,
Gao, L., Wang, X., Sun, G., & Xu, Y. (2011). "A game approach for cell selection and resource allocation in heterogeneous wireless networks, Sensor mesh, and ad hoc communications and networks", 2011 8th annual IEEE communications society conference, vol.8, no.5, 530–538, June
Khoshkholgh, M. G., Navaie, K., Shin, K. G., & Leung, V. C. M., Cell Association in Dense Heterogeneous Cellular Networks, inIEEE Transactions on Mobile Computing, vol. 17, no. 5, pp.1019–1032, 1 May 2018.
Altrad, O., & Muhaidat, S. (February 2013). Load Balancing Based on Clustering Methods for LTE Networking, Multidisciplinary Journals in Science and Technology,Journal of Selected Areas in Telecommunications (JSAT), Vol. 3, Issue 2,
Huang, M., Feng, S., & Chen, J. (June 2014). A Practical Approach for Load Balancing in LTE Networks. Journal of Communications Vol, 9(6), 490–497.
Kim, H., de Veciana, G., Yang, X., & Venkatachalam, M., Distributed alpha-Optimal User Association and Cell Load Balancing in Wireless Networks,IEEE/ACM Transactions on Networking, vol.20, no.1, pp:177–190, February 2012.
Kudo And, T., & Ohtsuki, T. (2013). Cell Selection Using Distributed Q-Learning in Heterogeneous Networks, IEEE Signal and Information Processing Association Annual Summit and Conference (APSIPA),vol. 16, no. 3, Nov.
Kudo, T., & Ohtsuki, T. (2013). Cell Range Expansion Using Distributed Q-Learning in Heterogeneous Networks, IEEE 78th Vehicular Technology Conference (VTC Fall), Las Vegas, NV, pp. 1–5,
Munoz, P., Barco, R., de la Bandera, I., & Cascales, I. (2012). Optimization of load balancing using fuzzy Q-Learning for next-generation wireless networks. Expert Systems with Applications, 40, 984–994.
Ali, M., & Coucheney, P. (July 2016). and M. Coupechoux,Load Balancing in Heterogeneous Networks Based on Distributed Learning in Near-Potential Games.IEEE Transactions on Wireless Communications, Vol. 15, no. 7,
Yu, P., Yang, X., Zhou, F. (2020). Deep Reinforcement Learning Aided Cell Outage Compensation Framework in 5G Cloud Radio Access Networks. Mobile NetwAppl
Wang, L., Liu, W., Zhang, D., Wang, Y., Wang, E., & Yang, Y. (2018). "Cell Selection with Deep Reinforcement Learning in Sparse Mobile Crowdsensing," IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, 2018, pp. 1543–1546, DOI: 10.1109/ICDCS.2018.00164.
Tosounidis, V., & Pavlidis, G., etal, "Deep Q-Learning for Load Balancing Traffic in SDN Networks", SETN 2020: 11th Hellenic Conference on Artificial Intelligence, pp. 135–143, September 2020.
Tennakoon, D., Karunarathna, S., & Udugama, B., "Q-learning Approach for Load-balancing in Software Defined Networks," 2018 Moratuwa Engineering Research Conference (MERCon), Moratuwa, 2018, pp. 1–6, DOI: 10.1109/MERCon.2018.8421895.
Roh, B. S., Han, M. H., Ham, J. H., & Kim, K. I. (2020). "Q-LBR: Q-Learning Based Load Balancing Routing for UAV-Assisted VANET" Sensors 20, no. 19: 5685, October 2020.
Lui, G., Gallagher, T., & Binghao, L. (2011). Differences in RSSI readings made by different Wi-Fi chipsets: A limitation of WLAN localization. Localization and GNSS (ICL-GNSS), International Conference on
Huawei, & R1-101083. (Feb. 2010)., Cell association analysis in outdoor Hotzone of heterogeneous networks, 3GPP TSG RAN WG1 #60, San Francisco, USA,
Elhattab, M. K., Elmesalawy, M. M., & Ibrahim, I. I. (October 2017). Opportunistic Device Association for Heterogeneous Cellular Networks With H2H/IoT Co-Existence Under QoS Guarantee. IEEE Internet of Things Journal, 4(5), 1360–1369.
Ksentini, A. (2012). Y. H.Aoul, and T.Taleb, Cellular-based machine-to-machine: overload control, IEEE Network Magazine, vol. 26, no. 6, pp. 54–60, Nov.
Bhuvaneswari, P. T. V., Indu, S., Shifana, N. L., Arjun, D., & Priyadharshini, A. S. (2015). “An analysis on Cell Range Expansion in 4G LTE networks”, 3rd International Conference on Signal Processing, Communication and Networking (ICSCN), pp. 1–6,
Sasikumar, R., Ananthanarayanan, V., & Rajeswari, A., An intelligent pico Cell Range Expansion technique for Heterogeneous Wireless Networks,Indian Journal of Science and Technology, vol. 9, Issue 9, Mar. 2016.
Gao, L., Tian, H., & Wang, M. (August 2013). Dynamic bias setting for range extension in LTE-advanced macro-pico heterogeneous networks. The Journal of China Universities of Posts and Telecommunications, 20(4), 28–33.
3GPP (Sep. 2012). System improvements for machine-type communications, 3rd Generation Partnership Project (3GPP), TR 23.888 V11.0.0,
Na.Lu, X., Zhu, Z., Jiang, X., Lu, F., Yang, & Qi (2013). Bi, Performance of LTE-Advanced macro-pico heterogeneous networks, IEEE Wireless Communication and Networking Conference (WCNC), pp. 545–550, 10.1109/WCNC.2013.6554622,
Mnih, V. (February 2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.
Lillicrap, T. (2016). Continuous control with deep reinforcement learning In: Bengio, Y. and LeCun, Y. (eds.) ICLR.
Kaelbling, L., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237–285.
Watkins, C. J. (1989). Learning from Delayed Reward, Ph.D. Thesis, Cambridge University,
Watkins, C. J., & Dayan, P. (1992). Machine Learning:Q Learning, Volume 8, pp.279–292,
Sutton, R., & Barto, A. (1998). Reinforcement Learning:An Introduction, MIT Press,
Mnih, V., Kavukcuoglu, K., & Silver, D. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.

Biography.docx

Download PDF

Editorial decision: Major revisions
05 Mar, 2024
Reviewers agreed at journal
18 Jul, 2023
Reviewers invited by journal
18 Jul, 2023
Editor assigned by journal
06 Feb, 2023
First submitted to journal
31 Jan, 2023

You are reading this latest preprint version

Symbol	Representation
\({B}_{YeNB,j}\)	B represents the instantaneous blocking factor from the eNB j Y represents the eNB type, \(Y\in \left\{P, M\right\}\) referring to PeNB and MeNB respectively.
D	The set of devices
\({g}_{j}^{i}\)	The average channel gain of the link between eNB\(j\) and device\(i\)
i	Device index
j	eNB index
Nj	The radio channels
\({P}_{j}\)	The transmitted power from eNB\(j\) over one RB
\({P}_{j}^{M2M,i}\)	The UL transmit power from the M2M device\(i\) associated with eNB\(j\)
\({PL}_{j}^{i}\)	The path loss of the link between any device\(i\) and eNB\(j\) measured in dBm
\({P}_{MAX}^{M2M}\)	The maximum uplink transmission power.
\(Q\left({s}_{t}, {a}_{t}\right)\)	The action-value function at time \(t\) is given for every state \({s}_{t}\in S\) and action\({a}_{t}\in A\left({s}_{t}\right)\)
\({QoS}_{j}^{Z2Z,i}\)	The quality of service for device \(i\) and eNB\(j\) Z represents the device type, \(Z\in \left\{H,M\right\}\) referring to H2H and M2M respectively
\({RSRP}_{YeNB,j}\)	The Reference Signal Received Power from eNB j Y represents the eNB type, \(Y\in \left\{P, M\right\}\) referring to PeNB and MeNB respectively.
\({R}^{M2M}\left(t\right)\)	The reward of the M2M QL algorithm at time t.
\({{R}_{LB}}_{}\left(t\right)\)	The reward of the LB-QL algorithm
\({SINR}_{j}^{i}\)	The SINR received by device \(i\)when associated with eNB\(j\)
α	The learning rate
ɣ	The discount factor
\(\varGamma\)	The target signal to noise ratio (SNR) on the whole frequency range for each eNB
\({\text{ƞ}}_{i}\)	The minimum DL data rate required by device \(i\).
\({\vartheta }_{j}^{i}\)	The number of radio channels allocated to a device\(i\) by eNB
\({\sigma }^{2}\)	The additive noise power on each RB
\(\mathcal{l}\)	The weighting factor
\({\rho }_{j}^{i}\)	The DL rate of device \(i\) associated with eNB\(j\)

States\({ {s}}_{{i}}\)	Action\({{a}}_{1}\)	Actions\({{a}}_{2}\)	Actions\({{a}}_{{v}}\)
\({ {s}}_{1}\)	\(\left\{ {Q}\left({{s}}_{1},{{a}}_{1}\right), {R}\left({{s}}_{1},{{a}}_{1}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{1},{{a}}_{2}\right), {R}\left({{s}}_{1},{{a}}_{2}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{1},{{a}}_{{v}}\right), {R}\left({{s}}_{1},{{a}}_{{v}}\right)\right\}\)
\({ {s}}_{2}\)	\(\left\{ {Q}\left({{s}}_{2},{{a}}_{1}\right), {R}\left({{s}}_{2},{{a}}_{1}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{2},{{a}}_{2}\right), {R}\left({{s}}_{2},{{a}}_{2}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{2},{{a}}_{{v}}\right), {R}\left({{s}}_{2},{{a}}_{{v}}\right)\right\}\)
\({ {s}}_{3}\)	\(\left\{ {Q}\left({{s}}_{3},{{a}}_{1}\right), {R}\left({{s}}_{3},{{a}}_{1}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{3},{{a}}_{2}\right), {R}\left({{s}}_{3},{{a}}_{2}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{3},{{a}}_{{v}}\right), {R}\left({{s}}_{3},{{a}}_{{v}}\right)\right\}\)
... . .	.....	....	.....
\({ {s}}_{{i}}\)	\(\left\{ {Q}\left({{s}}_{{u}},{{a}}_{1}\right), {R}\left({{s}}_{{u}},{{a}}_{1}\right)\right\}\)	\(\left\{ {Q}\left({{s}}_{{u}},{{a}}_{2}\right), {R}\left({{s}}_{{u}},{{a}}_{2}\right)\right\}\)	\(\left\{ {Q}\left({u},{{a}}_{{v}}\right), {R}\left({{s}}_{{u}},{{a}}_{{v}}\right)\right\}\)

Reward Ranges	The Reward Values\({R}_{LB}\left({t}-1\right)\)	The Action\({{a}}_{{n} }\left({t}\right)\)
Small Difference \({Ʀ}_{1 }= \left[1, 10\right]\)	\({{R}_{LB}\left(t-1\right) \le Ʀ}_{1 }\)	For positive \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n+1}\left(t+1\right)\) For negative \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-1}\left(t+1\right)\)
Medium Difference \({Ʀ}_{2 }= [10, 20]\)	\({{R}_{LB}\left(t-1\right) \le Ʀ}_{2 }\)	For positive \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n+3}\left(t+1\right)\) For negative\({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-3}\left(t+1\right)\)
Large Difference \({\text{Ʀ}}_{3 }= >20\)	\({{R}_{LB}\left(t-1\right) \le Ʀ}_{3 }\)	For positive \({R}_{LB}\left(t-1\right)\) : Move from\({s}_{n}\left(t\right)\to {s}_{n+5}\left(t+1\right)\) For negative \({R}_{LB}\left(t-1\right)\): Move from\({s}_{n}\left(t\right)\to {s}_{n-5}\left(t+1\right)\)

Q-Learning based Load Balancing in Heterogeneous Networks with Human and Machine Type Communication Co-existence

Status:

Version 1

Abstract

Figures

1. Introduction