In this section, the process of the screening and the results of the data extraction is described. Also, a brief analysis leads to the discussion, open issues, and conclusion sections.
4.4 Machine Learning strategies implemented for solving the RA problem.
The works analyzed in this review apply some ML strategies to solve the RA problem in ultra-dense networks. These techniques are divided into ANN-based, RL-based, and DRL-based models. In the following subsections, we describe the learning process of each model applied in each work. Figure 4 shows these algorithms' overall process to allocate network resources (RB and transmission power) between UE. The ML model decides its strategy based on observing the network’s environmental information, such as the UE's QoS requirements, Channel State Information (CSI), number of UE and Resource Blocks (RB), interference, or current resource usage. The ML models generally learn to map the observed information into resource allocation strategies. The network entities, such as the SBS, perform resource control strategies, which impact the environment and generate new observations. This interaction cycle is repeated for continuous control until an optimal decision-making policy is achieved. During this process, some ML models as the RL-based or the DRL-based may adjust their RA strategy. On the other hand, the ANN-based models require adjusting their RA strategy before implementation.
1) Artificial Neural Network-based models (ANN)
The ANN algorithms learn to perform tasks without being programmed with task-specific rules. NN consists of three layers: an input layer, a hidden layer, and an output layer. Training an NN algorithm from a given set of examples is used to optimize its weighted parameters for enhancing the predicted (target output) values' accuracy. Parameters are adjusted according to the error between the predicted and target values (i.e., reference values) (Chen et al. 2019). A Deep Neural Network (DNN) is an NN with multiple hidden layers between the input and output layers. The data flow from the input to the output layer creates a map of virtual neurons; each connection is assigned a weight. The weights are adjusted when the neural network does not recognize a particular pattern.
Furthermore, different NN architectures may be designed for various purposes. For example, recurrent NN such as LSTM allows neuron connections from previous layers to solve sequence prediction problems. Meanwhile, architectures, such as Convolutional Neural Networks (CNN) and Graph Neural Networks (GNN), are implemented to process data represented as images or graphs.
ANN-based models have been used in UDN to resolve issues related to Energy Efficiency (EE) (Zhang et al. 2019a), throughput (Cao et al. 2019; Zhang et al. 2022), and traffic control (Hossain and Muhammad 2020; Zhou et al. 2018). Table 4 shows the design features used by the ANN-based models, such as the optimizer, activation layer, and training strategy.
A NN is used by Cao et al. (2019) to solve the UE clustering and subchannels allocation. The authors implement the modified Min k-Cut and conflict graph algorithms. The NN extracts the inter-user interference relationship of each UE. The dataset consists of the potential interfering UEs (i.e., another UE is using the same Resource Block (RB) at the same transmission time interval) of each RB and the uplink Signal-to-Interference-plus-Noise-Ratio (SINR) as the label. The NN parameters are modified until a minimum mean-squared error is achieved. The data set used for training was collected from a simulation platform for LTE systems called LTE-Sim. Zhang et al. (2022) propose a GNN to extract nodes’ information and reduce the workload and data requirement of the UEA and power control problem.
Then, a training scheme is proposed by combining supervised and unsupervised learning. The model exploits the generalization ability gained during offline training, whereas the performance is enhanced during online training to suit real-time scenarios. Results show that using the GNN technique achieves higher performance and faster convergence than fully connected NN and CNN. In contrast, their proposal outperforms traditional techniques, such as maximal achievable rate association with maximum power and maximal sum-utility association with maximum power. Zhang et al. (2019a) propose a centralized DNN to allocate transmission power to each UE’s wireless link in UDN. The NN is trained with the data obtained from an iterative gradient method, and then, the data is normalized and formatted with normal distribution with zero mean and L2 regularization, respectively. Besides, they introduce a distributed DNN facing the challenge of model size at UDN. Distributed DNN divides the fully connected DNN into several DNN models that are trained in parallel. All weights are collected by a parameter server (i.e., central controller) that updates back all weights. This procedure reduces the training time and makes the system robust due to the different small-scale networks trained. Results show a 97.0%-98.4% accuracy, nearly ten times less operation time, and a slight difference in EE from the iterative gradient algorithm used for training. Hossain and Muhammad (2020) and Zhou et al. (2018) use the LSTM model to address traffic control issues. It uses currently available and past data to obtain output values. These works consider time division duplexing on their transmissions and aimed to change the uplink/downlink ratio before congestion occurred. On the other hand, Hossain and Muhammad (2020) implemented a tree-based deep model before the LSTM technique to reduce the parameters of the data gathered in the spatial domain from many UEs. Both works show a performance enhancement concerning methods where the network RA policies change once the congestion occurs.
2) Reinforcement Learning-based models (RL)
RL-based models potentially allocate resources dynamically according to the requirements. They allocate resources with the knowledge extracted from big data without the need for explicit mathematical models. RL-based models learn from the interaction with their current environment. An RL task is about training an agent. The agent arrives at different scenarios known as states by performing actions. Actions lead to rewards. The agent has only the purpose of maximizing its total reward. The output depends on the state of the current input, and the following input depends on the output of the previous input. Without a training dataset, it is bound to learn from its experience. Actions are selected based on their past experiences in a trial-and-error fashion (i.e., exploration and exploration). According to the reward function formulated, these actions are rewarded (positively or negatively), and their values are stored in a Q-table to influence the selection of future actions. The learning process consists of taking actions according to an exploration strategy from Q-table, calculating the reward, and updating the Q-table with the Q-values corresponding to the action-state tuple. This learning process is repeated until an end criterion is met (e.g., the number of episodes). Table 5 shows the design features of the works that apply RL-based models to solve the RA problem in UDN. These research works were grouped according to the reward function into delay minimization (Elsayed et al. 2019), EE maximization (Zhang et al. 2019b; AlOerm and Shihada 2017; Kim et al. 2022a; Sharma and Kumar 2022), interference mitigation (including interference control, interference management, and Inter-Cell Interference Coordination (ICIC)) (Feki and Capdevielle 2011; AlSobhi and Aghvami 2019; Jiang et al. 2016; Zhang et al. 2018a; Li et al. 2018), throughput maximization (Elsayed and Erol-Kantarci 2018; Lu et al. 2016; Amir et al. 2019; Chen et al. 2016; Amiri and Mehrpouyan 2018a; Lin et al. 2017; Amiri et al. 2018b; AlOerm and Shihada 2016; Li et al. 2021a; Iqbal et al. 2021; Iqbal et al. 2022), and utility maximization (Li et al. 2019).
Minimizing the delay in an LTE system is addressed by Elsayed et al. (2019). The evaluation scenario consists of an LTE network with mobile UEs and MicroGrid Devices (MGD). The goal is to reduce latency and increase fairness between UEs and MGDs. Each BS/MBS allocates the RBs according to the reward function, which is adapted with a scalar weight to control the priorities between traffic types on each device type achieving a tradeoff between UE and MGDs. The impact of the action space indicates that better RA actions can be learned at the cost of more delay.
Zhang et al. (2019b) propose an event-triggered QL approach for saving computational resources and maximizing EE. The Small base station User Equipment (SUE) acts when the difference between the agent’s reward and the SUE average rewards using the same channel on the current and previous steps is higher than a threshold. The proposed event-triggered approach achieves better EE than classical QL. AlOerm and Shihada (2017) use an intuition learning scheme to surmise other Secondary Transmitters (ST) (i.e., Pico Base Station (PBS), Femto Base Station (FBS), and Device-to-Device (D2D)) with local information, making use of their interactions with the environment and their past experiences. A Q-value approximation is proposed to reduce the state space, which results in better performance than conventional algorithms and ensures the QoS of primary and secondary UEs. Kim et al. (2022a) propose a power transmission control based on QL to maximize the EE while minimizing the number of UE outages. In this scheme, the UE considers only their past action for the decision-making while the reward function considers the global network performance. Results show that the computational complexity is significantly reduced compared to centralized QL, where the Q-table increases exponentially with the number of agents and action/state. Also, a higher convergence rate than distributed scheme considering independent reward tested in uniform and non-uniform UE spatial traffic distribution scenarios is achieved. Sharma and Kumar (2022) maximize EE considering the QoS of FUE in a small-cell network. The proposed strategy consists of groupings FBS through a K-means clustering algorithm considering the FBS geographical location and the traffic load at each cluster. Then, a QL algorithm allocates the RB for each cluster head. The QL algorithm is trained cooperatively by using the past historical information of other agents stored in a cloud server. Results show a higher EE, throughput, and convergence speed concerning independent QL and the Stackelberg methods.
To address the interference mitigation problem, Feki and Capdevielle (2011) developed a MAB algorithm bases on RL theory. Each cell aims to select the best band portion for ICIC. The MAB algorithm starts accessing all spectrum bands from the spectrum pool sequentially and choosing the group of available channels that gives the highest reward. Then, a decisional function is proposed to select the next spectrum band to transmit. This function considers the mean reward of the spectrum band and the
Table 4 Design features of ANN-based resource allocation in UDN.
Objective
|
Ref
|
Model
|
Training
|
Optimizer
|
Activation layer
|
Energy Efficiency Maximization
|
Zhang et al. (2019a)
|
DNN
|
Dataset was generated by solving an iterative gradient algorithm. Datasets generated for training and validation were 15000 and 1000, respectively.
|
Adam
|
ReLU
|
Throughput Maximization
|
Cao et al. (2019)
|
NN
|
Dataset contains billions of samples generated from the LTE-Sim platform. Each sample consists of the potential interfering UE and its uplink's SINR
|
-
|
-
|
Zhang et al. (2022)
|
GNN
|
The model was first trained with several optimization solutions to achieve generalization ability. Then, the model is fine-tuned with the current scenario data
|
GA
|
PA: Sigmoid
UEA. Softmax
|
Traffic Control
|
Hossain and Muhammad (2020)
|
DLSTM
|
Training was performed with different fixed uplink/downlink ratios. Deep model with a tree structure was used to enhance regularization and reduce complexity of the LTSM model
|
SGD
|
DTM: ReLU
LSTM: Sigmoid and tanh
|
Zhou et al. (2018)
|
LSTM
|
Training involves a data sequence of packets in the sending buffer at fixed uplink/downlink ratios. The network trains the model at each ratio change
|
-
|
Sigmoid and tanh
|
ReLU: Rectified Linear Unit. GA: Gradient Ascent. PA: Power Allocation. UEA: User Equipment Association. SGD: Stochastic Gradient Descent. DTM: Deep Tree Model.
|
number of times the same band is chosen for transmission. Also, the exploration parameters are tuned to allow for choice transmission bands with low reward values. The results show higher throughput using the spectrum portions more efficiently than fixed reuse schemes. AlSobhi and Aghvami (2019) propose three variants of QL algorithms to perform power allocation: distributed, formulated, and cooperative. The distributed algorithm considers enhancing the Femtocell-User (FUE) capacity while Macrocell-User (MUE) QoS is maintained. The formulated algorithm modifies its state, considering the MBS, the MUE, and Femtocell Access Point (FAP) location. Meanwhile, the cooperative QL reduces the computational complexity by letting the experienced agents (i.e., agents where the algorithm converged) share information with new agents with a similar state to accelerate their training convergence. Results show that the location of the MUE is a decisive factor in achieving the network QoS requirements.
Jiang et al. (2016) consider a dual-hop architecture where the access and self-backhaul networks share the same spectrum. The Hub Base Station (HBS) controls the file transmission to the SBS and from the SBS to the UE. To reduce the computational complexity of conventional QL, the authors consider a single-state QL, simplifying the action-state pairs to a stateless format. At the initialization, the agents remove the high-interference channels from the spectrum pool. Then, the agents assign a Q-value for each action, which guides their decision-making. As a result, the link capacity of both networks (macro and micro) is increased, and the convergence time is reduced compared with the conventional cognitive radio approach. Zhang et al. (2018a) propose a QL algorithm with a Transfer Learning (TL) method to accelerate the learning speed on a Small Cell Network (SCN). The Q-table is updated with new agents’ information from the Q-values of experienced agents with similar environments. However, their reward is based on the UE density, SINR, and transmit power, focusing on reducing inter-cell interference and saving energy consumption of BS. The conflict graph strategy for clustering and the QL technique for interference management is applied by Li et al. (2018). Agents allocate the transmission power over different RB according to other agents' interference and the overall network SINR. Results show enhancement in network throughput. However, peak throughput is decreased at the expense of better edge throughput.
Elsayed and Erol-Kantarci (2018) consider the coexistence of Data-Intensive Devices (DID) and traditional UE for the throughput maximization topic. The DID are emerging applications expected to be frequent in future networks, such as augmented reality, virtual reality, and tactile applications. The system consisted of a multi-agent scenario, where Evolved Node B (eNB) and SBS perform RB allocation to their attached SBS and UE, respectively. Furthermore, Resource Block Groups (RBG) consisting of the continuous RBs were considered to avoid the curse of the dimensionality of considering all RB combinations. Results outperform the Proportional Fair (PF) algorithm concerning throughput, delay, and fairness for different DID densities. The work by Lu et al. (2016) avoids interference among SBS by implementing an ICIC scheme consisting of adaptive Almost Blank Subframes (ABS) and QL for power control. The proposed ICIC scheme adapts the ABS ratio according to the most interferer cell load. However, instead of blanking the subframes, these cells, denominated as aggressor cells, are allowed to transmit at low power, controlled by the QL algorithm. Results show that the dynamic power control of low-power ABS outperforms the blanking ABS as UE density increases. Amir et al. (2019) consider a distributed QL for power control in a Self-Organizing Network (SON). As the system densifies, authors consider two training schemes for new agents, i.e., Independent Learning (IL) and Cooperative Learning (CL). To solve the optimization problem, they design a reward function to fulfill the FUE and the MUE QoS requirements. Results show that IL achieves higher FBs' sum rate and power consumption, while CL achieves higher MBs' sum rate and lower power consumption at the cost of signaling overhead. Moreover, Chen et al. (2016) consider distributed, and centralized RA approaches, where the cluster heads allocate the RB within the cluster, and the MBS allocates the RBs of each cluster, respectively. Results show a better performance of the centralized approach, showing that a more extensive action variation needs more time to converge than the distributed approach. Amiri and Mehrpouyan (2018a) propose a joint clustering method and transmission power allocation based on the QL algorithm. The BS chooses its transmission power according to its state, defined by zones separated by concentric circles from the cluster head. Besides, a reward is proposed to satisfy QoS and to provide fairness, which outperforms the benchmark reward function. In (Lin et al. 2017), the MBS coverage area is divided into two zones, the cell-edge and the cell-center regions. They are defined as the low-interference and high-interference zones, respectively. The QL technique is implemented in the SBSs of the cell-edge region to allocate the transmitting power to maximize the throughput while ensuring the MUE QoS. Amiri et al. (2018b) implement the QL algorithm to provide the whole network system fairness. The agents present two operation modes, individual learning, and CL. With individual learning, the agents learn through interaction with the environment, while with CL, the agents learn from experienced agents. The algorithm's complexity can be reduced using the cooperative approach compared to individual learning since agents can share their experiences instead of discovering all the environment information (i.e., making exploration actions) themselves. The same type of agents (i.e., FBS or D2D) works cooperatively to share their state information in (AlOerm and Shihada 2016). The agent can allocate RB and power and adapt the transmission modulation. Also, two mechanisms for the QL algorithm were implemented. First, the exploration rate is decreased to achieve a high exploration rate at the beginning. Second, the learning rate is modified to learn faster when losing and slowly when winning (i.e., Q-value comparison between consecutive actions). These modifications prevent the learning mechanism from depending only on the performance metrics. Results show better performance on throughput, SE, and fairness. Li et al. (2021a) implement a QL algorithm to control the transmission power. First, they introduce an analysis of network interference through graph theory. Then, the information gathered is used as part of the state. Results show that their proposal better describes the interference of the whole network achieving higher throughput performance than baseline algorithms. Iqbal et al. (2021) and Iqbal et al. (2022) implement the QL to maximize the throughput of MUE and SUE in a dense network. They propose an adaptive power control on the SBS, assuming SON functionalities for self-optimization. For validation, they consider different scenarios to mitigate cross-tier and co-tier interference. The reward function is designed to consider the minimum UE QoS requirements. Further, Iqbal et al. (2022) propose a cooperative QL. This scheme consists of sharing the Q-table information of nearby agents during the learning process. The results show that the cooperation scheme achieves a better SUE data rate than independent learning in denser scenarios. Meanwhile, both QL schemes outperform state-of-the-art solutions tested in international mobile telecommunications (IMT) scenarios.
The utility maximization is addressed by Li et al. (2019). The reward function considers the EE and load balancing through power allocation and UEA as actions. The load balancing benefits the number of UE associated with SBS, and therefore, better EE performance concerning conventional algorithms is obtained, which shows continuous improvement as the network densifies.
Table 5
Design features of RL-based resource allocation in UDN.
Objective
|
Ref
|
Model
|
Model Structure
|
Agent
|
Action
|
State
|
Reward
|
Exploration
|
Delay Minimization
|
Elsayed et al. (2019)
|
QL
|
eNB/SBS
|
Set of RBs to their UEs
|
SBS/UE channel state information
|
UE delay considering a trade-off between MGD and UEs
|
ε-greedy
|
Energy Efficiency (EE) Maximization
|
Zhang et al. (2019b)
|
QL
|
SUE
|
Subchannel and power allocation
|
Channel occupied and allocated power of the SUE
|
EE considering UE SINR threshold
|
Boltzmann probability distribution
|
AlOerm and Shihada (2017)
|
QL
|
ST
|
Transmission power
|
ST ID and transmission power
|
EE
|
Boltzmann probability distribution
|
Kim et al. (2022a)
|
QL
|
SBS
|
Transmit power steps {up, down, keep}
|
Maximum and minimum transmission power and step power size of the SBS
|
EE and the number of outages UE
|
ε-greedy
|
Sharma and Kumar (2022)
|
QL
|
Cluster head FBS
|
RB allocation and transmission power
|
User association relationship, SINR, required data rate, and total delay
|
EE considering UE’s QoS
|
Boltzmann probability distribution
|
Interference Mitigation
|
Feki and Capdevielle (2011)
|
MAB
|
PBS
|
Bandwidth portion
|
Cell and sub-band indexes
|
Mean instantaneous throughput
|
Decisional function
|
AlSobhi and Aghvanni (2019)
|
QL
|
1: FAP
2: FAP
3: FAP
|
1: Transmission power
2: Same as 1
3: Same as 1
|
1: Capacity of the MUE
2: Location of MBS and MUE to the FAP
3: same as 2
|
1: Favors the FUE based on the location and QoS of the MUE
2: Prioritizes the MUE QoS and sum capacity of the FUE
3: Same as 2
|
1: ε-greedy
2: ε-greedy
3: ε-greedy
|
Jiang et al. (2016)
|
QL
|
SBS
|
channel selection
|
-
|
Link capacity
|
ε-greedy
|
Zhang et al. (2018a)
|
QL
|
SBS
|
Transmission power
|
UEs’ density and UEs’ previous SINR
|
UE density, SINR, and transmission power
|
ε-greedy
|
Li et al. (2018)
|
QL
|
SBS
|
Transmission power allocated to each RB
|
Maximum interference and agent's SINR
|
Throughput and interference
|
-
|
Throughput Maximization
|
Elsayed and Erol-Kantarci (2018)
|
QL
|
eNB and SBS
|
RBG allocation
|
CQI and recent packet rate sent
|
Throughput of DID and regular UE
|
ε-greedy
|
Lu et al. (2016)
|
QL
|
SBS
|
Transmission power
|
UEs' sum rate in aggressor cell
|
SBS throughput
|
ε-greedy
|
Amiri et al. (2019)
|
QL
|
FBS
|
Transmission power level
|
Performance of FUE and MUE, MUE's interference from FBS, and FBS's distance to the MBS
|
Transmission rate considering FUE's and MUE's minimum rate requirements
|
ε-greedy
|
Chen et al. (2016)
|
QL
|
1: Cluster head
2: MBS
|
1: RB allocation within the cluster
2: RB allocation of each cluster
|
1: QoS of the macro cell
2: Same as 1
|
1: Capacity of the RB for a cluster
2: Average capacity of all clusters
|
ε-greedy
|
Amiri and Mehrpouyan (2018a)
|
QL
|
SBS
|
Transmission power
|
Concentric circles measured UE’s distance from the cluster head
|
Throughput considering UE QoS
|
ε-greedy
|
Lin et al. (2017)
|
QL
|
SBS
|
RB power levels
|
Victim MUE's target SINR in subchannel and neighbor transmission power
|
Throughput prioritizing MUE QoS
|
Boltzmann probability distribution
|
Amiri et al. (2018b)
|
QL
|
FBS
|
Transmission power
|
Neighborhood states based on the distance of FBS to MBS and MUE
|
Capacity for FUE while satisfying both FUE and MUE QoS
|
ε-greedy
|
AlOerm and Shihada (2016)
|
QL
|
FBS or D2D transmitter
|
RB, the transmission power of the underlay transmitters in the center and edge band, and the modulation level
|
Underlay tier transmitter, available RB and SINR measured in the central band, in the edge band, and from neighboring agents
|
MUE and FUE Data rate considering the spectral efficiency achieved at the underlay receiver
|
ε-greedy
|
Li et al. (2021a)
|
QL
|
SBS
|
Power allocation for each RB
|
Agent’s maximum interference and cluster interference
|
SBS throughput and unserved UE
|
ε-greedy
|
Iqbal et al. (2021)
|
QL
|
SBS
|
Transmission power
|
Distance between the SBS and MBS, and SBS and MUE
|
UE capacity considering the minimum SINR of MUE and SUE
|
ε-greedy
|
Iqbal et al. (2022)
|
QL
|
SBS
|
Transmission power
|
Radial distance between the SBS and MBS, and SBS and MUE
|
UE capacity considering the minimum SINR of MUE and SUE
|
ε-greedy
|
Utility Maximization
|
Li et al. (2019)
|
QL
|
UE
|
Power allocation and UE association
|
Received SINR, association state, and agent’s power level
|
Function considering UEA, energy efficiency, and QoS
|
Boltzmann probability distribution
|
ST: Secondary Transmitters (i.e., PBS, FBS, and D2D). RBG: Resource Block Group. CQI: Channel Quality Indicator. D2D: Device-to-Device.
|
3) Deep Reinforcement Learning based models
DRL-based algorithms have the same structure as RL-based model. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. In DLR-based model, the Q-value is approximated through a NN (Luong et al. 2019). The NN is necessary because RL-based algorithms' performance is degraded when the action/state space increases, making it challenging to find optimal policies (i.e., an action that obtains the maximum long-term benefit). In addition, since the consecutive data are correlated in time, DRL algorithms require implementing a replay memory mechanism and a secondary network (i.e., target network) for stability purposes. The replay memory stores some experiences, then random samples (i.e., mini-batches) are extracted to train the NN and get new experiences stored in the replay memory. The target network updates its parameters, making a copy of the NN parameters (called value network). Meanwhile, the value network trains itself using the target network as labels and the mini-batches as training sets. Table 6 shows the design features of the works that use DRL-based models to solve the resource allocation problems in UDN. We grouped them according to the following objectives: Computational Cost (CC) minimization (Li et al. 2020a), EE and SE maximization (Liu et al. 2019c; Ye 2022), Energy Consumption (EC) minimization (Huang et. al 2021), EE maximization (Li et al. 2018b; Ding et al. 2020; Liu et al. 2019a; Liu et al. 2019b; Chen et al. 2021; Anzaldo and Andrade 2022; Zhao et al.2021), interference mitigation (Xiao et. al 2019), joint SE, EE, and fairness maximization (Liao et. al 2019), SE maximization (Nasir and Guo 2019, throughput and EE maximization (Li et al. 2021b), throughput maximization (Su et al. 2020; Khoshkbari et al. 2020; Vishnoi et al. 2021; Li et al. 2020b; Sande et al. 2021; Anzaldo and Andrade 2021; Liu and Zhang 2022; Chen et al. 2022; Suh et al. 2022), UE satisfaction maximization (Doo and Koo 2019; Wang et al. 2019), and utility maximization (Zhao et al. 2019; Cheng et al. 2020; Kim et al. 2022b).
Li et al. (2020a) address the CC minimization for a Non-Orthogonal Multiple Access (NOMA) Mobile Edge Computing (MEC) system. To address the problem, the authors propose the jointly iterative optimization algorithm of User Cluster Matching (UCM) and Mean-Field DDPG (MF-DDPG). First, based on the UE channel gains, they are clustered with the UCM optimization algorithm. Then, the resource allocation (i.e., transmission power and task offloading) problem is modeled as a mean field game, solved by the MF-DDPG algorithm. Finally, the solution is obtained iteratively following these algorithms. The results were compared against Orthogonal Multiple Access (OMA) and DQN to demonstrate the system's reduction in energy consumption and task delay. Also, the convergence speed is improved concerning conventional DQN.
The SE and EE maximization is addressed in (Liu et al. 2019c; Ye 2022). Liu et al. (2019c) implement DuQN at the MBS for RB allocation. DuQN modifies the DQN architecture, separating the final layer into two streams for estimating how good it is for an agent to be in the current state and the advantage of selecting each action in that state (Wang et al. 2016). Results show DuQN achieves a higher EE and SE performance and faster training convergence concerning QL and DQN algorithms. Ye (2022) implement double DQN (DDQN) for the RB allocation. Unlike traditional DQN, which uses the maximum Q-values to select and evaluate the action, DDQN is introduced by Van Hasselt et al. (2016). It decouples the selection and the evaluation into two Q-value functions, preventing the overestimated values resulting from using the maximum values. Once the model is trained, a pruning algorithm reduces the size of the NN. It removes redundant connections between the fully connected layers. Pruning reduces the model's complexity by decreasing inference time with negligible performance drops.
Huang et al. (2021) implement two DRL-training schemes for EC minimization while satisfying task latency requirements by computation offloading and resource allocation (channel allocation, uplink power control, and computation RA for each UE served by the SBS). Both schemes consider multi-agent DRL. Federated learning involves individual agents training their model with local information without exchanging information. Then, it obtains a new model from all the agent’s parameters. It is sent back to all the agents, avoiding sending local information to other agents. Meanwhile, in the centralized training scheme, the agents sent their experiences (i.e., local information) to a centralized controller to train each agent's updated model. Results show higher energy consumption and a guarantee in latency requirements than DRL with no cooperation (i.e., independent learning).
In (Li et al. 2018b), a DDPG for power control in an Energy Harvesting UDN is reported. Two NN are implemented as the critic and actor networks, respectively. The critic network approximates the action-value function updating its weights using the Temporal-Difference (TD) error. The actor network updates the power allocation policy by updating its weights using a sampled policy gradient. The target network is implemented in both the critic and actor networks to ensure stability. The DDPG can handle continuous action space (i.e., continuous power control rates), allowing more exploration. Consequently, it shows high instability (due to higher space action) initially but better EE when converging, compared to QL and DQN. Ding et al. (2020) use the DQN algorithm to perform UA and control the uplink power transmission to maximize the EE. The system consists of a two-tier network; the UEs are regarded as agents for the decision-making process. The reward considers the EE of all the agents sharing the same RB, which adds awareness of the agent’s decision. The results show less convergence time and better performance than QL. Moreover, the proposed algorithm shows consistent results and higher performance over different UE and BS densities for solving the joint optimization problem. Uplink power control was also addressed for a NOMA UDN by Liu et al. (2019a). To reduce the interference at the BS due to multiple UEs transmitting on the same subcarrier, a centralized controller applies UE pairing based on their channel gain difference and implements a DQN algorithm for power control. Results show that the proposed DQN outperforms QL concerning convergence, UE, and BS density. Liu et al. (2019b) implement DQN for UEA and power allocation. They implement a water-filling algorithm for power allocation at the beginning of the training to avoid performance drops due to random allocation. The results show that DQN performs better than QL after a few iterations. Also, the DQN shows consistency in terms of EE when density increases. Meanwhile, the performance gap between DQN and QL increases. Chen et al. (2021) address the RA problem in UAV-assisted UDN. The UAV acts as an auxiliary BS. The RA problem is divided into UE link selection and power control and solved by a DQN model design to maximize the EE. The results outperform QL and heuristic schemes (i.e., random and maximum power allocation) regarding EE, throughput, and power consumption. Anzaldo and Andrade (2022) propose a conventional DQN for power control. It consists of a knowledge transference scheme that reuses the experiences of other trained models (i.e., models trained with fewer agents) and the current model experiences during the learning process. Results show a higher convergence rate using a diverse set of experiences of models with lower complexity than conventional DQN training. Zhao et al. (2021) propose a DDQN model for uplink power control to maximize EE. The system consists of an ultra-dense femtocell network. The model is trained with the centralized training and distributed execution scheme. Furthermore, an interference identification algorithm models the user-level relationship to obtain accurate environmental information and use it in the model state. Results show higher EE and lower complexity than Fractional Programming (FP) algorithm reported by Shen and Yu (2018) and Successive Pseudo Convex Optimization (SPCO) in (Zhang and Mao 2020).
Xiao et al. (2019) implement a strategy to control the interference in an ultra-dense small cell network. It allocates the transmission power to UE considering SINR and the UE density. The agent’s initialization uses the experienced agent’s information to accelerate the learning stage. Also, it applies a convolutional NN to estimate the Q-values and compress the agent’s state space. This procedure improves power consumption and increases the throughput compared to RL and data-driven algorithms.
The joint SE, EE, and fairness maximization are addressed in (Liao et al. 2019). The authors solve the multi-objective problem in two stages. First, SE maximization is used to build the DNN. Then, the EE and fairness are then considered in the DRL framework reward function. The proposed algorithm obtains the RB and power allocation decisions based on limited CSI. Random and model-based initialization were evaluated, which resulted in fewer iterations for converging when the initialization is based on network knowledge. The proposed algorithm outperforms the results compared to a benchmark algorithm that requires full CSI knowledge for different conditions of EE, SE, Fairness, different number of UE, and channels.
On the other hand, Nasir and Guo (2019) maximize the SE through the DQN algorithm. The authors train the DQN weight parameters in a centralized manner, gathering experiences from different local agents. The experiences consist of several neighbors' information on the agent's states. The states comprise local information, interfering neighbors, and interfered neighbors. On the framework considered, all network transmitters send their experiences to a central controller for training. Then after training is finished, the weights parameters are updated to the transmitters to execute the DQN with their local states as inputs. Therefore, the memory and computational resources are reduced at the transmitter. The algorithm shows a faster power allocation process than centralized algorithms like the Weighted Minimum Mean Square Error (WMMSE) algorithm (Shi et al. 2011) and the iterative algorithm based on FP, which require instant and accurate measurements of all channel gains.
Li et al. (2021b) solve the User Association (UEA) and Power Allocation (PA) problems applying a DDPG model to maximize jointly the throughput and EE. Further, they implement an additional layer to preprocess the output of the DDPG model into discrete and continuous variables for UEA and PA. As a result, the joint action is composed of discrete and continuous values for UEA and PA, respectively. Their proposal was evaluated in an OMA-/NOMA- HetNet.
For maximizing the UE throughput in UDN, Su et al. (2020) implement two DQN on the FBS and the MBS. In addition, the distance between the agents (FBS or MBS) and UE is considered in the DQN states to perform the transmission power allocation. The reward function considers the QoS of FUE and MUE on each DQN. Results show that multi-type agents perform better than single-type agents, whereas the BS density increases Khoshkbari et al. (2020) use BBQN technique to improve the exploration process of the DQN model compared to two conventional RL exploration strategies: e-greedy and noisy DQN. Also, the authors consider only their CSI BS information removing the overhead of getting the CSI of the neighbor cells. Their strategy implemen a Bayesian NN (BNN) to obtain posterior distribution over the action-space value function. i.e., leads the agent to increase the probability of the actions with higher uncertainty in their Q-functions, thus, preventing the random exploration and outperforming the conventional exploration strategies. Vishnoi et al. (2021) propose a centralized DRL to solve the power allocation problem to maximize the system's D2D cellular pairs and Cellular Mobile Users (CMU) throughput. Specifically, they implement Proximal Policy Optimization (PPO) reported in (Schulman et al. 2017) to get faster convergence of the DDPG model. PPO aims to ensure low deviation in training at the policy by comparing the current and obtained policies. Results achieve faster convergence than conventional centralized DRL methods. Also, higher throughput performance is achieved from different CMU and D2D densities. Li et al. (2020b) propose centralized and distributed DDPG schemes for power control. It considers energy harvesting between SBS. The centralized DDPG achieves higher performance than distributed DDPG. However, the complexity of centralized approaches increases exponentially by the action-state space increment due to SBS density. Meanwhile, the distributed DDPG performs the decision-making with its own SBS information. Results show that both DDPG schemes achieve higher throughput than DQN, greedy, and conservative approaches. Sande et al. (2021) address the QoS maximization problem for an integrated access and backhaul network. The authors propose a DQN to solve the power control considering the Independent (IL) and Cooperative (CL) learning strategies. CL uses neighbors' experiences to help improve the learning process, where the nearest SBS are localized with the Euclidean distance. The results show improvements compared to a baseline DQN in terms of congestion, average bit rate, and degree of satisfaction for different UE densities.
Anzaldo and Andrade (2021) implement a DQN for solving the power allocation problem. The proposal enhances the trained DQN performance with additional training synthetic scenarios with fewer SBS than the original training scenario, resulting in higher robustness in denser scenarios. Results show that additional training enhances DQN model performance and reduces the information required for decision-making concerning DQN with more significant input sizes. Liu and Zhang (2022) propose a power allocation method based on DDPG with a CNN as the function approximator to maximize the network throughput. Their proposal results in 39.7% faster convergence and up to 14.6% performance gain concerning DPPG with DNN and DQN with DNN/CNN. Furthermore, DDPG with CNN achieves 200 times less CPU than the WMMSE algorithm at the cost of slightly less sum-rate performance in denser scenarios. Work in (Chen et al. 2022) addresses the resource allocation and band-switching problem. This work implemented a DRL to maximize the total UE's data rate in an ultra-dense low-earth orbit satellite. First, they implement a DDPG algorithm for channel and power allocation considering the UE, satellite locations, and rain conditions. Then, a hierarchical algorithm for band switching is implemented, resulting in higher performance than implementing DDPG and random allocation methods. Suh et al. (2022) maximize the network throughput by allocating resources for different slices with a DQN-based network slicing technique. Each slice considers different QoS focused on enhanced Mobile Broadband (eMBB), Ultra-Reliable Low Latency Communications (URLLC), and massive Machine-Type Communications (mMTC) services. To accelerate the learning of the DRL model, they implement an action elimination technique via parallel processing for the actions leading to a decision that does not meet the service requirements, resulting in higher-quality policy during training. At each interval, the space action is filtered by URLLC followed by eMBB. Then, the DQN selects the action based on the filtered action space. Results show an improvement of up to 15% and 10% over a regression tree-based allocation method and vanilla DQN-based models.
The UEA and RB allocation in UDN with Energy-Harvesting (EH) BS is considered by Do and Koo (2019). An Actor-Critic Deep Learning (ACDL) model is implemented to maximize the UE satisfaction regarding bandwidth requirements. The model consists of actor and critic networks. The critic network estimates the environment state values to calculate the Temporal-Difference (TD) error, and the actor network predicts the allocation policy. After each action, BS reports the number of satisfied UEs and UE’s battery energy level to update the environment states to the central controller. Then, the two networks update their functions (i.e., the policy and value functions) from the TD error. Less convergence is achieved compared to other learning and non-learning approaches. However, since the battery capacity is finite and bandwidth is restricted, results show that as the number of UE increases, the network performance decay and the performance gaps between the different approaches evaluated converge to similar results. On the other hand, Wang et al. (2019) apply the DQN technique to allocate RBs considering QoS indicators in UDN. Besides experience replay and target network, the authors implemented a prioritized sweep reported in (Moore and Atkeson 1993) and a heuristic mechanism into DQN architecture to accelerate algorithm convergence. The prioritized sweep scheme assigns priority to sample the states from the experience pool, which has a higher probability of changing the Q-values of the network. The heuristic mechanism is an indicator function on the action space added for optimal strategy selection that helps to identify if the generated action is based on a traditional scheduling algorithm. Results show that the heuristic mechanism offers high performance in light and heavy traffic conditions compared to traditional scheduling algorithms like round-and-robin, proportional fair, and max C/I.
Zhao et al. (2019) maximize network utility applying the D3QN technique. The proposal consists of a DDQN and a dueling architecture to tackle the over-optimistic Q-value estimation and for better policy evaluation. Furthermore, to reduce the complexity of ample action space, a multi-agent DRL with cooperation through message passing between UEs to collect the global state information and joint policies of all UEs is implemented. Results indicate that D3QN performance outperformed regular DQN at higher UE densities. Also, results show that other algorithms like genetic algorithm and maximum receive signal power fail to find suitable strategies for meeting the UE requirements when the number of UE and QoS levels increases concerning DRL approaches. Chen et al. (2020) address the joint User Equipment Association and Resource Allocation by implementing a DQN model to maximize the network utility. The agent or UE can choose to associate among multiple Access Points (AP) but is limited to using one subcarrier per AP. In addition, the AP divides the transmit power equally between its UEs attached. The results show that the network throughput benefits from more AP connections and more subcarrier usage concerning the max reference signal received power method. Kim et al. (2022b) propose centralized DDPG scheme for power control and base station on/off switching. It considers the ratio of UE served energy. However, the complexity of centralized approaches increases exponentially by the action-state space increment due to UE density. Results show that DDPG schemes achieve higher utility than DQN and greedy approaches.
Table 6
Design features of DRL-based resource allocation in UDN.
Objective
|
Ref
|
Model
|
Model Structure
|
Agent
|
Action
|
State
|
Reward
|
Exploration
|
CC Minimization
|
Li et al. (2020a)
|
DDPG
|
SBS
|
UE’s transmit power and weight coefficient of RA and tasks offloading
|
SINR and channel gain of each UE
|
Computing cost
|
Exploration noise
|
EE and SE Maximization
|
Liu et al. (2019c)
|
DDQN
|
MgNB
|
RB allocation to all SCs
|
Number and throughput of all SCs and the allocation of all RBs in the system
|
SE and EE
|
ε-greedy
|
Ye (2022)
|
DDQN
|
MBS
|
RB allocation to all UEs
|
RB allocation and throughput
|
SE and EE
|
ε-greedy
|
EC Minimization
|
Huang et al. (2021)
|
DDPG
|
SBS
|
Computation offload decision, RB allocation, uplink power, and computation RA for each UE served by the SBS
|
Channel gains, interference power, and task profiles of the UEs in SBS’s coverage
|
Function considering the overall energy consumption by local and edge computing, and the task’s latency requirements
|
-
|
EE Maximization
|
Li et al. (2018b)
|
DDPG
|
Central controller
|
Power control policies of all SBS
|
Energy harvested, battery power, traffic load, and throughput of all BS
|
EE
|
Exploration noise
|
Ding at al. (2020)
|
DQN
|
UE
|
UE association and uplink power control
|
UEs’ association and power control on the same RB
|
UEs’ Energy Efficiency on the same RB
|
ε-greedy
|
Liu et al. (2019a)
|
DQN
|
Central AP
|
UE's UL transmission power
|
Data rate and transmission power of the system
|
EE
|
ε-greedy
|
Liu et al. (2019b)
|
DQN
|
Central controller
|
UE’s association and power allocation
|
Traffic volume, channel state, and transmission power
|
EE
|
ε-greedy
|
Chen et al. (2021)
|
DQN
|
Central AP
|
UEs’ link selection and UEs’ transmit power
|
UEs’ rate and total transmission power
|
Total data rate
|
ε-greedy
|
Anzaldo and Andrade (2022)
|
DQN
|
SBS
|
Discrete Power levels
|
Channel gain, transmission power, and EE of neighbor UEs
|
Function considering the SBS’s and neighbors' EE
|
ε-greedy
|
Zhao et al. (2021)
|
DDQN
|
SBS
|
Discrete power levels
|
UEs interference ratio and SBSs rewards
|
EE and throughput of the whole system
|
ε-greedy
|
Interference Mitigation
|
Xiao et al. (2019)
|
DQN
|
SBS
|
UEs transmit power
|
SINR and estimated channel state of the former UE, and the estimated UE density
|
Function considering throughput, energy consumption, and inter-cell interference
|
ε-greedy
|
Jointly SE, EE, and Fairness Maximization
|
Liao et al. (2019)
|
DQN
|
SBS
|
Subcarrier allocation and corresponding transmission power
|
UEA information and interference power
|
Maximizing the EE and minimizing the variance of throughput between UEs
|
Generated by the DNN-based optimization framework
|
Spectral Efficiency Maximization
|
Nasir and Guo (2019)
|
DQN
|
SBS
|
Discrete power levels
|
Transmission power, throughput, and SINR of the agents and neighbors
|
Spectral efficiency of each link
|
ε-greedy
|
Throughput and EE Maximization
|
Li et al. (2021b)
|
DDPG
|
Central Controller
|
UEA, UE power allocation, and BS transmit power
|
UEs’ data rate, UEs´ data packet transmission time, and UEs´ transmission power
|
EE and sum rate
|
Exploration noise
|
Throughput Maximization
|
Su et al. (2020)
|
DQN
|
1: FBS
2: MBS
|
1: FUEs Power transmission
2: MUEs Power transmission
|
1: Proximity of the FBS to the MBS and proximity of the FBS to the MUE
2: Distance between the distant MUE and the MBS and distance from the closest FBS to the MBS
|
1: Throughput considering FUE QoS
2: Throughput considering MUE QoS
|
1: ε-greedy
2: ε-greedy
|
Khoshkbari et al. (2020)
|
BBQN
|
SBS
|
Discrete power levels
|
CSI, transmission power, and data rate
|
Total capacity
|
BNN
|
Vishnoi et al. (2022)
|
DDPG
|
SBS
|
Transmission power
|
Local channel gains and interferences
|
Throughput
|
Exploration noise
|
Li et al. (2020b)
|
DDPG
|
1: Central Controller
2: SBS
|
1: Power allocation and energy transfer of each SBS
2: Power allocation and energy transfer of its SBS
|
1: UEs’ SINR, battery level, and SBSs’ harvested energy
2: SINR of UEs belonging to the SBS, battery level, and harvested energy of the SBS
|
1: Network sum-rate
2: SBS sum-rate
|
Exploration noise
|
Sande et al. (2021)
|
DQN
|
SBS
|
Power allocation, required throughput
|
Time-average number of packets, SBS load status
|
Function considering throughput and minimum required power
|
ε-greedy
|
Anzaldo and Andrade (2021)
|
DQN
|
SBS
|
SBS transmission power
|
Channel gain, power transmission, reward
|
Function considering each SBS throughput and SBS neighbors contribution
|
ε-greedy
|
Liu and Zhang (2022)
|
DDPG
|
MBS
|
Power allocation matrix
|
CSI, cooperation, and power allocation matrices
|
Network sum-rate
|
Exploration noise
|
Chen et al. (2022)
|
DDPG
|
Central controller
|
Subchannel and power allocation
|
UEs’ location, satellites location, and rain intensities
|
Total data rate of all UEs
|
Entropy
|
Suh et al. (2022)
|
DQN
|
gNB
|
RB slice allocation
|
Minimum rate, slice allocation, CSI of all UE
|
Overall system throughput
|
ε-greedy
|
UE Satisfaction Maximization
|
Do and Koo (2019)
|
ACDL
|
Central controller
|
UEA variable and bandwidth allocation
|
Number of energy packets in the battery of the BS and system bandwidth
|
Ratio of the total allocated bandwidth to the total required bandwidth
|
ε-greedy
|
Wang et al. (2019)
|
DQN
|
SBS
|
RBs allocated to each UE
|
QoS demand, QoS provision, environmental parameters
|
UE satisfaction
|
ε-greedy with a heuristic mechanism
|
Utility Maximization
|
Zhao et al. (2019)
|
D3QN
|
UEs
|
BS association and channel allocation
|
All UE's QoS demand status
|
Function considering the difference between achieved profit and the transmission cost
|
ε-greedy
|
Cheng et al. (2020)
|
DQN
|
UEs
|
AP association and subcarrier allocation
|
UE’s data rate requirement status
|
UE’s sum rate
|
ε-greedy
|
Kim et al. (2022b)
|
DDPG
|
Central controller
|
Power control and on/off switching
|
Ratio of UE served
|
Function considering throughput penalized by power consumption
|
Exploration Noise
|