Intelligence-learning driven resource allocation for B5G Ultra-Dense Networks: A structured literature review

doi:10.21203/rs.3.rs-2763206/v1

Download PDF

Research Article

Intelligence-learning driven resource allocation for B5G Ultra-Dense Networks: A structured literature review

https://doi.org/10.21203/rs.3.rs-2763206/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Network densification is a suitable solution to improve the capacity of future mobile networks. However, deploying massive low-power base stations sharing the radio spectrum will cause increased interference reducing the ultra-dense networks (UDN) performance. Resource Allocation (RA) proposals have been developed for decades to meet mobile subscribers' data traffic and QoS demands and to prevent harmful interference. However, as networks evolve and mobile applications request more bandwidth, high data rates, and ultra-reliable low latency, the RA problem has become more complex. Machine Learning (ML) techniques have recently been explored to significantly reduce the computational complexity of RA problems and improve overall UDN performance compared to traditional methods. This paper systematically focuses on the most relevant research contributions that use ML techniques to produce accurate channel and power allocation results in UDN. A total of 56 articles were analyzed from a thorough selection process from manuscripts published from 2010 to 2022 in different academic databases. We describe the main aim of these research works and, according to the ML technique applied, have classified them into ANN-based, RL-based, or DRL-based models. Also, we identify the design features of reinforcement learning strategies used to enhance Key Performance Indicators (KPIs), such as energy and spectral efficiency, throughput, interference, or fairness. Research directions are discussed based on the findings.

Deep Learning

Reinforcement Learning

Resource Allocation

Systematic Review

Ultra-Dense Networks.

Wireless network densification is a promising strategy to support the extremely high demand for data traffic, ultra-reliable low latency, and ubiquitous communications in the beyond Fifth Generation (B5G) mobile cellular systems (Kamel et al. 2016). According to (Teng et al. 2018), network densification is the deployment of multi-tier and inter-connected emerging technologies for enhancing the performance of an already deployed wireless network. It refers to the deployment of massive low-power nodes, or Small Base Stations (SBS), over the high-power Macro Base Stations (MBS) coverage (Ding et al. 2015). The MBS allow mobile operators to provide ubiquitous coverage to User Equipment (UE), while SBS (micro- /pico- / femto- / nanocells) help to increase the UEs' capacity and spectrum efficiency.

Improving wireless link strength between the SBS and UE allows for fulfilling UEs' data rate demands with fewer resources (Cheng et al. 2015; Teng et al. 2014; Belikaidis et al. 2018). This capacity improvement comes from reusing the communication channels in all small-cells, which increases the bandwidth per UE in the network. In B5G-Ultra-Dense Networks (B5G-UDN), the UE's data rate and network performance increase as more SBS are added to the network, and some UE are offloaded from the MBS to these SBS (Huang et al. 2017).

The system spectrum's spatial reuse is exploited to improve Spectral Efficiency (SE). The massive deployment of SBS is useful in zones with high traffic demands. However, some challenges arise due to B5G-UDN deployment (Romanous et al. 2015): i) UE Association (UEA) and BS selection, several BS are close to the UE, so to decide which BS the UE should connect to, it is necessary to consider the load balancing between cells; the less energy consumption, the better use of channel gains, MBS offloading, or the tradeoff between spectrum reuse and capacity (Liu et al. 2017a; Yu et al. 2017; Kamel et al. 2017), ii) handover, the number of handovers will increase with the network densification caused by the UE’s frequent change between the small coverage of the SBSs, degrading the network performance by increasing packet losses and packet delay (Hassan et al. 2018; wang et al. 2017), and iii) interference coordination, novel schemes need to be implemented to manage the co-tier interference between SBS, which is considerably higher than in conventional sparse cellular networks (Cao et al. 2018; Andrews et al. 2016; Zhang et al. 2018b; Liu et al. 2017b).

The benefits of network densification (such as throughput and SE) are diminished due to the interference generated by sharing the spectrum among the two-tier base stations. These interference levels have been recognized as a critical factor limiting B5G-UDN capacity improvement (Bhushan et al. 2014; Yun et al. 2011, Giménez et al. 2017). Therefore, the system cannot benefit unlimitedly as cell densification increases. It may also be exacerbated in cases where individual UE randomly installs SBS outside planned deployment. It will generate an unpredictable interference effect (Kuntourus et al. 2017). Therefore, further improving the mobile network capacity depends on the capability of Resource Allocation (RA) algorithms to manage the interference among wireless nodes (Teng et al. 2018; Liu et al. 2017b; Li et al. 2018).

The Resource Allocation (RA) algorithms play a crucial role in minimizing interference using orthogonal spectrum resources, i.e., controlling the allocation of channel and transmission power to the massive number of nodes (BS and UE) deployed in the network. One of RA’s objectives is to maximize the network’s overall capacity while satisfying the UEs’ individual Quality of Service (QoS) demands. The efficient allocation of resources is a complex problem to be analytically solved because the interference generated is modified dynamically as network conditions change (e.g., UE mobility, channel state, requirements, and the number of serving UE) (Li et al. 2018; Gupta et al. 2019).

Since radio-frequency spectrum resources are finite, the most significant challenge is using them efficiently. Although each small-cell deployed in the network increases the network capacity, higher densification means more unsuitable allocation solutions are found due to the interference produced by sharing spectrum resources. Also, these RA algorithms usually require the entire network’s complete Channel State Information (CSI), which is infeasible due to the device’s capabilities. Also, the time consumed by centralized algorithms is too high to achieve reliable communications. Furthermore, the lack of information limits the efficiency of conventional optimization methods to find solutions that allocate resources appropriately (Hussain et al. 2020; Haenggi et al. 2009).

Several exact and approximation methods have been applied to solve the RA problem in UDN scenarios, each offering different characteristics in terms of the quality of the solutions and the computational efficiency. For example, the Lagrangian dual decomposition method is applied by relaxing the RA problem with time-sharing, providing optimal solutions in polynomial time (Zhang et al. 2014). However, convex approaches are rarely used in RA research for UDN, as their computational complexity is still an issue in these networks. Heuristics, such as greedy approaches, provide a fast-sub-optimal solution, but interference is usually not fully considered in these approaches. Due to the high dimensionality, the vast search space, and aggressive spectrum reuse, the RA optimization problem quickly becomes too complicated to be solved efficiently with approximation algorithms. This issue has prevented the natural evolution of these approaches from sparse Heterogeneous Networks (HetNets) to UDN. Therefore, new strategies are needed to reduce the adverse effects and help find feasible and efficient solutions. Given the NP-Hard nature of the RA problem, finding the optimal solution requires enumerating all the solutions, which is intractable even for low cell densification. A promising direction to tackle the challenges described above is to adopt Machine Learning (ML) technologies to analyze and manage the resources of future wireless networks. ML technologies will reduce manual resource management interventions and improve network performance and reliability.

ML algorithms are promising tools to manage networks intelligently, ensuring network efficiency, reliability, robustness goals, and QoS demands (Wang et al. 2015). These techniques can find hidden patterns in network information and adapt to environmental changes to dynamically set the network parameters and transmission policies (Sun et al. 2019). Furthermore, ML technologies have the potential to efficiently solve unstructured and seemingly intractable problems involving large amounts of data that need to be dealt with in the design and optimization of future wireless networks (Wang et al. 2020).

In contrast to conventional optimization techniques, ML can lead the intelligent network’s paradigm with its capability to operate online by learning and adapting from network information data by frequently interacting with the wireless environment (Jiang et al. 2017; Chen et al. 2019; Yau et al. 2012). As a result, ML has been recently introduced in communication networks, increasing interest in B5G systems (Xu et al. 2020; Chen et al. 2020; Elsayed 2019; Letaief et al. 2019). However, to our knowledge, no surveys still comprehensively describe the conjunction of ML algorithms and RA in UDN.

In this review, we aim to provide a systematic search of the most relevant research contributions applying ML techniques to solve the RA problems in ultra-dense networks. Also, we identify the RA strategies implemented in these studies and the model structure adaptations and break down the ML methods to provide a comprehensive survey for researchers interested in this research area through developing a structured literature review. The contributions of this paper concerning the existing surveys are summarized below:

A method to extract data they allow highlights the tendencies of ML techniques applied to solve the RA problem in UDN.

A comprehensive study of the different works of the last decade applying ML models to solve the RA problem in UDN. Related studies are reviewed in detail.

Specific future research directions and open issues are discussed.

This review summarizes the description of some surveys on resource management for wireless networks in Section 2. Then, the description of the study design is presented in Section 3. The search results and data extracted are presented in Section 4. A brief discussion of the result is presented in Section 5. The open issues identified are highlighted in Section 6. Finally, Section 7 concludes the paper.

Some reviews have been conducted around RA strategies to address the UDN deployment challenges. For instance, Yu et al. (2016) describe the issues of interference, radio domains, resource management, and mobility in UDN. Although ML techniques are not discussed in this work, several resource management issues are highlighted, such as the radio base energy consumption, increment of traffic demands, and geospatial conditions (e.g., buildings) affecting the BS’s deployment planning. Jiang et al. (2017) focus on insights about ML implementation for future wireless networks. They present a taxonomy of the ML techniques for 5G networks. Some ML techniques are described as a future research area for application in B5G networks, mentioning clustering on HetNets and the need for intelligent RA decisions under unknown network conditions. However, no further analysis is performed on the RA problem in UDN.

On the other hand, the RA algorithms for UDN are addressed in Teng et al. (2018). The authors describe the UDN with several inter-connected deployments that span emerging features such as beamforming, network virtualization, network cooperation, caching, computing, or energy harvesting. Also, the differences in features between UDN and traditional networks are highlighted, as significant enhancement in performance gains, multi-cell connection, several types of Access Points (AP), network slicing, emerging services (e.g., Virtual Reality and gaming), high and diverse mobility scenarios, and the integration of communication, caching, and computing. The authors present features of different UDN scenarios regarding the transmitter/receiver capabilities, massive Internet of Things (IoT), Cloud-Radio Access Networks (C-RANs), HetNets, massive Multiple Input Multiple Output (MIMO), and mmWave. Furthermore, several techniques and challenges for RA are addressed and discussed. However, although the authors broadly discussed the RA problem in UDN, they provide only a brief introduction to the Artificial Intelligence (AI) methods. ML is gaining attention among the IA methods due to the BS self-configurations and model-free characteristics, the reduction of the computational burden, and the adaptation by learning to the dynamic scenarios foreseen for UDN.

The survey by Hussain et al. (2020) focuses on resource management for wireless IoT networks applying ML and Deep Learning (DL). The authors discuss different resource management techniques and their limitations. They establish that using heuristic techniques provokes inaccuracy, cost, complexity, convergence to local optima, sub-optimal solutions in high-dimensional problems, parameter sensitivity, and lack of flexibility in the model-based RA methods. Meanwhile, game theoretic limitations were identified as the homogeneity of players, network information availability, lack of scalability, slow convergence, and information overhead. Then, resource management aspects were addressed and discussed, as well as enabling technologies and research challenges of IoT networks. However, the UDN topic was not discussed. Cellular users, MBS, and SBS, have more capabilities, transmit at higher power, and have higher data rate demands. Thus, more complex RA algorithms could be implemented to control the interference to fulfill their requirements when user density and data rate need to increase. Sun et al. (2019) describe several resource management strategies for non-dense wireless networks, such as power control, spectrum management, backhaul management, cache management, beamforming design, and computation resource management. Also, bring an in-deep analysis of the advantages of using ML techniques for resource management and their motivations concerning RA conventional techniques. Furthermore, diverse ML techniques and open issues are surveyed and discussed.

The literature review by Luong et al. (2019) introduces several Deep Reinforcement Learning (DRL) techniques and their extensions. Different wireless telecommunications technologies, such as IoT networks, cloud RAN, cellular networks, Wireless Sensor Networks (WSN), and Unmanned Aerial Vehicles (UAV), are described, as well as the application of DLR for addressing network access, wireless proactive caching, data, and computation offloading, network security, connectivity preservation, data collection, resource sharing and scheduling, and traffic engineering and routing.

This section presents the method for selecting the research work set described in this systematic review. The structure is mainly based on (Kitchenham 2004), which presents the guidelines for performing systematic reviews on software engineering. The methodology consists of three stages: planning, conducting, and reporting. The planning stage involves identifying the necessity of a systematic review and the protocol design (Section I-III). The conducting stage consists of the selection procedure of the papers, selection criteria, and data extraction (Section IV). Finally, the results lead to the conclusion of the reported and discussed findings (Section V-VI).

3.1 Research Objectives

This structured literature review aims to identify the ML techniques' design features for solving the RA problem in UDN. Table 1 shows the research questions to meet the goal and their respective motivations.

3.2 Search Strategy

Searches were performed from 2010 to 2022 in the following databases:

IEEE Explore (http://ieeexplore.ieee.org)
Springer Link (http://www.springer.com/)
ACM Digital Library (http://dl.acm.org/)
Scopus (https://www.scopus.com/)

We used the search term ‘learning method’, ‘resource’, ‘management’, and ‘ultra dense network’ to indicate the implementation of a learning technique, the usage of some network resource, the assignment of that resource, and a dense BS deployment, respectively. Later, all four terms were concatenated with the AND operator. Table 2 shows the key terms and the keywords associated with them. Also, since some databases focus on diverse research areas, we limited the Springer Link, and the Scopus database searches to the engineering and computer science areas. The searches in the remaining databases only limit the range of years.

Table 1

Research questions and motivations.
Research Questions	Motivation
RQ1. What strategies are used by the ML algorithms to discriminate relevant information during the RA process in UDN?	Comparing and discussing the different strategies and assumptions of the studies that implement ML to solve the RA problem could help find the gaps and opportunities of these algorithms to lead to further research
RQ2. Which Key Performance Indicator (KPI) does the RA algorithms ML-based consider deciding how to allocate the network resources?	Algorithms modify their strategies according to the objectives and resources allocated. KPI selection could be related to the resource, objective, and ML technique implemented

Table 2

Search terms.
Search terms	Keywords
Learning method	Learning OR artificial intelligence
Resource	Resource OR power OR bandwidth OR channel OR spectrum
Management	Management OR allocation OR scheduling
Ultra dense network	Ultra dense network OR UDN OR ultra-dense network OR small dense network OR dense network OR ultradense network

3.3 Inclusion Criteria

From the articles identified by the database searches, we applied specific criteria for screening proposes to select the most relevant articles for this review. The inclusion criteria are listed in Table 3. Inclusion criteria 1 and 2 are related to the objective of this review. Inclusion criteria 1 considers articles focusing on power allocation and bandwidth allocation on UDN, while inclusion criteria 2 considers the works that allocate these resources applying an ML technique. Inclusion criteria 3 considers research works published in scientific and magazine journals, congress proceedings, or book chapters. In addition, only studies written in English were considered, as stated in inclusion criteria 4. On the other hand, we exclude those research papers with unclear information about the simulation scenario or ML model details.

Table 3

Inclusion criteria.
1. Studies focused on power and bandwidth allocation on UDN
2. Studies implementing some techniques of ML in the resource allocation strategy
3. Studies are published in research journals, magazine journals, congress proceedings, or book chapters
4. Studies wrote only in English

3.4 Selection Process

The duplicated papers were removed before the screening process. The filtering process started with screening the title and abstract, considering the inclusion criteria mentioned in subsection 3.3. The rest of the papers were read in the full-text, following the same inclusion criteria for selecting the most relevant research works. Then, a forward snowballing method was done on the selected papers to include papers not found in the database searches. Finally, a peer review was carried out during all steps, where regular sessions were held to discuss results and resolve disagreements.

3.5 Data extraction

Once the screening process was done, the extracted data were the year of the study, the research problem addressed, the optimization objective, the strategy and design of the ML model, and the Key Performance Indicator (KPI) used to evaluate the network. Within the strategy and design of the ML technique, we consider the information of the wireless system environment design as well as the generation of the data used to train the ML model, the network entity performing the algorithms, the resource Fig. 1 Selection process of articles considered in this review

allocation policies, and the strategy to evaluate the ML model.

In this section, the process of the screening and the results of the data extraction is described. Also, a brief analysis leads to the discussion, open issues, and conclusion sections.

4.1 Dataset’s Search results

The databases’ conducted searches were performed from January 1st, 2010, to December 31st, 2022. Figure 1 shows the selection process of the articles considered in this review. Initially, 4104 documents were recovered (IEEE: 562, ACM: 605, Springer: 869, Scopus: 2,068). We removed 259 duplicate documents, and 3792 documents were eliminated after title/abstract and full-text screening, leaving 53 research documents. After the forward snowball process, we identified 143 new documents, and after performing the screening (title, abstract, and full-text), only three more documents were included for the analysis. A total of 56 documents were analyzed in this review. On the other hand, from the 88 articles removed during full-text screening, we identified that 20 research papers did not implement an ML technique to perform the RA strategy, 20 research papers focused on offloading, UEA, or BS selection issues, 36 articles did not consider UDN evaluation scenario, 2 articles were non-English language, 1 was not accessible, and 9 did not describe the experimental design or algorithm details, such as model hyper-parameters or experimental methodology.

4.2 Machine Learning Models

From 56 reported ML strategies for solving the RA problem in B5G-UDN, we identified the application of 14 different ML models. We classified them into three main groups, as shown in Fig. 2.

Deep Reinforcement Learning (DLR) group: Actor-Critic Deep Learning (ACDL) (Doo and Koo 2019), Bayes by Backprop Q-Network (BBQN) (Khoshkbari et al. 2020), Deep Deterministic Policy Gradient (DDPG) (Li et al. 2018b; Li et al. 2020a; Li et al. 2021b; Huang et al. 2021; Vishnoi et al. 2021; Li et al. 2020b; Liu and Zhang 2022; Chen et al. 2022; Kim et al. 2022b), Deep Q-Network (DQN) (Ding et al. 2020; Xiao et al. 2019; Wang et al. 2019; Liao et al. 2019; Liu et al. 2019a; Nasir and Guo 2019; Su et al. 2020; Cheng et al. 2020; Liu et al. 2019b; Sande et al. 2021; Chen et al. 2021; Anzaldo and Andrade 2022; Anzaldo and Andrade 2021; Suh et al. 2022), Double Deep Q-Network (DDQN) (Zhao et al. 2021; Ye 2022), Dueling Double Deep Q-Network (D3QN) (Zhao et al. 2019) and Dueling Q-Network (DuQN) (Liu et al. 2019c).
Reinforcement Learning (RL) group: Q-Learning (QL) (Elsayed and Erol-Kantarci 2018; Lu et al. 2016; Zhang et al. 2019b; AlSobhi and Aghvami 2019; Jiang et al. 2016; Amiri et al. 2019; AlOerm and Shihada 2017; Zhang et al. 2018a; Li et al. 2019; Li et al. 2018c; Chen et al. 2016; Amiri and Mehrpouyan 2018a; Lin et al. 2017; Amiri et al. 2018b; AlOerm and Shihada 2016; Elsayed et al. 2019; Li et al. 2021a; Iqbal et al. 2021; Iqbal et al. 2022; Kim et al. 2022a; Sharma and Kumar 2022), and Multi-Armed Bandit (MAB) (Feki and Capdevielle 2011).
Artificial Neural Networks (ANN) group: Long Short Term Memory (LSTM) (Zhou et al. 2018), Deep tree model with Long Short Term Memory (DLSTM) (Hossain and Muhammad 2020), Deep Neural Network (DNN) (Zhang et al. 2019a), Graph Neural Network (GNN) (Zhang et al. 2022), Neural Network (NN) (Cao et al. 2019).

4.3 Machine Learning year-wise distribution

For over 20 years, wireless communications network designers have used ML techniques to solve complex problems for improving UE retention and maximizing profitability to the mobile carrier (Mozer et al. 2000), addressing routing protocol issues for underwater sensor networks (Hu and Fei 2010), controlling wireless data in vehicular networks (Fu et al. 2018) or sensing the availability of spectrum for CRN (Liu et al. 2019d). In 2016 wireless network designers will start applying ML techniques to solve the RA problem in UDN, as seen in Fig. 3. There is a greater tendency to use models based on RL and DRL techniques as emerging tools to address RA problems compared to the ANN models effectively. RL and DRL techniques allow the network entities and mobile UE to dynamically learn and control the networking environment (e.g., channel selection and spectrum access) without accurate wireless network information as required for the ANN models (Luong et al. 2019).

4.4 Machine Learning strategies implemented for solving the RA problem.

The works analyzed in this review apply some ML strategies to solve the RA problem in ultra-dense networks. These techniques are divided into ANN-based, RL-based, and DRL-based models. In the following subsections, we describe the learning process of each model applied in each work. Figure 4 shows these algorithms' overall process to allocate network resources (RB and transmission power) between UE. The ML model decides its strategy based on observing the network’s environmental information, such as the UE's QoS requirements, Channel State Information (CSI), number of UE and Resource Blocks (RB), interference, or current resource usage. The ML models generally learn to map the observed information into resource allocation strategies. The network entities, such as the SBS, perform resource control strategies, which impact the environment and generate new observations. This interaction cycle is repeated for continuous control until an optimal decision-making policy is achieved. During this process, some ML models as the RL-based or the DRL-based may adjust their RA strategy. On the other hand, the ANN-based models require adjusting their RA strategy before implementation.

1) Artificial Neural Network-based models (ANN)

The ANN algorithms learn to perform tasks without being programmed with task-specific rules. NN consists of three layers: an input layer, a hidden layer, and an output layer. Training an NN algorithm from a given set of examples is used to optimize its weighted parameters for enhancing the predicted (target output) values' accuracy. Parameters are adjusted according to the error between the predicted and target values (i.e., reference values) (Chen et al. 2019). A Deep Neural Network (DNN) is an NN with multiple hidden layers between the input and output layers. The data flow from the input to the output layer creates a map of virtual neurons; each connection is assigned a weight. The weights are adjusted when the neural network does not recognize a particular pattern.

Furthermore, different NN architectures may be designed for various purposes. For example, recurrent NN such as LSTM allows neuron connections from previous layers to solve sequence prediction problems. Meanwhile, architectures, such as Convolutional Neural Networks (CNN) and Graph Neural Networks (GNN), are implemented to process data represented as images or graphs.

ANN-based models have been used in UDN to resolve issues related to Energy Efficiency (EE) (Zhang et al. 2019a), throughput (Cao et al. 2019; Zhang et al. 2022), and traffic control (Hossain and Muhammad 2020; Zhou et al. 2018). Table 4 shows the design features used by the ANN-based models, such as the optimizer, activation layer, and training strategy.

A NN is used by Cao et al. (2019) to solve the UE clustering and subchannels allocation. The authors implement the modified Min k-Cut and conflict graph algorithms. The NN extracts the inter-user interference relationship of each UE. The dataset consists of the potential interfering UEs (i.e., another UE is using the same Resource Block (RB) at the same transmission time interval) of each RB and the uplink Signal-to-Interference-plus-Noise-Ratio (SINR) as the label. The NN parameters are modified until a minimum mean-squared error is achieved. The data set used for training was collected from a simulation platform for LTE systems called LTE-Sim. Zhang et al. (2022) propose a GNN to extract nodes’ information and reduce the workload and data requirement of the UEA and power control problem.

Then, a training scheme is proposed by combining supervised and unsupervised learning. The model exploits the generalization ability gained during offline training, whereas the performance is enhanced during online training to suit real-time scenarios. Results show that using the GNN technique achieves higher performance and faster convergence than fully connected NN and CNN. In contrast, their proposal outperforms traditional techniques, such as maximal achievable rate association with maximum power and maximal sum-utility association with maximum power. Zhang et al. (2019a) propose a centralized DNN to allocate transmission power to each UE’s wireless link in UDN. The NN is trained with the data obtained from an iterative gradient method, and then, the data is normalized and formatted with normal distribution with zero mean and L2 regularization, respectively. Besides, they introduce a distributed DNN facing the challenge of model size at UDN. Distributed DNN divides the fully connected DNN into several DNN models that are trained in parallel. All weights are collected by a parameter server (i.e., central controller) that updates back all weights. This procedure reduces the training time and makes the system robust due to the different small-scale networks trained. Results show a 97.0%-98.4% accuracy, nearly ten times less operation time, and a slight difference in EE from the iterative gradient algorithm used for training. Hossain and Muhammad (2020) and Zhou et al. (2018) use the LSTM model to address traffic control issues. It uses currently available and past data to obtain output values. These works consider time division duplexing on their transmissions and aimed to change the uplink/downlink ratio before congestion occurred. On the other hand, Hossain and Muhammad (2020) implemented a tree-based deep model before the LSTM technique to reduce the parameters of the data gathered in the spatial domain from many UEs. Both works show a performance enhancement concerning methods where the network RA policies change once the congestion occurs.

2) Reinforcement Learning-based models (RL)

RL-based models potentially allocate resources dynamically according to the requirements. They allocate resources with the knowledge extracted from big data without the need for explicit mathematical models. RL-based models learn from the interaction with their current environment. An RL task is about training an agent. The agent arrives at different scenarios known as states by performing actions. Actions lead to rewards. The agent has only the purpose of maximizing its total reward. The output depends on the state of the current input, and the following input depends on the output of the previous input. Without a training dataset, it is bound to learn from its experience. Actions are selected based on their past experiences in a trial-and-error fashion (i.e., exploration and exploration). According to the reward function formulated, these actions are rewarded (positively or negatively), and their values are stored in a Q-table to influence the selection of future actions. The learning process consists of taking actions according to an exploration strategy from Q-table, calculating the reward, and updating the Q-table with the Q-values corresponding to the action-state tuple. This learning process is repeated until an end criterion is met (e.g., the number of episodes). Table 5 shows the design features of the works that apply RL-based models to solve the RA problem in UDN. These research works were grouped according to the reward function into delay minimization (Elsayed et al. 2019), EE maximization (Zhang et al. 2019b; AlOerm and Shihada 2017; Kim et al. 2022a; Sharma and Kumar 2022), interference mitigation (including interference control, interference management, and Inter-Cell Interference Coordination (ICIC)) (Feki and Capdevielle 2011; AlSobhi and Aghvami 2019; Jiang et al. 2016; Zhang et al. 2018a; Li et al. 2018), throughput maximization (Elsayed and Erol-Kantarci 2018; Lu et al. 2016; Amir et al. 2019; Chen et al. 2016; Amiri and Mehrpouyan 2018a; Lin et al. 2017; Amiri et al. 2018b; AlOerm and Shihada 2016; Li et al. 2021a; Iqbal et al. 2021; Iqbal et al. 2022), and utility maximization (Li et al. 2019).

Minimizing the delay in an LTE system is addressed by Elsayed et al. (2019). The evaluation scenario consists of an LTE network with mobile UEs and MicroGrid Devices (MGD). The goal is to reduce latency and increase fairness between UEs and MGDs. Each BS/MBS allocates the RBs according to the reward function, which is adapted with a scalar weight to control the priorities between traffic types on each device type achieving a tradeoff between UE and MGDs. The impact of the action space indicates that better RA actions can be learned at the cost of more delay.

Zhang et al. (2019b) propose an event-triggered QL approach for saving computational resources and maximizing EE. The Small base station User Equipment (SUE) acts when the difference between the agent’s reward and the SUE average rewards using the same channel on the current and previous steps is higher than a threshold. The proposed event-triggered approach achieves better EE than classical QL. AlOerm and Shihada (2017) use an intuition learning scheme to surmise other Secondary Transmitters (ST) (i.e., Pico Base Station (PBS), Femto Base Station (FBS), and Device-to-Device (D2D)) with local information, making use of their interactions with the environment and their past experiences. A Q-value approximation is proposed to reduce the state space, which results in better performance than conventional algorithms and ensures the QoS of primary and secondary UEs. Kim et al. (2022a) propose a power transmission control based on QL to maximize the EE while minimizing the number of UE outages. In this scheme, the UE considers only their past action for the decision-making while the reward function considers the global network performance. Results show that the computational complexity is significantly reduced compared to centralized QL, where the Q-table increases exponentially with the number of agents and action/state. Also, a higher convergence rate than distributed scheme considering independent reward tested in uniform and non-uniform UE spatial traffic distribution scenarios is achieved. Sharma and Kumar (2022) maximize EE considering the QoS of FUE in a small-cell network. The proposed strategy consists of groupings FBS through a K-means clustering algorithm considering the FBS geographical location and the traffic load at each cluster. Then, a QL algorithm allocates the RB for each cluster head. The QL algorithm is trained cooperatively by using the past historical information of other agents stored in a cloud server. Results show a higher EE, throughput, and convergence speed concerning independent QL and the Stackelberg methods.

To address the interference mitigation problem, Feki and Capdevielle (2011) developed a MAB algorithm bases on RL theory. Each cell aims to select the best band portion for ICIC. The MAB algorithm starts accessing all spectrum bands from the spectrum pool sequentially and choosing the group of available channels that gives the highest reward. Then, a decisional function is proposed to select the next spectrum band to transmit. This function considers the mean reward of the spectrum band and the

Table 4 Design features of ANN-based resource allocation in UDN.

Objective	Ref	Model	Training	Optimizer	Activation layer
Energy Efficiency Maximization	Zhang et al. (2019a)	DNN	Dataset was generated by solving an iterative gradient algorithm. Datasets generated for training and validation were 15000 and 1000, respectively.	Adam	ReLU
Throughput Maximization	Cao et al. (2019)	NN	Dataset contains billions of samples generated from the LTE-Sim platform. Each sample consists of the potential interfering UE and its uplink's SINR	-	-
Throughput Maximization	Zhang et al. (2022)	GNN	The model was first trained with several optimization solutions to achieve generalization ability. Then, the model is fine-tuned with the current scenario data	GA	PA: Sigmoid UEA. Softmax
Traffic Control	Hossain and Muhammad (2020)	DLSTM	Training was performed with different fixed uplink/downlink ratios. Deep model with a tree structure was used to enhance regularization and reduce complexity of the LTSM model	SGD	DTM: ReLU LSTM: Sigmoid and tanh
Traffic Control	Zhou et al. (2018)	LSTM	Training involves a data sequence of packets in the sending buffer at fixed uplink/downlink ratios. The network trains the model at each ratio change	-	Sigmoid and tanh
ReLU: Rectified Linear Unit. GA: Gradient Ascent. PA: Power Allocation. UEA: User Equipment Association. SGD: Stochastic Gradient Descent. DTM: Deep Tree Model.

number of times the same band is chosen for transmission. Also, the exploration parameters are tuned to allow for choice transmission bands with low reward values. The results show higher throughput using the spectrum portions more efficiently than fixed reuse schemes. AlSobhi and Aghvami (2019) propose three variants of QL algorithms to perform power allocation: distributed, formulated, and cooperative. The distributed algorithm considers enhancing the Femtocell-User (FUE) capacity while Macrocell-User (MUE) QoS is maintained. The formulated algorithm modifies its state, considering the MBS, the MUE, and Femtocell Access Point (FAP) location. Meanwhile, the cooperative QL reduces the computational complexity by letting the experienced agents (i.e., agents where the algorithm converged) share information with new agents with a similar state to accelerate their training convergence. Results show that the location of the MUE is a decisive factor in achieving the network QoS requirements.

Jiang et al. (2016) consider a dual-hop architecture where the access and self-backhaul networks share the same spectrum. The Hub Base Station (HBS) controls the file transmission to the SBS and from the SBS to the UE. To reduce the computational complexity of conventional QL, the authors consider a single-state QL, simplifying the action-state pairs to a stateless format. At the initialization, the agents remove the high-interference channels from the spectrum pool. Then, the agents assign a Q-value for each action, which guides their decision-making. As a result, the link capacity of both networks (macro and micro) is increased, and the convergence time is reduced compared with the conventional cognitive radio approach. Zhang et al. (2018a) propose a QL algorithm with a Transfer Learning (TL) method to accelerate the learning speed on a Small Cell Network (SCN). The Q-table is updated with new agents’ information from the Q-values of experienced agents with similar environments. However, their reward is based on the UE density, SINR, and transmit power, focusing on reducing inter-cell interference and saving energy consumption of BS. The conflict graph strategy for clustering and the QL technique for interference management is applied by Li et al. (2018). Agents allocate the transmission power over different RB according to other agents' interference and the overall network SINR. Results show enhancement in network throughput. However, peak throughput is decreased at the expense of better edge throughput.

Elsayed and Erol-Kantarci (2018) consider the coexistence of Data-Intensive Devices (DID) and traditional UE for the throughput maximization topic. The DID are emerging applications expected to be frequent in future networks, such as augmented reality, virtual reality, and tactile applications. The system consisted of a multi-agent scenario, where Evolved Node B (eNB) and SBS perform RB allocation to their attached SBS and UE, respectively. Furthermore, Resource Block Groups (RBG) consisting of the continuous RBs were considered to avoid the curse of the dimensionality of considering all RB combinations. Results outperform the Proportional Fair (PF) algorithm concerning throughput, delay, and fairness for different DID densities. The work by Lu et al. (2016) avoids interference among SBS by implementing an ICIC scheme consisting of adaptive Almost Blank Subframes (ABS) and QL for power control. The proposed ICIC scheme adapts the ABS ratio according to the most interferer cell load. However, instead of blanking the subframes, these cells, denominated as aggressor cells, are allowed to transmit at low power, controlled by the QL algorithm. Results show that the dynamic power control of low-power ABS outperforms the blanking ABS as UE density increases. Amir et al. (2019) consider a distributed QL for power control in a Self-Organizing Network (SON). As the system densifies, authors consider two training schemes for new agents, i.e., Independent Learning (IL) and Cooperative Learning (CL). To solve the optimization problem, they design a reward function to fulfill the FUE and the MUE QoS requirements. Results show that IL achieves higher FBs' sum rate and power consumption, while CL achieves higher MBs' sum rate and lower power consumption at the cost of signaling overhead. Moreover, Chen et al. (2016) consider distributed, and centralized RA approaches, where the cluster heads allocate the RB within the cluster, and the MBS allocates the RBs of each cluster, respectively. Results show a better performance of the centralized approach, showing that a more extensive action variation needs more time to converge than the distributed approach. Amiri and Mehrpouyan (2018a) propose a joint clustering method and transmission power allocation based on the QL algorithm. The BS chooses its transmission power according to its state, defined by zones separated by concentric circles from the cluster head. Besides, a reward is proposed to satisfy QoS and to provide fairness, which outperforms the benchmark reward function. In (Lin et al. 2017), the MBS coverage area is divided into two zones, the cell-edge and the cell-center regions. They are defined as the low-interference and high-interference zones, respectively. The QL technique is implemented in the SBSs of the cell-edge region to allocate the transmitting power to maximize the throughput while ensuring the MUE QoS. Amiri et al. (2018b) implement the QL algorithm to provide the whole network system fairness. The agents present two operation modes, individual learning, and CL. With individual learning, the agents learn through interaction with the environment, while with CL, the agents learn from experienced agents. The algorithm's complexity can be reduced using the cooperative approach compared to individual learning since agents can share their experiences instead of discovering all the environment information (i.e., making exploration actions) themselves. The same type of agents (i.e., FBS or D2D) works cooperatively to share their state information in (AlOerm and Shihada 2016). The agent can allocate RB and power and adapt the transmission modulation. Also, two mechanisms for the QL algorithm were implemented. First, the exploration rate is decreased to achieve a high exploration rate at the beginning. Second, the learning rate is modified to learn faster when losing and slowly when winning (i.e., Q-value comparison between consecutive actions). These modifications prevent the learning mechanism from depending only on the performance metrics. Results show better performance on throughput, SE, and fairness. Li et al. (2021a) implement a QL algorithm to control the transmission power. First, they introduce an analysis of network interference through graph theory. Then, the information gathered is used as part of the state. Results show that their proposal better describes the interference of the whole network achieving higher throughput performance than baseline algorithms. Iqbal et al. (2021) and Iqbal et al. (2022) implement the QL to maximize the throughput of MUE and SUE in a dense network. They propose an adaptive power control on the SBS, assuming SON functionalities for self-optimization. For validation, they consider different scenarios to mitigate cross-tier and co-tier interference. The reward function is designed to consider the minimum UE QoS requirements. Further, Iqbal et al. (2022) propose a cooperative QL. This scheme consists of sharing the Q-table information of nearby agents during the learning process. The results show that the cooperation scheme achieves a better SUE data rate than independent learning in denser scenarios. Meanwhile, both QL schemes outperform state-of-the-art solutions tested in international mobile telecommunications (IMT) scenarios.

The utility maximization is addressed by Li et al. (2019). The reward function considers the EE and load balancing through power allocation and UEA as actions. The load balancing benefits the number of UE associated with SBS, and therefore, better EE performance concerning conventional algorithms is obtained, which shows continuous improvement as the network densifies.

Table 5

Design features of RL-based resource allocation in UDN.
Objective	Ref	Model	Model Structure
Objective	Ref	Model	Agent	Action	State	Reward	Exploration
Delay Minimization	Elsayed et al. (2019)	QL	eNB/SBS	Set of RBs to their UEs	SBS/UE channel state information	UE delay considering a trade-off between MGD and UEs	ε-greedy
Energy Efficiency (EE) Maximization	Zhang et al. (2019b)	QL	SUE	Subchannel and power allocation	Channel occupied and allocated power of the SUE	EE considering UE SINR threshold	Boltzmann probability distribution
	AlOerm and Shihada (2017)	QL	ST	Transmission power	ST ID and transmission power	EE	Boltzmann probability distribution
	Kim et al. (2022a)	QL	SBS	Transmit power steps {up, down, keep}	Maximum and minimum transmission power and step power size of the SBS	EE and the number of outages UE	ε-greedy
	Sharma and Kumar (2022)	QL	Cluster head FBS	RB allocation and transmission power	User association relationship, SINR, required data rate, and total delay	EE considering UE’s QoS	Boltzmann probability distribution
Interference Mitigation	Feki and Capdevielle (2011)	MAB	PBS	Bandwidth portion	Cell and sub-band indexes	Mean instantaneous throughput	Decisional function
	AlSobhi and Aghvanni (2019)	QL	1: FAP 2: FAP 3: FAP	1: Transmission power 2: Same as 1 3: Same as 1	1: Capacity of the MUE 2: Location of MBS and MUE to the FAP 3: same as 2	1: Favors the FUE based on the location and QoS of the MUE 2: Prioritizes the MUE QoS and sum capacity of the FUE 3: Same as 2	1: ε-greedy 2: ε-greedy 3: ε-greedy
	Jiang et al. (2016)	QL	SBS	channel selection	-	Link capacity	ε-greedy
	Zhang et al. (2018a)	QL	SBS	Transmission power	UEs’ density and UEs’ previous SINR	UE density, SINR, and transmission power	ε-greedy
	Li et al. (2018)	QL	SBS	Transmission power allocated to each RB	Maximum interference and agent's SINR	Throughput and interference	-
Throughput Maximization	Elsayed and Erol-Kantarci (2018)	QL	eNB and SBS	RBG allocation	CQI and recent packet rate sent	Throughput of DID and regular UE	ε-greedy
	Lu et al. (2016)	QL	SBS	Transmission power	UEs' sum rate in aggressor cell	SBS throughput	ε-greedy
	Amiri et al. (2019)	QL	FBS	Transmission power level	Performance of FUE and MUE, MUE's interference from FBS, and FBS's distance to the MBS	Transmission rate considering FUE's and MUE's minimum rate requirements	ε-greedy
	Chen et al. (2016)	QL	1: Cluster head 2: MBS	1: RB allocation within the cluster 2: RB allocation of each cluster	1: QoS of the macro cell 2: Same as 1	1: Capacity of the RB for a cluster 2: Average capacity of all clusters	ε-greedy
	Amiri and Mehrpouyan (2018a)	QL	SBS	Transmission power	Concentric circles measured UE’s distance from the cluster head	Throughput considering UE QoS	ε-greedy
	Lin et al. (2017)	QL	SBS	RB power levels	Victim MUE's target SINR in subchannel and neighbor transmission power	Throughput prioritizing MUE QoS	Boltzmann probability distribution
	Amiri et al. (2018b)	QL	FBS	Transmission power	Neighborhood states based on the distance of FBS to MBS and MUE	Capacity for FUE while satisfying both FUE and MUE QoS	ε-greedy
	AlOerm and Shihada (2016)	QL	FBS or D2D transmitter	RB, the transmission power of the underlay transmitters in the center and edge band, and the modulation level	Underlay tier transmitter, available RB and SINR measured in the central band, in the edge band, and from neighboring agents	MUE and FUE Data rate considering the spectral efficiency achieved at the underlay receiver	ε-greedy
	Li et al. (2021a)	QL	SBS	Power allocation for each RB	Agent’s maximum interference and cluster interference	SBS throughput and unserved UE	ε-greedy
	Iqbal et al. (2021)	QL	SBS	Transmission power	Distance between the SBS and MBS, and SBS and MUE	UE capacity considering the minimum SINR of MUE and SUE	ε-greedy
	Iqbal et al. (2022)	QL	SBS	Transmission power	Radial distance between the SBS and MBS, and SBS and MUE	UE capacity considering the minimum SINR of MUE and SUE	ε-greedy
Utility Maximization	Li et al. (2019)	QL	UE	Power allocation and UE association	Received SINR, association state, and agent’s power level	Function considering UEA, energy efficiency, and QoS	Boltzmann probability distribution
ST: Secondary Transmitters (i.e., PBS, FBS, and D2D). RBG: Resource Block Group. CQI: Channel Quality Indicator. D2D: Device-to-Device.

3) Deep Reinforcement Learning based models

DRL-based algorithms have the same structure as RL-based model. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. In DLR-based model, the Q-value is approximated through a NN (Luong et al. 2019). The NN is necessary because RL-based algorithms' performance is degraded when the action/state space increases, making it challenging to find optimal policies (i.e., an action that obtains the maximum long-term benefit). In addition, since the consecutive data are correlated in time, DRL algorithms require implementing a replay memory mechanism and a secondary network (i.e., target network) for stability purposes. The replay memory stores some experiences, then random samples (i.e., mini-batches) are extracted to train the NN and get new experiences stored in the replay memory. The target network updates its parameters, making a copy of the NN parameters (called value network). Meanwhile, the value network trains itself using the target network as labels and the mini-batches as training sets. Table 6 shows the design features of the works that use DRL-based models to solve the resource allocation problems in UDN. We grouped them according to the following objectives: Computational Cost (CC) minimization (Li et al. 2020a), EE and SE maximization (Liu et al. 2019c; Ye 2022), Energy Consumption (EC) minimization (Huang et. al 2021), EE maximization (Li et al. 2018b; Ding et al. 2020; Liu et al. 2019a; Liu et al. 2019b; Chen et al. 2021; Anzaldo and Andrade 2022; Zhao et al.2021), interference mitigation (Xiao et. al 2019), joint SE, EE, and fairness maximization (Liao et. al 2019), SE maximization (Nasir and Guo 2019, throughput and EE maximization (Li et al. 2021b), throughput maximization (Su et al. 2020; Khoshkbari et al. 2020; Vishnoi et al. 2021; Li et al. 2020b; Sande et al. 2021; Anzaldo and Andrade 2021; Liu and Zhang 2022; Chen et al. 2022; Suh et al. 2022), UE satisfaction maximization (Doo and Koo 2019; Wang et al. 2019), and utility maximization (Zhao et al. 2019; Cheng et al. 2020; Kim et al. 2022b).

Li et al. (2020a) address the CC minimization for a Non-Orthogonal Multiple Access (NOMA) Mobile Edge Computing (MEC) system. To address the problem, the authors propose the jointly iterative optimization algorithm of User Cluster Matching (UCM) and Mean-Field DDPG (MF-DDPG). First, based on the UE channel gains, they are clustered with the UCM optimization algorithm. Then, the resource allocation (i.e., transmission power and task offloading) problem is modeled as a mean field game, solved by the MF-DDPG algorithm. Finally, the solution is obtained iteratively following these algorithms. The results were compared against Orthogonal Multiple Access (OMA) and DQN to demonstrate the system's reduction in energy consumption and task delay. Also, the convergence speed is improved concerning conventional DQN.

The SE and EE maximization is addressed in (Liu et al. 2019c; Ye 2022). Liu et al. (2019c) implement DuQN at the MBS for RB allocation. DuQN modifies the DQN architecture, separating the final layer into two streams for estimating how good it is for an agent to be in the current state and the advantage of selecting each action in that state (Wang et al. 2016). Results show DuQN achieves a higher EE and SE performance and faster training convergence concerning QL and DQN algorithms. Ye (2022) implement double DQN (DDQN) for the RB allocation. Unlike traditional DQN, which uses the maximum Q-values to select and evaluate the action, DDQN is introduced by Van Hasselt et al. (2016). It decouples the selection and the evaluation into two Q-value functions, preventing the overestimated values resulting from using the maximum values. Once the model is trained, a pruning algorithm reduces the size of the NN. It removes redundant connections between the fully connected layers. Pruning reduces the model's complexity by decreasing inference time with negligible performance drops.

Huang et al. (2021) implement two DRL-training schemes for EC minimization while satisfying task latency requirements by computation offloading and resource allocation (channel allocation, uplink power control, and computation RA for each UE served by the SBS). Both schemes consider multi-agent DRL. Federated learning involves individual agents training their model with local information without exchanging information. Then, it obtains a new model from all the agent’s parameters. It is sent back to all the agents, avoiding sending local information to other agents. Meanwhile, in the centralized training scheme, the agents sent their experiences (i.e., local information) to a centralized controller to train each agent's updated model. Results show higher energy consumption and a guarantee in latency requirements than DRL with no cooperation (i.e., independent learning).

In (Li et al. 2018b), a DDPG for power control in an Energy Harvesting UDN is reported. Two NN are implemented as the critic and actor networks, respectively. The critic network approximates the action-value function updating its weights using the Temporal-Difference (TD) error. The actor network updates the power allocation policy by updating its weights using a sampled policy gradient. The target network is implemented in both the critic and actor networks to ensure stability. The DDPG can handle continuous action space (i.e., continuous power control rates), allowing more exploration. Consequently, it shows high instability (due to higher space action) initially but better EE when converging, compared to QL and DQN. Ding et al. (2020) use the DQN algorithm to perform UA and control the uplink power transmission to maximize the EE. The system consists of a two-tier network; the UEs are regarded as agents for the decision-making process. The reward considers the EE of all the agents sharing the same RB, which adds awareness of the agent’s decision. The results show less convergence time and better performance than QL. Moreover, the proposed algorithm shows consistent results and higher performance over different UE and BS densities for solving the joint optimization problem. Uplink power control was also addressed for a NOMA UDN by Liu et al. (2019a). To reduce the interference at the BS due to multiple UEs transmitting on the same subcarrier, a centralized controller applies UE pairing based on their channel gain difference and implements a DQN algorithm for power control. Results show that the proposed DQN outperforms QL concerning convergence, UE, and BS density. Liu et al. (2019b) implement DQN for UEA and power allocation. They implement a water-filling algorithm for power allocation at the beginning of the training to avoid performance drops due to random allocation. The results show that DQN performs better than QL after a few iterations. Also, the DQN shows consistency in terms of EE when density increases. Meanwhile, the performance gap between DQN and QL increases. Chen et al. (2021) address the RA problem in UAV-assisted UDN. The UAV acts as an auxiliary BS. The RA problem is divided into UE link selection and power control and solved by a DQN model design to maximize the EE. The results outperform QL and heuristic schemes (i.e., random and maximum power allocation) regarding EE, throughput, and power consumption. Anzaldo and Andrade (2022) propose a conventional DQN for power control. It consists of a knowledge transference scheme that reuses the experiences of other trained models (i.e., models trained with fewer agents) and the current model experiences during the learning process. Results show a higher convergence rate using a diverse set of experiences of models with lower complexity than conventional DQN training. Zhao et al. (2021) propose a DDQN model for uplink power control to maximize EE. The system consists of an ultra-dense femtocell network. The model is trained with the centralized training and distributed execution scheme. Furthermore, an interference identification algorithm models the user-level relationship to obtain accurate environmental information and use it in the model state. Results show higher EE and lower complexity than Fractional Programming (FP) algorithm reported by Shen and Yu (2018) and Successive Pseudo Convex Optimization (SPCO) in (Zhang and Mao 2020).

Xiao et al. (2019) implement a strategy to control the interference in an ultra-dense small cell network. It allocates the transmission power to UE considering SINR and the UE density. The agent’s initialization uses the experienced agent’s information to accelerate the learning stage. Also, it applies a convolutional NN to estimate the Q-values and compress the agent’s state space. This procedure improves power consumption and increases the throughput compared to RL and data-driven algorithms.

The joint SE, EE, and fairness maximization are addressed in (Liao et al. 2019). The authors solve the multi-objective problem in two stages. First, SE maximization is used to build the DNN. Then, the EE and fairness are then considered in the DRL framework reward function. The proposed algorithm obtains the RB and power allocation decisions based on limited CSI. Random and model-based initialization were evaluated, which resulted in fewer iterations for converging when the initialization is based on network knowledge. The proposed algorithm outperforms the results compared to a benchmark algorithm that requires full CSI knowledge for different conditions of EE, SE, Fairness, different number of UE, and channels.

On the other hand, Nasir and Guo (2019) maximize the SE through the DQN algorithm. The authors train the DQN weight parameters in a centralized manner, gathering experiences from different local agents. The experiences consist of several neighbors' information on the agent's states. The states comprise local information, interfering neighbors, and interfered neighbors. On the framework considered, all network transmitters send their experiences to a central controller for training. Then after training is finished, the weights parameters are updated to the transmitters to execute the DQN with their local states as inputs. Therefore, the memory and computational resources are reduced at the transmitter. The algorithm shows a faster power allocation process than centralized algorithms like the Weighted Minimum Mean Square Error (WMMSE) algorithm (Shi et al. 2011) and the iterative algorithm based on FP, which require instant and accurate measurements of all channel gains.

Li et al. (2021b) solve the User Association (UEA) and Power Allocation (PA) problems applying a DDPG model to maximize jointly the throughput and EE. Further, they implement an additional layer to preprocess the output of the DDPG model into discrete and continuous variables for UEA and PA. As a result, the joint action is composed of discrete and continuous values for UEA and PA, respectively. Their proposal was evaluated in an OMA-/NOMA- HetNet.

For maximizing the UE throughput in UDN, Su et al. (2020) implement two DQN on the FBS and the MBS. In addition, the distance between the agents (FBS or MBS) and UE is considered in the DQN states to perform the transmission power allocation. The reward function considers the QoS of FUE and MUE on each DQN. Results show that multi-type agents perform better than single-type agents, whereas the BS density increases Khoshkbari et al. (2020) use BBQN technique to improve the exploration process of the DQN model compared to two conventional RL exploration strategies: e-greedy and noisy DQN. Also, the authors consider only their CSI BS information removing the overhead of getting the CSI of the neighbor cells. Their strategy implemen a Bayesian NN (BNN) to obtain posterior distribution over the action-space value function. i.e., leads the agent to increase the probability of the actions with higher uncertainty in their Q-functions, thus, preventing the random exploration and outperforming the conventional exploration strategies. Vishnoi et al. (2021) propose a centralized DRL to solve the power allocation problem to maximize the system's D2D cellular pairs and Cellular Mobile Users (CMU) throughput. Specifically, they implement Proximal Policy Optimization (PPO) reported in (Schulman et al. 2017) to get faster convergence of the DDPG model. PPO aims to ensure low deviation in training at the policy by comparing the current and obtained policies. Results achieve faster convergence than conventional centralized DRL methods. Also, higher throughput performance is achieved from different CMU and D2D densities. Li et al. (2020b) propose centralized and distributed DDPG schemes for power control. It considers energy harvesting between SBS. The centralized DDPG achieves higher performance than distributed DDPG. However, the complexity of centralized approaches increases exponentially by the action-state space increment due to SBS density. Meanwhile, the distributed DDPG performs the decision-making with its own SBS information. Results show that both DDPG schemes achieve higher throughput than DQN, greedy, and conservative approaches. Sande et al. (2021) address the QoS maximization problem for an integrated access and backhaul network. The authors propose a DQN to solve the power control considering the Independent (IL) and Cooperative (CL) learning strategies. CL uses neighbors' experiences to help improve the learning process, where the nearest SBS are localized with the Euclidean distance. The results show improvements compared to a baseline DQN in terms of congestion, average bit rate, and degree of satisfaction for different UE densities.

Anzaldo and Andrade (2021) implement a DQN for solving the power allocation problem. The proposal enhances the trained DQN performance with additional training synthetic scenarios with fewer SBS than the original training scenario, resulting in higher robustness in denser scenarios. Results show that additional training enhances DQN model performance and reduces the information required for decision-making concerning DQN with more significant input sizes. Liu and Zhang (2022) propose a power allocation method based on DDPG with a CNN as the function approximator to maximize the network throughput. Their proposal results in 39.7% faster convergence and up to 14.6% performance gain concerning DPPG with DNN and DQN with DNN/CNN. Furthermore, DDPG with CNN achieves 200 times less CPU than the WMMSE algorithm at the cost of slightly less sum-rate performance in denser scenarios. Work in (Chen et al. 2022) addresses the resource allocation and band-switching problem. This work implemented a DRL to maximize the total UE's data rate in an ultra-dense low-earth orbit satellite. First, they implement a DDPG algorithm for channel and power allocation considering the UE, satellite locations, and rain conditions. Then, a hierarchical algorithm for band switching is implemented, resulting in higher performance than implementing DDPG and random allocation methods. Suh et al. (2022) maximize the network throughput by allocating resources for different slices with a DQN-based network slicing technique. Each slice considers different QoS focused on enhanced Mobile Broadband (eMBB), Ultra-Reliable Low Latency Communications (URLLC), and massive Machine-Type Communications (mMTC) services. To accelerate the learning of the DRL model, they implement an action elimination technique via parallel processing for the actions leading to a decision that does not meet the service requirements, resulting in higher-quality policy during training. At each interval, the space action is filtered by URLLC followed by eMBB. Then, the DQN selects the action based on the filtered action space. Results show an improvement of up to 15% and 10% over a regression tree-based allocation method and vanilla DQN-based models.

The UEA and RB allocation in UDN with Energy-Harvesting (EH) BS is considered by Do and Koo (2019). An Actor-Critic Deep Learning (ACDL) model is implemented to maximize the UE satisfaction regarding bandwidth requirements. The model consists of actor and critic networks. The critic network estimates the environment state values to calculate the Temporal-Difference (TD) error, and the actor network predicts the allocation policy. After each action, BS reports the number of satisfied UEs and UE’s battery energy level to update the environment states to the central controller. Then, the two networks update their functions (i.e., the policy and value functions) from the TD error. Less convergence is achieved compared to other learning and non-learning approaches. However, since the battery capacity is finite and bandwidth is restricted, results show that as the number of UE increases, the network performance decay and the performance gaps between the different approaches evaluated converge to similar results. On the other hand, Wang et al. (2019) apply the DQN technique to allocate RBs considering QoS indicators in UDN. Besides experience replay and target network, the authors implemented a prioritized sweep reported in (Moore and Atkeson 1993) and a heuristic mechanism into DQN architecture to accelerate algorithm convergence. The prioritized sweep scheme assigns priority to sample the states from the experience pool, which has a higher probability of changing the Q-values of the network. The heuristic mechanism is an indicator function on the action space added for optimal strategy selection that helps to identify if the generated action is based on a traditional scheduling algorithm. Results show that the heuristic mechanism offers high performance in light and heavy traffic conditions compared to traditional scheduling algorithms like round-and-robin, proportional fair, and max C/I.

Zhao et al. (2019) maximize network utility applying the D3QN technique. The proposal consists of a DDQN and a dueling architecture to tackle the over-optimistic Q-value estimation and for better policy evaluation. Furthermore, to reduce the complexity of ample action space, a multi-agent DRL with cooperation through message passing between UEs to collect the global state information and joint policies of all UEs is implemented. Results indicate that D3QN performance outperformed regular DQN at higher UE densities. Also, results show that other algorithms like genetic algorithm and maximum receive signal power fail to find suitable strategies for meeting the UE requirements when the number of UE and QoS levels increases concerning DRL approaches. Chen et al. (2020) address the joint User Equipment Association and Resource Allocation by implementing a DQN model to maximize the network utility. The agent or UE can choose to associate among multiple Access Points (AP) but is limited to using one subcarrier per AP. In addition, the AP divides the transmit power equally between its UEs attached. The results show that the network throughput benefits from more AP connections and more subcarrier usage concerning the max reference signal received power method. Kim et al. (2022b) propose centralized DDPG scheme for power control and base station on/off switching. It considers the ratio of UE served energy. However, the complexity of centralized approaches increases exponentially by the action-state space increment due to UE density. Results show that DDPG schemes achieve higher utility than DQN and greedy approaches.

Table 6

Design features of DRL-based resource allocation in UDN.
Objective	Ref	Model	Model Structure
Objective	Ref	Model	Agent	Action	State	Reward	Exploration
CC Minimization	Li et al. (2020a)	DDPG	SBS	UE’s transmit power and weight coefficient of RA and tasks offloading	SINR and channel gain of each UE	Computing cost	Exploration noise
EE and SE Maximization	Liu et al. (2019c)	DDQN	MgNB	RB allocation to all SCs	Number and throughput of all SCs and the allocation of all RBs in the system	SE and EE	ε-greedy
EE and SE Maximization	Ye (2022)	DDQN	MBS	RB allocation to all UEs	RB allocation and throughput	SE and EE	ε-greedy
EC Minimization	Huang et al. (2021)	DDPG	SBS	Computation offload decision, RB allocation, uplink power, and computation RA for each UE served by the SBS	Channel gains, interference power, and task profiles of the UEs in SBS’s coverage	Function considering the overall energy consumption by local and edge computing, and the task’s latency requirements	-
EE Maximization	Li et al. (2018b)	DDPG	Central controller	Power control policies of all SBS	Energy harvested, battery power, traffic load, and throughput of all BS	EE	Exploration noise
	Ding at al. (2020)	DQN	UE	UE association and uplink power control	UEs’ association and power control on the same RB	UEs’ Energy Efficiency on the same RB	ε-greedy
	Liu et al. (2019a)	DQN	Central AP	UE's UL transmission power	Data rate and transmission power of the system	EE	ε-greedy
	Liu et al. (2019b)	DQN	Central controller	UE’s association and power allocation	Traffic volume, channel state, and transmission power	EE	ε-greedy
	Chen et al. (2021)	DQN	Central AP	UEs’ link selection and UEs’ transmit power	UEs’ rate and total transmission power	Total data rate	ε-greedy
	Anzaldo and Andrade (2022)	DQN	SBS	Discrete Power levels	Channel gain, transmission power, and EE of neighbor UEs	Function considering the SBS’s and neighbors' EE	ε-greedy
	Zhao et al. (2021)	DDQN	SBS	Discrete power levels	UEs interference ratio and SBSs rewards	EE and throughput of the whole system	ε-greedy
Interference Mitigation	Xiao et al. (2019)	DQN	SBS	UEs transmit power	SINR and estimated channel state of the former UE, and the estimated UE density	Function considering throughput, energy consumption, and inter-cell interference	ε-greedy
Jointly SE, EE, and Fairness Maximization	Liao et al. (2019)	DQN	SBS	Subcarrier allocation and corresponding transmission power	UEA information and interference power	Maximizing the EE and minimizing the variance of throughput between UEs	Generated by the DNN-based optimization framework
Spectral Efficiency Maximization	Nasir and Guo (2019)	DQN	SBS	Discrete power levels	Transmission power, throughput, and SINR of the agents and neighbors	Spectral efficiency of each link	ε-greedy
Throughput and EE Maximization	Li et al. (2021b)	DDPG	Central Controller	UEA, UE power allocation, and BS transmit power	UEs’ data rate, UEs´ data packet transmission time, and UEs´ transmission power	EE and sum rate	Exploration noise
Throughput Maximization	Su et al. (2020)	DQN	1: FBS 2: MBS	1: FUEs Power transmission 2: MUEs Power transmission	1: Proximity of the FBS to the MBS and proximity of the FBS to the MUE 2: Distance between the distant MUE and the MBS and distance from the closest FBS to the MBS	1: Throughput considering FUE QoS 2: Throughput considering MUE QoS	1: ε-greedy 2: ε-greedy
	Khoshkbari et al. (2020)	BBQN	SBS	Discrete power levels	CSI, transmission power, and data rate	Total capacity	BNN
	Vishnoi et al. (2022)	DDPG	SBS	Transmission power	Local channel gains and interferences	Throughput	Exploration noise
	Li et al. (2020b)	DDPG	1: Central Controller 2: SBS	1: Power allocation and energy transfer of each SBS 2: Power allocation and energy transfer of its SBS	1: UEs’ SINR, battery level, and SBSs’ harvested energy 2: SINR of UEs belonging to the SBS, battery level, and harvested energy of the SBS	1: Network sum-rate 2: SBS sum-rate	Exploration noise
	Sande et al. (2021)	DQN	SBS	Power allocation, required throughput	Time-average number of packets, SBS load status	Function considering throughput and minimum required power	ε-greedy
	Anzaldo and Andrade (2021)	DQN	SBS	SBS transmission power	Channel gain, power transmission, reward	Function considering each SBS throughput and SBS neighbors contribution	ε-greedy
	Liu and Zhang (2022)	DDPG	MBS	Power allocation matrix	CSI, cooperation, and power allocation matrices	Network sum-rate	Exploration noise
	Chen et al. (2022)	DDPG	Central controller	Subchannel and power allocation	UEs’ location, satellites location, and rain intensities	Total data rate of all UEs	Entropy
	Suh et al. (2022)	DQN	gNB	RB slice allocation	Minimum rate, slice allocation, CSI of all UE	Overall system throughput	ε-greedy
UE Satisfaction Maximization	Do and Koo (2019)	ACDL	Central controller	UEA variable and bandwidth allocation	Number of energy packets in the battery of the BS and system bandwidth	Ratio of the total allocated bandwidth to the total required bandwidth	ε-greedy
UE Satisfaction Maximization	Wang et al. (2019)	DQN	SBS	RBs allocated to each UE	QoS demand, QoS provision, environmental parameters	UE satisfaction	ε-greedy with a heuristic mechanism
Utility Maximization	Zhao et al. (2019)	D3QN	UEs	BS association and channel allocation	All UE's QoS demand status	Function considering the difference between achieved profit and the transmission cost	ε-greedy
	Cheng et al. (2020)	DQN	UEs	AP association and subcarrier allocation	UE’s data rate requirement status	UE’s sum rate	ε-greedy
	Kim et al. (2022b)	DDPG	Central controller	Power control and on/off switching	Ratio of UE served	Function considering throughput penalized by power consumption	Exploration Noise

4.5 Key Performance Indicators (KPI) on ML-based models

We found 53 KPI labels for the performance evaluation. For example, KPI throughput can be referred to as average sum rate, average system capacity, average throughput, capacity, MUE transmission rate, or UE throughput. Therefore, we group them in 12 KPI, and their distribution is shown in Fig. 6. Delay, energy consumption, energy efficiency, fairness, interference, outage ratio, packet loss rate, SINR, SE, throughput, UE satisfaction, and utility, which correspond the 5%, 7%, 17%, 7%, 2%, 2%, 3%, 2%, 9%, 37%, 4% and 5% of all performance evaluations found, respectively. Also, for researchers interested, Table 7 lists the reference works grouped according to the KPI evaluated for each ML-group technique.

ML techniques have the potential to efficiently solve unstructured and seemingly intractable problems involving large amounts of data. Given the NP-Hard nature of the RA problem, finding the optimal solution requires enumerating all possible solutions, which is intractable even for radiobase low densification. ML algorithms find solutions that achieve an appropriate allocation of resources to improve the KPIs in UDN compared to exact and approximation algorithms. However, the computing capabilities of the entities running these algorithms may limit their potential implementations. In this regard, pruning methods, as implemented by Ye 2022) reduce the network model’s size, thus, their complexity. Therefore, further research on the tradeoff between the model complexity, the performance, and the response time is required.

On the other hand, ML becomes challenging to implement in UDN due to longer training times. Despite training in simulations before implementation or testing, future wireless networks will present unpredictable behaviors due to their diverse entities’ coexistence and environmental dynamics (e.g., traffic, UE mobility, UE arrival/departure, or geospatial properties). Matching the model to the current environment requires efficient dataset gathering and additional training. For instance, some works implement TL to manage the experienced knowledge generated in different network conditions. The knowledge is transferred to accelerate the model’s convergence in the form of experiences (Anzaldo and Andrade 2021), Q-tables (AlSobhi and Aghvami 2019; Amiri et al. 2018b; AlOerm and Shihada 2016; Iqbal et al. 2022), Q-values (Amiri et al. 2019; Zhang et al. 2018a), or parameters (Xiao et al. 2019; Liao et al. 2019; Zhang et al. 2022). Nevertheless, as mentioned early, conditions of each SBS could be highly different, and the relevance of the knowledge shared is not addressed.

Furthermore, future wireless networks must efficiently perform RA strategies to support data-intensive and delay-critic devices, as B5G and 6G systems expect more demanding QoS and higher data rates. Therefore, novel methodologies require further research to cope with forecasted service demands for future networks, for example, by considering a network with different model designs for the MBS and SBS agents instead of implementing a single model design for all agents. Thus, each network tier will benefit from their requirements in a distributed manner, with message passing to obtain state information about their neighbors. In this sense, non-critical UEs could focus their reward function on maintaining a degree of QoS to free up resources for critical or high data rate UEs.

On the other hand, most of the research works (46%) address the power allocation problem in UDN. The power allocation problem becomes relevant for UDN since the SBS consumes power even in sleep mode for sending information through signaling transmissions. These signaling transmissions allow the network system to complete processes such as the association of UE to BS, SBS turn on/off, or spectrum mobility. Network densification brings the complexity of managing high signaling information, and

Table 7 Key performance indicator enhancement compared to traditional approaches.

ML
Approach

Reference

Key Performance indicators (KPI)

Delay

Energy Consumption

Energy Efficiency

Fairness

Interference

Outage Ratio

Packet Loss Rate

SINR

Spectral Efficiency

Throughput

UE Satisfaction

Utility

DRL

Do and Koo (2019)

✔

Li et al. (2018b); Ding at el. (2020)

Liu et al. (2019a); Liu et al. (2019b) Anzaldo and Andrade (2022)

✔

Xiao et al. (2019)

✔

Wang et al. (2019)

✔

Liao et al. (2019)

✔

Nasir and Guo (2019)

✔

Su et al. (2020); Li et al. (2021b)

Zhao et al. (2021)

✔

Zhao et al. (2019); Chen et al. (2022)

✔

Liu et al. (2019c); Ye (2022)

✔

Khoshkbari et al. (2020)

Vishnoi et al. (2022)

Li et al. (2020b); Suh et al. (2022)

Cheng et al. (2020)

Anzaldo and Andrade (2021)

Liu and Zhang (2022)

✔

Li et al. (2020a); Huang et al. (2021)

✔

Sande et al. (2021)

✔

Chen et al. (2021)

✔

Kim et al. (2022b)

✔

ANN

Hossain and Muhammad (2020) Zhou et al. (2018)

✔

Zhang et al. (2019a)

✔

Cao et al. (2019)

✔

Zhang et al. (2022)

✔

Feki and Capdevielle (2011)

✔

Elsayed and Erol-Kantarci (2018)

Elsayed et al. (2019)

✔

Lu et al. (2016); Amiri et al. (2019)

AlSobhi and Aghvami (2019)

Chen et al. (2016); Lin et al. (2017) Amiri et al. (2018)

Iqbal et al. (2022)

✔

Zhang et al. (2019b); Li et al. (2019)

✔

Jiang et al. (2016)

✔

AlOerm and Shihada (2017)

✔

Zhang et al. (2018a)

✔

Li et al. (2018c)

✔

Amiri and Mehrpouyan (2018)

✔

AlOerm and Shihada (2016)

✔

Li et al. (2021a)

✔

Iqbal et al. (2022)

✔

Kim et al. (2022a)

✔

Sharma and Kumar (2022)

✔

as the number of SBS increases, the probability of high interference increases, causing network capacity degradation.

This section identifies some open issues that require further research to solve the RA problem in UDN applying ML.

6.1 Ultra-Dense Networks heterogeneity

As the wireless network evolves, the coexistence of various devices, such as IoT or DID, different QoS requirements, connecting and disconnecting continuously causes unpredictable behavior. Further, the BS goals may differ from MBS, PBS, FBS, UAV, or EH-BS, requiring different model designs. Furthermore, the knowledge of different models may be exploited to learn and adapt to different devices and requirements. As this review knows, the knowledge is exploited simply by transferring datasets or network parameters. Still, transferring knowledge between different model designs (i.e., MBS to UAV or IoT to UEs), such as knowledge distillation techniques to extract the intrinsic model’s features, are not explored yet.

6.2 Scalability of Machine Learning models

The ML models need to be designed to be scalable. As the network densifies, the centralized solutions become infeasible. Distributed schemes become more feasible to deal with the densification of the network. However, lacking CSI may worsen the network performance through selfish behaviors. Careful model designs are needed to avoid degrading the network performance, including cooperation or interference awareness to reduce potential interference. The challenge remains in determining the most relevant information to avoid unnecessary overhead and longer training instances.

6.3 Machine Learning model design

The inference time for decision-making depends on the model design, for instance, the number of layers and neurons in the DNN. As several layers may extract valuable information to build robust models, the wireless networks need quick response times, as foreseen in ultra-reliable-low-communication services. However, the denser the DNN, the less plasticity of the model to adapt the parameters. Therefore, a tradeoff between the number of layers, the learning adaptability, and the execution time requires further research to analyze ML's reliability for different types of services.

6.4 Dataset diversity

Extensive information is generated through the network systems. This information is processed as inputs for the ML algorithms. However, this process leads to a waste of resources due to the high signaling requirements. For example, instead of systematically getting all the information on the fly, better data analysis may be required to identify the most relevant information concerning learning decision-making for RA problems. The question remains on how to measure the quality of these datasets and their impact on the network performance to learn efficiently by reducing the training instances and improving the network performance.

6.5 Energy Consumption

Besides the energy consumption of BS and UE, the impact of offline and online training strategies' power consumption is often not considered, raising the question of the environmental effects and battery constraints of the devices running these ML algorithms. Robust models require dense DNNs to extract the features requiring longer training instances. At the same time, shallow DNNs adapt more quickly at the cost of additional training updates due to the lack of robustness. Therefore, future ML schemes should consider the energy consumption and hardware limitations within their model design for implementation in B5G networks.

This structured literature review provides insight into the design features of the ML algorithms for resource allocation in UDN. The algorithm mechanism of the different models, the objectives, the network scenarios, and the KPI was extracted, presented, and discussed. RA is a challenging problem in UDN. However, implementing intelligent algorithms will help the network find solutions in less time and with better performance. The initialization of these algorithms is fundamental for the fast learning needed for some applications and was identified as one of the areas that need research, considering different types of requirements and network knowledge. Also, the implementation of models with different objectives on the same network shows a promising strategy that needs further investigation to define the trade-off and potential of this approach.

ACKNOWLEDGEMENTS

This research was partly funded by the National Council of Science and Technology (CONACyT, Mexico) through the Fondo Sectorial de Investigación para la Educación under Grant number 288670-Y.

AlQerm I, Shihada B (2016) A cooperative online learning scheme for resource allocation in 5G systems. Proceedings IEEE International Conference on Communications (ICC). https://doi.org/10.1109/ICC.2016.7511617
AlQerm I, Shihada B (2017) Energy-efficient power allocation in multi-tier 5G networks using enhanced online learning. IEEE Trans Veh Technol 66:11086–11097. https://doi.org/10.1109/TVT.2017.2731798
AlSobhi, Aghvami W (2019) AH QoS-aware resource allocation of two-tier HetNet: A Q-learning approach. Proceedings 26th International Conference on Telecommunications (ICT). https://doi.org/10.1109/ICT.2019.8798829
Amiri R, Almasi MA, Andrews JG, Mehrpouyan H (2019) Reinforcement learning for self-organization and power control of two-tier heterogeneous networks. IEEE Trans Wireless Commun 18:3933–3947. https://doi.org/10.1109/TWC.2019.2919611
Amiri R, Mehrpouyan H (2018a) Self-organizing mm wave networks: A power allocation scheme based on machine learning. Proceedings 11th Global symposium on millimeter waves (GSMM). https://doi.org/10.1109/GSMM.2018.8439323
Amiri R, Mehrpouyan H, Fridman L, Mallik RK (2018b) ete al A Machine Learning Approach for Power Allocation in HetNets Considering QoS. Proceedings IEEE International Conference on Communications (ICC). https://doi.org/10.1109/ICC.2018.8422864
Andrews G, Zhang X, Durgin GD, Gupta AK (2016) Are we approaching the fundamental limits of wireless network densification? IEEE Commun Mag 54:184–190. https://doi.org/10.1109/MCOM.2016.7588290
Anzaldo A, Andrade ÁG (2021) Training Effect on AI-based Resource Allocation in small-cell networks. Proceedings IEEE Latin-American Conference on Communications (LATINCOM). https://doi.org/10.1109/LATINCOM53176.2021.9647736
Anzaldo A, Andrade ÁG (2022) Buffer Transference Strategy for Power Control in B5G-Ultra-Dense Wireless Cellular Networks. Wirel Netw 28:3613–3620. https://doi.org/10.1007/S11276-022-03087-6
Belikaidis P, Georgakopoulos A, Kosmatos E, Frascolla V, Demestichas P (2018) Management of 3.5-GHz spectrum in 5G dense networks: A hierarchical radio resource management scheme. IEEE Veh Technol Mag 13:57–64. https://doi.org/10.1109/MVT.2018.2814340
Bhushan N et al (2014) Network densification: the dominant theme for wireless evolution into 5G. IEEE Commun Mag 52:82–89. https://doi.org/10.1109/MCOM.2014.6736747
Cao J et al (2019) Resource allocation for ultradense networks with machine-learning-based interference graph construction. IEEE Internet of Things Journal 7:2137–2151. https://doi.org/10.1109/JIOT.2019.2959232
Cao J, Peng T, Qi Z, Duan R, Yuan Y, Wang W (2018) Interference management in ultradense networks: A user-centric coalition formation game approach. IEEE Trans Veh Technol 67:5188–5202. https://doi.org/10.1109/TVT.2018.2799568
Chen J, Gao Z, Zhao Q (2015) Load-aware dynamic spectrum access in ultra-dense small cell networks. Proceedings International Conference on Wireless Communications & Signal Processing (WCSP). https://doi.org/10.1109/WCSP.2015.7341028
Chen M, Hua Y, Gu X, Nie S, Fan Z A self-organizing resource allocation strategy based on Q-learning approach in ultra-dense networks. Proceedings IEEE International Conference on Network Infrastructure and, Content D (2016) (IC-NIDC). https://doi.org/10.1109/ICNIDC.2016.7974555
Chen M, Challita U, Saad W, Yin C, Debbah M (2019) Artificial neural networks-based machine learning for wireless networks: A tutorial. IEEE Commun Surv Tutorials 21:3039–3071. https://doi.org/10.1109/COMST.2019.2926625
Chen S, Liang YC et al (2020) Vision, requirements, and technology trend of 6G: How to tackle the challenges of system coverage, capacity, user data-rate and movement speed. IEEE Wirel Commun 27:218–228. https://doi.org/10.1109/MWC.001.1900333
Chen T et al (2022) Efficient Uplink Transmission in Ultra-Dense LEO Satellite Networks With Multiband Antennas. IEEE Commun Lett 26:1373–1377. Https://doi.org/doi: 10.1109/LCOMM.2022.3160839
Chen X, Liu X, Chen Y, Jiao L, Min G (2021) Deep Q-Network based resource allocation for UAV-assisted Ultra-Dense Networks. Comput Netw 196:108249. https://doi.org/10.1016/J.COMNET.2021.108249
Cheng Z, LiWang M, Chen N, Lin H, Gao Z, Huang L (2020) Learning-based joint user-AP association and resource allocation in ultra dense network. Proceedings IEEE 91st Vehicular Technology Conference (VTC2020-Spring). https://doi.org/10.1109/VTC2020-SPRING48590.2020.9128602
Ding H, Zhao F, Tian J, Li D, Zhang H (2020) A deep reinforcement learning for user association and power control in heterogeneous networks. Ad Hoc Netw 102:102069. https://doi.org/10.1016/J.ADHOC.2019.102069
Ding M, López-Pérez D, Mao G, Wang P, Lin Z (2015) Will the area spectral efficiency monotonically grow as small cells go dense? Proceedings IEEE Global Communications Conference (GLOBECOM). https://doi.org/10.1109/GLOCOM.2015.7416981
Do QV, Koo I (2019) Actor-critic deep learning for efficient user association and bandwidth allocation in dense mobile networks with green base stations. Wireless Netw 25:5057–5068. https://doi.org/10.1007/S11276-019-02117-0
Elsayed M, Erol-Kantarci M (2018) earning-based resource allocation for data-intensive and immersive tactile applications. Proceedings IEEE 5G World Forum (5GWF). https://doi.org/10.1109/5GWF.2018.8517001
Elsayed M, Erol-Kantarci M, Kantarci B, Wu L, Li J (2019) Low-latency communications for community resilience microgrids: A reinforcement learning approach. IEEE Trans Smart Grid 11:1091–1099. https://doi.org/10.1109/TSG.2019.2931753
Feki A, Capdevielle V (2011) Autonomous resource allocation for dense lte networks: A multi armed bandit formulation. Proceedings IEEE 22nd International Symposium on Personal, Indoor and Mobile Radio Communications.https://doi.org/10.1109/PIMRC.2011.6140047
Fu Y, Wang S, Wang CX, Hong X, McLaughlin S (2018) Artificial Intelligence to Manage Network Traffic of 5G Wireless Networks. IEEE Network 32:58–64. https://doi.org/10.1109/MNET.2018.1800115
Giménez S (2017) Ultra dense networks deployment for beyond 2020 technologies. Dissertation, Universitat Politècnica de València. https://doi.org/10.4995/THESIS/10251/86204
Gupta MS, Kumar K (2019) Progression on spectrum sensing for cognitive radio networks: A survey, classification, challenges and future research issues. J Netw Comput Appl 143:47–76. https://doi.org/10.1016/J.JNCA.2019.06.005
Haenggi M, Andrews JG, Baccelli F et al (2009) Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE J Sel Areas Commun 27:1029–1046. https://doi.org/10.1109/JSAC.2009.090902
Hasan MM, Kwon S, Oh S (2018) Frequent-handover mitigation in ultra-dense heterogeneous networks. IEEE Trans Veh Technol 68:1035–1040. https://doi.org/10.1109/TVT.2018.2874692
Hossain MS, Muhammad G (2020) A deep-tree-model-based radio resource distribution for 5G networks. IEEE Wirel Commun 27:62–67. https://doi.org/10.1109/MWC.001.1900286
Hu T, Fei Y (2010) QELAR: A Machine-Learning-Based Adaptive Routing Protocol for Energy-Efficient and Lifetime-Extended Underwater Sensor Networks. IEEE Trans Mob Comput 9:796–809. https://doi.org/10.1109/TMC.2010.28
Huang PH, Kao H, Liao W (2017) Cross-tier cooperation for optimal resource utilization in ultra-dense heterogeneous networks. IEEE Trans Veh Technologym 66:11193–11207. https://doi.org/10.1109/TVT.2017.2732165
Huang X, Zhang K, Wu F, Leng S (2021) Collaborative machine learning for energy-efficient edge networks in 6G. IEEE Network 35:12–19. https://doi.org/10.1109/MNET.100.2100313
Hussain F, Hassan SA, Hussain R, Hossain E (2020) Machine learning for resource management in cellular and IoT networks: Potentials, current solutions, and open challenges. IEEE Commun Surv tutorials 22:1251–1275. https://doi.org/10.1109/COMST.2020.2964534
Iqbal MU, Ansari EA, Akhtar S (2021) Interference Mitigation in HetNets to Improve the QoS Using Q-Learning. IEEE Access 9:32405–32424. https://doi.org/10.1109/ACCESS.2021.3060480
Iqbal MU, Ansari EA, Akhtar S, Khan AN (2022) Improving the QoS in 5G HetNets Through Cooperative Q-Learning. IEEE Access 10:19654–19676. https://doi.org/10.1109/ACCESS.2022.3151090
Jiang C, Zhang H, Ren Y, Han Z et al (2017) Machine learning paradigms for next-generation wireless networks. IEEE Wirel Commun 24:98–105. https://doi.org/10.1109/MWC.2016.1500356WC
Jiang T, Zhao Q, Grace D, Burr AG, Clarke T (2016) Single-state Q-learning for self-organised radio resource management in dual-hop 5G high capacity density networks. Trans Emerg Telecommunications Technol 27:1628–1640. https://doi.org/10.1002/ETT.3019
Kamel M, Hamouda W, Youssef A (2016) Ultra-dense networks: A survey. IEEE Commun Surv tutorials 18:2522–2545. https://doi.org/10.1109/COMST.2016.2571730
Kamel M, Hamouda W, Youssef A (2017) Performance analysis of multiple association in ultra-dense networks. IEEE Trans Commun 65:3818–3831. https://doi.org/10.1109/TCOMM.2017.2706261
Khoshkbari H, Pourahmadi V, Sheikhzadeh H (2020) Power allocation in cellular network without global csi: Bayesian reinforcement learning approach. Proceedings 28th Iranian Conference on Electrical Engineering (ICEE). https://doi.org/10.1109/ICEE50131.2020.9260583
Kim E, Choi HH, Kim H, Na J, Lee H (2022a) Optimal Resource Allocation Considering Non-Uniform Spatial Traffic Distribution in Ultra-Dense Networks: A Multi-Agent Reinforcement Learning Approach. IEEE Access 10:20455–20464. https://doi.org/10.1109/ACCESS.2022.3152162
Kim H, So J, Kim H (2022b) Carbon-Neutral Cellular Network Operation Based on Deep Reinforcement Learning. Energies 15:4504. https://doi.org/10.3390/EN15124504
Kitchenham B (2004) Procedures for performing systematic reviews. UK Keele University 33:1–26. https://doi.org/10.21926/RPM.2001005
Klaine PV, Imran MA, Onireti O, Souza RD (2017) A Survey of Machine Learning Techniques Applied to Self-Organizing Cellular Networks. IEEE Commun Surv Tutorials 19:2392–2431. https://doi.org/10.1109/COMST.2017.2727878
Kountouris M et al (2017) Performance limits of network densification. IEEE J Sel Areas Commun 35:1294–1308. https://doi.org/10.1109/JSAC.2017.2687638
Letaief KB, Chen W, Shi Y, Zhang J, Zhang YJ (2019) The roadmap to 6G: AI empowered wireless networkS. IEEE Commun Mag 57:84–90. https://doi.org/10.1109/MCOM.2019.1900271
Li D, Zhang H, Long K et al (2019) User association and power allocation based on Q-learning in ultra dense heterogeneous networks. Proceedings IEEE Global Communications Conference (GLOBECOM). https://doi.org/10.1109/GLOBECOM38437.2019.9013455
Li H, Lv T, Zhang X (2018b) Deep deterministic policy gradient based dynamic power control for self-powered ultra-dense networks. Proceedings IEEE Globecom Workshops (GC Wkshps). https://doi.org/10.1109/GLOCOMW.2018.8644157
Li L et al (2020a) Resource allocation for NOMA-MEC systems in ultra-dense networks: A learning aided mean-field game approach. IEEE Trans Wireless Commun 20:1487–1500. https://doi.org/10.1109/ICCWORKSHOPS49005.2020.9145070
Li W, Zhang J (2018a) Cluster-based resource allocation scheme with QoS guarantee in ultra-dense networks. IET Commun 12:861–867. https://doi.org/10.1049/IET-COM.2017.1331
Li Y, Gao Z, Huang L, Du X, Guizani M (2018c) Energy-aware interference management for ultra-dense multi-tier HetNets: Architecture and technologies. Comput Commun 127:30–35. https://doi.org/10.1016/J.COMCOM.2018.05.012
Li Y, Tang Z, Lin Z, Gong Y et al (2021a) Reinforcement Learning Power Control Algorithm Based on Graph Signal Processing for Ultra-Dense Mobile Networks. IEEE Trans Netw Sci Eng 8:2694–2705. https://doi.org/10.1109/TNSE.2021.3051660
Li Y, Zhao X, Liang H (2020b) Throughput maximization by deep reinforcement learning with energy cooperation for renewable ultradense IoT networks. IEEE Internet of Things Journal 7:9091–9102. https://doi.org/10.1109/JIOT.2020.3002936
Li Z, Wen X, Lu Z, Jing W (2021b) A General DRL-based Optimization Framework of User Association and Power Control for HetNet. Proceedings IEEE 32nd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). https://doi.org/10.1109/PIMRC50174.2021.9569426
Liao X, Shi J, Li Z, Zhang L, Xia B (2019) A model-driven deep reinforcement learning heuristic algorithm for resource allocation in ultra-dense cellular networks. IEEE Trans Veh Technol 69:983–997. https://doi.org/10.1109/TVT.2019.2954538
Lin S, Yu J, Ni W, Liu R (2017) Radio resource management for ultra-dense smallcell networks: A hybrid spectrum reuse approach. Proceedings IEEE 85th Vehicular Technology Conference (VTC Spring). https://doi.org/10.1109/VTCSPRING.2017.8108229
Liu C, Wang J, Liu X, Liang YC (2019d) Deep CM-CNN for Spectrum Sensing in Cognitive Radio. IEEE J Sel Areas Commun 37:2306–2321. https://doi.org/10.1109/JSAC.2019.2933892
Liu J, Sheng M, Liu L, Li J (2017b) Interference management in ultra-dense networks: Challenges and approaches. IEEE Network 31:70–77. https://doi.org/10.1109/MNET.2017.1700052
Liu J, Zhang H (2022) Power Allocation in Ultra-Dense Networks Through Deep Deterministic Policy Gradient. IEEE Wirel Commun Lett 11:2502–2506. https://doi.org/10.1109/LWC.2022.3206096
Liu L, Zhou Y, Garcia V, Tian L, Shi J (2017a) Load aware joint CoMP clustering and inter-cell resource scheduling in heterogeneous ultra dense cellular networks. IEEE Trans Veh Technol 67:2741–2755. https://doi.org/10.1109/TVT.2017.2773640
Liu X, Chen X, Chen Y, Li Z (2019a) Deep learning based dynamic uplink power control for NOMA ultra-dense network system. Proceedings Blockchain and Trustworthy Systems: First International Conference.https://doi.org/10.1007/978-981-15-2777-7_64
Liu Z, Chen X, Chen Y, Li Z (2019b) Deep reinforcement learning based dynamic resource allocation in 5G ultra-dense networks. Proceedings IEEE International Conference on Smart Internet of Things (SmartIoT). https://doi.org/10.1109/SMARTIOT.2019.00034
Liu Z, Chen X, Chen Y, Li Z (2019c) Deep reinforcement learning based dynamic resource allocation in 5G ultra-dense networks. Proceedings IEEE International Conference on Smart Internet of Things (SmartIoT). https://doi.org/10.1109/SMARTIOT.2019.00034
Lu W, Fan Q, Li Z, Lu H (2016) Power control based time-domain inter-cell interference coordination scheme in DSCNs. Proceedings IEEE International Conference on Communications (ICC).https://doi.org/10.1109/ICC.2016.7511467
Luong NC et al (2019) Applications of Deep Reinforcement Learning in Communications and Networking: A Survey. IEEE Commun Surv Tutorials 21:3133–3174. https://doi.org/10.1109/COMST.2019.2904897
Moore AW, Atkeson CG (1993) Prioritized sweeping: Reinforcement learning with less data and less time. Mach Learn 13:103–130. https://doi.org/10.1007/BF00993104
Mozer MC, Wolniewicz R, Grimes DB, Johnson E, Kaushansky H (2000) Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry. IEEE Trans Neural Networks 11:690–696. https://doi.org/10.1109/72.846740
Nasir YS, Guo D (2019) Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE J Sel Areas Commun 37:2239–2250. https://doi.org/10.1109/JSAC.2019.2933973
Romanous B, Bitar N, Imran A, Refai H (2015) Network densification: Challenges and opportunities in enabling 5G. Proceedings IEEE 20th international workshop on computer aided modeling and design of communication links and networks (CAMAD). https://doi.org/10.1109/CAMAD.2015.7390494
Sande MM, Hlophe MC, Maharaj BT (2021) Access and radio resource management for IAB networks using deep reinforcement learning. IEEE Access 9:14218–114234. https://doi.org/10.1109/ACCESS.2021.3104322
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347
Sharma N, Kumar K (2022) Energy Efficient Clustering and Resource Allocation Strategy for Ultra-Dense Networks: A Machine Learning Framework. IEEE Trans Netw Serv Manage https://doi.org/10.1109/TNSM.2022.3218819 Https//doi.org/10.1109/TNSM.2022.3218819
Shen K, Yu W (2018) Fractional programming for communication systems—Part I: Power control and beamforming. IEEE Trans Signal Process 66:2616–2630. https://doi.org/10.1109/TSP.2018.2812733
Shi Q, Razaviyayn M, Luo ZQ, He C (2011) An Iteratively Weighted MMSE Approach to Distributed Sum-Utility Maximization for a MIMO Interfering Broadcast Channel. IEEE Trans Signal Process 59:4331–4340. https://doi.org/10.1109/TSP.2011.2147784
Su Q, Li B, Wang C et al (2020) A power allocation scheme based on deep reinforcement learning in HetNets. Proceedings international conference on computing, networking and communications (ICNC). https://doi.org/10.1109/ICNC47757.2020.9049771
Suh K, Kim S, Ahn Y, Kim S, Ju H, Shim B (2022) Deep Reinforcement Learning-Based Network Slicing for Beyond 5G. IEEE Access 10:7384–7395. https://doi.org/10.1109/ACCESS.2022.3141789
Sun Y, Peng M, Zhou Y, Huang Y, Mao S (2019) Application of machine learning in wireless networks: Key techniques and open issues. IEEE Commun Surv Tutorials 21:3072–3108. https://doi.org/10.1109/COMST.2019.2924243
Teng Y, Liu M, Yu FR, Leung VC, Song M, Zhang Y (2018) Resource allocation for ultra-dense networks: A survey, some research issues and challenges. IEEE Commun Surv Tutorials 21:2134–2168. https://doi.org/10.1109/COMST.2018.2867268
Teng Y, Wang Y, Horneman K (2014) Co-primary spectrum sharing for denser networks in local area. Proceddings 9th International Conference on Cognitive Radio Oriented Wireless Networks and Communications (CROWNCOM). https://doi.org/10.4108/ICST.CROWNCOM.2014.255397
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/AAAI.V30I1.10295
Vishnoi V, Malik PK, Budhiraja I, Yadav A (2022) Deep Reinforcement Learning Based Throughput Maximization Scheme for D2D Users Underlaying NOMA-Enabled Cellular Network. Proceedings Advanced Computing Conference. https://doi.org/10.1007/978-3-030-95502-1_25
Wang CX, Di Renzo M, Stanczak S et al (2020) Artificial intelligence enabled wireless networking for 5G and beyond: Recent advances and future challenges. IEEE Wirel Commun 27:16–23. https://doi.org/10.1109/MWC.001.1900292
Wang H, Chen S, Ai M, Xu H (2017) Localized mobility management for 5G ultra dense network. IEEE Trans Veh Technol 66:8535–8552. https://doi.org/10.1109/TVT.2017.2695799
Wang L, Yang C, Wang X, Li J, Wang Y, Wang Y (2019) Integrated resource scheduling for user experience enhancement: A heuristically accelerated DLR. Proceedings 11th International Conference on Wireless Communications and Signal Processing (WCSP). https://doi.org/10.1109/WCSP.2019.8927970
Wang W, Yang L, Zhang Q, Jiang T (2018) Configurations and Diagnosis for Ultra-Dense Heterogeneous Networks: From Empirical Measurements to Technical Solutions. IEEE Network 32:138–145. https://doi.org/10.1109/MNET.2017.1700015
Wang X, Li X, Leung VC (2015) Artificial intelligence-based techniques for emerging heterogeneous network: State of the arts, opportunities, and challenges. IEEE Access 3:1379–1391. https://doi.org/10.1109/ACCESS.2015.2467174
Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. Proceedings International conference on machine learning. https://doi.org/10.48550/arXiv.1511.06581
Xiao L et al (2019) Reinforcement learning-based downlink interference control for ultra-dense small cells. IEEE Trans Wireless Commun 19:423–434. https://doi.org/10.1109/TWC.2019.2945951
Xu L, Mao Y, Leng S, Qiao G, Zhao Q (2017) Energy-efficient resource allocation strategy in ultra dense small-cell networks: A Stackelberg game approach. Proceedings IEEE International Conference on Communications (ICC). https://doi.org/10.1109/ICC.2017.7997289
Xu T, Zhou T, Tian J, Sang J, Hu H (2020) Intelligent spectrum sensing: When reinforcement learning meets automatic repeat sensing in 5G communications. IEEE Wirel Commun 27:46–53. https://doi.org/10.1109/MWC.001.1900246
Yau KLA, Komisarczuk P, Teal PD (2012) Reinforcement learning for context awareness and intelligence in wireless networks: Review, new features and open issues. J Netw Comput Appl 35:253–267. https://doi.org/10.1016/J.JNCA.2011.08.007
Ye Z (2022) Intelligent Resource Allocation for Ultradense Networks Based on Improved Reinforcement Learning. Sci. Program. 2022:9312847. https://doi.org/10.1155/2022/9312847
Yu G, Zhang Z, Qu F, Li GY (2017) Ultra-dense heterogeneous networks with full-duplex small cell base stations. IEEE Network 31:108–114. https://doi.org/10.1109/MNET.2017.1700040
Yu W, Xu H et al (2016) Ultra-dense networks: Survey of state of the art and future directions. Proceedings 25th international conference on computer communication and networks (ICCCN). https://doi.org/10.1109/ICCCN.2016.7568592
Yun JH, Shin KG (2011) Adaptive interference management of OFDMA femtocells for co-channel deployment. IEEE J Sel Areas Commun 29:1225–1241. https://doi.org/10.1109/JSAC.2011.110610
Zhang H et al (2019a) Distributed DNN Based User Association and Resource Optimization in mmWave Networks.Proceedings IEEE Global Communications Conference (GLOBECOM). https://doi.org/10.1109/GLOBECOM38437.2019.9014077
Zhang H, Feng M, Long K, Karagiannidis GK, Nallanathan A (2019b) Artificial intelligence-based resource allocation in ultradense networks: Applying event-triggered Q-learning algorithms. IEEE Veh Technol Mag 14:56–63. https://doi.org/10.1109/MVT.2019.2938328
Zhang H, Jiang C, Beaulieu NC et al (2014) Resource allocation in spectrum-sharing OFDMA femtocells with heterogeneous services. IEEE Trans Commun 62:2366–2377. https://doi.org/10.1109/TCOMM.2014.2328574
Zhang H, Min M, Xiao L, Liu S, Cheng P, Peng M (2018a) Reinforcement learning-based interference control for ultra-dense small cells. Proceedings IEEE Global Communications Conference (GLOBECOM). https://doi.org/10.1109/GLOCOM.2018.8648136
Zhang T, Mao S (2020) Energy-Efficient Power Control in Wireless Networks With Spatial Deep Neural Networks. IEEE Trans Cogn Commun Netw 6:111–124. https://doi.org/10.1109/TCCN.2019.2945774
Zhang X, Zhang Z, Yang L (2022) Learning-Based Resource Allocation in Heterogeneous Ultradense Network. IEEE Internet of Things Journal 9:20229–20242. https://doi.org/10.1109/JIOT.2022.3173210
Zhang Z, Yang G, Ma Z, Xiao M, Ding Z, Fan P (2018b) Heterogeneous ultradense networks with NOMA: System architecture, coordination framework, and performance evaluation. IEEE Veh Technol Mag 13:110–120. https://doi.org/10.1109/MVT.2018.2812280
Zhao N, Liang YC, Niyato D, Pei Y, Wu M, Jiang Y (2019) Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks. IEEE Trans Wireless Commun 18:5141–5152. https://doi.org/10.1109/TWC.2019.2933417
Zhao Y, Peng T, Guo Y, Wang W (2021) Energy-Efficient Uplink Power Allocation in Ultra-Dense Network Through Multi-agent Reinforcement Learning. Proceedings IEEE 94th Vehicular Technology Conference (VTC2021-Fall). https://doi.org/10.1109/VTC2021-FALL52928.2021.9625554
Zhou Y, Fadlullah ZM, Mao B, Kato N (2018) A deep-learning-based radio resource assignment technique for 5G ultra dense networks. IEEE Network 32:28–34. https://doi.org/10.1109/MNET.2018.1800085

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Intelligence-learning driven resource allocation for B5G Ultra-Dense Networks: A structured literature review

Status:

Version 1

Abstract

Figures

1. Introduction

2. Background

3. Study Design

3.1 Research Objectives

3.2 Search Strategy

3.3 Inclusion Criteria

3.4 Selection Process

3.5 Data extraction

4. Results

4.1 Dataset’s Search results

4.2 Machine Learning Models

4.3 Machine Learning year-wise distribution

4.4 Machine Learning strategies implemented for solving the RA problem.

4.5 Key Performance Indicators (KPI) on ML-based models

5. Discussion

6. Open Issues

6.1 Ultra-Dense Networks heterogeneity

6.2 Scalability of Machine Learning models

6.3 Machine Learning model design

6.4 Dataset diversity

6.5 Energy Consumption

7. Conclusions

Declarations

ACKNOWLEDGEMENTS

References

Additional Declarations

Status:

Version 1