In this section, we first provide an overview of cache replacement algorithms that have been proposed for improving Hadoop performance. We then discuss different intelligent caching mechanisms that use machine learning methods.
3.1 Cache replacement policies in Hadoop
In this section, we investigate different cache replacement strategies in Hadoop, including their advantages and disadvantages.
LIFE and LFU-F are two replacement strategies used in PacMan [8] for an in-memory coordinated caching mechanism and data-intensive parallel jobs. In PacMan, parallel jobs run numerous tasks concurrently in a wave with the all-or-nothing property. The LIFE algorithm evicts data blocks of files with the largest wave-width and results in reducing the average job completion time. LFU-F aims to maximize cluster efficiency. For this purpose, it evicts data blocks with less frequent access. Both strategies prioritize incomplete files over completed files for eviction and use a window-based aging mechanism to avoid cache pollution. They first check whether there are data blocks that have not been accessed within a given time window. Among these files, the one with the least number of accesses is chosen.
Enhanced Data-Aware Cache (EDACHE) [9] was introduced for caching intermediate results to accelerate MapReduce job execution times. In this strategy, WSClock is used as a cache replacement algorithm in which cached items are maintained in a circular list and a clock hand advances around this ring. This algorithm replaces cached items based on their reference bit and the last time used. It first checks the reference bit. If its value is one it means this item is used. The item's reference bit is then reset, its last time used is updated, and the clock hand is advanced. Otherwise, an item with an age greater than a threshold value is evicted. The bottleneck of this mechanism is related to the fact that large blocks lead to long search times for requested contents.
In collaborative caching, a Modified ARC replacement algorithm [10] was proposed in order to increase the cache hit ratio and improve efficiency. In this strategy, the cache is divided into the recent cache, recent history, frequent cache, and frequent history such that the cache sections contain data blocks and the history sections include references to evicted items. Initially, on a request for a block, the references in the history caches are checked. If present the corresponding block is placed in the recent or frequent cache, otherwise the cache references and serves the request from either of the history caches, which helps in faster caching as well as locating files for initial checks. If references are found in recent history then the block is placed in the recent cache. If the block is found in the recent cache, then it is moved to the frequent cache, hence a hit in either of the history caches removes the reference and places the corresponding block in one of the caches (recent or frequent). Caching a block also involves caching metadata. When either of the caches is fully utilized then a block is evicted from the recent or frequent cache but its reference is placed into its corresponding history. When either of the history caches is fully utilized the references simply drop out of the cache.
An adaptive cache algorithm [11] was designed to cache the partition of tables into the HDFS cache for Big SQL. Selective LRU-K (SLRU-K) and Exponential-Decay (EXD) are used as online caching algorithms and selective cache insertion to reduce the overhead of inserting items into the HDFS cache. SLRU-K takes into account the variable size of the partition and uses a weight heuristic to place the partitions into the cache selectively. It keeps the list of K's last access time for each partition. However, EXD maintains only the time of last access to computing the score for each partition that determines the weight of access frequency versus recently used. In Adaptive SLRU-K and EXD, the adaptor adjusts its behavior with access patterns of various workloads by automatically adopting the value of their parameters. Maximizing the byte hit ratio and minimizing the byte insertion ratio is the primary aim of the adaptor.
The block goodness aware cache replacement strategy [12] was presented in 2017 and uses two metrics for cache management: cache affinity (CA) depends on resources used by the application and block goodness (BG) measures how much a cached data block is worth. For each cached item, this strategy first calculates the BG value based on the data block access count and MapReduce application cache affinity then selects a data block with the lowest BG value for eviction. A data block with the oldest access time will be discarded if there is more than one data block with the same lowest BG value.
The Cache Affinity Aware Cache Replacement Algorithm [13] was designed in 2018 and categorizes MapReduce applications based on their cache affinity. This algorithm prioritizes caching input data of applications with high cache affinity. It takes into account the cache affinity of a MapReduce application and data access frequency to calculate the benefit of caching for input data. As a result, it evicts a data block with the lowest caching benefit. If there are some data blocks with identical lowest benefits, it evicts a block based on the LRU policy.
AutoCache [14] was developed in 2019 and employs a lightweight gradient-boosted tree (XGBoost) to predict file access patterns on the HDFS cache. In this mechanism, the probability of accessing a file is measured by a probability score which is used as a metric by the cache replacement policy to avoid cache pollution. When the free space of the cache is less than 10%, the eviction operation is started and it continues until the cache capacity becomes lower than 85%. This cache replacement algorithm has a low overhead by limiting computation to a fixed number of files.
In Table 1, we compare these cache replacement strategies in terms of their criteria for eviction and mitigating cache pollution and summarize their advantages and disadvantages.
Table 1
Hadoop cache replacement comparisons
Replacement strategy | Metrics for eviction | Cache pollution | Advantages | Disadvantages |
LIFE | Largest wave width and incompleted file | Window age strategy | Reduces average completion time | Effective for short jobs |
LFU-F | Frequency access and incompleted file | Window age strategy | Maximizes cluster efficiency | Effective for short jobs |
WSClock | Last time used | No | Decreases execution time | Long search times for large data blocks |
Modified ARC | Recency and frequency of access | No | Increases cache hit ratio | Needs space for storing history |
Adaptive cache | The score of each partition | No | Adapts to various workload characteristics | Significant overhead |
Block goodness aware | Block goodness value and access time | No | Effective for multiple concurrent applications | Needs to calculate block goodness |
Cache Affinity aware | Cache affinity of application and recency | No | Effective use of cache space | Needs to know the cache affinity of applications |
AutoCache | probability of accessing the file | Calculating probability score | Reduces average completion time and improves cluster efficiency | File oriented cache |
3.2 Intelligent caching mechanisms
In a related application domain, several intelligent caching strategies have been presented that use different machine learning techniques to enhance the performance of web proxy caches. Ali et al. proposed SVM–LRU, SVM–GDSF, and C4.5–GDS [15] that combined a support vector machine (SVM) and a decision tree (C4.5) with Least-Recently-Used (LRU), Greedy-Dual-Size (GDS), and Greedy-Dual-Size Frequency (GDSF) replacement strategies. In these techniques, web objects are classified into two groups: revisited later or not. These methods use a web proxy log file as a training dataset and different features of web objects such as recency, frequency, size, access latency, and type of object are considered for classification. Experimental results show that SVM-LRU appears to have the best hit ratio.
Employing a Bayesian network, BN-GDS, and BN-LRU [16] were introduced to improve the performance of cache replacement strategies like Greedy-Dual-Size (GDS), and Least-Recently-Used (LRU). In these strategies, the probability of web objects belonging to the revisited class (and so should be cached) is calculated based on features such as retrieval time, frequency, size, and type. Experimental results suggest that BN-GDS achieves the best hit ratio while BN-LRU has the best byte hit ratio.
Hybrid ELM-LFU [17], a two-level caching scheme for web proxies, was presented in 2018. In the first level, LFU is used for fast caching replacement (due to its low complexity), and thus is suitable for real-time communication. An Extreme Learning Machine (ELM) is used in the second level, applying a single hidden layer feed-forward network where there is no need to adjust the weights. In this mechanism, the chosen web object for eviction in the first level will be placed in the second-level cache. This method features low training times.
In [18], Herotodos et al. designed a framework for moving data automatically through tiered storage in the distributed file system via a set of pluggable policies. For this purpose, they employ incremental learning to find which data should be downgraded or upgraded allowing for adaption to workload changes over time. For downgrading, this method uses different caching replacement strategies like LRU, LFU, Least Recently & Frequently Used (LRFU), LIFE, LFU-F, Exponential Decay (EXD), and XGBoost-based Modeling (XGB). Also, On Single Access (OSA), LRFU, EXD, and XGB are used for the upgrade policy. Experimental results show that XGB is more suitable because it requires minimal storage, has low training overhead, makes useful predictions, and can learn incrementally over time.
PACS-oriented SVM-LRU [19] was proposed for picture archiving and communication systems in 2021. This algorithm calculates the probability of future access to cached items. In this strategy, SVM-LRU has brought some benefits like low training time, low computation, high prediction accuracy, and high hit ratio.
Even though using a caching mechanism has yielded some benefits in the Hadoop environment, some challenges remain. For instance, cache management imposes a heavy load on the NameNode, both in terms of required storage and computational load, potentially degrading performance. Moreover, existing cache replacement policies in Hadoop do not take into account cache pollution and effective use of cache space and they do not apply intelligent caching mechanisms. In this paper, we design a cache replacement mechanism as an approach for overcoming these problems, using SVM to classify data, resulting in improved performance. We choose SVM because its generalization ability can be maximized when training data are scarce, and it can control the misclassification error.