In the fields of hydrology, irrigation, and drainage engineering, the soil infiltration process plays a fundamental role. In general, infiltration refers to the vertical and lateral movement of water from the soil surface down through the layers. Initially, when water is applied, whether through rainfall or irrigation, it rapidly penetrates the soil at a high rate, termed the potential infiltration rate or maximum infiltration capacity. However, over time, this rate stabilizes and reaches a constant value known as the saturated infiltration rate or steady-state infiltration capacity. Understanding this transition from rapid to stable infiltration rates is crucial for analysing water movement dynamics within soil profiles. Describing infiltration process proves challenging due to its complex nature, particularly under isotropic and heterogeneous soil conditions (He et al. 2024). The complex process of infiltration is profoundly influenced by various factors, including soil depth, geomorphological features, hydraulic properties, and climatic conditions. Among these factors, the arrangement of soil particles and the moisture content within the soil layers stand out as crucial determinants of its ability to absorb and retain water during rainfall or irrigation events (Arya et al. 1999; Dexter and Richard 2009). A nuanced understanding of these intricacies is indispensable for crafting strategies to effectively mitigate soil erosion, manage groundwater recharge, and optimize the design and management of irrigation and hydrological systems (Mattar et al. 2015).
Over the years, researchers have grappled with the accurate assessment of infiltration rates due to the spatial and temporal variability inherent in field measurements (Mahapatra et al. 2020). This variability stems from the heterogeneity of soil properties, which are further complicated by equipment limitations, logistical constraints, and environmental interferences. Limited access in remote locations adds another layer of difficulty.
Despite these challenges, many authors have developed a variety of equations to predict infiltration, encompassing physical, semi-empirical, and empirical equations. Examples include Green and Ampt (1911), Philip (1969), Smith (1972), Smith and Parlange (1978), Horton (1941), Holtan (1961), Overton (1964), Richards (1931), Kostiakov (1932), modified Green-Ampt (Wang et al. 1999), modified Kostiakov (Haverkamp et al. 1988), among others. While these equations offer valuable predictive tools, their reliance on simplifying assumptions such as (i) homogeneity of soil and (ii) constant soil moisture content, along with limited applicability to specific soil and environmental conditions, can introduce inaccuracies. This highlights a key drawback in their use. In addition, numerical and physically based models such as SWAT (Soil and Water Assessment Tool), MIKE SHE (MIKE Surface-water Hydrology), among others are renowned for accurately predicting infiltration processes. However, acquiring data with high spatial resolutions, especially soil data type required to run these models proves challenging, particularly for the large heterogeneous catchments (Christiaens and Feyen 2001). To overcome these limitations, our study integrates these established empirical models with machine learning (ML) techniques, specifically Artificial Neural Networks (ANN) and the MissForest (MF) algorithm. This hybrid approach leverages the strengths of both empirical and ML models, addressing the simplifying assumptions and enhancing prediction accuracy while maintaining interpretability. By combining empirical knowledge with advanced data-driven techniques, we mitigate the challenges of data acquisition and model applicability, providing a more robust and practical solution for predicting infiltration in water-scarce and data-scarce regions.
Ongoing advancements in measurement techniques and data analysis methodologies offer a glimmer of hope. Soft computing and data driven methods, including Artificial Neural Networks (ANN), Random Forest (RF), Multi-Linear Regression (MLR), Support Vector Regressor (SVR), among others have emerged as powerful tools in hydrology and irrigation engineering, addressing various complex challenges (Sayari et al. 2021; Ahmed et al. 2024; Chen et al. 2024; Teshome et al. 2024). In hydrology, they excel in predictive analytics, flood and drought prediction, and water quality assessment. In parallel, within irrigation engineering, they optimize schedules, estimate crop water needs, assess system performance, and recognize patterns (Sidhu et al. 2020). Notably, the literature on machine learning (ML) methods for soil water infiltration remains limited. However, ML algorithms shine in several areas, making accurate predictions, enhancing performance over time, unveiling concealed patterns within intricate datasets, and automating tasks, thus offering multifaceted benefits.
Sy (2006) applied the ANN to model infiltration using data from plot-scale rainfall simulator experiments. The research highlighted the efficiency of ANN in capturing infiltration dynamics, with soil moisture and hydraulic conductivity identified as critical factors. Furthermore, compared to traditional methods such as Philip and Green-Ampt, ANN exhibited superior accuracy in predicting cumulative infiltration.
Sayari et al. (2021) compared five artificial intelligence (AI) models and their integrative versions with the Firefly Algorithm (FA) to forecast infiltrated water in furrow irrigation system. Utilizing data from both literature sources and field experiments conducted in Iran, the study incorporated key input parameters including furrow length, inflow rate, advance time, cross-sectional area of inflow, and infiltration opportunity time. Evaluation metrics highlighted the significant enhancement in accuracy achieved by integrating FA. These findings underscore the potential of AI models in refining complex hydrological processes.
Sihag et al. (2019) evaluated the performance of Adaptive Neuro-Fuzzy Inference System (ANFIS), SVM, and RF models in estimating cumulative infiltration and infiltration rate in arid areas of Iran, concluding that SVM, particularly with radial basis kernel function, outperforms ANFIS and RF. In a subsequent study, Sihag et al. (2020) compared ANN, Gaussian process (GP), Gene Expression Programming (GEP), and Generalized Neural network (GRNN) to estimate soil infiltration rates, finding that ANN with specific parameters achieves higher correlation coefficients than other algorithms. Singh et al. (2017) evaluated the performance of RF, ANN and M5P Model Tree techniques in predicting the infiltration rate. It was reported that the RF acted better in providing a closer estimation than ANN and M5P model tree.
According to the literatures, there are no universally acceptable algorithm that fits all site-specific scenarios. Many predictive algorithms, including ANN, can have a time-consuming training process due to their sensitivity to hyperparameter selection. ANN continue to demonstrate remarkable predictive accuracy in estimating infiltration rates. Additionally, they showcase their adaptability and robustness, even when confronted with small datasets (Jia and Culver 2006). One more ML technique which operates on the RF algorithm is MissForest (MF). It excels with small datasets due to its ability to handle missing data, heterogeneous data, and its efficient training process compared to other algorithms (Ispirova et al. 2020; Naranjo-Fernández et al. 2020; Bikše et al. 2023). Given the success of Random Forest in predicting infiltration in numerous studies, algorithms derived from its present promising successors. However, the 'fit_transform()' function from the 'missingpy' library in Python remains largely unexplored in soil infiltration prediction, signifying a notable research gap in the field. According to Parchami-Araghi et al. (2013) and Sy (2006), coupling ML techniques with physical-empirical based models results in reliable infiltration prediction. Therefore, the aim of this study is to develop hybrid ML and hydrological models for predicting infiltration rates. This will be achieved by using in-situ observations of gravel (%), sand (%), clay and silt (%), and soil moisture content (%) as predictor variables, measured from soil samples collected at surface, 50 cm, and 100 cm depth. The study will also focus on assessing the robustness of the hybrid models, particularly with scenarios involving an increasing number of predictor variables collected from various regions of Southern India.