Enhancing Infiltration Rate Predictions with Hybrid Machine Learning and Empirical Models: Addressing Challenges in Southern India

doi:10.21203/rs.3.rs-4869876/v1

Download PDF

Research Article

Enhancing Infiltration Rate Predictions with Hybrid Machine Learning and Empirical Models: Addressing Challenges in Southern India

https://doi.org/10.21203/rs.3.rs-4869876/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Despite the success of machine learning (ML) in many disciplines, its application in hydrology, especially in water-scarce regions, faces challenges due to the lack of interpretability and physical consistency. This study addresses these challenges by integrating established empirical hydrological models with ML techniques to predict infiltration rates in water-scarce regions of southern India. Data from 199 observations across 11 sites, including soil characteristics and infiltration measurements, were used to parameterize traditional models like Philip's, Horton's, and Kostiakov's, which were then combined with Artificial Neural Networks (ANN) and the MissForest (MF) algorithm to form hybrid models. The results demonstrate that hybrid models, particularly those based on Philip's model, significantly improve prediction accuracy (R²: 0.76–0.92, RMSE: 0.08–0.2 cm/min, and LCE: 0.11–0.71 with more predictors) across all target sites while retaining interpretability. This approach leverages the strengths of both empirical models and machine learning, addressing the limitations of each. The study highlights that while empirical models are data-driven and may introduce uncertainties, combining them with ML techniques can enhance predictive power and provide a more robust understanding of infiltration dynamics. This is particularly valuable in regions where direct measurement is challenging. The hybrid models facilitate accurate predictions using minimal data from readily accessible locations, offering a practical solution for effective water resource management and soil conservation in semi-arid and data-scarce regions. By blending empirical knowledge with machine learning algorithms, this approach not only improves accuracy but also enhances the physical meaningfulness of hydrological models, providing a balanced and innovative solution to hydrological modeling challenges.

Soil Infiltration prediction

Infiltration models

Artificial Neural Network (ANN)

MissForest

Hybrid hydrological model

In the fields of hydrology, irrigation, and drainage engineering, the soil infiltration process plays a fundamental role. In general, infiltration refers to the vertical and lateral movement of water from the soil surface down through the layers. Initially, when water is applied, whether through rainfall or irrigation, it rapidly penetrates the soil at a high rate, termed the potential infiltration rate or maximum infiltration capacity. However, over time, this rate stabilizes and reaches a constant value known as the saturated infiltration rate or steady-state infiltration capacity. Understanding this transition from rapid to stable infiltration rates is crucial for analysing water movement dynamics within soil profiles. Describing infiltration process proves challenging due to its complex nature, particularly under isotropic and heterogeneous soil conditions (He et al. 2024). The complex process of infiltration is profoundly influenced by various factors, including soil depth, geomorphological features, hydraulic properties, and climatic conditions. Among these factors, the arrangement of soil particles and the moisture content within the soil layers stand out as crucial determinants of its ability to absorb and retain water during rainfall or irrigation events (Arya et al. 1999; Dexter and Richard 2009). A nuanced understanding of these intricacies is indispensable for crafting strategies to effectively mitigate soil erosion, manage groundwater recharge, and optimize the design and management of irrigation and hydrological systems (Mattar et al. 2015).

Over the years, researchers have grappled with the accurate assessment of infiltration rates due to the spatial and temporal variability inherent in field measurements (Mahapatra et al. 2020). This variability stems from the heterogeneity of soil properties, which are further complicated by equipment limitations, logistical constraints, and environmental interferences. Limited access in remote locations adds another layer of difficulty.

Despite these challenges, many authors have developed a variety of equations to predict infiltration, encompassing physical, semi-empirical, and empirical equations. Examples include Green and Ampt (1911), Philip (1969), Smith (1972), Smith and Parlange (1978), Horton (1941), Holtan (1961), Overton (1964), Richards (1931), Kostiakov (1932), modified Green-Ampt (Wang et al. 1999), modified Kostiakov (Haverkamp et al. 1988), among others. While these equations offer valuable predictive tools, their reliance on simplifying assumptions such as (i) homogeneity of soil and (ii) constant soil moisture content, along with limited applicability to specific soil and environmental conditions, can introduce inaccuracies. This highlights a key drawback in their use. In addition, numerical and physically based models such as SWAT (Soil and Water Assessment Tool), MIKE SHE (MIKE Surface-water Hydrology), among others are renowned for accurately predicting infiltration processes. However, acquiring data with high spatial resolutions, especially soil data type required to run these models proves challenging, particularly for the large heterogeneous catchments (Christiaens and Feyen 2001). To overcome these limitations, our study integrates these established empirical models with machine learning (ML) techniques, specifically Artificial Neural Networks (ANN) and the MissForest (MF) algorithm. This hybrid approach leverages the strengths of both empirical and ML models, addressing the simplifying assumptions and enhancing prediction accuracy while maintaining interpretability. By combining empirical knowledge with advanced data-driven techniques, we mitigate the challenges of data acquisition and model applicability, providing a more robust and practical solution for predicting infiltration in water-scarce and data-scarce regions.

Ongoing advancements in measurement techniques and data analysis methodologies offer a glimmer of hope. Soft computing and data driven methods, including Artificial Neural Networks (ANN), Random Forest (RF), Multi-Linear Regression (MLR), Support Vector Regressor (SVR), among others have emerged as powerful tools in hydrology and irrigation engineering, addressing various complex challenges (Sayari et al. 2021; Ahmed et al. 2024; Chen et al. 2024; Teshome et al. 2024). In hydrology, they excel in predictive analytics, flood and drought prediction, and water quality assessment. In parallel, within irrigation engineering, they optimize schedules, estimate crop water needs, assess system performance, and recognize patterns (Sidhu et al. 2020). Notably, the literature on machine learning (ML) methods for soil water infiltration remains limited. However, ML algorithms shine in several areas, making accurate predictions, enhancing performance over time, unveiling concealed patterns within intricate datasets, and automating tasks, thus offering multifaceted benefits.

Sy (2006) applied the ANN to model infiltration using data from plot-scale rainfall simulator experiments. The research highlighted the efficiency of ANN in capturing infiltration dynamics, with soil moisture and hydraulic conductivity identified as critical factors. Furthermore, compared to traditional methods such as Philip and Green-Ampt, ANN exhibited superior accuracy in predicting cumulative infiltration.

Sayari et al. (2021) compared five artificial intelligence (AI) models and their integrative versions with the Firefly Algorithm (FA) to forecast infiltrated water in furrow irrigation system. Utilizing data from both literature sources and field experiments conducted in Iran, the study incorporated key input parameters including furrow length, inflow rate, advance time, cross-sectional area of inflow, and infiltration opportunity time. Evaluation metrics highlighted the significant enhancement in accuracy achieved by integrating FA. These findings underscore the potential of AI models in refining complex hydrological processes.

Sihag et al. (2019) evaluated the performance of Adaptive Neuro-Fuzzy Inference System (ANFIS), SVM, and RF models in estimating cumulative infiltration and infiltration rate in arid areas of Iran, concluding that SVM, particularly with radial basis kernel function, outperforms ANFIS and RF. In a subsequent study, Sihag et al. (2020) compared ANN, Gaussian process (GP), Gene Expression Programming (GEP), and Generalized Neural network (GRNN) to estimate soil infiltration rates, finding that ANN with specific parameters achieves higher correlation coefficients than other algorithms. Singh et al. (2017) evaluated the performance of RF, ANN and M5P Model Tree techniques in predicting the infiltration rate. It was reported that the RF acted better in providing a closer estimation than ANN and M5P model tree.

According to the literatures, there are no universally acceptable algorithm that fits all site-specific scenarios. Many predictive algorithms, including ANN, can have a time-consuming training process due to their sensitivity to hyperparameter selection. ANN continue to demonstrate remarkable predictive accuracy in estimating infiltration rates. Additionally, they showcase their adaptability and robustness, even when confronted with small datasets (Jia and Culver 2006). One more ML technique which operates on the RF algorithm is MissForest (MF). It excels with small datasets due to its ability to handle missing data, heterogeneous data, and its efficient training process compared to other algorithms (Ispirova et al. 2020; Naranjo-Fernández et al. 2020; Bikše et al. 2023). Given the success of Random Forest in predicting infiltration in numerous studies, algorithms derived from its present promising successors. However, the 'fit_transform()' function from the 'missingpy' library in Python remains largely unexplored in soil infiltration prediction, signifying a notable research gap in the field. According to Parchami-Araghi et al. (2013) and Sy (2006), coupling ML techniques with physical-empirical based models results in reliable infiltration prediction. Therefore, the aim of this study is to develop hybrid ML and hydrological models for predicting infiltration rates. This will be achieved by using in-situ observations of gravel (%), sand (%), clay and silt (%), and soil moisture content (%) as predictor variables, measured from soil samples collected at surface, 50 cm, and 100 cm depth. The study will also focus on assessing the robustness of the hybrid models, particularly with scenarios involving an increasing number of predictor variables collected from various regions of Southern India.

A total of eleven infiltration points from various regions of Southern India (Fig. 1) were deliberately selected for this study to encompass a range of soil and climatic conditions. These points are situated within the premises of the Indian Institute of Astrophysics (IIA), a government institution dedicated to interplanetary observations. Specifically, four of the eleven points are located within the Hoskote Bengaluru Campus (site 1), another four are situated at the Kavalur Tamil Nadu campus (site 2), and the remaining three are found at the Gauribidanur Karnataka Campus (site 3). According to the Food and Agriculture Organization (FAO) soil database, site 1 is predominantly characterized by sandy clay loam texture, while site 2 and site 3 exhibit clay loam texture. Additionally, it is noted that the upper layer of the soil (SOL_Z) at these sites is less than 300mm in depth.

Hoskote experiences an average annual rainfall of 843 mm with maximum and minimum temperature ranges from 33.6ºC to 15ºC, while Gauribidanur receives 694 mm of average annual rainfall with temperature ranges from 40ºC to 10ºC. Kavalur has an average rainfall of 917 mm, and temperatures range from 40.4 ºC to 18.5 ºC as per the Census of India (CIA) handbook 2011. The climates at these different sites are characterized by seasonally dry tropical savanna climate and semi-arid climate, providing a diverse range of soil and climatic conditions for the infiltration work. This information underscores the deliberate selection of the infiltration sites within the Indian Institute of Astrophysics (IIA) premises to capture a variety of environmental conditions for comprehensive study.

3.1 Field measurement and data collection

A comprehensive study spanning various climate regions in southern India involved diligent data collection from nearly 199 observations. These observations included measurements of infiltration rates (\(\:f\left(t\right)\)) and cumulative infiltration (\(\:F\left(t\right)\)) across 11 distinct sites. The detailed descriptions of all 11 infiltration test points are provided in Table 1 and the infiltration observations \(\:f\left(t\right)\) and \(\:F\left(t\right)\) recoded at each test point are illustrated in Fig. 2. Furthermore, soil characteristics were extensively examined, with close to 33 observations gathered at each site, encompassing percentages of gravel, sand, and combined silt and clay content, as well as moisture content. This detailed dataset, collected at surface level, 50 cm, and 100 cm depths, provides a robust foundation for investigating of infiltration processes.

Table 1

Detail description of the infiltration test points
Location	Test point ID	Latitude	Longitude	Elevation above MSL (m)	Vegetation cover	Land use
Hoskote Bengaluru Campus	H1	13° 6'47.00"N	77°48'45.00"E	937	Sparse Shrubs	Mixed Use
	H2	13° 6'43.34"N	77°48'53.86"E	931	Lawn	Mixed Use
	H3	13° 6'50.51"N	77°48'38.71"E	941	Mixed Forest	Mixed Use
	H4	13° 6'51.55"N	77°48'43.89"E	938	Sparse Shrubs	Mixed Use
Kavalur Tamil Nadu campus	K1	12°34'33.62"N	78°49'17.59"E	718	Sparse Shrubs	Mixed Use
	K2	12°34'27.76"N	78°49'13.50"E	724	Mixed Forest	Forest
	K3	12°34'38.30"N	78°49'20.85"E	723	Sparse Shrubs	Mixed Use
	K4	12°34'42.66"N	78°49'30.19"E	716	Meadow Grass	Mixed Use
Gauribidanur Karnataka Campus	G1	13°36'12.85"N	77°25'43.17"E	725	Meadow Grass	Mixed Use
	G2	13°36'9.97"N	77°25'37.39"E	724	Bare Soil	Mixed Use
	G3	13°36'8.69"N	77°25'45.40"E	723	Sparse Shrubs	Mixed Use

The field process involved using a double ring infiltrometer made of a metal plate that is 10 mm thick with a depth of 42 cm, and inner and outer ring diameters of 25 cm and 48 cm, respectively. The upper existing soil layer is removed to eliminate surface irregularities and organic matter, ensuring accurate infiltration measurements. Both rings of the infiltrometer are driven simultaneously into the ground to a depth of 5 cm using a wooden plank and hammer. The observations are carried out in the inner ring to ensure that the infiltration measurements reflect the vertical downward movement of water into the soil strata, providing a more precise observation of soil water conductivity. The outer ring is used to control the lateral movement of water from the inner ring, which is one of the error sources of this type of infiltrometer. Observations were taken at time intervals of 2, 3, 5, 10, 15, 30, 45, 60, 90, and 120 minutes, and continued until the infiltration reached a steady rate.

To determine the percentages of gravel, sand, combined silt, and clay content, as well as the moisture content, various sample types were collected from different depths at each test point. Three messy samples, each weighing approximately 100 g, and six samples, each weighing approximately 1000 g, were collected from each test point (collectively 99 samples). The messy samples were used for moisture content analysis, while the other samples were used for determining the percentages of gravel, sand, and combined silt and clay content. The 100 g messy samples was divided into two equal parts, each weighing 50 g, and the moisture content present in each sample was determined using the ASTM Standard Test Method for Laboratory Determination of Water Content of Soil Sample by Mass (ASTM Standard D2216–19 2019). To distinguish the size of soil particles, the 1000 g soil samples were oven-dried for 24 hours at 105℃. Subsequently, the dried samples were sieved using 4.75 mm and 0.0075 mm sieves to determine the percentage of gravel (particles greater than 4.75 mm), sand (particles ranging from 4.75 to 0.0075 mm), and combined silt and clay (particles less than 0.0075 mm). The experimental results of two samples collected from the same depth are averaged and used for further study.

The infiltration rate measured with the double ring infiltrometer serves as the foundation for the formulation of infiltration models. These models, including the Philip’s (Philip 1969), Horton’s (Horton 1941), and Kostiakov’s (Kostiakov 1932), can be parameterized through fitting procedures by utilizing the observed infiltration data. The principal equation employed in the models is detailed in the Table 2, along with a comprehensive summary of each associated parameter. Analysing these models, the governing parameters that describe the infiltration process for a specific soil and land-use condition can be driven and which will be used for further training and testing of ML techniques. Figure 3 shows the detailed method of research and evaluation of hybrid ML and traditional infiltration models.

Table 2

Approximate equations for infiltration rate derived from both theoretical principles and empirical observations
Model	Equation	Model Parameters	Application Context
Philip (1969)	\(\:f\left(t\right)=s{t}^{-1/2}+2A\)	s is sorptivity (\(\:L{T}^{-0.5}\)), which is the capacity of the soil to absorb water due to capillarity. A is transmissivity factor (\(\:L{T}^{-1}\)), indicating the rate at which water moves through the soil under a unit gradient, \(\:t\) is the time of infiltration (T).	Suitable for short-term infiltration studies and homogeneous soils.
Horton (1941)	\(\:f\left(t\right)={i}_{c}+m\left({e}^{-{K}_{h}t}\right)\)	\(\:{i}_{c}\) (\(\:L{T}^{-1}\)), is the steady rate or ultimate infiltration capacity, \(\:m=\left({i}_{o}-{i}_{c}\right),\:\)where \(\:{i}_{o}\)(\(\:L{T}^{-1}\)), is the infiltration capacity at \(\:t=0\), \(\:{K}_{h}\) is an empirical soil constant.	Ideal for predicting infiltration capacity over time, especially in varied soil conditions.
Kostiakov (1932)	\(\:f\left(t\right)=\left(ab\right){t}^{(b-1)}\)	\(\:a>0\:and\:0<b<1\) are empirical dimensionless constants	Commonly used for irrigation studies and quick field estimations.

3.2 Machine leaning algorithm

3.2.1 Artificial Neural Network (ANN)

As inspired by the intricate connections of human neurons, the ANN serve as computational tools in hydrology, mimicking the natural complexity to model and predict water-related phenomena (ASCE Task Committee on Application of Artificial Neural Networks in Hydrology. 2000b, a). The ANN consist of one input layer, one or multiple hidden layers and one output layer. These layers process the information fed as input variables in interconnected processing components called nodes or neurons. Neurons in adjacent layers are allied through weighted connections, which function as communication channels. The connections between nodes represent weights (W), which determine the strength of influence one neuron has on neurons in the subsequent layer. Additionally, biases (B) are constant values added to the weighted sum of inputs for each neuron. These weights and biases enhance the flexibility of the ANN model to fit the input variables. Inputs to a neuron are multiplied by their corresponding weights, summed, and then processed through a transfer function, which controls the signal strength relayed through the neuron's output.

Hidden and output layers in neural networks use special functions called "activation functions" to introduce non-linearity. This enables the network to learn complex patterns in the data that simple linear models cannot capture. Popular choices include Rectified Linear Unit (ReLU, known for its simplicity and efficiency) and the Hyperbolic Tangent function (tanh, which is similar to the sigmoid function but faster to compute and with different learning behavior). These functions are essential for the network's ability to capture intricate relationships within the data (Dubey et al. 2022).

In our approach, a custom Python script utilizing the scikit-learn library was used to develop the multilayer feed-forward ANN model with a back-propagation training algorithm, specifically the \(\:{\prime\:}\text{M}\text{L}\text{P}\text{R}\text{e}\text{g}\text{r}\text{e}\text{s}\text{s}\text{o}\text{r}{\prime\:}\). The data, loaded from an Excel sheet, consisted of various features and target variables. The available data was split such that the last row was reserved for testing, while the rest was used for training. Hyperparameter tuning was conducted using \(\:{\prime\:}\text{G}\text{r}\text{i}\text{d}\text{S}\text{e}\text{a}\text{r}\text{c}\text{h}\text{C}\text{V}{\prime\:}\) with a parameter grid that included variations in hidden layer sizes, activation functions, and solvers. The best estimator was selected based on the mean squared error scoring. This approach ensured optimal prediction accuracy for the target variables.

Also, we explored various scenarios to test the performance of the ANN model by incorporating data from multiple neighbouring test points and a target test point. This approach, which we refer to as a hybrid model integrating different infiltration models such as Philip's, Horton's, and Kostikov's methods with ANN. As an example, the 22 input parameters used were the percentages of gravel (\(\:G\)), sand (\(\:S\)), silt and clay (\(\:CS\)), and moisture content (\(\:M\)) for all test points (1, 2, 3, and 4) and, as well as the sorptivity (\(\:s\)) and transmissivity factor (\(\:A\)) for the nearby points (1, 2, and 3). The output parameters were the sorptivity (\(\:{s}_{t}\)) and transmissivity factor (\(\:{A}_{t}\)) for the target test point (t). Therefore, there were 22 nodes in the input layer and two in the output layer. The suggested structure was 22-j-2, where j is the number of nodes in the hidden layer (Fig. 4).

In a similar manner, we tested the ANN model using Horton's and Kostikov's methods as hybrid models, integrating their respective parameters to evaluate and enhance the network's prediction capabilities across different scenarios.

3.2.2 MissForest (MF)

Stekhoven and Bühlmann (2012) introduced the MF algorithm, an iterative imputation method that is an enhanced version of the RF algorithm. RF algorithm ((Breiman 2001) grow many decision trees and average their results. However, averaging can mask underlying variability and interactions in the data, leading to biased imputations, which MF addresses by iteratively capturing these complex relationships for more accurate imputation.

To illustrate the concept, consider an example of a hybrid MF algorithm combined with Philip's model. In each iteration of the adapted MF algorithm, aimed at imputing missing values of sorptivity (\(\:{s}_{t}\)) and transmissivity factor (\(\:{A}_{t}\)) at the target test site ‘t’, the process begins by preparing the data from an Excel file. Features \(\:G,\:S,\:CS\) and \(\:M\) and targets \(\:{s}_{t}\) and \(\:{A}_{t}\) are separated, with the last row reserved for testing. Specifically, \(\:{X}_{train}=df.iloc[:\:-1,\::\:-2]\) includes all rows except the last one, excluding the last two columns, and \(\:{y}_{train}=df.iloc[:\:-1,\::\:-2]\) includes the last two columns for all rows except the last one. The MF model fits a RF model with \(\:{P}_{k}\sim\{G,\:S,\:CS,\:M\}\), where \(\:{P}_{k}\) is either \(\:{s}_{t}\) or \(\:{A}_{t}\), using data from rows at three nearby test points that do not have missing values. These test points include data like \(\:{G}_{1},\:{\:S}_{1},\:{CS}_{1},\) \(\:{M}_{1}\), \(\:{s}_{1}\) and \(\:{A}_{1}\) for test point 1, and similarly for test points 2 and 3. A grid search optimizes hyperparameters such as the ‘\(\:n\_estimators\)’, \(\:{\prime\:}\text{m}\text{a}\text{x}\_features{\prime\:}\), \(\:{\prime\:}\text{m}\text{a}\text{x}\_depth{\prime\:}\) and \(\:{\prime\:}random\_state{\prime\:}\). Once the best RF model is identified, it is used to impute missing values in the last row of the test data. This involves initializing the MF model, using \(\:{\prime\:}\text{G}\text{r}\text{i}\text{d}\text{S}\text{e}\text{a}\text{r}\text{c}\text{h}\text{C}\text{V}{\prime\:}\) to fit the model and find the best parameters, imputing missing values in the test data, and predicting \(\:{s}_{t}\) and \(\:{A}_{t}\) for the last row. The best parameters' values are estimated using the mean squared error, ensuring accurate predictions. Initial mean imputation provides a baseline, and iterative refinement improves the predictions based on the optimized model.

Similarly, models were built using Horton's and Kostikov's methods, combining the classical infiltration models with the MF algorithm to robustly impute and predict the missing values under different scenarios with an increase in nearby stations' data.

3.3 Performance evaluation criteria

The infiltration rates obtained from the hybrid models for varying target test points were evaluated by computing three standard statistical performance indicators and one graphical indicator. These indicators were the coefficient of determination (\(\:{R}^{2})\), Root Mean Square Error (\(\:RMSE\)), Legates’s Coefficient of Efficiency (LCE), and Taylor diagram.

The three statistical indicators were expressed as:

\(\:{R}^{2}=\:\left[\frac{{\left[\sum\:_{i=1}^{N}\left({p}_{i}-{P}_{m}\right)\left({o}_{i}-{O}_{m}\right)\right]}^{2}}{\sum\:_{i=1}^{N}{\left({p}_{i}-{P}_{m}\right)}^{2}\:\sum\:_{i=1}^{N}{\left({o}_{i}-{O}_{m}\right)}^{2}}\right]\)	(1)
\(\:RMSE={\left[\frac{\sum\:_{i=1}^{N}{({p}_{i}-{o}_{i})}^{2}}{N}\right]}^{1/2}\)	(2)
\(\:LCE=1-\frac{\left[\sum\:_{i=1}^{N}\left\|{p}_{i}-{o}_{i}\right\|\right]}{\left[\sum\:_{i=1}^{N}\left\|{o}_{i}-{O}_{m}\right\|\right]}\)	(3)

where \(\:{o}_{i}\) is the i-th observed infiltration rate, \(\:{p}_{i}\) is the i-th predicted infiltration rate. The mean values of the observed and predicted rates, each consisting of N values, are represented as \(\:{O}_{m}\) and \(\:{P}_{m}\) respectively.

R² is a commonly used metric to measure the degree of correlation between predicted and observed values, where an ideal R² value close to 1 indicates a strong match. To quantify the error between these values, RMSE is frequently employed, expressed in the same units as the observed values; lower RMSE values suggest better predictive accuracy. Additionally, LCE provides a dimensionless measure of model prediction accuracy relative to observed values, with values near 1 indicating near-perfect agreement (Legates and McCabe 1999).

A Taylor diagram offers a comprehensive visual representation of model performance by combining RMSE, R², and standard deviation (SD) into a single polar plot (Taylor 2001). This tool's significant advantage is its ability to compare multiple model predictions on a single plot, providing a more holistic view than individual summary statistics. In this study, a Python script was developed to generate Taylor diagrams, facilitating rapid assessment of model predictions relative to observed values and enabling efficient comparison of multiple models.

4.1. Predictor variables from lab-measured soil parameters

The soil parameters for different test points, including gravel (G), sand (S), silt and clay (SC), and moisture content (M), were measured at three different depths: at the surface, 50 cm, and 100 cm below the surface. The data obtained from the laboratory analysis revealed notable variability across different test points and depths (Table 3). The Gravel content (G) varied significantly, with surface levels ranging from 14% at H3 to 47.8% at K4, generally decreasing with soil depth. Sand content (S) showed notable variability, with the highest surface content at H3 (85.2%) and increasing with depth, particularly at K4 (92.5% at 100 cm). Silt and clay (SC) content was relatively low across all test points and depths, with surface levels ranging from 0.0% at K1, K2 and K3 to 4.1% at G3. Moisture content (M) varied widely, with surface levels from 2.1% at G1 to 14.2% at K3, generally increasing with depth, reaching up to 15.0% at H2 at 100 cm.

Site-specific observations showed that Hoskote Station (H), with elevations from 941 to 931 m and mixed-use land, exhibited a decrease in gravel content and an increase in moisture content with depth. Kavalur Station (K), in a hilly area with elevations from 724 to 716 m, showed high surface gravel content and low clay and silt, indicative of well-drained conditions. Gauribidanur Station (G), with elevations from 725 to 723 m and known for water scarcity, exhibited high sand content at depth and low moisture, highlighting its water scarcity issues. These variations in soil parameters across different sites and depths reflect the heterogeneity of soil properties in the study area, which is crucial for understanding soil behaviour and management in semi-arid regions.

Table 3

Soil parameters from laboratory analysis for training hybrid models
Test point	\(\:{G}^{a}\) (%)	\(\:{S}^{b}\) (%)	\(\:{SC}^{c}\) (%)	\(\:{M}^{d}\) (%)	\(\:G\) (%)	\(\:S\) (%)	\(\:SC\) (%)	\(\:M\) (%)	\(\:G\) (%)	\(\:S\) (%)	\(\:SC\) (%)	\(\:M\) (%)
Test point	At surface				At 50 cm below surface				At 100 cm below surface
\(\:H1\)	23.2	76.1	0.7	10.9	17	81.1	1.9	8.8	11.9	86.2	1.9	9.8
\(\:H2\)	17.7	81.2	1.1	7.8	12.9	84.9	2.2	12.1	6.2	90.2	3.6	15
\(\:H3\)	14	85.2	0.8	9.4	14.6	83.6	1.8	11.8	22.5	74.7	2.8	8.7
\(\:H4\)	42.9	55.9	1.2	9.2	23.3	76.3	0.4	11.5	14.7	82.5	2.8	9.6
\(\:H5\)	26.6	71.8	1.6	9.5	13.1	86.2	0.7	12.4	22.5	76.7	0.8	12.2
\(\:K1\)	39.8	60.2	0	5.8	15	84	1	10.6	11.8	82.2	6	4.2
\(\:K2\)	44.8	55.2	0	13.3	19.8	80	0.2	10.6	19.1	80.2	0.7	7.2
\(\:K3\)	32.4	67.6	0	14.2	26.7	73.1	0.2	16.4	25.8	74.2	0	8.6
\(\:K4\)	47.8	51.3	0.9	12	32	68	0	11.1	7.1	92.5	0.4	9
\(\:G1\)	25.5	72.9	1.6	2.1	48.5	51.1	0.4	4.5	12.8	85.7	1.5	4.6
\(\:G2\)	33.5	65.1	1.4	2.5	33.2	65.8	1	5.4	7.6	91.4	1	5.2
\(\:G3\)	25.4	70.5	4.1	8.2	14.6	83	2.4	7.8	5.9	89.6	4.5	3.6

These results align with other studies on soil variability in semi-arid regions. Yuan et al. (2024) emphasized the influence of land use and topography on soil properties, noting that urban and disturbed areas tend to exhibit higher variability in surface soil composition due to anthropogenic interferences. Similarly, Bonanomi et al. (2024) and Qiu et al. (2001) highlighted the impact of elevation and land use on soil moisture content, with lower moisture levels typically observed in hilly and well-drained areas, and higher moisture retention in flatter, less disturbed regions. These studies support the observed patterns in the current analysis, where moisture content increases with depth and is higher in flatter areas like Hoskote and Gauribidanur compared to the hilly Kavalur. Moreover, the high sand content observed at deeper levels in Gauribidanur is consistent with findings by Ceballos et al. (2002), reporting similar trends in semi-arid regions, where deeper soil layers often exhibit higher sand fractions due to historical deposition processes. This high sand content correlates with the low moisture retention capacity, exacerbating water scarcity issues in these areas. In contrast, the mixed-use and forest land use at Kavalur contribute to lower moisture levels and higher gravel content, typical of well-drained, hilly terrains as described by Lado et al. (2004). The soil parameters obtained from this study will be used as predictor variables to train and test hybrid ML and infiltration models to predict infiltration rates. Recent studies have demonstrated the efficacy of ML techniques in hydrological modeling, significantly enhancing prediction accuracy (Mosavi et al. 2018). Integrating these parameters into hybrid ML and hydrological models will substantially improve the accuracy of infiltration rate predictions, which is crucial for effective water resource management and soil conservation in semi-arid regions. The heterogeneity in soil properties observed in this study underscores the necessity for tailored approaches in model training and validation to account for site-specific characteristics, ensuring robust predictions. This approach aligns with Salvadore et al. (2015), emphasizing the importance of developing site-specific models in heterogeneous environments to significantly boost predictive accuracy and management effectiveness.

4.2. Derived infiltration parameters for hybrid model development

4.2.1 Sorptivity and transmissivity from Philip's model

The application of Philip's model to the observed infiltration rates \(\:f\left(t\right)\) against \(\:{t}^{-0.5}\) for infiltration test sites provided key parameters: sorptivity (\(\:s\)) and transmissivity factor (\(\:A\)) (Fig. 5). The Analysis resulted that H1 had the highest sorptivity (13.378), indicating rapid initial infiltration due to its high sand content (76.1%) and low clay content (0.7%) (Table 3). K3 and K2 also exhibited high sorptivity values (12.720 and 4.962, respectively), corresponding to their relatively high sand content (67.6% and 55.2%, respectively). In contrast, G3 had the lowest sorptivity (1.497), aligning with its relatively high silt and clay content (4.1%) and high moisture content (8.2%). The transmissivity factors further supported these findings, with H1 and H3 displaying high values (0.529 and 1.124, respectively), indicating efficient water movement through the sandy soil. Conversely, G3's very low transmissivity factor (0.001) suggested its soil structure hinders water movement, consistent with its higher silt and clay content.

The sandy soils observed at H1 and K3 generally exhibit higher infiltration rates (Fig. 2) due to larger pore spaces that facilitate rapid water movement. These findings aligns with observations by Allaire et al. (2009) and Manns et al. (2024), which indicate that coarse-textured soils like sands promote greater hydraulic conductivity due to their interconnected macropores. Conversely, soils with higher silt and clay content, along with higher moisture content (e.g., G3), show reduced infiltration rates. The smaller particle sizes and larger specific surface area in these soils create a more tortuous path for water, hindering hydraulic conductivity (Mantoglou and Gelhar 1987). Additionally, subsurface moisture content significantly impacts a soil's ability to absorb and transmit water after initial infiltration. Dry soils, like those observed in H1 with a low moisture content (8.8% at 50 cm and 9.8% at 100 cm), exhibit higher sorptivity. This translates to a greater capacity for prolonged water uptake. Conversely, soils with higher soil moisture (e.g., H2 with 12.1% at 50 cm and 15% at 100 cm) demonstrate lower sorptivity. This can be attributed to reduced capillary forces driving water absorption, as reported by Rosenbom et al. (2009).

4.2.2 Initial infiltration capacity and soil empirical constant from Horton’s model

The scatter plots showed in Fig. 6 illustrate the fit of Horton's infiltration model to the observed infiltration data from various sites. Each plot gives the logarithmic difference in infiltration capacity (\(\:ln\left({i}_{o}-{i}_{c}\right)\)) against time, along with the derived parameters \(\:{K}_{h}\) (empirical soil constant or Horton's decay coefficient) and \(\:{i}_{o}-{i}_{c}\) (the initial infiltration potential). The parameters were estimated from the linear fits to the logarithmic infiltration data, indicating the rate of decrease in infiltration capacity over time and the initial infiltration potential.

For the Hoskote sites, the \(\:{K}_{h}\) values ranged from 0.015 to 0.031, with \(\:{i}_{o}-{i}_{c}\) values ranging from 0.253 to 4.253. The Kavalur sites exhibited \(\:{K}_{h}\) values between 0.019 and 0.048, and \(\:{i}_{o}-{i}_{c}\) values from 0.418 to 2.094. The Gauribidanur sites demonstrated \(\:{K}_{h}\) values ranging from 0.025 to 0.039, with \(\:{i}_{o}-{i}_{c}\) values from 0.215 to 0.527. Here, the decay constants and initial infiltration potentials serve as primary indicators of soil infiltration performance. The varied decay constants across different sites underscore the influence of site-specific factors. For example, the higher decay constants observed at H4, K4, G1, and G3 suggest quicker saturation and reduced infiltration rates that decrease asymptotically to reach the basic or steady-state over time (Fig. 2), which could be indicative of higher silt and clay content in the upper profile of the soil (Table 3). These observations align with the findings of Adhikary et al. (2008), who reported a significant inverse relationship between silt and clay content and infiltration rates, demonstrating that increased silt and clay concentrations substantially impede soil permeability and water infiltration. Additionally, the initial infiltration potential offers valuable insights into the soil's initial response to water application, a key parameter for designing efficient irrigation and drainage management (Gjettermann et al. 1997). The high sand content observed in Hoskote corresponds to its high initial infiltration potential. This reinforces the notion that larger pore spaces characteristic of sandy soils promote faster infiltration. Sandy soils, dominated by large, interconnected macropores, typically exhibit higher initial infiltration potential due to their enhanced hydraulic conductivity (Zhang and Schaap 2019). This rapid infiltration minimizes surface runoff and promotes deep percolation, which in turn is crucial for groundwater recharge in the vadose zone (Shanafield and Cook 2014; He et al. 2024).

4.2.3 Empirical dimensionless constants from Kostiakov’s model

The Fig. 7 utilizes scatter plots to showcase the application of Kostiakov's infiltration model to field observations of cumulative infiltration (\(\:F\left(t\right)\)) across various test sites. The x-axis depicts the natural logarithm of time (\(\:\text{l}\text{n}t\)), while the y-axis represents the natural logarithm of cumulative infiltration (\(\:\text{ln}F\left(t\right)\)). This logarithmic transformation facilitates a clearer visualization of the relationship between infiltration and time, enabling a comprehensive analysis of infiltration patterns at each location. The successful fitting of the model to the data isolates the initial phase of rapid infiltration, characterized by the parameter '\(\:a\)' and the subsequent decline in infiltration rate, represented by '\(\:b\)'. These parameters offer valuable information regarding the hydraulic properties of the soil at each site, reflecting the impact of soil texture and structure on water infiltration.

H1 and K3 exhibited the highest \(\:a\) values (7.600 and 7.200, respectively), indicating rapid initial infiltration consistent with their high sand content (76.1% for H1 and 67.6% for K3), which promotes higher hydraulic conductivity. Conversely, G1 and G3 showed the lowest \(\:a\) values (0.353 and 0.800, respectively), reflecting slower initial infiltration rates due to higher silt and clay content, which hinder infiltration. The \(\:b\) values ranged between 0.579 (G3) and 0.843 (H2), indicating a moderate decrease in infiltration rate over time across all sites. The consistent \(\:b\) values suggest a uniform decrease in infiltration rate over time, reflecting similar soil behaviour in terms of infiltration rate reduction. This could be attributed to factors like pore clogging by fine particles, as reported by Mantoglou and Gelhar (1987) in their modeling of water flow in stratified soils. Additionally, the gradual saturation of the upper soil horizons, as infiltration progresses, can contribute to the observed decrease in the infiltration rate.

4.3. Evaluation of hybrid machine learning and hydrological models for infiltration rate prediction under varying scenarios

For a comprehensive and unbiased evaluation, two target test sites (H4 and K4) were randomly chosen. This approach minimizes potential site-specific biases and provides a more generalizable assessment of the models' performance. A diverse set of predictor variables is utilized, including both direct soil data and features extracted from existing hydrological models. This broadens the model's understanding of infiltration processes, leading to a more robust evaluation that's less susceptible to site-specific quirks.

The following section delves into the details of each hybrid model's performance, providing insights into their effectiveness for predicting infiltration rates

4.3.1 Hybrid ANN and hydrological models keeping H4 as a target site

The evaluation of hybrid ANN and hydrological models for predicting infiltration rates keeping H4 has target site reveals significant insights. Increasing the number of predictor sites enhances the models' ability to accurately predict the observed infiltration rates. This trend is visually represented in Fig. 8a, where prediction curves for models with 7 and 10 predictor sites align more closely with the observed infiltration rates compared to those with only 3 predictor sites. Quantitatively, the ANN + Horton model demonstrates the most consistent improvement across various error metrics with increasing predictor sites, followed by the ANN + Philip and ANN + Kostiakov models.

For the ANN + Horton model, the \(\:{\text{R}}^{2}\) increases from 0.85 with 3 predictor sites to 0.94 with 10 predictor sites, indicating a significant enhancement in model performance (Fig. 8b). The \(\:\text{R}\text{M}\text{S}\text{E}\) decreases from 0.36 to 0.08 cm/min, highlighting reduced prediction errors. Additionally, the \(\:\text{L}\text{C}\text{E}\) improves from − 0.42 to 0.76, reflecting enhanced model efficiency and predictive accuracy. The Taylor diagram corroborates these findings, showing that with more predictor sites, the model points move closer to the reference point, indicating better agreement with observed values (Fig. 8c). Although the ANN + Philip model maintains a high \(\:{\text{R}}^{2}\) from 0.87 to 0.91 across different scenarios, its \(\:\text{R}\text{M}\text{S}\text{E}\) decreases from 1.60 to 0.08 cm/min, and \(\:\text{L}\text{C}\text{E}\) shifts from − 5.90 to 0.71, indicating substantial improvements as the ANN + Horton model. The ANN + Kostiakov model shows a high \(\:{\text{R}}^{2}\), decreasing \(\:\text{R}\text{M}\text{S}\text{E}\) from 1.34 to 0.09 cm/min, and \(\:\text{L}\text{C}\text{E}\) improving from − 4.41 to 0.67, with improved performance on the Taylor diagram, though less consistent compared to the ANN + Horton model.

4.3.2 Hybrid MF and hydrological models keeping H4 as a target station

The assessment of hybrid MF and hydrological models for predicting infiltration rates at the target station H4 is shown in the Fig. 9. Similar to hybrid ANN models, incorporating more predictor sites will enhances the models' ability to predict the infiltration rates (as shown in the Fig. 9a). The MF + Philip model excels \(\:{\text{R}}^{2}\) value of 0.89, 0.91, and 0.92 for scenarios 3, 7, and 10 predictor sites. As illustrated in Fig. 9b and c, with increasing predictor sites demonstrably benefits this model, significantly reducing errors (\(\:\text{R}\text{M}\text{S}\text{E}\) from 1.57 to 0.1 cm/min) and improving efficiency (\(\:\text{L}\text{C}\text{E}\) from − 5.87 to 0.62), as upheld by the Taylor diagram's convergence towards the reference point. The MF + Horton model exhibits a more nuanced behaviour. While its \(\:{\text{R}}^{2}\) starts well (0.87 with 3 sites), it dips slightly (0.84 with 10 sites) with more predictors, suggesting potential overfitting. However, this is outweighed by a substantial decrease in errors (\(\:\text{R}\text{M}\text{S}\text{E}\) from 1.0 to 0.19 cm/min) and a clear improvement in efficiency (\(\:\text{L}\text{C}\text{E}\) from − 3.71 to 0.47) as the number of sites increases. The Taylor diagram aligns with this, showing the model better matching observations with more predictors. Finally, the MF + Kostiakov model thrives with increasing predictor sites. The \(\:{\text{R}}^{2}\) steadily increases from 0.88 to 0.9, indicating a stronger grasp of the data's variability. Similarly, the \(\:\text{R}\text{M}\text{S}\text{E}\) descends (from 1.59 to 0.12 cm/min), highlighting a considerable error reduction. The \(\:\text{L}\text{C}\text{E}\) showcases a remarkable improvement as well (from − 6.31 to 0.69), demonstrating enhanced model efficiency and accuracy. As with the other models, the Taylor diagram confirms this positive trend.

4.3.3 Hybrid ANN and hydrological models keeping K4 as a target station

Analysing predicted infiltration rates at K4 reveals that incorporating more predictor sites (7 and 10) significantly improves the accuracy of hybrid ANN and hydrological models, as evident in Fig. 10a, where models produce prediction curves closely aligned with observed data (particularly ANN + Philip, whose curves progressively approach observed rates). This is further supported by bar plot of error measures (Fig. 10b): ANN + Philip exhibits substantial increase in \(\:{\text{R}}^{2}\) (0.72 to 0.80), drastic reduction in \(\:\text{R}\text{M}\text{S}\text{E}\) (1.05 to 0.11 cm/min), and improved \(\:\text{L}\text{C}\text{E}\) (-5.38 to 0.57), indicating enhanced efficiency and accuracy. While improvements in ANN + Horton and ANN + Kostiakov models are less pronounced, some metrics improve with more predictor sites, while others show contrasting behaviour. For instance, ANN + Horton's \(\:{\text{R}}^{2}\) decreases from 0.59 (3 sites) to 0.54 (10 sites), \(\:\text{R}\text{M}\text{S}\text{E}\) and \(\:\text{L}\text{C}\text{E}\) increases from 0.16 to 0.33 cm/min and − 0.05 to -0.92. Similarly, ANN + Kostiakov's \(\:{\text{R}}^{2}\) reduces from 0.72 to 0.67, and \(\:\text{R}\text{M}\text{S}\text{E}\) decreases from 0.34 to 0.11 cm/min. \(\:\text{L}\text{C}\text{E}\) values improve from − 0.49 to 0.63. The Taylor diagram (Fig. 10c) reinforces this, demonstrating that model points move closer to the reference point with more data, signifying better agreement with observed values. Overall, for Kavalur site 4, increasing predictor sites generally enhances the model performance, with ANN + Philip exhibiting the most significant improvements.

4.3.4 Hybrid MF and hydrological models keeping K4 as a target station

The Fig. 11 illustrates the prediction accuracy of hybrid MF and hydrological models for infiltration rates at K4. The results reveal a clear benefit for the MF + Philip model, which exhibits a substantial decrease in \(\:\text{R}\text{M}\text{S}\text{E}\) from 0.87 to 0.20 cm/min, indicating a significant reduction in prediction error. Additionally, the \(\:\text{L}\text{C}\text{E}\) values show a marked improvement from − 3.58 to 0.13, reflecting enhanced model efficiency and accuracy. While \(\:{\text{R}}^{2}\) values show some variability, the overall trend suggests improvement for MF + Philip with more predictor sites.

The \(\:{\text{R}}^{2}\) values exhibit inconsistent changes in both MF + Horton and MF + Kostikov model. \(\:\text{R}\text{M}\text{S}\text{E}\) generally, improves in MF + Horton (0.31 to 0.22 cm/min) but with a slight increase at 10 sites. \(\:\text{L}\text{C}\text{E}\) shows mixed improvements, suggesting some performance gains but inconsistencies. Similarly, MF + Kostiakov's \(\:{\text{R}}^{2}\) shows a slight decrease, while \(\:\text{R}\text{M}\text{S}\text{E}\) improves (0.51 to 0.47 cm/min). \(\:\text{L}\text{C}\text{E}\) initially improves but worsens slightly with more data. The Taylor diagram (Fig. 11c) reinforces these findings. MF + Philip displays the most consistent movement towards the reference point, indicating improved agreement with observed values. Conversely, MF + Horton and MF + Kostiakov models show less consistent trends.

As the study aims to predict infiltration rates at sites using minimal data, such as soil parameters (% gravel, % sand, % silt + clay, % moisture), leveraging datasets from accessible locations. This approach is particularly valuable for locations where direct measurement is impractical. However, some models (e.g., ANN + Horton and MF + Horton) exhibit mixed results with increasing predictor sites. This could be due to overfitting and the inherent complexity of soil-water interactions not fully captured by the models. To comprehensively evaluate model performance, we employed different error measures.

Previous research has demonstrated significant improvements in predictive accuracy by integrating ML techniques with hydrological models. This study aligns with the trend in data-driven hydrological modeling, emphasizing the importance of combining physical process-based empirical models with ML approaches for enhanced prediction. As shown in Fig. 12, standalone infiltration models (Horton's, Philip's, Kostiakov's) exhibited limited success in predicting infiltration rates compared to the superior performance achieved by the hybrid models (particularly ANN + Philip) at both target sites. Althoff et al. (2021), Xu et al. (2024) and Young et al. (2017) also highlighted that hybrid models can effectively capture complex hydrological processes that traditional models cannot. Similarly, Zubelzu et al. (2024) underscored the benefits of ML algorithms in improving the accuracy of hydrological predictions, especially in data-scarce regions. Furthermore, the adaptability of ML-based hybrid models to the inherent variability and uncertainties within hydrological data is particularly valuable for predicting infiltration rates across diverse soil and climatic conditions, where conventional models often struggle (Parchami-Araghi et al. 2013).

Among the evaluated models, hybrids incorporating Philip's equation (ANN + Philip and MF + Philip) demonstrated superior performance in predicting infiltration rates at remote sites. Philip's model, a well-established hydrological approach, describes infiltration as a function of time and initial soil moisture. Its robust mathematical foundation provides a solid representation of infiltration processes, making it a strong foundation for integration with ML models. Furthermore, the superior performance of these Philip's model-based hybrids can be attributed to their ability to leverage the strengths of both approaches. The combination of an ANN or MF with Philip's model enhances predictive accuracy. While Philip's model offers a strong foundation, ANN and MF can capture complex, non-linear relationships in the data that the physical model alone might miss (Sy 2006). This synergy between physical process understanding and data-driven techniques ultimately leads to improved prediction capabilities.

In contrast, the Horton and Kostiakov models, while useful, have limitations that might explain their relatively poorer performance when combined with ML models. Both Horton and Kostiakov models are primarily empirical and might not generalize well across different environments, which can limit the hybrid models' ability to capture the full complexity of infiltration processes (Parchami-Araghi et al. 2013). To overcome these limitations, the integration of ML techniques can be tailored to enhance the empirical models' flexibility and generalizability. By using ANN and MF, these hybrid models can identify and adjust for site-specific factors and non-linear relationships that the empirical models alone may miss. Additionally, incorporating techniques such as cross-validation and sensitivity analysis can help quantify and reduce uncertainties, ensuring more robust and reliable predictions across diverse environments. This combined approach not only leverages the empirical models' simplicity and ease of use but also enriches them with the adaptability and precision of machine learning, leading to better performance in various hydrological contexts.

This study successfully integrated traditional hydrological models with ML techniques to predict infiltration rates in semi-arid regions of southern India. Hybrid models, particularly those based on Philip's equation, outperformed standalone traditional methods by leveraging both physical understanding and ML's predictive power. For instance, the ANN-Philip model achieved impressive accuracy (R², RMSE, and LCE of 0.91, 0.08 cm/min, and 0.71 at target site H4 and 0.92, 0.1 cm/min, and 0.62 at K4). A robust dataset with detailed soil characteristics from various depths and locations was crucial for model training. The study further demonstrates the value of spatially diverse data, as models trained with data from more sites exhibited higher performance. Accounting for the inherent variability in semi-arid soils was essential for robust predictions. Importantly, the hybrid models can predict infiltration rates at remote sites with minimal data, making them a valuable tool for areas with limited measurement capabilities. By integrating theory-guided data science with physics-informed ML, these hybrid models offer interpretable and accurate predictions, overcoming a major limitation of traditional approaches in hydrology. This novel approach has the potential to significantly improve water resource management and soil conservation in semi-arid and data-scarce regions. However, the study observed that hybrid models struggled to predict chaotic patterns, such as sudden dips and peaks in infiltration rates, especially at the start of the process. These fluctuations, likely due to initial soil conditions, water repellency, or micro-variations in soil texture, were not well captured. To overcome these limitations and address the concerns regarding the integration of empirical models with ML, future research should aim to improve the models' ability to handle these chaotic changes by incorporating more detailed data, utilizing advanced ML techniques, and enhancing training algorithms. Specifically, integrating cross-validation and sensitivity analysis can help quantify and reduce uncertainties, ensuring more robust and reliable predictions. Furthermore, exploring advanced ML techniques such as recurrent neural networks (RNNs) or long short-term memory (LSTM) networks could better capture temporal dynamics and sudden variations in infiltration rates. Additionally, using high-resolution temporal and spatial data can help in understanding and modeling initial soil conditions and micro-variations more accurately. This combined approach not only leverages the empirical models' simplicity but also enriches them with the adaptability and precision of machine learning, leading to better performance in various hydrological contexts. By addressing these challenges, the prediction accuracy and reliability of hybrid models can be further enhanced across diverse conditions.

Acknowledgements

The authors would like to express their sincere gratitude to the Director of the Indian Institute of Astrophysics, Government of India, for granting access to the campus to perform the infiltration tests. This support was instrumental in the successful completion of our study.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statement

The data and code used in this study will be made available upon request.

Adhikary PP, Chakraborty D, Kalra N, et al (2008) Pedotransfer functions for predicting the hydraulic properties of Indian soils. Aust J Soil Res 46:476–484. https://doi.org/10.1071/SR07042
Ahmed AA, Sayed S, Abdoulhalik A, et al (2024) Applications of machine learning to water resources management: A review of present status and future opportunities. J Clean Prod 441:140715. https://doi.org/10.1016/j.jclepro.2024.140715
Allaire SE, Roulier S, Cessna AJ (2009) Quantifying preferential flow in soils: A review of different techniques. J Hydrol 378:179–204. https://doi.org/10.1016/j.jhydrol.2009.08.013
Althoff D, Bazameb HC, Nascimentob JG (2021) Untangling hybrid hydrological models with explainable artificial intelligence. H2Open J 4:13–28. https://doi.org/10.2166/H2OJ.2021.066
Arya LM, Leij FJ, Shouse PJ, van Genuchten MT (1999) Relationship between the Hydraulic Conductivity Function and the Particle‐Size Distribution. Soil Sci Soc Am J 63:1063–1070. https://doi.org/10.2136/sssaj1999.6351063x
ASCE Task Committee on Application of Artificial Neural Networks in Hydrology. (2000a) Artificial Neural Networks in Hydrology. I: Preliminary Concepts. J Hydrol Eng 5:115–123. https://doi.org/10.1061/(ASCE)1084-0699(2000)5:2(115)
ASCE Task Committee on Application of Artificial Neural Networks in Hydrology. (2000b) Artificial Neural Networks in Hydrology. II: Hydrologic Applications. J Hydrol Eng 5:124–137. https://doi.org/https://doi.org/10.1061/(ASCE)1084-0699(2000)5:2(124)
ASTM Standard D2216–19 (2019) Standard Test Methods for Laboratory Determination of Water (Moisture) Content of Soil and Rock by Mass. West Conshohocken, PA
Bikše J, Retike I, Haaf E, Kalvāns A (2023) Assessing automated gap imputation of regional scale groundwater level data sets with typical gap patterns. J Hydrol 620: 129424. https://doi.org/10.1016/j.jhydrol.2023.129424
Bonanomi G, Motti R, Abd-ElGawad AM, Idbella M (2024) Soil water repellency along elevation gradients: The role of climate, land use and soil chemistry. Geoderma 443:116847. https://doi.org/10.1016/j.geoderma.2024.116847
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
Ceballos A, Martı́nez-Fernández J, Santos F, Alonso P (2002) Soil-water behaviour of sandy soils under semi-arid conditions in the Duero Basin (Spain). J Arid Environ 51:501–519. https://doi.org/10.1006/jare.2002.0973
Chen G, Hou J, Liu Y, et al (2024) Urban inundation rapid prediction method based on multi-machine learning algorithm and rain pattern analysis. J Hydrol 633:131059. https://doi.org/10.1016/j.jhydrol.2024.131059
Christiaens K, Feyen J (2001) Analysis of uncertainties associated with different methods to determine soil hydraulic properties and their propagation in the distributed hydrological MIKE SHE model. J Hydrol 246:63–81. https://doi.org/10.1016/S0022-1694(01)00345-6
Dexter AR, Richard G (2009) The saturated hydraulic conductivity of soils with n-modal pore size distributions. Geoderma 154:76–85. https://doi.org/10.1016/j.geoderma.2009.09.015
Dubey SR, Singh SK, Chaudhuri BB (2022) Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 503:92–108. https://doi.org/10.1016/j.neucom.2022.06.111
Gjettermann B, Nielsen KL, Petersen CT, et al (1997) Preferential flow in sandy loam soils as affected by irrigation intensity. Soil Technol 11:139–152. https://doi.org/10.1016/S0933-3630(97)00001-9
Haverkamp R, Kutilek M, Parlange J-Y, et al (1988) Inﬁltration under ponded conditions: 2. Inﬁltration equationstested for parameter time-dependence and predictive use. Soil Sci 145:317–329
He Y, Wang Y, Liu Y, et al (2024) Focus on the nonlinear infiltration process in deep vadose zone. Earth-Science Rev 252:104719. https://doi.org/10.1016/j.earscirev.2024.104719
Heber Green W, Ampt GA (1911) Studies on Soil Phyics. J Agric Sci 4:1–24. https://doi.org/10.1017/S0021859600001441
Holtan HN (1961) A concept for infiltration estimates in watershed engineering, 41st edn. Agricultural Research Service, US Department of Agriculture
Horton RE (1941) An Approach Toward a Physical Interpretation of Infiltration‐Capacity. Soil Sci Soc Am J 5:399–417. https://doi.org/10.2136/sssaj1941.036159950005000C0075x
Ispirova G, Eftimov T, Seljak BK (2020) Evaluating missing value imputation methods for food composition databases. Food Chem Toxicol 141:111368. https://doi.org/10.1016/j.fct.2020.111368
Jia Y, Culver TB (2006) Bootstrapped artificial neural networks for synthetic flow generation with a small data sample. J Hydrol 331:580–590. https://doi.org/10.1016/j.jhydrol.2006.06.005
Kostiakov AN (1932) On the dynamics of the coefficient of water-percolation in soils and on the necessity of studying it from a dynamic point of view for purposes of amelioration. Trans 6th Cong Int Soil Sci Russ Part A 17–21
Lado M, Paz A, Ben-Hur M (2004) Organic Matter and Aggregate‐Size Interactions in Saturated Hydraulic Conductivity. Soil Sci Soc Am J 68:234–242. https://doi.org/10.2136/sssaj2004.2340
Legates DR, McCabe GJ (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35:233–241. https://doi.org/10.1029/1998WR900018
Mahapatra S, Jha MK, Biswal S, Senapati D (2020) Assessing Variability of Infiltration Characteristics and Reliability of Infiltration Models in a Tropical Sub-humid Region of India. Sci Rep 10:1–18. https://doi.org/10.1038/s41598-020-58333-8
Manns HR, Jiang Y, Parkin G (2024) Soil pores in preferential flow terminology and permeability equations. Vadose Zo J 1–12. https://doi.org/10.1002/vzj2.20365
Mantoglou A, Gelhar LW (1987) Effective hydraulic conductivities of transient unsaturated flow in stratified soils. Water Resour Res 23:57–67. https://doi.org/10.1029/WR023i001p00057
Mattar MA, Alazba AA, Zin El-Abedin TK (2015) Forecasting furrow irrigation infiltration using artificial neural networks. Agric Water Manag 148:63–71. https://doi.org/10.1016/j.agwat.2014.09.015
Mosavi A, Ozturk P, Chau KW (2018) Flood prediction using machine learning models: Literature review. Water (Switzerland) 10:1–40. https://doi.org/10.3390/w10111536
Naranjo-Fernández N, Guardiola-Albert C, Aguilera H, et al (2020) Clustering groundwater level time series of the exploited almonte-marismas aquifer in southwest Spain. Water (Switzerland) 12:1–20. https://doi.org/10.3390/W12041063
Overton D (1964) Mathematical refinement of an infiltration equation for watershed engineering. Agricultural Research Service, US Department of Agriculture
Parchami-Araghi F, Mirlatifi SM, Ghorbani Dashtaki S, Mahdian MH (2013) Point estimation of soil water infiltration process using Artificial Neural Networks for some calcareous soils. J Hydrol 481:35–47. https://doi.org/10.1016/j.jhydrol.2012.12.007
Philip JR (1969) Theory of Infiltration. In: Advances in Hydroscience. Academic PRESS, INC., pp 215–296
Qiu Y, Fu B, Wang J, Chen L (2001) Soil moisture variation in relation to topography and land use in a hillslope catchment of the Loess Plateau, China. J Hydrol 240:243–263. https://doi.org/10.1016/S0022-1694(00)00362-0
Richards LA (1931) Capillary conduction of liquids through porous mediums. J Appl Phys 1:318–333. https://doi.org/10.1063/1.1745010
Rosenbom AE, Therrien R, Refsgaard JC, et al (2009) Numerical analysis of water and solute transport in variably-saturated fractured clayey till. J Contam Hydrol 104:137–152. https://doi.org/10.1016/j.jconhyd.2008.09.001
Salvadore E, Bronders J, Batelaan O (2015) Hydrological modelling of urbanized catchments: A review and future directions. J Hydrol 529:62–81. https://doi.org/10.1016/j.jhydrol.2015.06.028
Sayari S, Mahdavi-Meymand A, Zounemat-Kermani M (2021) Irrigation water infiltration modeling using machine learning. Comput Electron Agric 180:105921. https://doi.org/10.1016/j.compag.2020.105921
Shanafield M, Cook PG (2014) Transmission losses, infiltration and groundwater recharge through ephemeral and intermittent streambeds: A review of applied methods. J Hydrol 511:518–529. https://doi.org/10.1016/j.jhydrol.2014.01.068
Sidhu RK, Kumar R, Rana PS (2020) Machine learning based crop water demand forecasting using minimum climatological data. Multimed Tools Appl 79:13109–13124. https://doi.org/10.1007/s11042-019-08533-w
Sihag P, Singh B, Sepah Vand A, Mehdipour V (2020) Modeling the infiltration process with soft computing techniques. ISH J Hydraul Eng 26:138–152. https://doi.org/10.1080/09715010.2018.1464408
Sihag P, Singh VP, Angelaki A, et al (2019) Modelling of infiltration using artificial intelligence techniques in semi-arid Iran. Hydrol Sci J 64:1647–1658. https://doi.org/10.1080/02626667.2019.1659965
Singh B, Sihag P, Singh K (2017) Modelling of impact of water quality on infiltration rate of soil by random forest regression. Model Earth Syst Environ 3:999–1004. https://doi.org/10.1007/s40808-017-0347-3
Smith RE (1972) The infiltration envelope: Results from a theoretical infiltrometer. J Hydrol 17:1–22. https://doi.org/10.1016/0022-1694(72)90063-7
Smith RE, Parlange J ‐Y (1978) A parameter‐efficient hydrologic infiltration model. Water Resour Res 14:533–538. https://doi.org/10.1029/WR014i003p00533
Stekhoven DJ, Bühlmann P (2012) Missforest-Non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597
Sy NL (2006) Modelling the infiltration process with a multi-layer perceptron artificial neural network. Hydrol Sci J 51:3–20. https://doi.org/10.1623/hysj.51.1.3
Taylor KE (2001) Summarizing multiple aspects of model performance in a single diagram. J Geophys Res Atmos 106:7183–7192. https://doi.org/10.1029/2000JD900719
Teshome FT, Bayabil HK, Schaffer B, et al (2024) Simulating soil hydrologic dynamics using crop growth and machine learning models. Comput Electron Agric 224:109186. https://doi.org/10.1016/j.compag.2024.109186
Wang Q, Shao M, Horton R (1999) Modified Green and Ampt models for layered soil infiltration and muddy water infiltration. Soil Sci 164:445–453
Xu W, Chen J, Corzo G, et al (2024) Coupling Deep Learning and Physically Based Hydrological Models for Monthly Streamflow Predictions. Water Resour Res 60:1–25. https://doi.org/10.1029/2023WR035618
Young CC, Liu WC, Wu MC (2017) A physically based and machine learning hybrid approach for accurate rainfall-runoff modeling during extreme typhoon events. Appl Soft Comput J 53:205–216. https://doi.org/10.1016/j.asoc.2016.12.052
Yuan J, Yao Y, Guan Y, et al (2024) Effects of land use patterns on soil properties and nitrous oxide flux on a semi-arid environmental conditions of Loess Plateau China. Glob Ecol Conserv 51:e02899. https://doi.org/10.1016/j.gecco.2024.e02899
Zhang Y, Schaap MG (2019) Estimation of saturated hydraulic conductivity with pedotransfer functions: A review. J Hydrol 575:1011–1030. https://doi.org/10.1016/j.jhydrol.2019.05.058
Zubelzu S, Ghalkha A, Ben Issaid C, et al (2024) Coupling machine learning and physical modelling for predicting runoff at catchment scale. J Environ Manage 354:120404. https://doi.org/10.1016/j.jenvman.2024.120404

Supplementarymaterials.docx

Download PDF

Editorial decision: Major revisions
16 Oct, 2024
Reviewers agreed at journal
16 Aug, 2024
Reviewers invited by journal
16 Aug, 2024
Editor invited by journal
15 Aug, 2024
Editor assigned by journal
09 Aug, 2024
First submitted to journal
06 Aug, 2024

You are reading this latest preprint version

Enhancing Infiltration Rate Predictions with Hybrid Machine Learning and Empirical Models: Addressing Challenges in Southern India

Status:

Version 1

Abstract

Figures

1. Introduction

2. Study area

3. Methodology

3.1 Field measurement and data collection

3.2 Machine leaning algorithm

3.2.1 Artificial Neural Network (ANN)

3.2.2 MissForest (MF)

3.3 Performance evaluation criteria

4. Results and Discussions

4.1. Predictor variables from lab-measured soil parameters

4.2. Derived infiltration parameters for hybrid model development

4.2.1 Sorptivity and transmissivity from Philip's model

4.2.2 Initial infiltration capacity and soil empirical constant from Horton’s model

4.2.3 Empirical dimensionless constants from Kostiakov’s model

4.3. Evaluation of hybrid machine learning and hydrological models for infiltration rate prediction under varying scenarios

4.3.1 Hybrid ANN and hydrological models keeping H4 as a target site

4.3.2 Hybrid MF and hydrological models keeping H4 as a target station

4.3.3 Hybrid ANN and hydrological models keeping K4 as a target station

4.3.4 Hybrid MF and hydrological models keeping K4 as a target station

5. Conclusions

Declarations

References

Supplementary Files

Status:

Version 1