Database
With the help of various methods and techniques, groundwater potential mapping with high reliability and accuracy could be built (Moghaddam et al. 2015). In the current study, spatial data and materials were prepared, including geology map and hydrogeology map. The digital elevation model (DEM) with a spatial resolution of 30×30 m was used to extract a set of influence factors in the study area. All impact factors were processed in the ArcGIS 10.2 software. In order to evaluate the potential of regional groundwater, the whole area was divided into 400501 grids with a size of 100×100 m based on the prediction accuracy of models and the geological conditions of this area.
In terms of groundwater data, a total of 245 wells were identified based on field survey by using a handheld GPS and historical hydrogeological materials. For this analysis, these wells were randomly divided into two groups, of which 172 wells (70%) were used for training datasets and 73 wells (30%) for validation (Fig. 1).
Selection And Analysis Of Influence Factors
The presence of groundwater is closely related to various environmental geological factors (Cantonati et al. 2016). Through quantifying environmental geological parameters to realize the analysis of groundwater potential in this area, and it is concluded that there is a functional relationship between factors and groundwater, to evaluate the presence or absence of groundwater by using the machine learning method. There are no fixed guidelines in selecting of the groundwater potential influence factors (Oh et al. 2011).
Based on the results of the field geological survey, this study analyzed the relevant geological materials and previous literature. Thirteen influence factors were selected to predict the spatial distribution of groundwater, including elevation, slope, aspect, plan curvature, profile curvature, topographic wetness index (TWI), drainage density, distance to rivers, distance to faults, lithology, soil type, land use, normalized difference vegetation index (NDVI) (Fig. 3a-m). The elevation was often used as an important factor in finding the presence of groundwater (Wang et al. 2015). It was extracted from the DEM to show the undulations of the terrain. This study divides elevation into 8 categories according to an equal-interval classification scheme, including: <20 m, 20-40, 40-60, 60-80, 80-100, 100-120, 120-140, and >140 m.
Based on the side slope unit of terrain segmentation, the slope was adopted, which can control the flow of groundwater directly. The slope unit was extracted with 30 m resolution from DEM as the basic data, and the hydrological analysis module in ArcGIS 10.2 was used to extract the regional slope. The value of the slope was divided into 7 categories according to the natural breakpoint method, including: <0.5°, 0.5-1°, 1-1.5°, 1.5-2°, 2-5°, 5-10°, and >10°. To a certain extent, the slope can indicate the direction of groundwater flow (Naghibi et al. 2016). The slope aspect was the inclination direction of slope, which controls the flow of precipitation, wind direction and plant photosynthesis (Zabihi et al. 2016). Compared with shady slopes, sunny slopes had longer sunshine time, and their surface had stronger weathering and evaporation. The aspect was extracted from the DEM and was divided into 9 categories according to the different directions, including: Flat, North, Northeast, East, Southeast, South, Southwest, West, and Northwest.
Plan curvature was the change rate of slope at any point on the surface, which was formed by the intersection of a horizontal plane and the surface (Arabameri et al. 2020). This morphological feature will affect the convergence and divergence of surface runoff, and can reflect the degree of contour curvature. After completing the aspect extraction in ArcGIS 10.2, the slope was extracted from this aspect and was shown as a plan curvature map. The plan curvature was divided into 7 categories using an equal-interval classification scheme, including: <20, 20-30, 30-40, 40-50, 50-60, 60-70, and >70. Profile curvature represents the change rate of the surface slope at any given point. Like the plan curvature, profile curvature map was generated by calculating the slope of DEM twice in succession, and the values were reclassified into 7 categories, including: <0.4, 0.4-0.8, 0.8-1.2, 1.2-2, 2-5, 5-10, and >10.
The topographic wetness index (TWI) is a physical indicator of the influence of regional topography on groundwater flow direction and convergence (Moore et al. 1991). This index is a function of the slope and upstream contribution area. It is defined as Eq. (1) follows:
$$TWI=\text{l}\text{n}\left(\frac{\alpha }{tan\beta }\right)$$ 1
where α is the upstream area, and β is the slope o0f each point. According to the different TWI values, five classes were created: <6, 6-8, 8-10, 10-12, and >12.
Drainage density is the total length of rivers per unit regional area. It is closely related to the precipitation, difference elevation, and moisture retention capacity of soil. The drainage density in this area is binned into four classes: <0.15, 0.15-0.3, 0.3-0.45, and >0.45 km/km2. Distance to rivers is a key factor affecting the potential of groundwater. Rivers are an important source of groundwater recharge. This area is mainly covered with Chihe River systems. The distance values between wells and rivers have an important influence to this research. Based on the hydrological conditions of this area, the buffer zones on the borderlands of the river system were divided into 5 classes: <100, 100-200, 200-300, 300-400, and >400 m.
Faults can control the flow and storage of regional groundwater. Regional faults were extracted from the geological map and were reclassified into five groups based on distance: <500, 500-1000, 1000-1500, 1500-2000, and >2000 m. Lithology of aquifer is the basis of groundwater flow and storage, and it determines the porosity and permeability of the aquifer (Ayazi et al. 2010). Lithology categories were extracted from the regional geological map. Fourteen types of lithology were divided, as shown in Table 1.
Table 1
Regional lithology information of study area
Number
|
Unit
|
Lithology
|
Period
|
1
|
K2x
|
Sandstone, argillaceous, siltstone
|
Cretaceous
|
2
|
K2z
|
Sandstone, glutenite
|
3
|
K1g
|
Sandstone, glutenite, mixed shale
|
4
|
E3s
|
Mudstone, siltstone, glutenite
|
Paleogene
|
5
|
β6
|
Basalt
|
Cenozoic
|
6
|
E2t
|
Sandstone, glutenite, mudstone, mixed basalt
|
Paleogene
|
7
|
Ar2xz
|
Gneiss, feldspar quartzite, schist
|
Archaeozoic Eon
|
8
|
Є 2
|
Limestone, dolomites
|
Cambrian
|
9
|
O1
|
Medium-thick limestone, dolomites
|
Ordovician
|
10
|
Zz1z
|
Banded siliceous dolomite, marl
|
Ediacaran
|
11
|
δO5
|
Quartz diorite
|
Late Jurassic-Cretaceous
|
12
|
Qh
|
Clay rock, siltstone, glutenite, mudstone
|
Holocene
|
13
|
Xγ2
|
Moyite
|
Early Jurassic-Cretaceous
|
14
|
Qp
|
Clay rock, sandstone, mudstone
|
Pleistocene
|
Soil type has a vital influence on the infiltration of surface water and recharge of groundwater. It determines the distribution of groundwater potential to a certain extent (Razandi et al. 2015). This factor was roughly divided into three groups from the regional geological map according to different types, including rock outcrops, sandy soil, and cohesive soil. Different land-use types affect the quantity of groundwater resources and quality of groundwater. The relationship between human activities and natural systems was revealed and the distribution of groundwater potential was reflected (Chen et al. 2018). The land use map was classified into water bodies, residential areas, forests, and agriculture. For the normalized difference vegetation index (NDVI), it refers to the percentage of the vertical projection area of vegetation on the ground to the total statistical area (Pourghasemi et al. 2013). The NDVI map was calculated and mapped by the ArcGIS software. Eight groups were reclassified based on a natural break method: <0.125, 0.125-0.25, 0.25-0.375, 0.375-0.5, 0.5-0.625, 0.625-0.75, 0.75-0.875, and >0.875.
Multicollinearity Analysis Of Factors
A multicollinearity analysis of factors was performed to select factors, which have a significant relationship with groundwater distribution. Multicollinearity refers to a certain extent of linear correlation between independent factors, which will affect the contribution of factors to the model (Pourtaghi and Pourghasemi 2014). If there was collinearity between two factors, it was difficult to distinguish the effect of each factor on the results, and the regression model lacks stability. Then, two statistical parameters to determine the multicollinearity problem between each factor were proposed, namely tolerance (TOL) and variance inflation factor (VIF). The values of TOL>0.1 or VIF<10 suggest independence between each factor.
Description Of Models
Logistic regression (LR)
A nonlinear dynamic response relationship between a dependent variable and several corresponding independent variables was established by training and testing the known samples in the logistic regression (LR) model, and then predicts or evaluates the probability of an event in unknown samples (Lombardo et al. 2018). When evaluating groundwater potential, each influencing factor was taken as an independent variable, and the presence or absence of groundwater was taken as a dependent variable. In this study, P is the probability of the presence of groundwater with a range [0, 1], 1-P is the probability of the absence of groundwater, P/(1-P) is the ratio of probability, which is often taken as its natural logarithm. The LR model equation is expressed as follow:
$$\text{l}\text{n}\frac{P}{1-P}={\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\cdots +{\beta }_{n}{X}_{n}$$ 2
where, X1, X2, …, Xn represents the independent factors;β0, β1, …, βn represents the regression coefficients. The probability P of groundwater potential can be obtained from Eq. 3:
$$P=\frac{{e}^{{\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\cdots +{\beta }_{n}{X}_{n}}}{1+{e}^{{\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\cdots +{\beta }_{n}{X}_{n}}}$$ 3
In order to calculate the groundwater potential index of the Chihe River basin by using the LR model, 245 known well locations (with potential index 1) and 245 randomly selected non-well location units (with potential index 0) were randomly divided into two parts, of which 70% was used for model training and the others was used for model validation. Groundwater potential indexes of each grid unit calculated by the LR model ranges from 0.001 to 0.999, the high value with greater the groundwater potential. The data processing of the LR model was completed by SPSS 22.0 statistical software, and the P values of groundwater were imported into ArcGIS 10.2 to generate the groundwater potential map.
Deep neural networks (DNN)
In deep neural networks, sample data was processed through multiple layers, the initial low-level feature layer was gradually converted into a high-level feature layer with more abstract. The distributed characteristics of regional data were mined, which was more conducive to the visualization of classification or characteristics. Multiple layers of deep structure with nonlinear factors in the processing of machine learning of deep neural network (DNN) can accomplish complex function approximation. Through the summary and analysis of DNN model, a multi-layer perceptron model was established with the MATLAB software to conduct nonlinear prediction research on groundwater potential in this area (Jiang et al. 2019).
The function of the activation layer in the DNN model is to use the function to activate output result of the entire connected layer. The activation function is a non-linear transformation function that can imitate threshold activation characteristics of brain neurons. It introduces non-linear features into DNN and strengthens the expressive ability of this model. Two activation functions Relu and Sigmoid are used in this model. Relu function is a piecewise function widely used. It can improve the speed of convergence and increase the sparsity of the network. The sigmoid function is actually the expression of logistic regression function, which maps the input of this layer to [0, 1]. It is suitable for second-class discrimination in the last layer of this network. Two functions can be represented as Eq. 4 and Eq. 5.
Relu function: (4)
Sigmoid function: \(\text{f}\left(\text{x}\right)=\frac{1}{1+{e}^{-x}}\) (5)
In model training, the loss function adopts two kinds of cross entropy, and the Adam method is used to find the optimum value. The initial learning rate is set to 0.005, and the iteration of model is 2500.
DNN algorithm was continuously optimized by using the training samples (70%) and the validation samples (30%). Then, predicted data was substituted into the model for training to obtain the predicted value, and input the values with location information into ArcGIS to generate the groundwater potential map.
Random forest (RF)
Random forest (RF) is the most representative algorithm based on bagging integrated learning (Norouzi et al. 2019). It uses random sampling to perform integrated learning on multiple decision trees, and finally makes predictions through a majority voting mechanism. By extracting several samples from the original datasets with samples returned, using the decision tree algorithm to train these extracted samples, and then combining these decision trees together (Naghibi et al. 2016). After voting, the final classification result was the one with the most votes.
These main steps of the RF model are as follows: first, sampling many times from the original sample set and return the sample every time, each time to form a training dataset. Then, a decision tree is generated. If the sample has X groups of features in total, n features are randomly selected from the X groups as the split feature set of each internal node of a decision tree. Subsequently, the node is split using an optimal split method of classification feature set. Classification and regression tree (CART) algorithms are used to generate decision trees. Finally, integrate all decision trees to form a random forest model to classify and predict unknown data. Voting on the results of each tree produce the result and the most vote is the classification result.
This study uses the RF package in R 4.1.0 software to fit the model. Takes each groundwater potential influence factor as an auxiliary dataset, the final model variables are screened out by error in the outside of the RF package, variables are eliminated one by one, and then the change of error in the outside of the RF package is observed. If the error increases, the variable is retained, otherwise eliminate the variable.
Model Comparison
The accuracy of model evaluation is controlled by many factors. This study evaluates the prediction accuracy of three models by calculating the mean absolute error (MAE), root mean square error (RMSE) and correlation coefficient (R) between measured values and predicted values of the validation data (Singha et al. 2020). RMSE can evaluate the variation degree of data, and its equation is shown in Eq. 6. The smaller value of the RMSE is, the higher prediction accurate of the model is. MAE is the average of absolute errors, which can better reflect the actual error of the predicted value. The smaller value of the MAE is, the more accurate prediction of the model is, as shown in Eq. 7. R value is a statistical indicator that can reflect the degree of correlation between variables, which can be calculated as Eq. 8. The value of R is closer to 1, the stronger of correlation between two variables.
$$\text{R}\text{M}\text{S}\text{E}=\sqrt{1/\text{n}\underset{\text{i}}{\overset{\text{n}}{?}}{({\text{A}}_{\text{i}}-{\text{B}}_{\text{i}})}^{2}}$$ 6
$$\text{M}\text{A}\text{E}=1/\text{n}\underset{\text{i}}{\overset{\text{n}}{?}}\left|{\text{A}}_{\text{i}}-{\text{B}}_{\text{i}}\right|$$ 7
$$\text{R}=\frac{\sum _{\text{i}}^{\text{n}}({\text{A}}_{\text{i}}-\stackrel{-}{\text{A}})({\text{B}}_{\text{i}}-\stackrel{-}{\text{B}})}{\sqrt{\sum _{\text{i}}^{\text{n}}{({\text{A}}_{\text{i}}-\stackrel{-}{\text{A}})}^{2}}\sqrt{\sum _{\text{i}}^{\text{n}}{({\text{B}}_{\text{i}}-\stackrel{-}{\text{B}})}^{2}}}$$ 8
where Ai is measured values; Bi is predicted values; \(\stackrel{-}{A}\) and \(\stackrel{-}{B}\) represent the average value of measured values and predicted values.