2.1 Supervised learning (SL) algorithm and validation sets
IR datasets were implemented to train SL models (Fig. 2).
A decision tree (DT) 28 is a tree-like structure classifier for classification problems, and the DT creates a rule to predict the value of an instance variable by learning decision rules inferred from the data features. Specifically, the node represents the test on a feature, the branch is the outcome of the test, and the leaf node represents the class label. In this study, the CART algorithm is implemented to construct a binary DT by calculating the Gini index as follows:
$$Gini \left(D\right)=1-\sum _{i=1}^{m}{p}_{i}^{2}$$
3
where pi denotes the possibility of a training sample belonging to class Ci in D. pi is calculated by |Ci,D|/|D|; m is the total number of classes.
RF 29 is a meta estimator that ensembles a number of DT classifiers. RF decreases the variance and tendency of overfitting by involving two randomness strategies, creating splits from all input features and a random subset of features, and averaging the predictions of all decision tree classifiers.
ET 30 is an ensemble algorithm that is similar to RF, and it keeps the strategies of RF but further the randomness by taking random thresholds for features when splitting. ET normally decreases the variance but increases the bias compared with RF.
ADB 31 ensembles a sequence of weak classifiers (DTs in this study) to increase the model performance by iteratively modifying the dataset. The boosting iteration of ADB modifies the dataset by reweighting each sample of the training set at each iteration, and the weights of samples that were incorrectly predicted by the model increase, whereas the weights are decreased for those that were predicted correctly.
Gradient tree boost (GB) 32 is a type of boosting algorithm that generalizes a sequence of weak classifiers (DTs in this study) according to an arbitrary differentiable loss function:
$${L}_{MSE}=\frac{1}{n}\sum _{i=1}^{n}{({y}_{i}-F({x}_{i}\left)\right)}^{2}$$
4
where F(xi) is the prediction value of classifier F and the goal of the loss function is to minimize the mean squared error (MSE).
Assume X is an n attribute vector and yi takes values in the set{0,1}. LR predicts the probability of the positive class P(yi=1|Xi) as:
$$P\left({X}_{i}\right)=\frac{1}{1+\text{e}\text{x}\text{p}(-{X}_{i}\omega -{\omega }_{0})}$$
5
where ω and ω0 are parameters, and they can be assessed by minimizing the following cost function with regularization term r(ω):
$$\underset{{\omega }}{\text{min}}C\sum _{i=1}^{n}\left(-{y}_{i}\text{log}\left(P\left({X}_{i}\right)\right)-(1-{y}_{i})\text{log}(1-P({X}_{i}\left)\right)\right)+r\left({\omega }\right)$$
6
SVM 33 applies a hyperplane ωTz+b=0 to separate samples into two classes for a binary problem:
$$\left\{\begin{array}{c}-{{\omega }}^{T}{x}_{i}+b\ge 1, {y}_{i}=1\\ -{{\omega }}^{T}{x}_{i}+b\le 0, {y}_{i}=0\end{array}\right.$$
7
where xi is the attribute vector of the samples and yi is the label of each one.
The support vectors are defined as samples that are close to the hyperplane. The margin is defined as the distance between two support vectors of two classes:
$$\gamma =\frac{2}{\left|\right|\omega \left|\right|}$$
8
The maximizing \(\gamma\) is the optimal hyperplane.
KNN 34 is an instance-based algorithm. For a multifeature dataset, each sample can be handled as a subset of the feature space X with n-dimensional vector space Rd, X={x1…,xN}, xn ∈Rd. The difference between samples is calculated by the Euclidean function:
$$d\left(x,y\right)=\sqrt{\sum _{i=1}^{n}{({x}_{i}-{y}_{i})}^{2}}$$
9
and the closest number of k samples to an instance are called its KNN.
ANN 35 implements the back propagation algorithm, which is a gradient-decent-based algorithm including one input layer, one output layer and one or more hidden layers. Each layer is composed of one or more neuron nodes. The training procedure contains three steps: randomly initialize the weights; each input unit feed forwards and broadcasts the signal to each of the neurons of the hidden layers; reverse error propagation from the output layer and updating the weights between the neurons in the input and hidden layers. The sigmoid function is used in this study to determine the output state:
$${a}_{i}=\frac{1}{1+{e}^{-ne{t}_{i}}}$$
10
The convolutional neural network (CNN) 36 architecture comprisesof convolutional layers, pooling layers and fully connected layers. Specifically, the convolution layer generates the output of neurons connected to the local regions of the input by calculating the weight of the neurons and the scalar product between the area connected to the input volume; the pooling layer applies downsampling to the given input by the spatial dimensionality; the fully connected layer performs the same with ANNs and calculates the confidence of the class score according to the activations.
The gated recurrent unit (GRU) 37 addresses the vanishing gradient problem in a simply recurrent neural network (RNN) by consisting of two RNNs. Assuming a hidden state h and an optional output y that operates on a variable sequence x={x1,…,xT}, one encodes a sequence of symbols into a fixed length vector representation:
$${h}_{t}=f({h}_{t-1},{x}_{t})$$
11
where f is a nonlinear activation function.
The other decodes the representation into another sequence of symbols:
$${h}_{t}=f({h}_{t-1},{y}_{t-1},c)$$
12
where c is a summary of the whole input sequence after reading the end of the sequence of the first RNN. The two components of RNN are jointly trained to maximize:
$$\underset{\theta }{\text{max}}\frac{1}{N}\sum _{n=1}^{N}{\text{log}}_{{p}_{\theta }}\left({y}_{n}\right|{x}_{n})$$
13
where θ is the model parameter and each (xn, yn) is a pair of input sequences and output sequences.
The validation sets of all models are subsets of the original data because the IR strategies are prone to involve an estimation error. A reasonable and straightforward approach is randomly selecting 30% of the original data as a validation set, which does not involve all IBL training sets five times, and the final result is the average of 5 validation sets.
2.3 IR
2.3.1 Undersampling
The cluster centroid (CC) 43 is an extension of k-means clustering. Assuming that the dataset has been clustered into M disjoint subsets C1...,CM, each with a centroid ci, and the most widely used classification criterion for K-means clustering is the sum of squared Euclidean distances of samples and centroids mk, this criterion is named clustering error, which depends on the clustering centres m1,…,mk:
$$E\left({m}_{1},\dots ,{m}_{M}\right)=\sum _{i=1}^{N}\sum _{k=1}^{M}I({x}_{i}\in {C}_{k}){{x}_{i}‖{x}_{i}-{m}_{k}‖}^{2}$$
1
if X is true I(X) = 1 else I(X) = 0. The K-means algorithm finds local optimal solutions for clustering errors. CC finds every centroid for the majority class by k-means and only keeps the data on the centroid or uses the kNN rule to keep the samples within the k nearest neighbour of the centroids44.
Near Miss 20 (NM) (Fig. 3) applies 3 different strategies to balance the dataset, and NM comprises kNN rules.
NM strategy 1 (NM1)) samples from the majority samples using the kNN algorithm, and it filters and retains the samples with the smallest average distance from the minority group.
NM strategy 2 (NM2) samples from the majority class using the kNN algorithm, and it filters the minority samples that are farthest from the majority group as a subset and retains the majority samples with the smallest average distance from them.
NM strategy 3 (NM3) preselects a group of samples of the minority samples as a subset, and it filters and retains the samples with the largest average distance from this group.
Tomek’s link 45 (Fig. 3) applies the 1NN rule to remove the noise and border samples, which means there is a Tomek’s link if 2 samples are 1NN of each other, and they are defined as noise or border samples that can affect the decision boundary of the mode.
One Sided Selection (OSS) 41 will run the program after Tomek’s link: add all minority samples to set C, add 1 sample from the majority class to C, and add all other samples to set S; traverse set S sample by sample, and apply the 1 nearest neighbour rule for each sample in S. If the sample is misclassified, add it to C; otherwise, repeat the above steps until no sample is added 46.
ENN 42 is a technique to remove noise and border samples. ENN traverses every sample and classifies if they are noise by kNN, and the criterion is the proportion of two classes, which means samples are not noise if 2/3 of samples of kNN are majority class, otherwise the noise needs to be removed.
The neighbour cleaning rule (NCL) 46 is similar but different from OSS by the application of ENN rather than Tomek’s link. A T threshold parameter is added to avoid excessive data cleaning: Ci>C*T, Ci is the number of samples of classes and C is the number of samples of the dataset. The subsequent process is the same as that of OSS 46.
The instance hardness threshold (IHT) 47 defines the study problems of ML as maximizing the probability value by Bayes’ theorem. Instance hardness (IH) means that the instance is prone to be classified incorrectly under the assumption h. A representative set of learning algorithms and their associated parameters ι are weighted a priori with nonzero probability, and all other learning algorithms are handled as having zero probability to approximate the unknown distribution p(h|t) or equivalently p(g(t, α)):
$$I{H}_{\iota }\left(⟨{x}_{i},{y}_{i}⟩\right)=1-\frac{1}{\left|\iota \right|}\sum _{j=1}^{\left|\iota \right|}p(\left.{y}_{i}\right|{x}_{i},{g}_{j}(t,{\alpha }\left)\right)$$
2
Specifically, a pretrained classifier determines the probability value IHι of each majority class sample on the dataset, and those with low probability are considered indistinguishable samples that need to be removed.
2.3.2 Oversampling
SMOTE 26 is a technique to artificially synthesize minority class samples by interpolating in the feature space of minority class samples. Compared with random oversampling (ROS), the advantage of SMOTE is that it effectively makes the decision region of the minority class samples more general and smoother 48.
There are two steps for SMOTE (Fig. 3). First, SMOTE randomly selects samples in kNN according to the required synthesis ratio. For example, 1 of the 5 nearest neighbours of each minority sample is randomly selected if 100% of the minority class samples need to be synthesized. A new sample can be randomly generated on the line segment composed of the two samples by calculating the difference between the feature vectors and multiplying the value difference by a random number between 0 and 1. SMOTE cannot be applied to categorical data.
Synthetic Minority Oversampling Technique for Nominal and Continuous (SMOTENC) 43 is only for the dataset of mixed variables. In oversampling of SMOTENC, the synthesis of continuous data is consistent with SMOTE, while categorical data will select the most common category in the nearest neighbour samples.
The Euclidean distance rule is not applied in the Synthetic Minority Oversampling Technique for Nominal (SMOTEN) but the value difference metric (VDM) 49.
2.3.3 Hybrid sampling
Although SMOTE reduces overfitting compared to ROS, it can still generate noise when using SMOTE (Menardi and Torelli, 2014). Hybrid strategies that combine oversampling and undersampling have been proposed.
SMOTE & Tomek’s link (ST) 50 implements SMOTE for the minority class and Tomek’s link for both classes.
SMOTE & ENN (SE) 51 implements SMOTE for the minority class and ENN for both classes.
2.4 Performance measure
For a binary classification problem, the confusion matrix defines the basis for performance metrics, and ROC is a plot generated by the true positive rate (TPR) and false positive rate (FPR). AUC will calculate the probability P value of each entity as a positive class iteratively when a model classifies which category an entity belongs to, and it sorts these values and calculates them as a threshold ϴ to weigh a positive category. The algorithm for calculating AUC is the trapezoidal rule 52.
$$TPR=\frac{TP}{TP+FN}$$
3
$$FPR=\frac{FP}{FP+TN}$$
4
Recall and Precision are from the confusion matrix(Buckland and Gey, 1994) and are defined as:
$$Recall=TPR=\frac{TP}{TP+FN}$$
5
$$Precision=\frac{TP}{TP+FP}$$
6
Normally, a model requires both recall and precision to be high. The F-measure 54 is an equation to balance recall and precision. When the value of β is 1, F1 represents that recall and precision are equally important. If β > 1, precision is considered more important than recall, and if β < 1, recall is considered more important.
$$F-measure=\frac{({\beta }^{2}+1)PR}{{\beta }^{2}*P+R}$$
7
G-mean 52 aggregates the accuracy of each class by taking the geometric mean of the accuracy to offset the dominance of the majority class.
$$Gmean\left(Ts\right)=\sqrt[m]{\prod _{i=1}^{m}\frac{corr\left(\text{T}\text{s}\text{i}\right)}{\left|\text{T}\text{s}\text{i}\right|}}$$
8
Cohen’s kappa 55measures the possibility of the agreement that occurs by chance. P0 denotes the empirical probability of agreement, which equals accuracy, and Pe is the possibility assigned labels randomly.
$$Kappa=\frac{{P}_{0}-{P}_{e}}{1-{P}_{e}}$$
9
2.5 Post hoc test
Friedman's test is a widely applied approach 56 to compare the performance of multiple algorithms on multiple datasets simply by the number of metrics in ML.
Let \({{R}_{i}}^{j}\)be the ranking value of the j algorithm on the i dataset in the N datasets (for example, the model with the largest AUC ranks 1, the second largest ranks 2), the Friedman test compares the algorithms by the average ranking values under the null hypothesis that all algorithms are equivalent, and their ranking Rj should be equal when N and k are large enough. The variable\({{\widehat{{\chi }}}^{2}}_{F}\) obeys k − 1 degrees of freedom and (k-1)(N-1) F distribution 57.
$${{\widehat{{\chi }}}^{2}}_{F}=\left[\frac{12}{nk(k+1)}\right.\left.\sum _{j}^{k}{{R}_{j}}^{2}\right]-3n(k+1)$$
10
It does not require the commensurability of measures across different datasets because it is nonparametric and does not assume normality of sample means and is robust to outliers.
A further assessment is necessary if the differences are significant when the Friedman test rejects the null hypothesis. Nemenyi 58 proposed a method to calculate the threshold of the average ranking value (ARV) among algorithms. Namely, for the ranking averages Ri and Rj of the two algorithms among multiple datasets, if the difference exceeds the threshold of the ARV, reject Ri=Rj:
$$\left|\stackrel{⃑}{{R}_{i}}-\stackrel{⃑}{{R}_{j}}\right|>{q}_{\alpha }\sqrt{\frac{k(k+1)}{6n}}=CD$$
11
qα is determined by the number of classifiers and the degree of confidence (Table 1) :
Table 1
Critical values for the two-tailed Nemenyi test 56
Classifiers | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
q0.05 | 1.960 | 2.343 | 2.569 | 2.728 | 2.850 | 2.949 | 3.031 | 3.102 | 3.164 |
q0.10 | 1.645 | 2.052 | 2.291 | 2.459 | 2.589 | 2.693 | 2.780 | 2.855 | 2.920 |
2.6 Case Study and available data
Global LS was involved as a case study.
Table 2: Description of Explanatory Variables to Develop the Global Susceptibility Map
Data type
|
Dataset
|
Resolution
|
Explanatory variable
|
Extent
|
Source and details
|
Elevation
|
GMTED2010: Global Multi-resolution Terrain Elevation Data 2010
|
1km
|
Elevation, General Curvature, Slope Aspect
|
Global
|
https://developers.google.com/earth-engine/datasets/catalog/USGS_GMTED2010?hl=en
|
Slope
|
GMTED2010: Global Multi-resolution Terrain Elevation Data 2010
|
1km
|
Slope
|
Global
|
https://developers.google.com/earth-engine/datasets/catalog/USGS_GMTED2010?hl=en
|
Rainfall
|
Global Precipitation Measurement
|
10km
|
global precipitation measurement (GPM)
|
Global
|
https://gpm.nasa.gov/data/directory
|
Landcover
|
FROM-GLC 2017v1
|
300m
|
Landcover type
|
Global
|
http://data.ess.tsinghua.edu.cn/
|
Soil type
|
Global Soil Regions Map
|
1:5000000
|
Soil classification
|
Global
|
https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/use/?cid=nrcs142p2_054013
|
NDVI
|
MOD13A2.006 Terra Vegetation Indices 16-Day Global 1km
|
1km
|
Forest cover
|
Global
|
https://developers.google.com/earth-engine/datasets/catalog/MODIS_006_MOD13A2
|
Lithology
|
Global Lithological Map Database v1.0 (gridded to 0.5° spatial resolution). PANGAEA
|
0.5°
|
Lithologic classification
|
Global
|
https://doi.pangaea.de/10.1594/PANGAEA.788537
|
Climate Classes
|
WORLD MAP OF THE KÖPPEN-GEIGER CLIMATE CLASSIFICATION UPDATED
|
10km
|
Climate Classes
|
Global
|
http://koeppen-geiger.vu-wien.ac.at/present.htm
|
River network
|
Major River Basins of the World, 2nd ed. (GRDC, 2020)
|
Variable
|
Euclidean distance to rivers
|
Global
|
https://www.bafg.de/GRDC/EN/02_srvcs/22_gslrs/221_MRB/riverbasins_node.html;jsessionid= 63179A36F6128A65D1D6355B9035421D.live21323#doc2731742bodyText3
|
Landslide Catalog
|
Global Landslide Catalog (GLC) & Landslide Reporter Catalog (LRC)
|
Variable
|
Landslide report
|
Global
|
https://gpm.nasa.gov/landslides/coolrdata.html
|
The landslide location data for this study are from NASA GLC and LRC 59–61. The digital elevation model (DEM), precipitation, slope, NDVI, land cover, lithology, soil types and climate classes are from the Google Earth Engine (GEE) platform (Table 2).
Slope was calculated by the function of the GEE platform. Slope aspect and general curvature (GC) were calculated by DEM in third-order partial derivatives of QGIS 62 (Fig. 4). The Euclidean distance to rivers (EDTR) of landslide points was calculated by the Euclidean Distance tool in ArcMap and the variable of the Global River network (Table 2).
This study applied RF, AdaBoost and L1 regularization 63 for feature selection, and 26 features were selected (Table 3).
Table 3
Numerical Variables | Climate Classes | Landcover | Lithology | Soil |
GPM | Aw | Cropland | Unconsolidated Sediments | Gelisols |
slope | Csb | Forest | Pyroclastics | Ultisols |
GC | Cwa | Grassland | Carbonate Sedimentary Rock | Inceptisols |
DEM | Dfc | | Evaporites | Alfisols |
NDVI | ET | | Basic Volcanic Rocks | |
EDTR | | I | Basic Plutonic Rocks | |
Slope aspect | | | Acid Plutonic Rocks | |