Landslides commonly cause numerous casualties and considerable property damage in hilly and mountainous areas, and landslide occurrence increases because of significant population growth in prone areas. Practical landslide disaster risk reduction tool is important for stakeholders in the landslide prone areas. A landslide susceptibility map that can be generated from landslide susceptibility modeling is considered as one of the tools that can be used by municipalities to design risk-reduction-based development (Fell et al., 2008). The most appropriate method for landslide susceptibility modeling that covers large areas on medium to small scales is statistical quantitative approaches (Cascini, 2008; Soeters and Westen, 1996; van Westen et al., 2008), which need a reliable landslide inventory (Blahut et al., 2010).
A landslide inventory is the simplest form of landslide map (Guzzetti et al., 2012; Ngadisih et al., 2017). The landslide inventory generally represents single or multiple events, which describe the locations and outlines of landslides (Chacón et al., 2006). The landslide inventory should cover information about the location, type, volume, activity, date of occurrence, and other characteristics of landslides in the area (Fell et al., 2008). Furthermore, the landslide inventory must identify and illustrate the triggering factors (Godt et al., 2008) as a single point or polygons of the landslides (Parise, 2001). For example, in Indonesia, landslide data are often represented by point or location coordinates, which are available at https://dibi.bnpb.go.id/. Several researchers have created a landslide polygon inventory for some areas (Aditian et al., 2018; Ngadisih et al., 2014; Samodra et al., 2020, 2018) for a specific purpose. Landslide inventory maps and landslide controlling factor maps are required for landslide susceptibility modeling.
A susceptibility map shows the likelihood of landslide occurrence in a given location (Corominas and Moya, 2008). Numerous landslide susceptibility models have been recently developed, and the methods are increasingly sophisticated because of the advancement of computer processing and geographic information system (GIS) technologies. The common procedure for landslide susceptibility modeling is to extract information on each landslide controlling factor from the landslide and non-landslide point/area samples (Lovelace et al., 2020). The differences in landslide inventory, i.e., point or polygon, can cause differences in landslide susceptibility data handling in the model. Landslide inventory data handling, i.e., the placement of landslide samples in a landslide polygon and the number of samples, can affect the appearance and performance of a landslide susceptibility map (Abraham et al., 2021; Hussin et al., 2016; Steger et al., 2016a).
Remote sensing and GIS technologies are becoming increasingly sophisticated and integrated, contributing to the advancement of landslide susceptibility modeling. Statistical learning and machine learning (ML) techniques, such as logistic regression, artificial neural network, decision tree, random forest, support vector machine, and their variations, are the most common methods adopted for landslide susceptibility modeling recently (Merghadi et al., 2020; Reichenbach et al., 2018). Among these methods, random forest is ranked as the most promising ML algorithm for spatial prediction (Hengl et al., 2015; Nussbaum et al., 2018; Park and Kim, 2019; Sun et al., 2021; Vaysse and Lagacherie, 2015).
Even though it exhibits high predictive performance, the ML algorithm for spatial prediction is often prone to overfitting (Jaafari et al., 2019; Just et al., 2020; Meyer et al., 2019, 2018). Overfitting occurs when the model performs well for the training data but probably worse for the independent data (Probst et al., 2019). Many researchers rarely considered the effect of overfitting in their ML models (Kim et al., 2017; Lai et al., 2019; Park and Kim, 2019; Taalab et al., 2018). The evaluation of the training data, particularly spatial cross validation (CV) and spatial hyperparameter tuning, is recommended to avoid overfitting in the ML algorithm (Brenning, 2005; Meyer et al., 2019, 2018; Schratz et al., 2019).
Instead of comparing algorithms, our research focuses on comparing the best strategies to handle landslide inventory data for landslide susceptibility modeling. This study focuses on handling landslide inventory with scenarios that often exist in Indonesia. That is, for point-based data handling, the landslide point is placed randomly on the landslide polygon, the landslide point is placed on the scarp (maximum height), the landslide point is placed on the toe (minimum height), and the landslide point is placed at the center (centroid). For area-based/polygon data handling, the landslide points are often placed in all landslide areas based on grid sampling. In this research, a 15 m grid sample was created in each landslide polygon. The authors also created strategies for randomly generating 3x, 5x, and 10x points in a landslide polygon to compensate for the number of points made based on grid sampling. Eight inventory data handling scenarios will be simulated in landslide susceptibility maps using nonspatial CV for performance estimation combined with nonspatial hyperparameter tuning (CV–nonspatial tuning), spatial CV estimation combined with nonspatial hyperparameter tuning (spatial CV–nonspatial tuning), and spatial CV combined with forward feature selection (FFS) estimation without tuning (spatial CV–FFS–no tuning). The comparison of the combinations of inventory data handling, CV, and hyperparameter tuning strategies is expected to provide general strategies for landslide susceptibility modeling using ML techniques.