Machine learning (ML), which comes under artificial intelligence has become popular among researchers in recent times, because of its ability to identify patterns between a set of inputs and an output with high precision and speed (Ali et al., 2022; Zhan and Kitchin, 2022). Even though ML was a tool primarily used by computer scientists a few decades ago, at present it is being used by researchers in many STEM fields (Lu et al., 2021; Sun et al., 2021; Tao et al., 2021; Walsh et al., 2021). Most importantly researchers are trying to integrate ML to enhance the sustainability of certain industries. For instance, ML is being applied in the upstream oil and gas industry to decrease manpower and time (Tariq et al., 2021, Temizel et al., 2021). The first ML algorithm known as “the perceptron” was invented by Rosenblatt (1958). Since then, ML has seen a rapid development over the years. Along the way some key ML algorithms were invented, such as multi layer perceptron, support vector machines, k-nearest neighbor, decision trees, hybrid learning, ensemble learning and deep learning (Fix & Hodges, 1951; Rosenblatt, 1958; Schölkopf et al., 2013; Cover & Hart, 1967; Morgan & Sonquist, 1963; Dasarathy & Sheela, 1979; Psichogios & Ungar, 1992; Dechter, 1986). Algorithms such as multi layer perceptron, support vector machines and decision trees can be categorised as traditional (classical) ML models. The reason behind this categorisation is, apart from the fact that they were invented in the initial stages of the ML development timeline, they were also used as building blocks of several modern ML algorithms such as hybrid learning, ensemble learning and deep learning. The basic concept of an ML model is to take in a bunch of input data and develop a relationship with an output (El Naqa & Murphy, 2015; Mahesh, B. 2020). Interpreting how an ML model generate these outputs can be quite complex and it is often referred as a black box (Handelman et al., 2019; Hsu & Elmore, 2019). However, with the development of ML related techniques and concepts it is possible to describe an ML model more confidently by utilising several data produced during and after the process.
1.1 Ensemble machine learning algorithms
Ensemble learning is a branch of ML where weaker ML algorithms are amalgamated to produce a high performing model (Rincy & Gupta, 2020; Feng et al., 2021; Ganaie et al., 2022,). Bagging, also known as bootstrap aggregating ensemble learning, is a subset of ensemble learning which was widely used in recent studies (Hong et al., 2020; Xu et al., 2020; Ngo, Beard & Chandra, 2022). The concept was first put forward by Breiman (1996) and since then, the concept has been strengthened by the invention of multiple high performing algorithms. Bagging ensembles have the capability to address the issue of overfitting in traditional ML models (Ghojogh & Crowley, 2019, Mosavi et al., 2021). Overfitting occurs when the model performs well for the training set but gives unusually poor results when the test set is introduced. When the model exhibit overfitting it shows a low bias and a high variance. Figure 1 shows the architecture of a bagging type ensemble. In bagging ensemble algorithms initial dataset is divided into several samples, and they are introduced to each base model.
Generally, the base models in bagging ensembles are tree based (comprise of decision trees). Predictions from each base model are averaged at the end to get the final prediction (Ganaie et al., 2022). Random forest and extra tree (extremely randomised trees) are two bagging type ensemble learning algorithms. These two ML algorithms are capable of solving regression type problems. Random forest and extra tree have similar characteristics. However, the notable difference between the two algorithms is random forest uses bootstrap replicas, i.e. its samples are selected from the dataset with replacement, while extra tree draws samples without replacement. Table 1 shows the main differences among decision tree, random forest, and extra tree algorithms.
Table 1
Comparison of the characteristics of a decision tree and bagging ensemble algorithms
Algorithm | Feature |
Number of trees | Number of features considered to split at each decision tree | Bootstrapping | Splitting procedure |
Decision tree | One | All features | Not applicable | Best split |
Random forest | Multiple | Random subset of features | Yes | Best split |
Extra tree | Multiple | Random subset of features | No | Random split |
1.2 Carbon capture and storage
The topic of sustainable development has become prominent in decision making of several industries, such as oil and gas. Along with sustainability, achieving carbon net-zero also has come into spotlight in recent times (Bergero et al., 2023; Dafnomilis et al., 2023; Xu et al., 2023). Carbon dioxide capture and storage (CCS) is a practice used worldwide to mitigate carbon dioxide emissions (Boot-Handford et al., 2014; Bahman et al., 2023). The oil and gas industry employs this method capture carbon dioxide emitted from production plants, liquify it, and inject it into the subsurface for long-term storage in suitable geological layers, as shown in (Shirmohammadi et al., 2020, Wilberforce et al., 2021).
To capture CO2, chemical absorption methods are commonly used, typically aqueous amine solutions. The CO2 is separated from the amine solution, dried and pressurised into liquid form. This liquified CO2 is then transported via pipelines to both onshore and offshore sites (Gibbins & Chalmers, 2008; Bui et al., 2018). The CO2 is injected into subsurface layers with interconnected pores, allowing the liquid to move within the layer.
During the initial assessment of a CCS project, a parameter called carbon storage capacity is estimated (Ringrose, 2020; AlNajdi & Worden, 2023). This parameter indicates how much CO2 can be stored in the geological layer being considered. Porosity plays a crucial role in estimating storage capacity. Traditional methods of estimating porosity, such as core analysis, are expensive, time consuming and can be affected by damaged core samples at certain depths (Erofeev et al., 2019; Agbadze et al., 2022). With the vast amount of data generated during the initial assessments of CCS projects, integrating ML techniques to predict porosity is feasible. This predicted porosity can then be used to estimate the carbon storage capacity.
Currently, there is no research focusing on the usability of bagging ensembles to predict porosity in CCS projects or on how feasible these models are for characterising, the CO2 storing layer. Further, a comparison of bagging ensembles with traditional ML models for porosity prediction in CCS assessment can accelerate ML advancement and support future studies.
In this study, two regression friendly ensemble ML models, random forest regression (RFR) and extra tree regression (ETR), were developed to predict porosity of sandstone dominated layers in the Darling basin, Australia. Data was collected as part of a CCS assessment program. Predications from these two bagging ensemble models will be compared with those from four traditional models: multi layer perceptron (MLP), support vector regressor (SVR), K nearest neighbors regressor (KNN), and decision tree regressor (DTR).