Machine learning ARIMA for modelling and forecasting variations inearth’s surface phenomenon using sparse time series satellite data—acase study of sea surface salinity in the Nigerian coastal zone

doi:10.21203/rs.3.rs-4056329/v1

The tropical coasts, particularly the Nigerian coastal zone, have been traditionally undersampled using appropriate in situ methods and understudied using appropriate remote sensing techniques despite the proliferation of satellite missions for earth observation. The contemporary all-weather satellite observations of phenomena of interest are characterized by relatively sparse time series data that discourage their utilization as input in building efficient machine learning (ML) models for both exploratory and predictive purposes. Additionally, data-poor areas usually have difficulties meeting the multiple predictor variable requirement of building appropriate multivariate ML regression models. We utilized a relatively sparse sea surface salinity (SSS) dataset from the Soil Moisture Active Passive Mission (SMAP) satellite products (Jan., 2016-Dec., 2021) for this study. We determined the accuracy and variability of the relatively sparse SSS data for the study area to be approximately 6.5° × 4.5°. We built ML autoregressive integrated moving average (ARIMA) models and determined and validated the best model for modelling (Jan., 2016-Dec., 2020) and forecasting (Jan.-Dec., 2021) Earth’s surface phenomenon (ESP) using relatively sparse SSS data as a case study. We show root mean squared differences (RMSDs) of 0.1279 psu and 0.1162 psu for modelling and forecasting data accuracy, respectively. We show a standard deviation (SD) of 0.2528 for the interannual SSS variability (iSSSv). We show the modelling accuracy with an R-squared (R²) of 0.8345281 and its validation with a mean absolute percentage error (MAPE) of 0.7779% and the forecasting accuracy with a root mean squared error (RMSE) of 0.9850 psu and its validation with a MAPE of 2.7670% for the best ML ARIMA model. The relatively low SD value suggests a relatively stable iSSSv along the Nigerian coastal zone. The R²and MAPE results suggest relatively high modelling and prediction accuracy. The results imply that relatively sparse satellite time series data of at least 60 epochs (hourly, daily, weekly, monthly or yearly observations) can be utilized for building a relatively accurate ML ARIMA model for modelling and forecasting variations in any ESP in any geographical area.

Earth’s surface phenomenon

sea surface salinity

machine learning arima

variations modelling

time series forecasting

Like sea surface salinity (SSS), every ESP is characterized by both spatial and temporal variations. The magnitude and frequency of such variations are usually driven by several factors. In some cases, such variations are associated with some risks to humankind and the environment, which is characterized by various species of plants, animals and microorganisms. In the case of changes in SSS on a global spatial scale, evaporation, precipitation, and river outflow are among the principal drivers (Dinnat et al., 2019). However, changes in SSS on a local spatial scale in the tropics, particularly along the Nigerian coastal zone, have been attributed to three important factors, namely, wind speed, high wind speed and sea level anomalies (Ajibola-James, 2023). The implications of spatial and temporal anomalies in SSS along coastal zones, particularly on relatively small (local or national) spatial scales, include the increasing risk of upstream seawater intrusion. More than often, the risk is associated with socioeconomic and environmental problems such as (a) the relatively high cost of tidal river water treatment for domestic and industrial purposes, (b) the threat to a sustainable freshwater supply for household consumption (Sneath, 2023) coupled with exposure to high blood pressure that may result from drinking water containing relatively high salt concentrations, (c) a decrease in the viability of the agricultural sector that can achieve an optimum yield of sensitive plants such as paddy rice and horticultural crops (CGIARCSA, 2016, Trung et al., 2016), and (d) disturbed natural ecosystems that cannot support species diversity and composition. It should be noted that relatively low-salinity water is important for establishing an enabling natural environment for industrial growth and economic development in coastal areas, particularly for the manufacturing and food processing industries. Therefore, proper modelling and forecasting of the ESP, including SSS changes along coastal zones, are crucial for providing useful early warning information for mitigating any such future risks (Ajibola-James, 2023; Ajibola-James et al., 2023).

Prior to the advent of remote sensing technologies that lend themselves to the observation of specific ESP, the traditional approach to such surface data acquisition has been in situ measurements. However, remoteness and large spatial extent limit conventional in situ measurements, while the seasonal cloud cover of dynamically important regions limits the applications of the predominant optical satellite surface observations. In the case of SSS observations, the launch of different all-weather satellite missions focused on sea surface observations, particularly the European Space Agency (ESA) Soil Moisture and Ocean Salinity (SMOS) satellite, which had a microwave imaging radiometer using aperture synthesis (MIRAS) on board in 2009; the subsequent National Aeronautics and Space Administration (NASA) Aquarius in 2011; the Soil Moisture Active Passive Mission (SMAP) in 2015, which used L-band (1.4 GHz) radiometry to measure SSS at approximately 0.2 practical salinity unit (psu) accuracy; and a paradigm shift to global satellite observations of SSS and other relevant sea surface variables such as high wind speed, wind speed, and sea surface temperature (Ajibola-James, 2023). The increasing development of contemporary all-weather satellite missions signifies the relative importance of their datasets for various applications, particularly for ML modelling and forecasting of ESP at different spatial scales ranging from local to global scales.

ML is a subset of artificial intelligence that enables computers to cleverly and intuitively make relatively accurate predictions based on previous learning by ML models. ML is a method of data analysis that involves building systems (models and algorithms) that can learn from data without being explicitly programmed, identifying patterns, and making decisions with minimal human intervention (Ajibola-James, 2023). To make the model selection process simpler for forecasting, ML entails a variety of strategies for identifying patterns and relationships in the data (Chan-Lau, 2017). A notable advantage of ML models and algorithms is their increasing ability to handle the time component of relatively large amounts of data (complex structured, semistructured and unstructured datasets with several characteristics, including volume, velocity, veracity, value and validity) in predictive studies. A time series is a sequence of data collected over a specific period of time. The time scale component of a data series may be either every minute or hourly or daily or monthly or yearly. When only one of the time scales is involved, it is regarded as a single seasonality. Any situation involving datasets with more than one of the time scales, for example, hourly and daily or hourly, daily and monthly or daily, monthly, and yearly, is called multiple seasonality. Time series forecasting has become a significant part of ML since there are many prediction problems with time components (Ajibola-James, 2023).

In terms of trade-offs, Chan-Lau (2017) considered various ML methods from two categorical perspectives, namely, ‘flexibility’ and ‘interpretability’. He opines that the latter should be given priority over the former and hence suggests a selection of relevant ML methods in decreasing order of interpretability, least absolute shrinkage and selection operator (LASSO) regressions, least squares (LS), generalized additive models (GAM), trees (T), support vector machines (SVM), and methods combining different base learning methods such as bagging and boosting. He further argues that the predictive power and interpretability of a linear regression model that improves fit by including a large number of independent variables are negatively affected. To alleviate and/or possibly overcome such effects, he proposed two types of linear models based on methods such as ‘subset selection’ and ‘shrinkage’. Typical examples of ML models that use the latter approach are L0-regularized regression (L0) and LASSO models. These models are considered sparse learning models, which can assist in eliminating the least important set of predictor variables to optimize the forecast accuracy. The ARIMA model, in which the predictors consist of lags of the dependent variable and/or lags of the forecast errors, may be considered an example of an ML model that utilizes the former method. The relative advantage of the ARIMA model for time series modelling and prediction is that it does not require predictor (independent) variables to fit new (predicted) values. Therefore, the costs (in terms of the amount of data input, data processing time, and computer hardware) of implementing it are relatively low (Ajibola-James, 2023).

The ARIMA model, which usually seeks to describe data autocorrelations by providing complementary approaches to a problem, is one of the most widely used methods for time series forecasting. The differenced autoregressive model is combined with the moving average model to form a typical ARIMA model, which consists of three technical parts. The AR component of ARIMA indicates that the time series has been regressed on its own past data. The MA component of ARIMA denotes that the forecast error is a linear combination of previous errors. The I part of ARIMA shows that the data values have been replaced with different values of d to obtain stationary data, as required by the assumption of the ARIMA model. With this combination approach, the ARIMA model is effective at fitting past data and forecasting future points in a time series (Kotu & Deshpande, 2019). In the applications of the ARIMA model, a widely used approach is known as the Box–Jenkins principle, which consists of three iterative steps, namely, model identification, parameter estimation, and diagnostic checking phases (Box & Jenkins, 1970). The main goal and fundamental rule of the model identification phase is to produce stationary time series data that have a constant mean and variance to comply with a basic requirement for time series forecasting, which is also one of the basic assumptions of the ARIMA model (Hyndman & Khandakar, 2008; Hyndman & Athanasopoulos, 2018). This implies that a time series should exhibit some theoretical autocorrelation (stationarity or white noise) qualities if it is derived from an ARIMA process. Such a stationary time series can be visually represented by autocorrelation function (ACF) and partial autocorrelation function (PACF) plots that do not show any exponential decay. Consequently, testing time series data for the presence of either white noise (stationarity) or a unit root (nonstationarity) is a required criterion in time series analysis. In this regard, Box & Jenkins (1970) suggest using the ACF and PACF of sample data as the fundamental tools to determine the order of ARIMA models. The ACF has been used to determine whether time series data are stationary, while the PACF has been used to test time series datasets for seasonality as part of the data preparation process for deploying ARIMA models (Fattah et al., 2018; Benvenuto, 2020; Hyndman & Athanasopoulos, 2021). The bar charts of the ACF plot of a stationary time series approach zero relatively rapidly, but those of the ACF plot of nonstationary data decline slowly (Hyndman & Athanasopoulos, 2021). However, a credible test for stationarity cannot be achieved by utilizing only the ACF plot, an informal test for stationarity that is based solely on the visual analysis of the series. In this regard, the augmented Dickey-Fuller test (ADF), a relatively credible and commonly used method (Cheung & Lai, 1995) that offers objective metric values for testing time series for stationarity, has been suggested. The Dickey-Fuller (DF) value, also known as the critical value of the ADF test, and its corresponding p value can easily be interpreted without prejudice to determine the stationarity of time series data (Ajibola-James, 2023).

In the parlance of ML, it is generally believed that a relatively large amount of historical data is required to successfully build and test a relatively accurate and reliable model for both classification and forecasting purposes. This is essentially because a small sample size is related to overfitting (a condition that predisposes a model that performs very well on small amounts of training data to fail in predicting a new task on new samples), which usually inhibits the development of a useful model (Raudys & Jain, 1991; Liu & Gillies, 2016; Zhao et al., 2017; Nguyen et al., 2018; Ajibola-James, 2023). Consequently, the contemporary all-weather satellite observations of phenomena of interest that are characterized by relatively sparse time series data discourage their utilization as input in building efficient ML models for such purposes. The tropical coasts, particularly the Nigerian coastal zone, have been traditionally undersampled using appropriate in situ methods and are understudied using remote sensing techniques (Ajibola-James, 2023). More than often, such data-poor areas have difficulties meeting the multiple predictor variable requirement of building appropriate multivariate ML regression models. Despite the relative advantages of using ML ARIMA for modelling ESP, our knowledge of its accuracy in fitting new values when built with sparse time series satellite data is still limited, particularly in such data-poor areas. Consequently, the objectives of this paper are to (i) determine the accuracy of relatively sparse SSS data (Jan., 2016-Dec., 2020 and Jan.-Dec., 2021) for the study area; (ii) determine the interannual variability of the SSS data (Jan., 2016-Dec., 2020); and (iii) construct ML ARIMA models and determine and validate the best model (Jan. 2016-Dec. 2020) and forecast (Jan.-Dec., 2021) the ESP using relatively sparse SSS data as a case study.

The location adopted for this experimental study was the Nigerian coastal zone, which comprises the immediate maritime area (IMA) and the contiguous Exclusive Economic Zone (EEZ) and reaches approximately 200 nautical miles (370 km) offshore of the Nigerian continental shelf; this zone should not extend beyond the limits of approximately 350 nautical miles in accordance with the provisions of Article 76(8) of the 1982 United Nations Convention on the Law of the Sea (UNCLOS) (United Nations, undated). The IMA was established for the purpose of this study. The offset ranged from 58-100 km between the shoreline and the edge of the observation points in the contiguous EEZ (Figure 1). To significantly reduce the effect of the error associated with satellite SSS data acquisitions close to land masses on the data accuracy, as observed by Boutin et al. (2016), the IMA was excluded from the study area. The study area was restricted to 278 data observation points in the contiguous EEZ of approximately 295,027.4 km2 (Figure 1). In the area, the mean monthly rainfall ranges from approximately 28 mm in January to approximately 374 mm in September (Zabbey et al., 2019), while the mean daily temperature ranges from 25–36°C (298.15– 309.15 K) depending on the time of day and the month of the year (Usoro, 2010). Several rivers, including the Niger, Forcados, Nun, Ase, Imo, Warri, Bonny, and Sombreiro Rivers, discharge freshwater to the coastal region of Nigeria. Given the actual evaporation of 1,000 mm per annum, a total runoff of 1,700–2,000 mm, and an additional flow of 50–60 km³ calculated for the water balance of the Niger system, a total of 250 km³ per year eventually discharges into the Gulf of Guinea (Golitzen et al., 2005; Ajibola-James, 2023; Ajibola-James et al., 2023).

3.1 Satellite Observations and Map

This study utilized the SMAP satellite SSS time series dataset, which was retrieved from NASA's SMAP online repository managed by NASA’s Joint Propulsion Laboratory, JPL (JPL, 2020), in netCDF-4, network Common Data Form-4 file format. Tables 3.1 (a) and (b) provide more information on the data. The base map material used for the study area was sourced from Ajibola-James (2023) and modified as appropriate (Figure 1).

Table 3.1 (a): Satellite dataset retrieved for the study and the sources

Data Name

Data Variable

Observation Period, Temporal, and

Spatial Resolutions

Source and Metadata Url

SMAP

SSS;

SSS Uncertainty

Jan., 2016 to Dec., 2021;

Monthly;

0.25° (Lat.) × 0.25° (Lon.)

JPL (2020)

https://doi.org/10.5067/SMP50-3TMCS

Table 3.1 (b): Quantity, quality and epochs of the dataset analysed for the study

Data Name

Data Variable

Observation (Obs.) Period

Obs./

Time

Total Obs.

RMSD

SMAP

SSS

Jan., 2016 to Dec., 2020

278

16680

0.1279 psu

SMAP

SSS

Jan. to Dec., 2021

278

3336

0.1162 psu

3.2 Data Preparation

Prior to the modelling and prediction tasks of the study, the appropriate data preparation tasks (data extraction, cleaning and selection) were implemented using automatic (scripted) procedures. The dataset was automatically extracted from the netCDF, network common data form (.nc and .nc4) files into comma-separated Excel (.csv) files by executing a python 3.10.2 script with glob, netCDF4, pandas, numpy and xarray libraries in Spyder IDE (Integrated Development Environment) 5.2.2 software. The data cleaning, which involved rigorous supervised-automatic deletion of the observation records with null values and outliers induced by radio frequency interference (RFI) and land contamination in the dataset stored in the .csv file, was achieved through three consecutive tasks: (a) automatic deletion of null values by executing a python script with libraries pandas, numpy, csv and xarray in the IDE; (b) visual identification and verification of outliers by overlaying each of the monthly SSS observations in the .csv files on the Google Earth Pro online to ascertain their proximity to land and tendency for land contamination; and (c) automatic deletion of the predetermined outliers by using their concatenated location coordinates as criteria for executing a python script with the same libraries and IDE that was utilized in (a) above. A total of 278 appropriate satellite observation points were selected for analysis in this study; these points constitute the study area (Figure 1), was achieved by executing a python script with the pandas, numpy, csv and xarray libraries in the IDE. The points were imported and merged with the base map using the overlay function in ArcMap 10.4.1 (Ajibola-James et al., 2023; Ajibola-James, 2023).

3.3 Data accuracy and variability

The accuracy of the satellite SSS data was computed in Microsoft Excel software by using the SSS uncertainty data (the difference between in situ SSS and satellite SSS) that were downloaded with the SSS data as the only input. See Table 3.1 (a). To compute the accuracy of the modelling data, the SSS uncertainty data of 16680 observation points were uploaded to column A in Excel to produce the formula A2:A16681 for computing the sum square (SUMSQ) in cell C2, which was given by the formula SUMSQ (A2:A16681). The mean squared difference (MSD) given by formula =(C2/16680) was computed in cell D2, while the RMSD was finally computed by using formula =SQRT(D2). The same procedure was replicated for computing the accuracy of the forecasting data using 3336 observation points. See Table 3.1 (b) for details of the input datasets.

Table 3.3: Dataframe for computing interannual variability in SSS

Year	SSS
2016	33.15872
2017	33.12886
2018	32.79823
2019	32.55897
2020	33.02366

The interannual variability of the SSS data was determined by utilizing the MLmetrics library to compute the SD, a universal measure of variability in R 4.1.3/R-studio 2022.02.3-492 software. After the mean annual SSS values for 2016 to 2020 were uploaded to the software by running data_obs_sss <- read.csv(file.choose(), header = TRUE, stringsAsFactors = FALSE), the dataframe produced (Table 3.3) by running data_sss <- data_obs_sss[, c("year", "sss")] was vectorized by running sss_2016_2020 <- data_sss$sss. The SD was finally computed by running sd (sss_2016_2020).

3.4 Autoregressive Integrated Moving Average Model and Algorithm

In the application of ML methods for modelling and forecasting variations in SSS, ESP and ARIMA models and algorithms were built primarily with the forecast library 8.17.0 in R 4.1.3/R-studio 2022.02.3-492 software. Other complimentary libraries, such as tseries and MLmetrics, were also used in this process. Model fitting and selection were achieved with the auto.arima() function. The function helps to determine the best model for given input data based on relevant model evaluation criteria. The function employs a variant of the Hyndman-Khandakar method, which combines unit root testing, Akaike information criterion (AIC) minimization, the Bayesian information criterion (BIC) and maximum likelihood estimation (MLE) to generate ARIMA models (Hyndman & Khandakar, 2008; Hyndman & Athanasopoulos, 2018). The most widely used criteria are the AIC and BIC (Rahman & Hasan, 2017; Suleiman & Sani, 2020). The function performs intuitive parameter estimation and provides information on the best ARIMA model parameter.

At the inception of the ML modelling task, the dataframe, df, containing 60 monthly epochs (Jan. 2016-Dec. 2020) of the SSS data was transformed from "function" to “time series” to satisfy one of the basic assumptions of the ARIMA model. The time series data were assessed for stationarity utilizing both visual and metric approaches. The former involved the inspection of autocorrelation function (ACF) and partial autocorrelation function (PACF) plot patterns, while the latter was characterized by hypothesis testing using augmented Dickey-Fuller (ADF) test metrics. The following hypotheses and assumptions (decision rules) were adopted for the ADF test:

H₀: No white noise (nonstationary)

H₁: White noise (Stationary)

where H₀ is the null hypothesis and H₁is the alternative hypothesis.

If the p value is ≤ 0.05, H₀ _{is rejected to support H1.}

Given that the computed p value = 0.1769, which is > 0.05, H₀ of Nonstationary was accepted to reject H₁ of Stationary. To achieve “stationarity”, another basic assumption of the ARIMA model, first-order differences in the data were used. The ADF test metrics were also used to reassess the output of the differenced data. Given that the computed p value = 0.01, which is < 0.05, H₀ of Nonstationary was rejected to accept H₁ of Stationary. The best ARIMA model together with the most appropriate parameters were identified using the auto.arima function, mymodel_train with the training data, and Outcome_SSS given by running

mymodel_train <- auto.arima(Outcome_SSS, ic='aic', trace=TRUE, approximation=FALSE) (1)

The Ljung-Box (Portmanteau) test was performed to assess the residual and stationarity of the auto.arima model based on the following hypotheses and assumptions (decision rule):

H₀: No white noise (nonstationary)

H₁: White noise (stationary)

If the p value is ≥ 0.05, H₀ is rejected (Hyndman & Khandakar, 2008).

Given that the computed p value = 0.4522, which is > 0.05, H₀ of Nonstationary was rejected to accept H₁ of Stationary. Having confirmed the stationarity of (1), it was used as input for building the user-defined forecasting model, myforecast_train, given by running

myforecast_train <- forecast(mymodel_train, level=c(95), h=1*12) (2)

where level is the confidence level and h is the number of monthly forecasts. Therefore, the SSS values were predicted 12 months ahead using the model and (2) built with parameter combinations h=1*12. The graph of the SSS values predicted by the model was generated by running

autoplot(myforecast_train) (3)

after running (2) successfully. The modelling accuracy was computed in terms of R², rsq by running

sss_obs1 <- myforecast_accuracy_Outcome_SSS$x

sss_pred1 <- myforecast_accuracy_Outcome_SSS$fitted

rss <- sum((sss_pred1 - sss_obs1) ^ 2)

tss <- sum((sss_obs1 - mean(sss_obs1)) ^ 2)

rsq <- 1 - rss/tss

rsq (4)

while the MAPE outcome of running

myforecast_accuracy_Outcome_SSS <- Arima(Outcome_SSS,

Model=mymodel_train)

accuracy (myforecast_accuracy_Outcome_SSS) (5)

where Outcome_SSS is the input time series SSS data and mymodel_train is the ML ARIMA model trained with the input of time series SSS data, which was utilized for validating the outcome of the above modelling accuracy.

The forecasting accuracy in terms of the RMSE was computed and validated by computing the MAPE with the MLmetrics for the best ARIMA ML model by running

RMSE(sss_pred1, sss_obs1) (6)

and

MAPE(sss_pred1, sss_obs1)*100 (7)

Immediately after running (4) successfully, where sss_pred1 is the predicted SSS value and sss_obs1 is the actual satellite SSS for January-December 2021.

3.5 Determination and Validation of ARIMA Model Accuracy for Modelling and Forecasting SSS

In subsection 3.4, the accuracy of the built ML ARIMA model for modelling variations in SSS was computed by using the R² performance metric, which represents the amount of variation explained by the ML model. The forecasting accuracy was determined with the RMSE, a measure of accuracy that reveals the magnitude of the difference between the predicted and observed (actual) values. The validation of the modelling and forecasting accuracy of the best ML model in relation to error estimation, which is also known as residual variation, was also computed in terms of MAPE, a good measure of the absolute percentage difference between predicted and observed values. In general, the greater the R² value is, the greater the amount of variation explained by the ML model. Conversely, lower values of MAPE and RMSE indicate relatively good accuracy of forecasts made by the model. In terms of the interpretation of the error metrics in real-world applications, the MAPE seems to be the most versatile because it is usually computed in percentage (%) units. In addition, what should be considered an acceptable accuracy level seems to be properly documented for the MAPE. In this regard, a MAPE less than 10% is considered to indicate “high prediction accuracy” (Lewis, 1982; Ağbulut et al., 2021b; Ajibola-James, 2023). It should be underscored that the true test of an ML time series model’s performance is in accurately forecasting new (future) values. This is usually determined by the value of its performance metrics in forecasting new target values that are not included in the model’s training datasets.

4.1 Data accuracy

The accuracies of the relatively sparse SSS data over a geographical area of approximately 6.5° × 4.5° in terms of the RMSD are 0.1279 psu and 0.1162 psu for the modelling dataset and forecasting dataset, respectively. The two RMSD values show a relatively high level of accuracy exceeding the SMAP missions’ accuracy requirement of 0.2 psu by substantial margins of approximately 36.05% and 41.9%, respectively. It should be noted that relatively high accuracy was achieved by the rigorous supervised automatic data cleaning approach, which primarily involved deletion of the outliers induced by RFI and land contamination in the satellite dataset. This implies that the data preparation technique can reasonably affect the accuracy of the input dataset in a modelling and predictive study.

4.2 Interannual variability

The interannual variability in the SSS data in terms of the SD is 0.2528. This shows that the iSSSv is relatively stable (predictable) given that it is approximately 74.72% less than 1 SD. This result shows that the dataset could be considered a viable input for the ARIMA model. However, to achieve “stationarity”, a basic assumption of the ARIMA model, the first-order differences of the data were taken as earlier mentioned in section 3.4. This implies that the order of differences that would be taken in a given input data for ML ARIMA modelling is a function of the SD value. Consequently, data variability assessment using the SD value should be considered an essential aspect of exploratory data analysis (EDA) in the process of building ML ARIMA models.

4.3 Determination and Validation of the Best ARIMA Model

The best ML ARIMA model with the most appropriate parameter that scored the minimum AIC value of 81.80972 was ARIMA(0,1,2)(0,1,1)[12]. It was automatically determined from a variety of options of ARIMA models and the allied AIC values computed, which include the following:

ARIMA(2,1,2)(1,1,1)[12] : Inf

ARIMA(0,1,0)(0,1,0)[12] : 93.4323

ARIMA(0,1,1)(0,1,1)[12] : 81.91669

ARIMA(0,1,1)(0,1,0)[12]: 88.61325

ARIMA(0,1,1)(1,1,1)[12] : Inf

ARIMA(0,1,1)(1,1,0)[12] : 82.80056

ARIMA(0,1,2)(0,1,1)[12] : 81.80972

ARIMA(0,1,2)(0,1,0)[12] : 89.46024

In Figure 4.3, the “Training” side shows the result of using 60 monthly epochs (Jan., 2016-Dec., 2020) of the data to train the best ML ARIMA model for modelling variations in the SSS, while the adjoining “Forecast” side shows the result of 12 monthly epochs of the SSS forecast.

The results of the preliminary automatic model selection task show that the AIC metric is an efficient approach for determining the best ML ARIMA model with the most appropriate parameters. The result of the modelling accuracy assessment performed with R² is 0.8345281, while the result of its validation with MAPE is 0.7779%. The relatively high R2 value shows that the ML ARIMA model explained a relatively large amount of variation, while the relatively low MAPE value shows that the ML ARIMA model has a relatively high modelling accuracy.

4.4 Determination and Validation of Forecasting Accuracy of the Best ARIMA Model

Table 4.4: Results of time series forecasting of SSS for 12 months ahead using the trained ARIMA model and the observed (actual) satellite SSS values

Forecast Period (2021)	ML ARIMA SSS (psu)	Observed SSS (psu)
January	33.11587	32.74971
February	33.51149	33.16754
March	33.73193	33.11268
April	33.84112	33.10024
May	34.15953	33.57706
June	34.77918	33.23043
July	35.00569	34.00293
August	35.03537	34.29564
September	34.49434	33.31446
October	33.25123	31.69396
November	32.02951	30.93628
December	32.26378	31.1894

According to Table 4.4 and Figure 4.4, the SSS values predicted by the best ML ARIMA model for the entire 12 months are greater than the observed satellite SSS values. This implies that the model has the tendency to overstate the SSS values in a predictive study, and the accuracy of such predicted values should be properly validated using appropriate interpretable metric(s). The forecasting accuracy of the best ML ARIMA model computed in terms of the RMSE is 0.9850, while the result of its validation in terms of MAPE is 2.7670%. Given that the RMSE is relatively difficult to interpret for such applications due to the squared nature of the measured error, the MAPE was utilized to validate the forecasting accuracy. The relatively low MAPE, which is approximately 3 times less than 10%, shows that the best ARIMA model has a relatively high forecasting accuracy.

5.1 Conclusion

The use of sparse satellite time series SSS data from the Nigerian coastal zone as a case study for ML ARIMA for modelling and forecasting variations in ESP yields encouraging results, which imply that relatively sparse satellite time series data from at least 60 epochs (hourly, daily, weekly monthly or yearly) can be productively utilized for building a relatively accurate ML ARIMA model for modelling and forecasting variations in any ESP in any geographical area. A relative advantage of the time series model is that it does not require predictor (independent) variables to model variation and fit new (predicted) values. In this regard, the costs (in terms of the amount of data input, data processing time, and computer hardware) of implementing it are relatively low and affordable. The variation modelling accuracy that was validated with a MAPE of 0.7779% is more than 2 times greater (better) than the forecasting accuracy with a MAPE of 2.7670%. It should be underscored that such a difference in accuracy (in which the accuracy of the former exceeds that of the latter) is a normal experience in such applications of ML models because the observed data utilized for validating the accuracy of the latter are relatively new to the ML model.

5.2 Recommendations

Considering the relatively high accuracy of the ML ARIMA model coupled with its relatively low costs of implementation, the following are highly recommended.

The ML model and its algorithm should be updated and adopted by stakeholders (particularly government agencies and aquatic entrepreneurs) as early warning decision support tools that will enable them to provide proactive and sustainable preventive measures to any current and future risks that may be posed by any ESP to humans and the environment. For example, the ML model built with SSS data can serve as a decision support tool for providing early warning information on the risk of upstream seawater intrusion to the drinking water supply, people’s health, sensitive plants such as rice and horticultural crop yield, and the environment.
Further studies on the comparative assessment of the ML model with the one utilizing a relatively large number of monthly mean SSS satellite observations should be encouraged, as more satellite data observations are available due to the apparent relative advantages they have (a) for building and improving the accuracy of such ML training and testing models and (b) over the traditional approach to time series forecasting.
Additionally, appropriate local and global funding that will facilitate prompt execution of the recommendations in (1) and (2) above should be equitably provided to reliable but relatively marginalized individual researchers, private research organizations and private/public research institutions in the geospatial and related industries in Nigeria as soon as possible.

Ağbulut, Ü., Gürel, A. E., & Sarıdemir S. (2021b). Experimental investigation and prediction of performance and emission responses of a CI engine fuelled with different metal-oxide based nanoparticles–diesel blends using different machine learning algorithms. Energy, 215:119076.
Ajibola-James, O. (2023). Assessment of sea surface salinity variability along Nigerian coastal zone using machine learning – 2012-2021 [Doctoral thesis, University of Nigeria, Nsukka, Enugu Campus].
Ajibola-James, O., & Okeke, F. I. (2024). An approach for good modelling and forecasting of sea surface salinity in a coastal zone using machine learning LASSO regression models built with sparse satellite time-series datasets. Research Square. Preprint. https://doi.org/10.21203/rs.3.rs-4016353/v1
Ajibola-James, O., Okeke, F. I., & Ojinnaka, O. C. (2023). Assessment of variability of sea surface salinity using integrated all-weather satellite data in a tropical coast (Nigerian coastal zone). Research Square. Preprint. https://doi.org/10.21203/rs.3.rs-3449318/v1
Anyikwa, O. B., & Martinez, N. (2012). Continental Shelf Act, 2012. The International Maritime Law Institute, IMO. https://imli.org/wp-content/uploads/2021/03/Obiora-Bede-Anyikwa.pdf
Benvenuto, D., Giovanetti, M., Vassallo, L., Angeletti, S., & Ciccozzi, M. (2020). Application of the ARIMA model on the COVID-2019 epidemic dataset. Data in Brief, 29. https://doi.org/10.1016/j.dib.2020.105340
Boutin, J., Chao, Y., Asher, W. E., Delcroix, T., Drucker, R., Drushka, K., Kolodziejczyk, N., Lee, T., Reul, N., Reverdin, G., Schanze, J., Soloviev, A., Yu,, L., Anderson, J., Brucker, L., Dinnat, E., Santos-Garcia, A., Jones, W., Maes, C., Meissner, T., Tang, W., Vinogradova, N., & Ward, B. (2016). Satellite and in situ salinity: understanding near-surface stratification and subfootprint variability. Bulletin of the American Meteorological Society, 97(8), 1391–1407. https://doi:10.1175/bams-d-15-00032.1
Box, G.E.P. & Jenkins, G. (1970). Time series analysis, forecasting and control. San Francisco: Holden-Day.
CGIAR Research Centers in Southeast Asia. (2016). The drought and salinity intrusion in the Mekong River Delta of Vietnam. https://cgspace.cgiar.org/rest/bitstreams/78534/retrieve/
Chan-Lau, J. A. (2017). Lasso Regressions and Forecasting Models in Applied Stress Testing. International Monetary Fund (IMF) Working Paper, WP/17/108. https://www.imf.org/~/media/Files/Publications/WP/2017/wp17108.ashx
Cheung, Y.-W., & Lai, K. S. (1995). Lag order and critical values of the Augmented Dickey-Fuller test. Journal of Business & Economic Statistics, 13(3), 277–280. https://doi.org/10.2307/1392187
Dinnat, E. P., Le Vine, D. M., Boutin, J., Meissner, T., & Lagerloef, G. (2019). Remote sensing of sea surface salinity: Comparison of satellite and in situ observations and impact of retrieval parameters. Remote Sensing, 11(7). https://doi.org/10.3390/rs11070750
Fattah, J., Ezzine, L., Aman, Z., el Moussami, H., & Lachhab, A. (2018). Forecasting of demand using ARIMA model. International Journal of Engineering Business Management, 10. https://doi.org/10.1177/1847979018808673
Golitzen, K. G. (Ed.), Andersen, I., Dione, O., Jarosewich-Holder, M., & Olivry, J. (2005). The Niger River Basin: A vision for sustainable management. World Bank, Washington, DC. https://doi.org/10.1596/978-0-8213-6203-7
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(1), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R.J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice, 2nd edition, OTexts. Melbourne, Australia. Retrieved ‎August 31, ‎2022, from https://otexts.com/fpp2/
Hyndman, R.J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice, 3rd edition, OTexts. Melbourne, Australia. Retrieved ‎August 31, ‎2022, from https://otexts.com/fpp3/
Joint Propulsion Laboratory. (2020). JPL CAP SMAP Sea Surface Salinity Products (PO.DAAC; Version V5.0) [Dataset]. JPL, CA, USA. Retrieved ‎July 10, ‎2022, from https://doi.org/10.5067/SMP50-3TMCS
Kotu, V., & Deshpande, B. (2019). Time Series Forecasting. Data Science, Elsevier, 395–445. https://doi.org/10.1016/B978-0-12-814761-0.00012-5
Lewis, C. D. (1982). Industrial and business forecasting methods: A radical guide to exponential smoothing and curve fitting. London: Butterworth Scientific.
Liu, R., & Gillies, D. F. (2016). Overfitting in linear feature extraction for classification of high-dimensional image data. Pattern Recognition, 53, 73–86. https://doi.org/10.1016/j.patcog.2015.11.015
Nguyen, P. T. B., Koedsin, W., McNeil, D., & Van, T. P. D. (2018). Remote sensing techniques to predict salinity intrusion: Application for a data-poor area of the coastal Mekong Delta, Vietnam. International Journal of Remote Sensing, 39(20), 6676–6691. https://doi.org/10.1080/01431161.2018.1466071
Rahman, A., & Hasan, M. M. (2017). Modelling and forecasting of carbon dioxide emissions in Bangladesh using Autoregressive Integrated Moving Average (ARIMA) Models. Open Journal of Statistics, 7, 560–566. https://doi.org/10.4236/ojs.2017.74038
Raudys, S. J., & Jain, A. K. (1991). Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(3), 252–264. https://doi.org/10.1109/34.75512
Sneath, S. (2023, September 23). Louisiana: New Orleans declares emergency over saltwater intrusion in drinking water. The Guardian. https://www.theguardian.com/us-news/2023/sep/22/louisiana-drought-drinking-water-mississippi-river-saltwater-new-orleans
Suleiman, S., & Sani, M. (2020). Application of ARIMA and Artificial Neural Networks Models for daily cumulative confirmed Covid-19 prediction in Nigeria. Equity Journal of Science and Technology, 7(2), 83–90. https://www.equijost.com/fulltext/14-1594712555.pdf?1681453732
Trung, N. H., Hoanh, C. H., Tuong, T. P., Hien, X. H., Tri, L. Q., Minh, V. Q., Nhan, D. K., Vu, P. T., & Tri, V. P. D. (2016). Climate Change Affecting Land Use in the Mekong Delta: Adaptation of Rice-Based Cropping Systems (CLUES) Theme 5: Integrated Adaptation Assessment of Bac Lieu Province and Development of Adaptation Master Plan. https://www.researchgate.net/publication/301612048_Climate_change_affecting_land_use_in_the_Mekong_Delta_Adaptation_of_rice-based_cropping_systems_CLUES_ISBN_978-1-925436-36-5
United Nations. (n.d.). United Nations Convention on the Law of the Sea. https://www.un.org/depts/los/convention_agreements/texts/unclos/unclos_e.pdf
Usoro, E. (2010). Encyclopedia of the World’s coastal landforms, 1, p. 949. London.
Zabbey, N., Giadom, F. D., & Babatunde, B. B. (2019). Nigerian coastal environments. In C. Sheppard (Ed.), World Seas: An environmental evaluation (pp. 835–854). Elsevier. https://doi.org/10.1016/B978-0-12-805068-2.00042-5
Zhao, J., Temimi, M., & Ghedira, H. (2017). Remotely sensed sea surface salinity in the hypersaline Arabian Gulf: Application to landsat 8 OLI data. Estuarine, Coastal and Shelf Science, 187, 168–177. https://doi.org/10.1016/j.ecss.2017.01.008

The authors declare no competing interests.

Machine learning ARIMA for modelling and forecasting variations inearth’s surface phenomenon using sparse time series satellite data—acase study of sea surface salinity in the Nigerian coastal zone

Status:

Version 1

Abstract

Figures

1. Introduction

2. Study Area

3. Materials And Methods

4. Results And Discussion

5. Conclusion and Recommendations

References

Additional Declarations

Status:

Version 1