3.1 Satellite Observations and Map
This study utilized the SMAP satellite SSS time series dataset, which was retrieved from NASA's SMAP online repository managed by NASA’s Joint Propulsion Laboratory, JPL (JPL, 2020), in netCDF-4, network Common Data Form-4 file format. Tables 3.1 (a) and (b) provide more information on the data. The base map material used for the study area was sourced from Ajibola-James (2023) and modified as appropriate (Figure 1).
Table 3.1 (a): Satellite dataset retrieved for the study and the sources
Data Name
|
Data Variable
|
Observation Period, Temporal, and
Spatial Resolutions
|
Source and Metadata Url
|
SMAP
|
SSS;
SSS Uncertainty
|
Jan., 2016 to Dec., 2021;
Monthly;
0.25° (Lat.) × 0.25° (Lon.)
|
JPL (2020)
https://doi.org/10.5067/SMP50-3TMCS
|
Table 3.1 (b): Quantity, quality and epochs of the dataset analysed for the study
Data Name
|
Data Variable
|
Observation (Obs.) Period
|
Obs./
Time
|
Total Obs.
|
RMSD
|
SMAP
|
SSS
|
Jan., 2016 to Dec., 2020
|
278
|
16680
|
0.1279 psu
|
SMAP
|
SSS
|
Jan. to Dec., 2021
|
278
|
3336
|
0.1162 psu
|
3.2 Data Preparation
Prior to the modelling and prediction tasks of the study, the appropriate data preparation tasks (data extraction, cleaning and selection) were implemented using automatic (scripted) procedures. The dataset was automatically extracted from the netCDF, network common data form (.nc and .nc4) files into comma-separated Excel (.csv) files by executing a python 3.10.2 script with glob, netCDF4, pandas, numpy and xarray libraries in Spyder IDE (Integrated Development Environment) 5.2.2 software. The data cleaning, which involved rigorous supervised-automatic deletion of the observation records with null values and outliers induced by radio frequency interference (RFI) and land contamination in the dataset stored in the .csv file, was achieved through three consecutive tasks: (a) automatic deletion of null values by executing a python script with libraries pandas, numpy, csv and xarray in the IDE; (b) visual identification and verification of outliers by overlaying each of the monthly SSS observations in the .csv files on the Google Earth Pro online to ascertain their proximity to land and tendency for land contamination; and (c) automatic deletion of the predetermined outliers by using their concatenated location coordinates as criteria for executing a python script with the same libraries and IDE that was utilized in (a) above. A total of 278 appropriate satellite observation points were selected for analysis in this study; these points constitute the study area (Figure 1), was achieved by executing a python script with the pandas, numpy, csv and xarray libraries in the IDE. The points were imported and merged with the base map using the overlay function in ArcMap 10.4.1 (Ajibola-James et al., 2023; Ajibola-James, 2023).
3.3 Data accuracy and variability
The accuracy of the satellite SSS data was computed in Microsoft Excel software by using the SSS uncertainty data (the difference between in situ SSS and satellite SSS) that were downloaded with the SSS data as the only input. See Table 3.1 (a). To compute the accuracy of the modelling data, the SSS uncertainty data of 16680 observation points were uploaded to column A in Excel to produce the formula A2:A16681 for computing the sum square (SUMSQ) in cell C2, which was given by the formula SUMSQ (A2:A16681). The mean squared difference (MSD) given by formula =(C2/16680) was computed in cell D2, while the RMSD was finally computed by using formula =SQRT(D2). The same procedure was replicated for computing the accuracy of the forecasting data using 3336 observation points. See Table 3.1 (b) for details of the input datasets.
Table 3.3: Dataframe for computing interannual variability in SSS
Year
|
SSS
|
2016
|
33.15872
|
2017
|
33.12886
|
2018
|
32.79823
|
2019
|
32.55897
|
2020
|
33.02366
|
The interannual variability of the SSS data was determined by utilizing the MLmetrics library to compute the SD, a universal measure of variability in R 4.1.3/R-studio 2022.02.3-492 software. After the mean annual SSS values for 2016 to 2020 were uploaded to the software by running data_obs_sss <- read.csv(file.choose(), header = TRUE, stringsAsFactors = FALSE), the dataframe produced (Table 3.3) by running data_sss <- data_obs_sss[, c("year", "sss")] was vectorized by running sss_2016_2020 <- data_sss$sss. The SD was finally computed by running sd (sss_2016_2020).
3.4 Autoregressive Integrated Moving Average Model and Algorithm
In the application of ML methods for modelling and forecasting variations in SSS, ESP and ARIMA models and algorithms were built primarily with the forecast library 8.17.0 in R 4.1.3/R-studio 2022.02.3-492 software. Other complimentary libraries, such as tseries and MLmetrics, were also used in this process. Model fitting and selection were achieved with the auto.arima() function. The function helps to determine the best model for given input data based on relevant model evaluation criteria. The function employs a variant of the Hyndman-Khandakar method, which combines unit root testing, Akaike information criterion (AIC) minimization, the Bayesian information criterion (BIC) and maximum likelihood estimation (MLE) to generate ARIMA models (Hyndman & Khandakar, 2008; Hyndman & Athanasopoulos, 2018). The most widely used criteria are the AIC and BIC (Rahman & Hasan, 2017; Suleiman & Sani, 2020). The function performs intuitive parameter estimation and provides information on the best ARIMA model parameter.
At the inception of the ML modelling task, the dataframe, df, containing 60 monthly epochs (Jan. 2016-Dec. 2020) of the SSS data was transformed from "function" to “time series” to satisfy one of the basic assumptions of the ARIMA model. The time series data were assessed for stationarity utilizing both visual and metric approaches. The former involved the inspection of autocorrelation function (ACF) and partial autocorrelation function (PACF) plot patterns, while the latter was characterized by hypothesis testing using augmented Dickey-Fuller (ADF) test metrics. The following hypotheses and assumptions (decision rules) were adopted for the ADF test:
H0: No white noise (nonstationary)
H1: White noise (Stationary)
where H0 is the null hypothesis and H1 is the alternative hypothesis.
If the p value is ≤ 0.05, H0 is rejected to support H1.
Given that the computed p value = 0.1769, which is > 0.05, H0 of Nonstationary was accepted to reject H1 of Stationary. To achieve “stationarity”, another basic assumption of the ARIMA model, first-order differences in the data were used. The ADF test metrics were also used to reassess the output of the differenced data. Given that the computed p value = 0.01, which is < 0.05, H0 of Nonstationary was rejected to accept H1 of Stationary. The best ARIMA model together with the most appropriate parameters were identified using the auto.arima function, mymodel_train with the training data, and Outcome_SSS given by running
mymodel_train <- auto.arima(Outcome_SSS, ic='aic', trace=TRUE, approximation=FALSE) (1)
The Ljung-Box (Portmanteau) test was performed to assess the residual and stationarity of the auto.arima model based on the following hypotheses and assumptions (decision rule):
H0: No white noise (nonstationary)
H1: White noise (stationary)
If the p value is ≥ 0.05, H0 is rejected (Hyndman & Khandakar, 2008).
Given that the computed p value = 0.4522, which is > 0.05, H0 of Nonstationary was rejected to accept H1 of Stationary. Having confirmed the stationarity of (1), it was used as input for building the user-defined forecasting model, myforecast_train, given by running
myforecast_train <- forecast(mymodel_train, level=c(95), h=1*12) (2)
where level is the confidence level and h is the number of monthly forecasts. Therefore, the SSS values were predicted 12 months ahead using the model and (2) built with parameter combinations h=1*12. The graph of the SSS values predicted by the model was generated by running
autoplot(myforecast_train) (3)
after running (2) successfully. The modelling accuracy was computed in terms of R2, rsq by running
sss_obs1 <- myforecast_accuracy_Outcome_SSS$x
sss_pred1 <- myforecast_accuracy_Outcome_SSS$fitted
rss <- sum((sss_pred1 - sss_obs1) ^ 2)
tss <- sum((sss_obs1 - mean(sss_obs1)) ^ 2)
rsq <- 1 - rss/tss
rsq (4)
while the MAPE outcome of running
myforecast_accuracy_Outcome_SSS <- Arima(Outcome_SSS,
Model=mymodel_train)
accuracy (myforecast_accuracy_Outcome_SSS) (5)
where Outcome_SSS is the input time series SSS data and mymodel_train is the ML ARIMA model trained with the input of time series SSS data, which was utilized for validating the outcome of the above modelling accuracy.
The forecasting accuracy in terms of the RMSE was computed and validated by computing the MAPE with the MLmetrics for the best ARIMA ML model by running
RMSE(sss_pred1, sss_obs1) (6)
and
MAPE(sss_pred1, sss_obs1)*100 (7)
Immediately after running (4) successfully, where sss_pred1 is the predicted SSS value and sss_obs1 is the actual satellite SSS for January-December 2021.
3.5 Determination and Validation of ARIMA Model Accuracy for Modelling and Forecasting SSS
In subsection 3.4, the accuracy of the built ML ARIMA model for modelling variations in SSS was computed by using the R2 performance metric, which represents the amount of variation explained by the ML model. The forecasting accuracy was determined with the RMSE, a measure of accuracy that reveals the magnitude of the difference between the predicted and observed (actual) values. The validation of the modelling and forecasting accuracy of the best ML model in relation to error estimation, which is also known as residual variation, was also computed in terms of MAPE, a good measure of the absolute percentage difference between predicted and observed values. In general, the greater the R2 value is, the greater the amount of variation explained by the ML model. Conversely, lower values of MAPE and RMSE indicate relatively good accuracy of forecasts made by the model. In terms of the interpretation of the error metrics in real-world applications, the MAPE seems to be the most versatile because it is usually computed in percentage (%) units. In addition, what should be considered an acceptable accuracy level seems to be properly documented for the MAPE. In this regard, a MAPE less than 10% is considered to indicate “high prediction accuracy” (Lewis, 1982; Ağbulut et al., 2021b; Ajibola-James, 2023). It should be underscored that the true test of an ML time series model’s performance is in accurately forecasting new (future) values. This is usually determined by the value of its performance metrics in forecasting new target values that are not included in the model’s training datasets.