The spectro::lyser™ (s::can company) submersible UV-Vis collectors measure approximately 65 cm in length, with a diameter of 44 mm. They are designed to record the attenuation of light (absorbance) almost continuously (one signal per minute). To provide light to the captor, it has a xenon lamp that generates wavelengths from 200 nm to 750 nm at 2.5 nm intervals (Langergraber et al. 2004; s::can 2006). Three UV-Vis time series with 5705 absorbance spectra were recorded at the following locations: (i) El-Salitre WWTP tributary from June 29, 2011, at 9:03 a.m. to July 3, 2011, at 5 p.m.: 33 h (readings every minute) in Bogotá; (ii) Gibraltar Pumping Station (GPS) from October 18, 2011, at 4:17 p.m. to October 22, 2011, at 3:21 p.m. (readings every minute) in Bogotá; and (iii) WWTP San Fernando tributary from September 24, 2011, at 06:04 a.m. to October 2, 2011, at 09:16 a.m. (readings every 2 min) in Itagüí, part of the Medellín metropolitan area, as shown in Fig. 1. For all UV-Vis absorbance time series, 4320 values were used for calibration, and 1385 values were used for testing.
Each UV-Vis absorbance time series comprises 219 wavelengths and requires a dimensionality reduction procedure. High dimensionality is a serious problem for machine learning, data mining, and pattern recognition tasks involving high-dimensional data (Zhang et al. 2016; Ayesha et al. 2020). Thus, various methods have been introduced in recent decades (Krawczak and Szkatula 2014). Reducing dimensionality is an important strategy to address this problem. By reducing the dimensionality, the new representation of the data is much smaller in volume than the original data set (Zhang et al. 2016; Zhu et al. 2016; Sengodan 2021). Dimensionality reduction algorithms are based on mathematical transformations to convert the original high-dimensional data space into a lower-dimensional feature space (Zhu et al. 2016; Ayesha et al. 2020; Sengodan 2021).
The color scale, in Fig. 1, is used as a visual absorbance amplitude indicator and it represents the presence of determinants, commonly monitored in wastewater (Plazas-Nossa et al. 2017). This color scale is proposed based on van den Broeke (2007): (i) in the UV range, the dark purple represents determinants such as Nitrites NO2, Nitrates NO3 and COD; (ii) in the visible (Vis) range, the violet, blue, green, yellow, orange and red colors represent determinants such as turbidity and total suspended solids. The “Time” axis depicts each of the spectrum captured by the captor (Plazas-Nossa et al. 2017).
Reducing dimensionality implies less processing time compared to the processing time of each time series for each wavelength. PCA was used to reduce dimensionality for the present work, combined with each methodology proposed for forecasting. In addition, previous experiences have shown that the forecast can improve if PCA is applied before the forecasting methodology: Plazas and Torres (2014) have shown that the PCA/DFT forecasting methodology systematically presented lower forecast errors and variability than those obtained using only the DFT procedure without PCA. PCA, proposed by Pearson (1901), performs a linear transformation from the original data set and finds a new coordinate system. In this new coordinate system, the first axis, called the first principal component (PC), captures the highest variance value of the data set; the second axis captures the second highest variance, and so on. The covariance matrix must be obtained to construct this linear transformation. The objective is to transform a given set of X data, with dimensions n x m, into another data set of lower dimension n x l, with a minimum loss of useful information (Juhos et al. 2008; Shlens 2009; Chowdhury and Husain 2020). For more information, see Plazas-Nossa and Torres (2014). The number of principal components is determined from the variance of the eigenvalue (eigenvalue), keeping only those PCs whose deviations are greater than or equal to one (eigenvalue > = 1). It is based on the Kaiser cutoff rule (Kaiser 1960; Jolliffe 2002; Chowdhury and Al-Zahrami 2014). The above procedure applies to the range of UV-Vis absorbance spectra (200 nm − 745 nm) over the three UV-Vis absorbance time series with the same length (5705 records).
Various Machine Learning techniques were tested, and it was possible to capture the behavior of the time series in the calibration stage, such as Artificial Neural Networks (ANN) (Solomatine 2002; Russell and Norving 2010; Zhu et al. 2022), Machines with Support Vectors (SVM) (Vapnik et al. 1997; Kandananond 2013; Priyadarshini et al. 2022) and Clustering process (k-means) (Saha and Manickavasagan 2021) combined with Markov Chains (Vrugt et al. 2013; Ginting et al. 2014; Okwuashi and Ndehedehe 2021), called kmMC. Machine Learning methods, a subfield of computer science called artificial intelligence, are based on the analysis and creation of algorithms that can be trained and constructed from time series values (data information). In recent years, ANNs have been used successfully for forecasting purposes to obtain one-step prognostic values as the horizon. They are accepted by different disciplines, being suitable due to their information processing characteristics, for example, non-linearity, parallelism, tolerance noise, learning, and generalization abilities (Yang et al. 2008; Young et al. 2015), especially for their ability to discover non-linear relationships (Faruk 2010; Ohana-Levi et al. 2022). Some experiences have shown that ANNs can be a promising technique for water quality forecasting (West and Dellana 2011; Martin et al. 2011; Riesco et al. 2014; Elbisy et al. 2014; Ouma et al. 2020; Uddin et al. 2022) due in particular to its ability to cope with a high number of inputs (multivariate data or training time steps), taking into account the non-linearities in noisy data sets, characteristics exhibited by the time series of water quality captured online (Solomatine 2002).
Another forecasting methodology is SVM, originally developed by Vapnik et al. in 1995. SVM are supervised learning models associated with learning algorithms that analyze data and recognize patterns. SVM are learning machines that apply the inductive principle of minimization of structural risks to obtain a good generalization in a limited number of learning patterns. SVM is a method based on the construction of hyperplanes in a multidimensional space, and it is used for classification and regression tasks, handling multiple continuous and categorical variables (Kandananond 2013; Imani et al. 2014; Uddin et al. 2022). SVMs are used for many machine learning tasks, such as pattern recognition, object classification, and, in time series forecasts, regression analysis (Sapankevych and Sankar 2009; López-Kleine and Torres 2014; Dilmi and Ladjal 2021). Different methodologies such as Fuzzy Logic, SVM, and Data Assimilation (DA) have been used and reported by Kim et al. (2014b), Tan et al. (2012), and Kim et al. (2014a). Finally, Markov Chains is a mathematical system based on transitions from one state to another. It is a random process generally characterized as a memoryless process: the next state depends only on the current state and not on the sequence of events that preceded it. This specific type of "forgetfulness" is called the Markov property (Vrugt et al. 2013; Ginting et al. 2014; Okwuashi and Ndehedehe 2021). Many researchers have applied the k-means clustering technique as a complementary tool for forecasting purposes (Zhang and Zhu 2012; Venkatesh et al. 2014; Cheng et al. 2015; Dong et al. 2020). Therefore, cluster analysis represents another viable option as it addresses the underlying multivariate data structure, natural classification, and compression (Jain 2010; Martin et al. 2011; Farrou et al. 2012; Riesco et al. 2014; Saha and Manickavasagan 2021).
In summary, the forecasting techniques used in this work are: (i) Principal Component Analysis (PCA) combined with Discrete Fourier Transform (DFT) - (PCA/DFT) proposed by Plazas-Nossa and Torres (2014); (ii) PCA combined with Chebyshev polynomials (Kopriva 2009) (PCA/Ch-Poly); (iii) PCA combined with Legendre polynomials (Kopriva 2009) (PCA/L-Poly); (iv) PCA combined with Feed-forward Artificial Neural Networks (PCA/ANN) proposed by Plazas-Nossa et al. 2017; (v) PCA combined with Polynomial regression (Barca et al. 2015; Han et al. 2016) (PCA/PolyReg); (vi) PCA combined with SVM (Vapnik et al. 1997; Kandananond, 2013) (PCA/SVM); (vii) Clustering process combined with Markov chains (kmMC) proposed by Plazas-Nossa et al. (2015).
This work proposes a methodology for forecasting UV-Vis absorbance time series through automatic analysis and choosing the best water quality prediction method among those previously described. Therefore, each absorbance time series's proposed procedure takes 4320 values for calibration, 1385 values for the test, and the forecast is made for 360 values. Each forecast value for each study site means (i) 6 hours for WWTP El-Salitre and GPS; and (ii) 12 hours for the San Fernando WWTP. Subsequently, the absolute percentage error value (Absolute Percentage Error - APE) (Bowerman et al. 2005; Kim et al. 2022; Said et al. 2022) (used as a performance indicator to evaluate and compare the seven different approaches) is calculated for each wavelength and every 30 time-steps. Therefore, based on the MAPE values (average APE value), the performance of each forecast methodology (lowest MAPE value) is established. The same process is repeated every 30 time-steps and is performed over seven iterations and a 6-hour forecast time horizon. Figure 2 shows a summary of the proposed hybrid forecasting system.