Study area and dataset
The Klang River basin is located at the central Peninsular Malaysia and encompasses the states of Selangor and the Federal Territory of Kuala Lumpur, Malaysia. The Klang River basin drains from Ulu Gombak to the river mouth in Port Klang and covers about 1288 square kilometres (km2) of catchment area. The basin consists of the main Klang River and 11 tributaries, including Sg. Gombak, Sg. Kerayong, Sg. Penchala, and Sg. Damansara. The upper catchment of the Klang River basin is still preserved by tropical forest, while the middle and lower catchment areas are urban areas. The continuous urban development, industrialization, and growing population, especially in the middle catchment area, has caused the degradation of river water quality. The construction of buildings and highways has caused soil erosion and sediments discharge into waterways. According to the Department of Irrigation and Drainage, Malaysia, an estimated 50 to 60 tons of waste end up in the river system daily in the Klang Valley. This includes very poor solid waste management in squatter areas along the Klang River basin reserve area, which suffer more from river pollution as these squatter areas are not provided with proper sewerage and rubbish disposal facilities and the rubbish is generally disposed directly into rivers.
Figure 1 shows the water quality monitoring stations located within the Klang River basin. Most of the monitoring stations are in urban residential areas, which are in the middle of the river basin. There are 16 monitoring stations in total, which are Stations 1K05, 1K06, 1K07, 1K08, 1K25, 1K45, and 1K46 located along Sg. Klang. Stations 1K17, 1K18 and 1K24 are along Sg. Gombak while Station 1K23 and 1K36 are along Sg. Ampang. Station 1K41 and Station 1K50 are located along Sg. Penchala, while Stations 1K47 and 1K51 are located along Sg. Kerayong and Sg. Damansara respectively. The data considered are the monthly recorded water quality status along the Klang River basin from January 2013 to December 2016.
The water quality parameters measured at each station include dissolve oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), ammoniacal nitrogen (NH3NL), temperature, and pH. Similar variables are also considered by Mohamed et al. (2015). These parameters are important to assess the quality status of river water and used in the development of the water quality index (WQI). The index is used to indicate the level of pollution and the corresponding suitability in terms of water use. The determination of WQI for each location also permits the categorical class based on the National Water Quality Standard (NWQS). The scores WQI ranges from 0 to 100, with 0 to 59 scores representing polluted rivers, 60 to 80 scores representing slightly polluted rivers and 81 to 100 scores representing clean rivers. The WQI is further classified into five major classes which are class I (> 92.7), class II (76.5–92.75), class III (51.9–76.5), class IV (31.0–51.9) and class V (< 31.0). Practically, no treatments are needed for class I as the water is very clean and safe for direct drinking. While water in class II only needs a conventional treatment and class III can still be used for livestock drinking. However, water in class IV is only used for irrigation and class V is considered polluted water and cannot be used for purposes listed in other classes.
Constructing functional data
The data recorded at each monitoring station along the river consists of multiple attributes or parameters. Therefore, the study of water quality data usually deals with multivariate data analysis (Mohamed et al. 2015, Vadde et al. 2018). However, water quality data is also a collection of finite discrete values, which means data taken across time and recorded as discrete time points. In this study, we will use the functional data analysis on the water quality index in order to determine the water quality status in the Klang River basin. It is done by highlighting the outlying points and outlying patterns of the water quality index over time. To achieve that, the first step is to transform the discrete-time series water quality index into a function or a curve. The function can then be considered as a random stochastic process observed at discrete points (Ramsay & Silverman, 2005). The analysis involves monthly water quality data recorded at 16 monitoring stations located along the Klang River from January 2013 to December 2016. We denote the collection of water quality index as \({x}_{i}{(t}_{j})\) where \({t}_{j, }=j,\) for \(j=1,\dots ,48\) and \(i=1,\dots , 16\). For a station \(i\), a set of observation \({x}_{i}{(t}_{j})\) is transformed to a function or curve denoted by \({y}_{i}\left(t\right)\), where \(t\) represents an interval of time, \(t\in \left[\text{1,48}\right].\) The functional form of the water quality data is constructed by smoothing techniques (Ramsay & Silverman, 2005). The discrete observation \({y}_{i}{(t}_{j})\) is fitted using regression model
$${x}_{i}\left({t}_{j}\right)={y}_{i}\left(t\right)+ {\epsilon }_{ij},$$
1
where \({\epsilon }_{ij}\) are the errors and the functions \({y}_{i}\left(\cdot \right)\) are linear combinations of K independent basis functions \({\varphi }_{k}(\cdot )\) and \({c}_{k}^{i}\) are the coefficient of the smoothing model given by
$${y}_{i}\left(t\right)={\sum }_{k=1}^{K}{c}_{ik}{\varphi }_{k}\left(t\right), t\in \left[\text{1,48}\right].$$
2
The functional data sets \({y}_{i}\left(t\right)\) are then given by
$${{y}_{i}\left(t\right)=\widehat{y}}_{i}\left(t\right)={\sum }_{k=1}^{K}{\widehat{c}}_{ik}{\varphi }_{k}\left(t\right), t\in \left[\text{1,48}\right]$$
3
where the estimated coefficients \({\widehat{c}}_{ik}\) are obtained by minimizing the following sum of squares errors
$${\sum }_{j=1}^{48}{\left({y}_{i}{(t}_{j}\right)-{x}_{i}{(t}_{j}\left)\right)}^{2}, i=1,\dots ,16.$$
4
where \({x}_{i}{(t}_{j})\) is the smoothing function \({y}_{i}\left(t\right)\) at time \(t={t}_{j}\).
Various basis functions are available to choose from when performing a functional data analysis. However, depending on the nature of the data, the most practical ones are the two well-known types of basis, which are the B-spline basis and Fourier basis functions. The two basis functions are different in that the B-spline basis is more suitable for non-periodic data while the Fourier basis is preferred over the other basis functions when dealing with periodic data (Ramsay & Silverman, 2005; Ullah & Finch, 2013). In this study, we use the Fourier basis for smoothing the water quality index data because the data is periodic. The FDA is carried out using fda function in R package.
Functional depth of water quality index
A data depth is first introduced in multivariate analysis to measure the “depth” or “outlyingness” of a given multivariate sample with respect to its underlying distribution (Liu et al., 1999). The depth concept provides a way of ordering multivariate sample points in the Euclidean space from the center to outward, where the observation that is closer to the center will have greater depth (Raquel, 2008). This concept has been extended later to the functional data form (Fraiman & Muniz 2001; Cuevas et al., 2006, 2007; Torres et al., 2011). The depth of the functional data measures the centrality of a curve with respect to a set of curves. It provides a center to the outward ordering of sample curves in Hilbert Space (Febrero et al., 2008). López-Pintado and Romo (2009) proposed the band and modified band depths for measuring the functional depth.
The band depth (BD) and modified band depth (MBD) were developed based on the idea of measuring the centrality or “outlyingness” of functions using the graphs of functions and the bands these graphs determine in the plane. A function \(y\left(t\right)\) is the subset of the plane \(G\left(y\right)=\left\{\left(t,y\left(t\right)\right):t\in I\right\}\) and the band in \({\mathbb{R}}^{2}\) determined by the curves \({y}_{i1},\dots {y}_{ik}\) is \(B\left({y}_{i1},\dots ,{y}_{jk}\right)=\left\{\left(,\right)\right\}.\)In general, BD computes the fraction of the bands containing the curve \({y}_{i}\) while MBD is a modified version of BD. To avoid many depth ties that occur using BD, the MBD measures the proportion of time that a curve \({y}_{i}\) is in the band. The population version of the band depth for a given curve \(y\left(t\right)\) with respect to the probability measure \(P\) is defined as
\({BD}_{J}\left(y,P\right)={\sum }_{j=2}^{J}{BD}^{j}\left(y\right)={\sum }_{j=2}^{J}P\left\{G\right(y) \subset B({Y}_{1},\dots ,{Y}_{j}\left)\right\},\) | (5) |
where \(J\) is the number of curves determining a band which has a fixed value with\(2 \le J \le n\) and \(B({Y}_{1},\dots {Y}_{j})\) is a band delimited by j random curves. The sample version of BD is given as
$${\text{B}\text{D}}_{n}^{\left(j\right)}\left(y\right)={\left(\genfrac{}{}{0pt}{}{n}{j}\right)}^{-1}{\sum }_{1\le {i}_{1}\le {i}_{2}\le {\dots \le i}_{j}\le n}I\left\{G\right(y)\subseteq B\left({y}_{i1},\dots ,{y}_{ij}\right)\},$$
where \(I\{\bullet \}\) denotes the indicator function. For the MBD, the indicator function is substituted with a more flexible definition by measuring the proportion of time that a curve \(y\left(t\right)\) is in the band. The equation is given by
\({\text{M}\text{B}\text{D}}_{n}^{\left(j\right)}\left(y,P\right)={\left(\genfrac{}{}{0pt}{}{n}{j}\right)}^{-1}{\sum }_{1\le {i}_{1}\le {i}_{2}\le {\dots \le i}_{j}\le n}{\lambda }_{r}\left\{A\right(y;{y}_{i1},\dots ,{y}_{ij})\},\) | (6) |
where \({A}_{j}\left(y\right)\equiv A\left({y;y}_{i1},\dots ,{y}_{ij}\right)\) and \({\lambda }_{r}\left(y\right)= {\lambda (A}_{j}\left(y\right))/{\lambda }\left(I\right)\), if \(\lambda\) is the Lebesgue measure on \(I.\) For MBD, it is more convenient to obtain the magnitude and shape of the curve as it takes into account the proportion of times that a curve is in the band. The depth measurements belong to each curve are then ordered by ascending order and the greatest depth is with the function in the center of the graph while the function with the lowest depth is outward from the center. Next section discussed the used of functional depths for the building of functional boxplot.
Functional boxplot and functional outlier detection
A classical tool for functional outlier detection is known as the functional boxplot proposed by Sun and Genton (2011). Similarly, to the case of real, univariate data, the functional boxplot is a visualization tool used to display the distribution of observation and to identify functional outliers. The functional boxplot is based on the center outward ordering induced by band depth for functional data. Given a sample of \(n\) functional data, \({y}_{1}\left(t\right),\dots {y}_{n}\left(t\right)\), we order the functions according to results of BD or MBD in decreasing order \({Y}_{1},\dots {Y}_{n}\). Then. the sample \(\alpha\) central region is denoted by
$${C}_{\alpha }\left(Y\right)=\left\{\left(t, y\left(t\right)\right):{Y}_{l}\left(t\right)\le y\left(t\right)\le {Y}_{r}\left(t\right) \right\}$$
7
.
Thus, the descriptive statistics of a functional boxplot are the envelope of the 50% central region which is the most central curve of the sample, the median curve and the maximum non-outlying envelope. In addition, outliers can be detected in a functional boxplot by the 1.5 times the 50% central region empirical rule.