Assessment of Water Quality Data Using Functional Data Analysis for Klang River Basin, Malaysia

doi:10.21203/rs.3.rs-2778529/v1

Download PDF

Research Article

Assessment of Water Quality Data Using Functional Data Analysis for Klang River Basin, Malaysia

https://doi.org/10.21203/rs.3.rs-2778529/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Rivers are subject to different sources of pollution. Continuous monitoring of river water quality provides an important basis for the authorities to take appropriate action. Water quality monitoring stations located within the river basin can provide necessary water quality data to establish any changes observed in the river water quality. It is important to highlight lower water quality status at specific monitoring stations so that immediate action can be taken. Similarly, it is an utmost important to ensure water quality at monitoring stations close to water catchment areas always at an acceptable level. This study aims to identify such monitoring stations using descriptive and functional data analysis. The approaches were applied to water quality data collected by the Department of Environment Malaysia at 16 stations in the Klang River basin from January 2013 to December 2016. Specifically, the functional boxplot was applied to identify the monitoring station with outlying properties. We identified many occasions when water quality deteriorated or improved largely due to the increase of COD, BOD and TSS. In addition, three stations close to two main catchment areas and forest reserve showed consistently good water quality. These indicate that the surrounding areas of the stations at the upstream of the rivers are still protected from uncontrolled pollution sources. The study is critical for the authority to understand the overall pattern of water quality data at each station so that action can be planned locally to preserve good river water quality.

River

water quality

outlier

functional data analysis

functional boxplot

Water quality monitoring is crucial to sustain or improve the health of the river ecosystem for present and future generations. The collection of water quality data is necessary to monitor any changes over time that may affect the present and future uses of water (Cosgrove & Loucks, 2015). In 2019, river water quality in Malaysia was assessed based on a total of 8,118 samples taken from a total of 1,353 manual monitoring stations covering 672 rivers (Malaysia environmental quality report, 2019). The report shows that 61% of the rivers had clean water quality, 30% were slightly polluted, and 9% were polluted. It is highlighted that the river water quality in the Klang River basin is the lowest amongst the rivers in Malaysia. As reported in the State of River Report 2015, the Department of Irrigation and Drainage Malaysia also identified the Klang River as the most polluted river in the country, with an estimated 77,000 tons of garbage being dumped each year. Azman et al. (2018) stated that, for more than 10 years, the Klang River has been under severe threat from a variety of pollution sources, including food and beverage, chemical, semiconductor, and electronics sectors. To make matters worse, the river also passes through heavily populated areas, making it difficult to measure pollutant loading in the river. Thus, proper planning should be in place to maintain the supply of clean water for current and future generations.

Continuous river monitoring is very useful to identify areas where water quality degradation exists and to investigate its causes and sources (Vega-Rodríguez et al., 2021). The monitoring activity usually collects and records the water quality data over time and helps to identify specific pollutants, their sources, and occurrences. The occasional outlying values in water quality samples behave as outliers in an environmental database (Rangeti et al., 2015). The causes of water quality degradation may be due to various circumstances. Pollutants from various sources, such as untreated wastewater from industrial activities as well as squatter populations located along the river, contribute to worsening river water quality degradation problems (Asiah, 2017). Besides, sediments flushed into water bodies by rainfall during the rainy season may also degrade the quality of the water (Sidek et al., 2016). Thus, water quality monitoring or sensors are used increasingly in practice to provide continuous monitoring of water health for different places. Shelutko and Makarova (2020) studied the water data collected in the Velyka River and found the detected outliers in the data are observed during bad or extreme weather conditions. In addition, Talagala et al. (2019) associated the outlier detected in their study with a sudden high flow event in the river. With the same motivation, we attempt to detect outliers in the Klang River basin and investigate the causes of such events.

To achieve better decisions in river water management, it is of utmost importance to employ appropriate statistical methods to trace any suspicious values or rare patterns in the water quality data. Among the methods available, outlier detection techniques are crucial for evaluating data and detecting anomalies that may affect the accuracy of the results. There are a number of methods available to detect outliers in the river water data. For example, the interquartile range (Amirabadizadeh et al., 2015), boxplot (Bresciani et al., 2019), and principal component analysis (Marinović Ruždjak & Domagoj Ruždjak, 2015; Zavareh et al., 2021). The interquartile range is a classical way to indicate outliers as the method reduces biases in the dataset caused by the outliers while retaining information for extreme events (Amirabadizadeh et al., 2015). In exploratory data analysis, a boxplot is a graphical method displaying the distribution of data based on the summary of statistics, namely minimum, maximum, median, and quartiles. The individual observations outside the boxplot are identified as outliers. However, for multivariate data, principal component analysis has often been used whereby a point deviated in the principal component subspace will be identified as an outlier. Recently, some other detection methods based on modelling and prediction have been introduced, such as the machine learning method (Mokua et al., 2021; Almuhtaram et al., 2021) and the Bayesian autoregressive method (Liu et al., 2020).

So far, the methods used for detecting rare patterns in Malaysian river data ignore the information about the smooth functional behaviour of the data over time. This functional approach, in fact, allows for the treatment of the data set as continuous measurements over time, instead of the original discrete values (Horváth & Kokoszka 2012). The benefits of considering functional data, according to Ramsay and Silverman (2005), include the ability to produce functional representations of a finite set of observations through smoothing, the ability to think of modelling issues more naturally in a functional form, and the ability to estimate the entire function to extract additional information contained in the function and its derivative, which is normally not available from the application of classical methods. In addition, this approach allows us to analyze the trend of water quality data and detect the abnormal trend without requiring the data to be normally distributed as in classical data analysis (Blasi et al., 2013).

The change in shape and magnitude of a function that does not follow the same pattern as the rest of the curves is associated with a functional outlier (Febrero et al., 2008). The functional outlier may be due to errors in measurements and recording or may indicate abnormal behaviour in the system and can lead to useful information or significant discoveries of polluted areas. The functional outliers of water quality data can be detected using some functional depths such as Fraiman–Muniz depth (FMD) (Fraiman & Muniz, 2001) and H-modal depth (HMD) (Cuevas et al., 2006). Recently, Hussain (2019) demonstrated the detection of outliers in hydrology data using two graphical methods, namely, the functional bagplot and the functional boxplot. Other studies include Muñiz et al. (2012), Blasi et al. (2013), Sancho et al. (2015), Millán-Roures et al. (2018).

The functional framework could be considered as an appropriate way to allow a comprehensive examination of water quality data by treating water quality data at a spatial location as one observation. This study will use the smoothing procedure to yield functional representations of a finite set of observations for the purpose of detecting abnormal water quality levels and patterns in one of the river basins in Malaysia. Specifically, we want to understand water quality degradation in the Klang River basin, the most developed area within the state of Selangor and Kuala Lumpur. The degradation of river water quality is the issue that needs to be resolved in order to support the Malaysian government's effort to sustain a healthy river ecosystem. The findings reported in this research may be beneficial to water resource management planning for water treatment in targeted spatial areas. This paper is structured as follows: Section 2 describes the materials and methods employed in this study. Section 3 presents the results obtained and the discussion of the results. Finally, the conclusion is given in the last section.

Study area and dataset

The Klang River basin is located at the central Peninsular Malaysia and encompasses the states of Selangor and the Federal Territory of Kuala Lumpur, Malaysia. The Klang River basin drains from Ulu Gombak to the river mouth in Port Klang and covers about 1288 square kilometres (km²) of catchment area. The basin consists of the main Klang River and 11 tributaries, including Sg. Gombak, Sg. Kerayong, Sg. Penchala, and Sg. Damansara. The upper catchment of the Klang River basin is still preserved by tropical forest, while the middle and lower catchment areas are urban areas. The continuous urban development, industrialization, and growing population, especially in the middle catchment area, has caused the degradation of river water quality. The construction of buildings and highways has caused soil erosion and sediments discharge into waterways. According to the Department of Irrigation and Drainage, Malaysia, an estimated 50 to 60 tons of waste end up in the river system daily in the Klang Valley. This includes very poor solid waste management in squatter areas along the Klang River basin reserve area, which suffer more from river pollution as these squatter areas are not provided with proper sewerage and rubbish disposal facilities and the rubbish is generally disposed directly into rivers.

Figure 1 shows the water quality monitoring stations located within the Klang River basin. Most of the monitoring stations are in urban residential areas, which are in the middle of the river basin. There are 16 monitoring stations in total, which are Stations 1K05, 1K06, 1K07, 1K08, 1K25, 1K45, and 1K46 located along Sg. Klang. Stations 1K17, 1K18 and 1K24 are along Sg. Gombak while Station 1K23 and 1K36 are along Sg. Ampang. Station 1K41 and Station 1K50 are located along Sg. Penchala, while Stations 1K47 and 1K51 are located along Sg. Kerayong and Sg. Damansara respectively. The data considered are the monthly recorded water quality status along the Klang River basin from January 2013 to December 2016.

The water quality parameters measured at each station include dissolve oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), ammoniacal nitrogen (NH₃NL), temperature, and pH. Similar variables are also considered by Mohamed et al. (2015). These parameters are important to assess the quality status of river water and used in the development of the water quality index (WQI). The index is used to indicate the level of pollution and the corresponding suitability in terms of water use. The determination of WQI for each location also permits the categorical class based on the National Water Quality Standard (NWQS). The scores WQI ranges from 0 to 100, with 0 to 59 scores representing polluted rivers, 60 to 80 scores representing slightly polluted rivers and 81 to 100 scores representing clean rivers. The WQI is further classified into five major classes which are class I (> 92.7), class II (76.5–92.75), class III (51.9–76.5), class IV (31.0–51.9) and class V (< 31.0). Practically, no treatments are needed for class I as the water is very clean and safe for direct drinking. While water in class II only needs a conventional treatment and class III can still be used for livestock drinking. However, water in class IV is only used for irrigation and class V is considered polluted water and cannot be used for purposes listed in other classes.

Constructing functional data

The data recorded at each monitoring station along the river consists of multiple attributes or parameters. Therefore, the study of water quality data usually deals with multivariate data analysis (Mohamed et al. 2015, Vadde et al. 2018). However, water quality data is also a collection of finite discrete values, which means data taken across time and recorded as discrete time points. In this study, we will use the functional data analysis on the water quality index in order to determine the water quality status in the Klang River basin. It is done by highlighting the outlying points and outlying patterns of the water quality index over time. To achieve that, the first step is to transform the discrete-time series water quality index into a function or a curve. The function can then be considered as a random stochastic process observed at discrete points (Ramsay & Silverman, 2005). The analysis involves monthly water quality data recorded at 16 monitoring stations located along the Klang River from January 2013 to December 2016. We denote the collection of water quality index as ${x}_{i}{(t}_{j})$ where ${t}_{j, }=j,$ for $j=1,\dots ,48$ and $i=1,\dots , 16$. For a station $i$, a set of observation ${x}_{i}{(t}_{j})$ is transformed to a function or curve denoted by ${y}_{i}\left(t\right)$, where $t$ represents an interval of time, $t\in \left[\text{1,48}\right].$ The functional form of the water quality data is constructed by smoothing techniques (Ramsay & Silverman, 2005). The discrete observation ${y}_{i}{(t}_{j})$ is fitted using regression model

$${x}_{i}\left({t}_{j}\right)={y}_{i}\left(t\right)+ {\epsilon }_{ij},$$

where ${\epsilon }_{ij}$ are the errors and the functions ${y}_{i}\left(\cdot \right)$ are linear combinations of K independent basis functions ${\varphi }_{k}(\cdot )$ and ${c}_{k}^{i}$ are the coefficient of the smoothing model given by

$${y}_{i}\left(t\right)={\sum }_{k=1}^{K}{c}_{ik}{\varphi }_{k}\left(t\right), t\in \left[\text{1,48}\right].$$

The functional data sets ${y}_{i}\left(t\right)$ are then given by

$${{y}_{i}\left(t\right)=\widehat{y}}_{i}\left(t\right)={\sum }_{k=1}^{K}{\widehat{c}}_{ik}{\varphi }_{k}\left(t\right), t\in \left[\text{1,48}\right]$$

where the estimated coefficients ${\widehat{c}}_{ik}$ are obtained by minimizing the following sum of squares errors

$${\sum }_{j=1}^{48}{\left({y}_{i}{(t}_{j}\right)-{x}_{i}{(t}_{j}\left)\right)}^{2}, i=1,\dots ,16.$$

where ${x}_{i}{(t}_{j})$ is the smoothing function ${y}_{i}\left(t\right)$ at time $t={t}_{j}$.

Various basis functions are available to choose from when performing a functional data analysis. However, depending on the nature of the data, the most practical ones are the two well-known types of basis, which are the B-spline basis and Fourier basis functions. The two basis functions are different in that the B-spline basis is more suitable for non-periodic data while the Fourier basis is preferred over the other basis functions when dealing with periodic data (Ramsay & Silverman, 2005; Ullah & Finch, 2013). In this study, we use the Fourier basis for smoothing the water quality index data because the data is periodic. The FDA is carried out using fda function in R package.

Functional depth of water quality index

A data depth is first introduced in multivariate analysis to measure the “depth” or “outlyingness” of a given multivariate sample with respect to its underlying distribution (Liu et al., 1999). The depth concept provides a way of ordering multivariate sample points in the Euclidean space from the center to outward, where the observation that is closer to the center will have greater depth (Raquel, 2008). This concept has been extended later to the functional data form (Fraiman & Muniz 2001; Cuevas et al., 2006, 2007; Torres et al., 2011). The depth of the functional data measures the centrality of a curve with respect to a set of curves. It provides a center to the outward ordering of sample curves in Hilbert Space (Febrero et al., 2008). López-Pintado and Romo (2009) proposed the band and modified band depths for measuring the functional depth.

The band depth (BD) and modified band depth (MBD) were developed based on the idea of measuring the centrality or “outlyingness” of functions using the graphs of functions and the bands these graphs determine in the plane. A function $y\left(t\right)$ is the subset of the plane $G\left(y\right)=\left\{\left(t,y\left(t\right)\right):t\in I\right\}$ and the band in ${\mathbb{R}}^{2}$ determined by the curves ${y}_{i1},\dots {y}_{ik}$ is $B\left({y}_{i1},\dots ,{y}_{jk}\right)=\left\{\left(,\right)\right\}.$In general, BD computes the fraction of the bands containing the curve ${y}_{i}$ while MBD is a modified version of BD. To avoid many depth ties that occur using BD, the MBD measures the proportion of time that a curve ${y}_{i}$ is in the band. The population version of the band depth for a given curve $y\left(t\right)$ with respect to the probability measure $P$ is defined as

${BD}_{J}\left(y,P\right)={\sum }_{j=2}^{J}{BD}^{j}\left(y\right)={\sum }_{j=2}^{J}P\left\{G\right(y) \subset B({Y}_{1},\dots ,{Y}_{j}\left)\right\},$

(5)

where $J$ is the number of curves determining a band which has a fixed value with$2 \le J \le n$ and $B({Y}_{1},\dots {Y}_{j})$ is a band delimited by j random curves. The sample version of BD is given as

$${\text{B}\text{D}}_{n}^{\left(j\right)}\left(y\right)={\left(\genfrac{}{}{0pt}{}{n}{j}\right)}^{-1}{\sum }_{1\le {i}_{1}\le {i}_{2}\le {\dots \le i}_{j}\le n}I\left\{G\right(y)\subseteq B\left({y}_{i1},\dots ,{y}_{ij}\right)\},$$

where $I\{\bullet \}$ denotes the indicator function. For the MBD, the indicator function is substituted with a more flexible definition by measuring the proportion of time that a curve $y\left(t\right)$ is in the band. The equation is given by

${\text{M}\text{B}\text{D}}_{n}^{\left(j\right)}\left(y,P\right)={\left(\genfrac{}{}{0pt}{}{n}{j}\right)}^{-1}{\sum }_{1\le {i}_{1}\le {i}_{2}\le {\dots \le i}_{j}\le n}{\lambda }_{r}\left\{A\right(y;{y}_{i1},\dots ,{y}_{ij})\},$

(6)

where ${A}_{j}\left(y\right)\equiv A\left({y;y}_{i1},\dots ,{y}_{ij}\right)$ and ${\lambda }_{r}\left(y\right)= {\lambda (A}_{j}\left(y\right))/{\lambda }\left(I\right)$, if $\lambda$ is the Lebesgue measure on $I.$ For MBD, it is more convenient to obtain the magnitude and shape of the curve as it takes into account the proportion of times that a curve is in the band. The depth measurements belong to each curve are then ordered by ascending order and the greatest depth is with the function in the center of the graph while the function with the lowest depth is outward from the center. Next section discussed the used of functional depths for the building of functional boxplot.

Functional boxplot and functional outlier detection

A classical tool for functional outlier detection is known as the functional boxplot proposed by Sun and Genton (2011). Similarly, to the case of real, univariate data, the functional boxplot is a visualization tool used to display the distribution of observation and to identify functional outliers. The functional boxplot is based on the center outward ordering induced by band depth for functional data. Given a sample of $n$ functional data, ${y}_{1}\left(t\right),\dots {y}_{n}\left(t\right)$, we order the functions according to results of BD or MBD in decreasing order ${Y}_{1},\dots {Y}_{n}$. Then. the sample $\alpha$ central region is denoted by

$${C}_{\alpha }\left(Y\right)=\left\{\left(t, y\left(t\right)\right):{Y}_{l}\left(t\right)\le y\left(t\right)\le {Y}_{r}\left(t\right) \right\}$$

Thus, the descriptive statistics of a functional boxplot are the envelope of the 50% central region which is the most central curve of the sample, the median curve and the maximum non-outlying envelope. In addition, outliers can be detected in a functional boxplot by the 1.5 times the 50% central region empirical rule.

Descriptive statistics

The summary statistics of WQI for 16 monitoring stations in the Klang River basin for the years 2013 to 2016 are presented in Table 1. We can see that the minimum average WQI score is 30.7 (Sg. Penchala, 1K41), while the maximum average is 73.1 (Sg. Gombak, IK24). Based on Malaysia’s National Water Quality Standard (NWQS), the

Table 1

Summary statistics of average WQI by stations
	Min.	Max.	1st Qu.	3rd Qu.	Median	Mean	Std. dev
1K05	35.9	67.1	44.3	53.4	49.9	49.4	7.5
1K06	36.1	58.6	40.8	48.3	44.3	45.2	5.4
1K07	38.1	57.9	45.0	49.1	46.8	47.1	4.1
1K08	38.7	56.5	44.9	48.4	47.4	46.8	3.6
1K17	35.2	63.8	45.4	50.2	47.2	47.9	5.0
1K18	44.5	64.7	48.1	54.6	51.0	51.7	4.8
1K23	34.2	68.2	43.3	50.6	46.6	47.5	6.3
1K24	57.2	73.1	66.7	69.7	68.0	67.7	2.9
1K25	39.5	58.0	44.9	49.9	48.2	47.7	4.0
1K36	39.0	66.0	47.9	55.8	52.1	51.6	5.8
1K41	30.7	56.1	40.7	48.3	44.8	44.2	5.3
1K45	53.7	70.4	63.6	67.7	65.8	65.3	3.1
1K46	45.5	65.7	52.1	57.4	55.6	55.1	4.7
1K47	32.2	56.3	37.7	43.7	39.9	40.9	5.1
1K50	59.9	70.3	63.8	67.7	66.2	65.7	2.7
1K51	37.6	72.5	60.2	69.7	67.4	63.6	8.3

minimum score is in the lowest WQI class V, which is not suitable for any significant purposes, while the maximum score is close to class II, which is suitable for recreational activities. Overall, the mean average WQI scores of all the monitoring stations fall into class III in the range from 41 to 68. Two monitoring stations, namely 1K51 and 1K05, have large standard deviation values, indicating the inconsistency of the WQI scores recorded at the stations. In addition, the median and mean are almost similar for all stations except for 1K51. It is suspected that there are outliers at station 1K51 as the mean is somewhat lower compared to the median. The large interquartile range and standard deviation for 1K51 also indicate a possible occurrence of outlier at the station, though the station belongs to the good water quality category. In contrast, the quartiles of average WQI scores for stations IK24, IK45, and IK50 are larger, with small interquartile ranges and standard deviations. These statistics indicate that the average WQI values are consistently good throughout the period considered. These stations are not exposed to severe sources of pollution as they are located at the upper stream of the rivers surrounded by green spaces.

The data can be further described by the multiple boxplots as illustrated in Fig. 2. Two important results are observed. Firstly, the water quality monitoring stations can be divided into two groups. The first group comprises stations with high mean average WQI scores of around 60–70 (1K24, 1K45, 1K50 and 1K51). The values are in the upper range of class III of the NWQS index. For simplicity, we denote the group of stations as Category 1. On the other hand, the other stations with mean average WQI scores less than 60 belong to Category 2. These stations have mean average WQI in the lower range of class III (1K18, 1K36, 1K46) and class IV (IK06-08, IK17, IK23, IK25, IK41, IK47). Secondly, the occurrence of outliers can also be observed in the boxplots. For example, in Category 1, the WQI for IK51 drops more than 20 units from the median value, and the lower whisker of the plot is also rather long. In fact, almost 25 observations from this station fall below 60. These lower-tailed outlier indicates the water quality at this station is frequently polluted during the period. In addition, in category I station, lower-tailed outliers are also observed at stations IK24 and IK45. As for stations in Category 2, there are instances when the WQI becomes better (upper-tailed outlier) or worse (lower-tailed outlier) at various time points.

Table 2

Outliers points and the water quality parameters
Outliers Type	No.	Station No.	DO	BOD	COD	TSS	pH	AN	Temp	WQI
Lower end outlier (Category 1)	387	1K24	9.3	16.0	49.3	25.0	7.5	0.1	24.4	57.2
		1K24	8.5	11.0	31.0	7.5	7.7	0.1	26.4	61.8
		1K24	9.3	7.5	30.0	64.0	8.0	0.1	24.9	61.1
	636	1K45	5.1	13.0	32.0	22.5	7.4	0.7	27.9	53.7
	838	1K51	7.0	31.5	78.0	2175.0	7.6	0.0	32.6	37.6
	841	1K51	7.2	31.5	120.0	557.0	7.7	0.1	31.9	42.2
Upper end outlier (Category 2)	301	1K18	4.6	5.0	16.0	24.5	7.4	1.0	29.5	64.7
	356	1K23	6.1	4.0	10.0	16.0	7.2	0.2	27.9	68.2
	359	1K23	5.7	3.0	11.2	11.0	7.5	1.1	28.6	63.7
	464	1K25	5.6	3.5	11.0	50.0	7.4	1.8	27.0	58.0
	729	1K47	4.1	3.5	12.0	57.5	7.1	3.6	28.4	56.3
	730	1K47	2.9	8.0	32.5	18.5	7.5	4.8	28.5	53.7
	144	1K07	4.0	7.5	23.0	22.0	7.5	3.8	28.9	57.3
	146	1K07	5.5	3.5	12.0	35.5	7.3	2.1	27.6	57.9
	195	1K08	4.6	6.0	26.0	20.0	7.5	2.9	28.5	54.4
	197	1K08	4.0	8.5	29.0	25.0	7.4	4.1	29.0	54.7
	199	1K08	5.3	4.5	14.0	30.5	7.2	2.4	27.1	56.5
	250	1K17	4.7	4.0	14.0	14.0	7.7	0.9	28.9	63.8
Lower end outlier (Category 2)	126	1K07	3.4	19.5	58.0	25.0	7.2	7.5	28.6	38.1
	179	1K08	3.3	19.5	59.0	16.3	7.1	8.1	28.9	38.7
	263	1K17	5.7	23.5	51.0	89.0	7.3	6.7	28.3	35.2

Table 2 provides the water quality parameter values of the outlying observations at the monitoring stations. The Category 1 stations with lower-tailed outliers are shown in the first 6 rows. The values of BOD, COD, and TSS are considerably high for observations 838 and 841 at IK51, resulting in low average WQIs that fall under class IV of the NWQS index. In fact, a few other observations from this station fall under bad water quality as well and warrant further investigation. As for other observations under this category, an increment in at least one of the COD, BOD, or TSS values caused the drop in the water quality. TSS is a crucial factor in determining the purity of river wateras when the presence of algae and soil in the river water increases the TSS value. During rainy days, soil from the surrounding area is washed into the water body and increases the quantity of suspended solids resulting in cloudier water (Abdul Aziz et al. 2020). As for Category 2 stations, there are also times when the average WQI improves as listed in row 7–19. Generally, the DO readings are low for stations in this category, but improvement in the BOD, COD, and TSS readings contribute to the improvement of the average WQI that are better compared to most values observed at the stations. For instance, the TSS readings at IK47 are still large but much lower than the average reading for the station. In addition, further deterioration of average WQI was also observed at stations in category 2, as listed in the last three rows of Table 2. This is largely due to the increase in BOD and COD. Station 1K07, 1K08, and 1K17 are located downstream of the Klang River basin and have high COD due to industrial sewage and urban development. COD is a measure of the oxygen required to oxidize the dissolved materials and organic matter in water and provides an index to assess whether the discharged sewage will have an impact on the environment (Abdul Aziz et al., 2020). In conclusion, the outliers observed in the study occurs due to the drop/rise of the important parameter values of the water quality at a specific location and time. Now the temporal-spatial changes of the WQI values will be investigated using functional data analysis as a function of time in order to better understand the water quality condition in the river basin.

Functional data analysis

Functional data analysis was applied to the spatio-temporal data consisting of 48 time points for 16 stations as described in the constructing functional data section. After applying the functional smoothing process, the functional data for WQI was obtained, where each ${y}_{i}$ is now a function of time t. Figure 3 gives the plot of functional data of WQI for each station. The functions are in the range of 30 to 75. Most functions are clustered around the functional mean (dashed black curve) which fluctuates around 45 to 55. This result indicates that, on average, the quality of water in the Klang River basin is moderate. There are several functions of WQI well above the mean function. These functions are those that belong to stations in Category 1, namely 1K24 (Sg. Gombak), 1K45 (Sg. Klang), 1K50 (Sg. Penchala), and 1K51 (Sg. Damansara), with good river water quality. In addition, the function for 1K51 (red curve) dropped abruptly between October 2015 and March 2016. Thus, the difference in the shape of the function for station 1K51 compared to the other functions in this cluster should be investigated. It is also observed that the majority of the stations have better water quality scores at the end of the year 2015 but have lower water quality scores in the month of October 2016 due to increasing BOD, NH3NL, and TSS values. Now, an analysis of the functional depths will be carried out.

Functional depth and the functional outlier detection

The detection of functional outliers is based on the measurement of functional depth. The smaller the depth of the WQI functions, the greater the potential for the functions to be detected as outliers. Figure 4 depicts the bar graphs for functional depth measurement by the MBD method for every sample station. It can be seen that 1K06, 1K24, 1K41, 1K45, 1K47, 1K50, and 1K51 have lower depth values, but the rest of the stations have depth values greater than 0.47. The results agree with the functional plot as plotted in Fig. 3. The functions for 1K24, 1K45, and 1K50 are consistently well separated from the other functions, while the function for 1K51 only drops for a short time range. In contrast, the function of 1K47 (green curve) is the most downward function at many time points, with the highest value reaching 50 at the 35th month, which is in the range of moderate water quality. In addition, the function of IK41 also fluctuates a lot with some points having the lowest WQI scores, such as in August 2014, July to August 2015, and July to September 2016. Such fluctuation is also observed for station 1K06. Thus, these stations are candidates for functional outliers.

Figure 5 shows the constructed functional boxplot for the WQI functions of the stations. In the functional boxplot, the median function is denoted by a dashed line in the grey region, and the flagged functional outliers are denoted by the dashed lines outside the grey region. The grey area denotes the 50% central region, and the border of the grey region, which is represented by the solid lines, is defined as the envelope. Meanwhile, the two dotted lines are the minimum and maximum of the range of the non-outlying envelope. Five stations have been identified as functional outliers by the functional boxplot, namely 1K24 (Sg. Gombak), 1K41 and 1K50 (Sg. Penchala), 1K45 (Sg. Klang), and 1K51 (Sg. Damansara). The majority of the functional outliers are located upward from the mean of functions and outside the upper envelope all the time, which signals consistent good water quality observed in the area. In addition, although most of the time the curve for 1K41 is within the range of the non-outlying envelope, the functional boxplot also detected 1K41 (Sg. Penchala) as a functional outlier since the corresponding function crosses the minimum curve three times, which are in months of 20, 31, and 45. That is, the water quality at IK41 drops below the minimum range of the non-outlying envelop indicating bad water quality observed at this station.

The study on river water quality should contribute significantly towards designing a comprehensive action plan by the authorities for better management of any river basin. Identifying the river location with water quality status and understanding the contributing factors can be carried out using both the descriptive and functional data analysis.

We first explore the river water quality using the standard descriptive statistics. Stations IK24, IK45, and IK50 are shown to have consistently good water quality with small interquartile ranges and standard deviations. These indicate that the surrounding areas of the stations at the upstream of the rivers are still protected from uncontrolled pollution sources. Station IK24 (Sg. Gombak) and IK45 (Sg. Klang) are located near the Batu Dam and Klang Gates Dam respectively that supply clean water to the Klang Valley area (Main Report IRBM Sg Klang 2016–2021, 2016) as shown in Fig. 6(a) and (b) respectively. Meanwhile, Station IK50 (Sg. Penchala) is at the edge of Bukit Kiara Forest Reserve bordering elite housing areas as shown in Fig. 6(c). The present river management in these three areas should remain and be continuously monitored to ensure the present good water quality status is conserved.

As for Station IK51 (Sg. Damansara), it is identified in the same good Category I as the other three stations stated in the previous paragraph. However, the corresponding boxplot for the station as shown in Fig. 2 is skewed to the left indicating significant number of low water quality instances observed in the period. In fact, we found that almost 20% of the WQI scores at this station is in Class III and below. The main contributor to the drop is TSS. The high level of TSS detected in the upstream area could be attributed to storm water discharges caused by heavy rainfall, that washes the surface of the construction and land cleaning activities (Chang et al., 2021). The station is close to new industrial/residential areas with land clearing activities observed as shown in Fig. 6(d). According to Wan et al. (2019), pollution from suspended sediment in the Damansara River catchment region may be considerable on rare occasions where the value for TSS shows that the fluctuation of sediment pollution within the Klang River watershed is excessive. The concentration of suspended sediment, which is contaminated by heavy metals, is a severe problem for the quality of water because it affects water temperature, dissolved oxygen levels, and clarity, which could have adverse effects on water quality (Moeini et al., 2021).

Other water quality monitoring stations in the Klang River basin have average WQI scores of around 50 with the worst water quality observed at Station IK47. These stations are located downstream of rivers and surrounded by high population density as well as variety of human and industrial activities. Station IK47 is next to a large industrial area in Cheras. In fact, the rivers here are known to have a higher gross pollutant wet load from the residential area (Shah et al.,2016). Squatter settlements that exist along the riverbanks, particularly the Sg. Klang, Sg. Kerayong and followed by the Sg. Gombak and Sg. Ampang, may contribute to lower WQI values. Weak waste disposal and garbage management in the area bring more pollutants into the river. Moreover, urbanization has affected the ecology of the river, which has enhanced microbial activity caused by high tropical temperatures, which is beneficial via rapid breakdown of organic pollutants, such as sewage (Yule et al., 2015). More rigorous inspection and implementation of law should be carried out especially around the industrial areas near Station IK47 to control the pollution sources from factories and eventually improve the overall water quality around the downstream of the river basin.

We then consider the FDA to capture the trends of WQI for every station over time. From Fig. 3, the functions of stations 1K24, 1K45, and 1K50 are well above the rest, with the WQI values are always greater than 60. However, the function corresponds to station IK51 has shifted from around 50 to above 64 in Jun 2013 and suddenly dropped drastically for a number of months starting in December 2015 before improving in the subsequent months due to the high level of TSS. We also observed that the corresponding function for IK47 is consistently far from the functional mean and is generally positioned at the lower WQI readings compared to other functions. This is largely due to low levels of DO and high levels of BOD, COD and ammoniacal nitrogen. The increase in the concentration of oxygen demand by organic matters indicates the presence of waste water loads originating from the household, industrial and agricultural activities.

Besides, when the overall information of the WQI of each station is considered as a function of time, the FDA discovers two other stations, namely 1K41 (Sg. Penchala) and 1K06 (Sg. Klang), with low WQI scores. Although most stations have WQI fluctuations in Class III, the trend shows the functions for the two stations have dropped several times due to low water quality scores. The above results can be captured using the functional outlier detection method as described in Sect. 3.3, with stations IK41 and IK06 having significantly low depths and being identified as functional outliers. Both are within the envelope of functions close to the functional mean. The fluctuation of the functions, however, does not follow the trend of the other functions. Instead, none of the 1K41 and 1K06 readings were detected as outliers using descriptive statistics. Thus, the FDA has provided additional information on the stations with trends that are not the same as the rest of the data in both location and shape of the functions.

The degradation of water quality at Station 1K41 (Sg. Penchala) and Station 1K06 (Sg. Klang) happened as the rivers are contaminated by urban runoff which comes from nearby residential and industrial areas. Even though the upstream of the Sg. Penchala and Sg. Klang flows from pristine rivers where the water quality is clean and natural, the rivers at these stations, which are further downstream, unfortunately receive various sources of pollutants, either from point or non-point sources. As a result, the relevant government agencies should make extra efforts to preserve good water quality in these places because, as the streams flow into the urban areas, they gradually become artificial canal lines, such as concrete channels, depriving the stream of its natural filtering function.

In addition, based on the FDA of each monitoring station in the Klang River basin, we conclude that there is no significant change in trend throughout the four years of period study except during the last few months of 2015, that is, when all areas experience marginal increase from lower water quality status to higher water quality status. But unfortunately, the trends fall again in the early part of the year 2016 and have slightly increased since the end of the year 2016. This scenario should be a concern to the authority. More comprehensive research in terms of water quality data collection, enforcement and urban planning is needed to balance between the human needs with the needs of the other habitats as well as to preserve the natural resources of the basin.

This study utilized functional data analysis to provide a deeper understanding on the water quality data at Klang River basin. Outlying observations and patterns have been identified and investigated. The water quality at critical stations near the water dams are consistently good and should be conserved. The main contributing factors to surprising low WQI values at good water quality stations have been identified and should be carefully monitored with more controlled and sustainable development implemented. Better pollution-controlled initiatives should be considered at lower streams of the river in order to improve the bad/moderate water quality observed. More thorough and in-depth research for future data should be carried out in order to ensure the sustainability of good river water quality in the Klang River basin remains in the future.

Statements and Declarations

Data availability statements

The data that support the findings of this study are available from the authors but restrictions apply to the availability of these data, which were used under license from the Department of Environmental Malaysia for the current study, and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission from the Department of Environmental Malaysia.

Authors’ contributions

Nur Fatihah Mohd Ali – Carrying out the literature review and analyzing the data; preparation of the initial draft of paper till completion.

Ibrahim Mohamed - Identification of methodology and outlier in classical/functional data.

Rossita Mohamad Yunus - Identification of methodology and functional data analysis.

Faridah Othman – Interpretation/discussion of methodology/results in relation to water quality.

*First author is a PhD student co-supervised by the other three authors.

Funding

The study is partially funded by the Ministry of Higher Education fundamental research grant (FRGS/1/2018/STG06/UM/02/12) and University Malaya Research Grant (RF015B-2018).

Competing Interests

No competing financial interests exists.

Abdul Aziz, N. I. H., Mohd Hanafiah, M., Halim, N. H., & Fidri, P. A. S. (2020). Phytoremediation of TSS, NH3-N and COD from Sewage Wastewater by Lemna minor L., Salvinia minima, Ipomea aquatica and Centella asiatica. Applied Sciences, 10(16), 5397.
Almuhtaram, H., Zamyadi, A., & Hofmann, R. (2021). Machine learning for anomaly detection in cyanobacterial fluorescence signals. Water Research, 197, 117073.
Amirabadizadeh, M., Huang, Y. F., & Lee, T. S. (2015). Recent trends in temperature and precipitation in the Langat River Basin, Malaysia. Advances in Meteorology, 2015.
Azman, A., Fathurrahman, L., Saiful Iskandar, K., & Hafizan, J. (2018). Pollution Sources Identification of Water Quality Using Chemometrics: A Case Study in Klang River Basin, Malaysia. International Journal of Engineering & Technology, 7(4), 83–89.
Bresciani, M., Giardino, C., Stroppiana, D., Dessena, M. A., Buscarinu, P., Cabras, L., … Tzimas, A. (2019). Monitoring water quality in two dammed reservoirs from multispectral satellite data. European Journal of Remote Sensing, 52(sup4), 113–122.
Camara, M., Jamil, N. R. B., & Abdullah, F. B. (2020). Variations of water quality in the monitoring network of a tropical river. Global Journal of Environmental Science and Management, 6(1), 85–96.
Chang, H., Makido, Y., & Foster, E. (2021). Effects of land use change, wetland fragmentation, and best management practices on total suspended solids concentrations in an urbanizing Oregon watershed, USA. Journal of Environmental Management, 282, 111962.
Cosgrove, W. J., & Loucks, D. P. (2015). Water management: Current and future challenges and research directions. Water Resources Research, 51(6), 4823–4839.
Cuevas, A., Febrero, M., & Fraiman, R. (2006). On the use of the bootstrap for estimating functions with functional data. Computational statistics & data analysis, 51(2), 1063–1074.
Cuevas, A., Febrero, M., & Fraiman, R. (2007). Robust estimation and classification for functional data via projection-based depth notions. Computational Statistics, 22(3), 481–496.
Department of Environment (DOE), 2019. Malaysia Environmental Quality Report. Ministry of Natural Resources and Environment, Malaysia.
Di Blasi, J. P., Torres, J. M., Nieto, P. G., Fernández, J. A., Muñiz, C. D., & Taboada, J. (2013). Analysis and detection of outliers in water quality parameters from different automated monitoring stations in the Miño river basin (NW Spain). Ecological engineering, 60, 60–66.
Febrero, M., Galeano, P., & González-Manteiga, W. (2008). Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics: The official journal of the International Environmetrics Society, 19(4), 331–345.
Fraiman, R., & Muniz, G. (2001). Trimmed means for functional data. Test, 10(2), 419–440.
Horváth, L., & Kokoszka, P. (2012). Inference for functional data with applications (Vol. 200). Springer Science & Business Media.
Hussain, I. (2019). Outlier Detection Using Graphical and Nongraphical Functional Methods in Hydrology. International Journal of Advanced Computer Science and Applications, 10, 438.
Kailasam, K. (2011). Community water quality monitoring programme in Malaysia. A Water Environment Partnership in Asia (WEPA) report, 6pp. Retrieved October 21st.
Liu, J., Wang, P., Jiang, D., Nan, J., & Zhu, W. (2020). An integrated data-driven framework for surface water quality anomaly detection and early warning. Journal of Cleaner Production, 251, 119145.
Liu, R. Y., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth: descriptive statistics, graphics and inference (with discussion and a rejoinder by liu and singh). The annals of statistics, 27(3), 783–858.
López-Pintado, S., & Romo, J. (2007). Depth-based inference for functional data. Computational Statistics & Data Analysis, 51(10), 4957–4968.
López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American statistical Association, 104(486), 718–734.
Marinović Ruždjak, A., & Ruždjak, D. (2015). Evaluation of river water quality variations using multivariate statistical techniques: Sava River (Croatia): A case study. Environmental monitoring and assessment, 187, 1–14.
Mentzafou, A., Varlas, G., Papadopoulos, A., Poulis, G., & Dimitriou, E. (2021). Assessment of Automatically Monitored Water Levels and Water Quality Indicators in Rivers with Different Hydromorphological Conditions and Pollution Levels in Greece. Hydrology, 8(2), 86.
Millán-Roures, L., Epifanio, I., & Martínez, V. (2018). Detection of anomalies in water networks by functional data analysis. Mathematical Problems in Engineering, 2018.
Moeini, M., Shojaeizadeh, A., & Geza, M. (2021). Supervised Machine Learning for Estimation of Total Suspended Solids in Urban Watersheds. Water, 13(2), 147.
Mohamed, I., Othman, F., Ibrahim, A. I., Alaa-Eldin, M. E., & Yunus, R. M. (2015). Assessment of water quality parameters using multivariate analysis for Klang River basin, Malaysia. Environmental monitoring and assessment, 187(1), 1–12.
Mokua, N., Ciira, W. M., & Kiragu, H. (2021). A Raw Water Quality Monitoring System using Wireless Sensor Networks.
Muñiz, C. D., Nieto, P. G., Fernández, J. A., Torres, J. M., & Taboada, J. (2012). Detection of outliers in water quality monitoring samples using functional data analysis in San Esteban estuary (Northern Spain). Science of the Total Environment, 439, 54–61.
Ng, C. K. C., Goh, C. H., Lin, J. C., Tan, M. S., Bong, W., Yong, C. S., … Khoo, G. (2018). Water quality variation during a strong El Niño event in 2016: A case study in Kampar River, Malaysia. Environmental monitoring and assessment, 190(7), 1–14.
Ramsay, J. O., & Silverman, B. W. (2005). Principal components analysis for functional data. Functional data analysis, 147–172.
Rangeti, I., Dzwairo, B., Barratt, G. J., & Otieno, F. A. (2015). Validity and errors in water quality data-a review. Research and Practices in Water Quality. Durban University of Technology, Durban, South Africa, 95–112.
Raquel, M. (2008). The Concept of Depth In Statistics.
Sancho, J., Iglesias, C., Piñeiro, J., Martínez, J., Pastor, J. J., Araújo, M., & Taboada, J. (2016). Study of water quality in a spanish river based on statistical process control and functional data analysis. Mathematical Geosciences, 48(2), 163–186.
Selangor Water Management Authority (LUAS), 2016. Main Report IRBM Sg Klang 2016–2021. Retrieved from https://www.luas.gov.my/en/luas/management-plan/irbm-plan
Shah, M. M., Zahari, N. M., Said, N. M., Sidek, L. M., Basri, H., Noor, M. M., … Dom, N. M. (2016, March). Gross Pollutant Traps: Wet Load Assessment at Sungai Kerayong, Malaysia. In IOP Conference Series: Earth and Environmental Science (Vol. 32, No. 1, p. 012065). IOP Publishing.
Sharif, S. M., Kusin, F. M., Asha’ari, Z. H., & Aris, A. Z. (2015). Characterization of water quality conditions in the Klang River Basin, Malaysia using self organizing map and K-means algorithm. Procedia Environmental Sciences, 30, 73–78.
Shelutko, V., & Makarova, M. (2020). Issues of accounting for outliers in assessing of the nutrients runoff. In E3S Web of Conferences (Vol. 163, p. 03014). EDP Sciences.
Sidek, L., Basri, H., Lee, L. K., & Foo, K. Y. (2016). The performance of gross pollutant trap for water quality preservation: a real practical application at the Klang Valley, Malaysia. Desalination and Water Treatment, 57(52), 24733–24741.
Siti Asiah, M. (2017). Seasonal impact on water quality and model development of a tropical urban river/Siti Asiah Muhammad (Doctoral dissertation, University of Malaya).
Sun, Y., & Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2), 316–334.
Talagala, P. D., Hyndman, R. J., Leigh, C., Mengersen, K., & Smith-Miles, K. (2019). A Feature‐Based Procedure for Detecting Technical Outliers in Water‐Quality Data From In Situ Sensors. Water Resources Research, 55(11), 8547–8568.
Torres, J. M., Nieto, P. G., Alejano, L., & Reyes, A. N. (2011). Detection of outliers in gas emissions from urban areas using functional data analysis. Journal of hazardous materials, 186(1), 144–149.
Ullah, S., & Finch, C. F. (2013). Applications of functional data analysis: A systematic review. BMC medical research methodology, 13(1), 1–12.
Vadde, K. K., Wang, J., Cao, L., Yuan, T., McCarthy, A. J., & Sekar, R. (2018). Assessment of water quality and identification of pollution risk locations in Tiaoxi River (Taihu Watershed), China. Water, 10(2), 183.
Vega-Rodriguez, M. A., Perez, C. J., Reder, K., & Floerke, M. (2021). A stage-based approach to allocating water quality monitoring stations based on the WorldQual model: The Jubba River as a case study. Science of The Total Environment, 762, 144162.
Wan Mohtar, W. H. M., Abdul Maulud, K. N., Muhammad, N. S., Sharil, S., & Yaseen, Z. M. (2019). Spatial and temporal risk quotient based river assessment for water resources management. Environmental Pollution, 248, 133–144.
Yang, J., Holbach, A., Wilhelms, A., Qin, Y., Zheng, B., Zou, H., … Norra, S. (2019). Highly time-resolved analysis of seasonal water dynamics and algal kinetics based on in-situ multi-sensor-system monitoring data in Lake Taihu, China. Science of the Total Environment, 660, 329–339.
Yule, C. M., Gan, J. Y., Jinggut, T., & Lee, K. V. (2015). Urbanization affects food webs and leaf-litter decomposition in a tropical stream in Malaysia. Freshwater Science, 34(2), 702–715.
Zavareh, M., Maggioni, V., & Sokolov, V. (2021). Investigating water quality data using principal component analysis and granger causality. Water, 13(3), 343.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Assessment of Water Quality Data Using Functional Data Analysis for Klang River Basin, Malaysia

Status:

Version 1

Abstract

Figures

Introduction

Materials And Methods

Results And Discussion

Discussion

Conclusions

Declarations

References

Additional Declarations

Status:

Version 1