Deterioration of surface water quality due to harmful algal blooms is a global concern 1. Harmful algal blooms adversely affect aquatic life, with repercussions for the fishery industry. Surface water quality maintenance costs are increasing owing to the increasing frequency of occurrence of algal blooms and their widespread effects 2. This increase in algal blooms requires urgent, timely, and reliable predictions to enable effective management interventions 1. Chlorophyll-a (Chl-a) is considered a proxy of algal concentration in surface waters 3 and is used to estimate the total phytoplankton biomass as a parametric variable to quantify the trophic state of the water body 4. Thus, the accurate estimation of Chl-a by simulating its concentration can provide useful information regarding surface water quality.
The concentration of Chl-a in surface waters has been modeled using theory-based numerical models, such as enhanced stream water quality models (QUAL2E) 5, soil and water assessment tools (SWAT) 6, and CE-QUAL-W2 7. These models estimate Chl-a concentrations using equations that reflect the underlying Chl-a dynamics. Zhang et al. 8 simulated Chl-a concentrations to understand the response of a lake to the changes in the climate and hydrological conditions. Kim et al. 9 modeled the eutrophication of a reservoir to evaluate the importance of point sources. However, process-based models are limited by their complexity, low predictive performance, large calibration time, and inability to incorporate the dynamic input features that affect the target variable 10.
Physical attributes like stream length, slope, width, the surface area of the sub-basin, and flow within sub-basins can significantly influence stream water quality and chemistry 11. These attributes vary from one sub-basin to another within the catchment. The concentration of phosphorus and nitrogen, which determine water quality, can also vary within a catchment from one site to another site 12. These site-dependent factors can be divided into time-variant data such as flow rate and time-invariant data like stream morphology. The water quantity and quality variables contribute to Chl-a concentrations in the stream. However, observations of water quality and quantity time-series data have not been conducted concurrently and continuously. Sampling and water quality experiments are time-consuming and expensive and, therefore, cannot be conducted at a frequency as high as flow rate observations. Park et al. 3 showed that, for high-performance Chl-a prediction models, it is pertinent to incorporate hydrological, geochemical, and ecological variables that affect algal growth in stream waters, thus requiring a modeling approach that considers data with different sampling frequencies.
With the rise in computational power and data availability in recent years, data driven machine-learning (ML) and deep-learning (DL) techniques are gaining attention. ML offers an alternative approach for developing high-performance prediction models for Chl-a simulation. Furthermore, DL-based models can effectively incorporate input data with different time steps 13. Several studies have been conducted to develop Chl-a prediction models using ML-based algorithms14,15. A literature review shows that majority of the data-driven models developed for Chl-a prediction in surface waters have used classical ML algorithms, such as support vector machines, random forest, and XGBoost 15,16. Recently, more studies have used advanced algorithms such as long short-term memory (LSTM), convolution neural network (CNN), and CNN combined with LSTM (CNN-LSTM) 16,17. However, most ML-based studies have been conducted at monthly or weekly time steps 3,18. Studies with daily or sub-daily time steps have focused on a limited simulation period, from one week to three years 15,19. Moreover, these studies focused only on the development of site-specific DL models. These models were evaluated using data from the same area where they were trained. Using this approach, developing a regional model requires a separate neural network (NN) for each site, which is computationally expensive 20. Thus, a site-invariant model that can be generalized to different regions for long-term Chl-a simulations is absent.
In this study, we propose a DL-based approach that considers different types of input data, such as static sub-basin characteristics, continuously measured climate observations, and discontinuously measured water quality observations (Fig. 1) (see Methods for details). This was accomplished using separate blocks of NNs to process continuous, discontinuous, and static data (Fig. 1a). Continuous and static data were processed using a single NN block. We compared the performance of six DL algorithms: LSTM, CNN, temporal convolution network (TCN), CNN-LSTM, LSTM-autoencoder, and input attention-based LSTM (IA-LSTM) in this block (Fig. 1b). Discontinuous water quality data were processed using a separate NN block (Fig. 1c). The outputs from these two NN blocks were concatenated and processed to predict daily Chl-a concentration (Fig. 1d). We used data spanning 16 years (2004-2020) from four sub-basins of the Nakdong River in South Korea. The sub-basins Hwangji (HG), Bonghwa (BH), and Dosan (DS) were used for model training and the Andong (AD) site was used for model validation (Fig. S1). To ascertain the impacts of static physical features, streamflow, and the attention mechanism on water quality, the performance of the LSTM model was further analyzed in different scenarios: (1) the use of static physical features to initialize hidden and cell states in LSTM (LSTM Cond), (2) removal of static physical features from input data (LSTM NoCond), (3) removal of streamflow data (LSTM NoFlow), and (4) use of the self-attention mechanism (LSTM SA).