2.1 Surveillance Data
We utilized two primary datasets to investigate the relationship between wastewater surveillance data and COVID-19 hospitalization in Germany.
The first dataset contains the daily incident hospital admissions, which were recorded by the Robert Koch Institute (RKI). In this dataset the hospital admissions are represented as the 7-day hospitalization normalized to 100k people (rate) (12). In the following we will refer to this data as hospitalization rates. One key metric frequently utilized during the pandemic was the number of reported incident cases. However, this metric is heavily influenced by the availability of testing, testing strategies, and underreporting (often referred to as the 'dark figure'). Therefore, we considered hospitalization to be a more reliable metric, and excluded incident cases from the subsequent analysis.
The second dataset consists of wastewater viral load measurements collected from 136 sewage treatment plants, distributed across Germany, participating in the AMELAG (Abwassermonitoring für die epidemiologische Lagebewertung) project (13). The name translates to “wastewater monitoring for the epidemiological situation assessment”. The assessment of this wastewater data involves several key steps. Wastewater samples are regularly collected from treatment plants, typically multiple times a week. These samples are then analyzed in laboratories using techniques like quantitative polymerase chain reaction (qPCR) to detect and quantify viral RNA. To ensure accurate concentration estimates, the viral load measurements are normalized based on the flow rate of the treatment plant. Next, data from multiple treatment plants are aggregated to provide a regional overview, using geometric means to smooth out fluctuations. Daily estimates from the weekly data are computed by applying locally estimated scatterplot smoothing (LOESS) (14).
Additionally, we wanted to assess the relationship between wastewater viral load and recorded diagnoses of respiratory diseases. For this we used data which was gathered by docmetric (https://docmetric.de/), a subsidiary company of CompuGroup Medical (CGM). Docmetric collected anonymized health data from approximately 3000 general practitioners. Part of these data are ICD-10 coded diagnoses. We selected diagnoses related to respiratory diseases and used ICD-10 parent categories to aggregate them to four categories: “U0.7.1 COVID-19”, “J00-J06 Acute upper respiratory infections”, “J10-J18 Influenza and pneumonia”, and “J20-J22 Other acute respiratory infections” (15).
2.2 Correlation Analysis
Our first step was to investigate the correlation between viral load in wastewater data and country-level hospitalization rates. For a prior visual inspection we normalized both the hospitalization rates and wastewater data using a min-max normalization. After visually inspecting the time series, we employed correlation analyses to evaluate the relationship between wastewater viral load and hospitalization rates. For this we calculated Spearman (16) correlation coefficients. Spearman coefficients evaluate monotonic relationships, which can be particularly useful when marginal distributions of individual variables do not meet the assumptions of normality. Our hypothesis is that there exists a consistent time lag between the two variables, with trends in wastewater viral load preceding similar trends in hospitalization rates. This is based on the premise that viral load in wastewater increases or decreases several days before individuals exhibit symptoms severe enough to require hospitalization. Consequently, we performed a lagged correlation analysis to explore this temporal relationship. We shifted the wastewater and hospitalization time series relative to each other to identify the highest positive correlation and, consequently, the time lag at which the wastewater data could most effectively predict hospitalization.
2.3 Forecasting Models
In the following part we wanted to assess whether the inclusion of the wastewater viral load into otherwise autoregressive models, will significantly improve the model forecasting performances. Given the expectation that both the hospitalization rates as well as the wastewater viral load exhibit exponential growth and decay behavior, we applied a log-transformation to both time series. The underlying idea is that this transformation yields piecewise linear behavior.
We employed three classical models: linear regression, ridge regression and ARIMA, and two decision tree based ones: XGBoost and Random Forest. These methods were chosen because they proved useful in our previous study (17). In the following we briefly provide more details.
Linear Regression and Ridge Regression
For our linear regression model, we employed the LinearRegression class from scikit-learn (18), specifically version 1.3.2. Since the optimal solution is derived using the Ordinary Least Squares (OLS) method, there was no need or possibility for hyperparameter optimization. In this model, the dependent variable was the hospitalization rate, while the independent variable comprised integers ranging from 1 to 14. Across the 72 windows, we trained the model using data from days 1 to 7 and made predictions for days 8 to 14. In the non-autoregressive scenario, daily wastewater data was included as a second independent variable. To ensure accurate forecasting, all independent variables had to be known, leading us to compensate for the time lag between wastewater and hospitalization data by shifting the wastewater data forward by seven days. This adjustment allowed the fitting window to utilize wastewater data from the previous week and the forecasting window to use the current week's data. To prevent overfitting, particularly when incorporating the wastewater viral load, we applied ridge regression. Also known as L2 regularization, ridge regression introduces an additional term to the loss function in linear regression, guided by a tunable parameter λ (19).We utilized the corresponding class in scikit-learn for this purpose. The parameter was fine-tuned on the left-out training data (see section 2.4), within a range of (0, 10].
ARIMA
Autoregressive Integrated Moving Average (ARIMA) models leverage the statistical properties of stationary data and are widely used for time series forecasting (20). A stationary time series is characterized by the absence of trends and consistent variation around its mean, allowing for the extraction of short-term random time patterns for forecasting purposes. In this study, we utilized a non-seasonal ARIMA model tailored for short-term periods that are not anticipated to exhibit seasonal effects. The ARIMA models in this context rely on three key parameters:
-
p: the number of autoregressive terms
-
d: the degree of differencing applied to achieve stationarity
-
q: the number of lagged forecast errors
The general forecasting equation for ARIMA is defined as follows:
$$\:\widehat{{y}_{t}}=\:\mu\:+{\phi\:}_{1}{y}_{t-1}+...+{\phi\:}_{p}{y}_{t-p}-{\theta\:}_{1}{e}_{t-1}-...-{\theta\:}_{q}{e}_{t-q}\:$$
In this equation, ŷ represents the forecast, calculated as the deviation from the mean µ of a stationary time series, with φ denoting the slope parameters for the p previous values y, and θ representing the q moving average parameters associated with autocorrelation errors e. This model learns to predict future values based on the mean of a stationary time series, adjusted for autocorrelation errors and lagged periods. To ensure stationarity, the differencing technique is applied, which involves calculating the differences between consecutive values in the time series (21). This transformation often leads to stationarity, particularly in first or second order. To identify the optimal parameters (p, d, q), we utilized the auto-ARIMA functionality from the pmdarima library version 1.8.5 (22). While ARIMA models work by nature autoregressive, it is possible to include exogenous features to the model. In so-called ARIMAX models a term βX is added to the ARIMA equation, where X is the exogenous feature, in our case wastewater viral load, and β the corresponding coefficient that is estimated by the model (23).
XGBoost and Random Forest
Both Random Forest and eXtreme Gradient Boosting (XGBoost) are decision tree-based approaches, but differ significantly in their training algorithms. Random Forest creates an unweighted ensemble of decision trees, which is trained in parallel on different subsets of the data using bagging, averaging the predictions (24). In contrast, XGBoost constructs its decision trees sequentially, correcting the residual errors from the previously trained weighted ensemble using gradient descent (25). Both models are widely used for tabular data and have also proven effective in time series forecasting (26–31). Since these models rely on decision trees, they can only extrapolate based on previously observed training data. When predicting values outside the range of the training data, they tend to predict the average or maximum of the observed values. To address this, we applied the differencing technique (21). That means we calculate from each measured datapoint to the next the slope of a local tangent. The models are hence tasked to learn and predict the relative change from one datapoint (here day) to another. Before evaluation the forecast was back transformed by using the cumulative sum. For Random Forest, we used the scikit-learn library version 1.3.2 RandomForestRegressor class (18), and for XGBoost, we employed the XGBoost library version 1.7.3 (25). For hyperparameter tuning we employed a blocked time series cross validation (32) using optuna version 3.2.0 (33) and the same possible hyperparameters (see Table S1) as in our previous study (17).
For both models we further wanted to test two things. First, we wanted to assess whether the inclusion of federal state level data for training the models would improve the models’ performances in forecasting the hospitalization rates on a country level. Since wastewater data was only available at the country level and by community, we aggregated daily wastewater data for each federal state by averaging measurements from sewage treatment plants within the state. This data was normalized based on the number of connected citizens, as specified in the dataset for each participating city. This approach aimed to improve prediction accuracy by increasing the dataset's granularity. However, only five federal states had complete and usable datasets for the period under examination. For these five federal states we used both the regional wastewater viral load and regional hospitalization rates. Secondly, we tested how well these models performed if they were only given the wastewater viral load for forecasting the hospitalization rates, later referred to as cross-modal models.
2.4 Sliding Window Approach for Model Training and Testing
To account for potential changes in conditions during the study periods - such as new safety regulations, vaccines, or virus variants - and to provide more test windows for a more reliable evaluation of the model’s performance, we employed a sliding window approach for both training and testing data (Fig. 1).
In this approach, we predict the hospitalization rate for a seven-day period (prediction window) based on a preceding set of data points (context window). The size of the prediction window is set to seven days because of the time lag between wastewater viral load and hospitalization rate observed via correlation analysis (see Results section). After each prediction, both windows are shifted forward by seven days, ensuring no overlap between successive prediction windows. Applying this method throughout the entire period for which both wastewater and hospitalization data are available results in 72 testing windows.
This approach, which is similar to one used in our previous study (17), simulates a scenario where the models are trained on past data, as if they were being used at the time the data was originally collected.
Given that the models operate in fundamentally different ways, we assigned them context windows of varying sizes (Fig. 3). For tree-based models like XGBoost and Random Forest, which benefit from larger context windows, we used a window size of 70 days to be consistent with a previously published study (17). Additionally, for model training we shifted the time series by 7 days to obtain the corresponding target vector. Similarly, ARIMA, which identifies patterns in the data, also requires a larger sample size, so it was assigned a 70-day context window as well. Here shifting the data is not necessary as ARIMA employs a different learning strategy (see methods). While linear regression generally improves with more data, using a large context window can smooth out smaller trends. In fact, using a 70-day window for linear regression in this case could be counterproductive, as it may obscure short-term trends relevant to the seven-day forecast. Instead, we opted to fit the linear regression using only the last seven days of training data to better capture recent changes. For a valid comparison between the models, it is essential that they forecast the same days and have the same number of forecasting windows. To achieve this, we introduced offsets at the beginning of the time series, ensuring that the initial test forecasting windows are aligned across all three models.
2.5 Model Evaluation and Comparison
To assess the performance of the models on the testing windows, we used the mean absolute error (MAE) (34) and the mean absolute percentage error (MAPE) (35) as metrics:
$$\:MAE\:=\frac{1}{n}{\sum\:}_{i=1}^{n}\:\left|{Y}_{i}-{\widehat{Y}}_{i}\right|\:\:;\:MAPE\:=\frac{1}{n}{\sum\:}_{i=1}^{n}\left|\frac{{Y}_{i}-{\widehat{Y}}_{i}}{{Y}_{i}}\right|*100\:$$
where Y represents the observed values, Ŷ represents the predicted values, and n is the number of data points, which in our case corresponds to the length of the testing window (7 days). The MAPE expresses the deviation of the prediction from the observed data as a percentage, making it a more intuitive measure compared to the MAE. However, due to this normalization the MAPE is high for deviations in small scales, e.g. 50% if the predicted value is 1 but the observed value is 2. Therefore we decided to record both the MAE and the MAPE. All models were evaluated on the original scale (non log-transformed).
In order to test for statistical differences between two models, we declare them to be statistically different if their 95% confidence interval of their mean MAPE does not overlap.