Data Preprocessing
In order to start the data analysis, all data first needs to be converted to the same dimension sizes for it to be comparable. To achieve this, all climate datasets are first converted to the same filetype, for which we chose the Network Common Data Form (NetCDF). This filetype is a convenient solution for geographic climate analysis varying over time. Both the ERA5 Reanalysis (Copernicus Climate Institute, 2019) and the GEFS reforecast (NOAA, 2019) data can be converted to this filetype without causing any data loss. To display the climate data that has thus been reformatted, the NetCDF files are loaded into the Python programming language and plotted on a global map with continental outlines, using the World Geodetic System 1984 (WGS84) coordinate system, as shown in Figs. 1 and 2. All Python code used for the analysis and creation of these maps can be found in the appendix of this paper.
As seen in Figs. 1 and 2, the reanalysis and reforecast datasets show some differences. While this can partially be attributed to modeling inaccuracies, the datasets also use different spatial resolutions and temporal units. In order to make the datasets comparable, these variables need to be converted to the exact same format. For the spatial resolution, this is relatively straightforward: since the resolution of the reanalysis is exactly four times higher than that of the reforecast, it is sufficient to reduce the reanalysis resolution to the exact extent of the reforecast dataset by only selecting every fourth cell. This yields datasets with a longitude of 360 cells by a latitude of 180 cells, as opposed to the reanalysis’ initial 1440 x 720 resolution.
The GEFS reforecast dataset starts on the 1st of June 1985 and ends on the 16th of September 2020, while the ERA5 reanalysis starts on the 1st of January 1980 and ends the 1st of August 2020. To solve this issue of mismatching timeframes, the temporal extent of both datasets was clipped to span a common overlap from the 1st of June 1985 until the 1st of August 2020. Turning to the time resolution, the reforecast contains data for every day of the month, whereas the reanalysis only contains data for the first day of every month. The reforecast data is reduced to match the timesteps of the reanalysis, meaning the final data for our analysis will refer to every first day of each month between June 1985 and August 2020.
By reducing the temporal and spatial resolutions of the datasets, a significant amount of data is lost. However, since the initial climate data was so abundant, this should not be a problem. In fact, the data we still use consists of 180 x 360 cells per map, across a total of 422 moments in time, resulting in more than 27 million datapoints that we can use for our analysis. This plethora of datapoints has proven to be more than enough to train and validate any of our machine learning models. The data is now ready to be compared, spanning the same spatial and temporal dimensions, and still holds a vast amount of information.
Creating a training and test set
When training our model, it is important that the data on which it is trained and the data on which it is tested are kept separate, so that we do not assess the model on the same data we train it on. This prevents overfitting the model on the training data and creates a more realistic accuracy assessment, since the final version of the model is validated on a dataset it has never seen before. Research has been conducted towards the ideal Train/Test ratio (Pawluszek-Filipiak & Borkowski, 2020; Rácz et al., 2021), but this has proven to be rather dependent on the type of data used (Pawluszek-Filipiak & Borkowski, 2020). Regardless, Train/Test ratios of either 70%/30% or 80%/20% are generally accepted and have been shown to perform well, specifically across large datasets (Rácz et al., 2021) like the one that we use. For our Train/Test split, we allocated 75% of the data to the training set, and the remaining 25% to the test set used for validating the model after training. Thus, we shuffle all the data, and randomly separate the dataset into a test and a training set.
Data Validation
Before beginning to tune the reforecast model output to match the reanalysis more closely, it is important to ascertain that the ERA5 reanalysis data is in fact more accurate than the GEFS climate model output. It is known that reanalysis datasets still contain some errors (Moalafhi et al., 2017), so it is crucial to confirm that the reanalysis dataset is an accurate representation of real-world climate variables before using it as test data. In order to validate the accuracy and reliability of both datasets, we use a validation dataset containing station measurements directly sourced from the European National Meteorological and Hydrological Services (NMHSs): the E-OBS meteorological data for Europe derived from in situ observations dataset (Copernicus Climate Change Service, 2021). While possibly still prone to some observational and interpolation bias, the temperature data in this dataset is derived directly from in situ climate observations made by NMHSs stations and can therefore be seen as very close to the real climate variables (Augustine et al., 2005; Sinisalo et al., 2013). We use this to test the accuracy of both the reanalysis and reforecast datasets against this ground truth baseline.
As seen in Fig. 3, the validation set does not exactly match the extent of the other datasets, as the observations on which the dataset is based only span across land areas within Europe. Therefore, we clip the extent of the reforecast and reanalysis dataset to match this extent above Europe as well. Additionally, the 25.933 different timestamps present in the validation set are reduced to exactly the 422 moments we have chosen for our analysis. The validation set still contains some missing values in water areas across this European extent, but these regions were ignored in the error estimation. After getting the observational, reanalysis, and reforecast datasets in the same shape, the mean temperature for each map was calculated over this extent, as shown in Table 1.
Table 1
Mean European Land Temperature Comparison
Model
|
Mean Temperature in degrees Kelvin
|
E-OBS Observational data
|
281.84
|
ERA5 Reanalysis data
|
282.09
|
GEFS Reforecast data
|
279.22
|
While all three mean temperatures are relatively close, the observational mean is significantly closer to the reanalysis mean than the reforecast mean. While this is an indication that the reanalysis reflects real climate variables more accurately, it is not yet a robust measurement of model error. In order to quantify these discrepancies more accurately, we will use the Root Mean Square Error (RMSE) to represent model inaccuracy. The RMSE gives a number error score which represents the difference between two climate maps, while weighing large outliers more heavily. This yields a unitless error measure between the predicted and the test data, which has previously been used for similar climate model tuning applications (Chang & Guillas, 2019; Grönquist et al., 2019). If we have a set of N model-predicted climate variables Ŷ (in this case temperature), and a set of N reference climate variables Y, the RMSE formula is defined as:
Formula 1
RMSE
$$RMSE= \sqrt{\sum _{i=1}^{n}\frac{{({Ŷ}_{i}-{\text{Y}}_{\text{i}})}^{2}}{N}}$$
Using the observational data as reference temperature, the RMSE of the reforecast data is 3.22. When testing the reanalysis’ accuracy on the observations, the reanalysis RMSE is 0.38, showing that the reanalysis data is substantially closer to the observed data than the reforecast data. This confirms that the reanalysis dataset is a relatively accurate and authoritative dataset suitable for climate model tuning. While it does not perform perfectly on the validation data and still has an error score of 0.38, this is a large improvement over the reforecast’s error score of 3.22. This indicates that if we can tune our climate model to reflect the reanalysis more accurately, it will be a more realistic estimation of real-world climate variables as well.
In order to define a baseline RMSE along which to compare our correction models, we measure how accurate the reforecast data is for reflecting the reanalysis. Comparing the difference between the reanalysis and the reforecast maps yields that the reforecast has an RMSE of 2.90 with the reanalysis data as a reference. Any of our correction models need to score below this baseline RMSE to be an accuracy improvement. This RMSE of 2.90 is lower than the 3.22 RMSE which the reforecast attained for representing the observational data, showing that the reforecast is closer to the reanalysis data than it is to the observational data. This is not entirely surprising, as the reanalysis dataset also employs short-range weather forecasting models (Hersbach et al., 2020), which may contain some incorrect parameterisations similar to the one’s present in the GEFS. This phenomenon, and how our correction model could potentially help reduce this bias, is further examined in the discussion section of this paper.
Regardless of the reanalysis’ slight inaccuracy, it is still deemed to be a valid test set, since it is much more accurate than the reforecasted climate data. Therefore, we will start quantifying climate model error by subtracting reforecast temperature from the reanalysis temperature. This yields global maps of climate model inaccuracy, such as the ones seen in Fig. 4. Using this definition of model error entails that a positive model error indicates that the prediction was too low and that the model temperature (reforecast) should have been higher, while a negative error value indicates that the prediction should have been lower. This model error is then calculated globally across all 422 timestamps, creating 422 unique inaccuracy maps. The inaccuracy patterns found in these maps differ a lot across different years, seasons, and regions. To give an indication of what these varying patterns look like, a sample of six inaccuracy maps equally distributed across the chosen timeframe is displayed in Fig. 4.
While some common patterns can be recognised in these maps, it is difficult to estimate the prevalence of a specific pattern across the years by simply looking at the different maps. In order to quantify the climate inaccuracy in a more generalized way, we use all the temperature data from the 422 different maps and average their error. This yields one, all-encompassing climate inaccuracy map, which shows the average inaccuracy at each grid cell of the GEFS from 1985 until 2020 (Fig. 5). The global patterns of inaccuracy that are displayed in Fig. 5 are further analyzed in the results section of this paper.
Adding more training data
The data we currently have in the dataset is still relatively meagre for making the more complex inaccuracy predictions that are needed to achieve a stronger RMSE reduction. Currently, one datapoint consists of its longitude, latitude, and time values, the reforecasted (model) temperature, the reanalysis (considered accurate) temperature, and the difference between these two temperatures (the error we are trying to predict). While this already offers a large amount of information about climate model inaccuracy, any machine learning model we train is likely to perform better if we give it more variables to train on. A condition for any data we add to the training and test datasets is that it needs to be consistently available throughout time, since we want our error correction model to be applicable for future climate model predictions as well. Therefore, we cannot include variables such as measured humidity or solar radiation since this data is not available for future model predictions. However, what we can add is other climate variables forecasted by the GEFS, as these predictions can also be made for the future. The GEFS contains predictions of many different physical processes and climate variables, which may prove helpful in our analysis and prediction of temperature inaccuracy. As such, we add the GEFS reforecasted Precipitation and Cloud Cover variables to the training and test sets and convert them to align with the spatial and temporal resolution as previously described.
Land use classification
A less straightforward addition to our dataset relates to the idea that land use may have an influence in determining temperature inaccuracy, which is explored further in the results section. In order to test and quantify this hypothesis, it would be ideal to add a global land use type dataset into the analysis. Most land use data publicly available does not suit our purpose exactly due to having impractical filetypes and/or shapes, or because of unnecessary high amounts of distinct land use classes. In order to use the most suitable land use data for our specific purpose, we developed our own custom-made land use dataset on a global scale.
To start this land use classification, we begin by georeferencing a high-resolution file of global satellite images (NASA, 2008) to a WGS84 coordinate system. After appending the appropriate coordinates to the global image, we use a form of supervised image classification in the GIS software ArcMap to begin the land use classification process. For this classification, we define 5 different classes of global land use: water, ice, land, shallow water, and desert. For each of these classes, several training samples are created on NASA’s georeferenced satellite map, as shown in Fig. 6.
After defining the training sets for the land use types, the classification model is trained to recognize the visual aspects of each different class, based on the training sample squares shown in Fig. 6. Maximum likelihood image classification is then used to assign each global coordinate to one of the five land use classes, resulting in a global land use map as shown in Fig. 7:
A confusion matrix accuracy assessment (Lina Yi & Guifeng Zhang, 2012) was conducted to calculate that the kappa accuracy coefficient for this classification is 0.91, which is above the generally accepted 0.81 threshold of “almost perfect agreement” (McHugh, 2012). Therefore, our custom land use classification dataset is determined to be sufficiently accurate to use in our analysis.
To implement the different land use classes into our training and test sets, five dummy variables are defined to represent the five different land use classes. Each of these dummy variables evaluates to 1 if that datapoint matches its corresponding land use class, and to 0 otherwise. For example, if a coordinate’s land use is part of the ‘Ice’ class, then the ‘Ice’ dummy variable will equal 1, and all the other dummy variables will equal 0. These variables, in combination with the newly created Precipitation and Cloud Cover variables, are all appended to the existing datasets. All this data is then used to create a huge Comma Separated Value (CSV) file that we will use to train our models, which is once again split into separate test and training sets. We use a CSV file specifically since it is convenient filetype to modify and export data between the ArcMap image classification and the Python analysis.
Table 1: An Example of the Training set CSV. ‘ModelError’ is the Variable we are Training to Predict.