The tropical Indian Ocean exhibits two annual blooms of phytoplankton with the highest peak occurring during summer (June–September) and a secondary peak during winter (December–February) 1,2. These blooms are driven by the changes in the physical forcing primarily associated with the southwest (summer) and northeast (winter) monsoon3–6. The phytoplankton blooms regulate the food availability for higher trophic levels, making primary production central to both the aquatic food web and the Indian Ocean rim population that is dependent on marine fisheries for their livelihood7–11. Recent decades have observed warming of the earth’s climate unequivocally, with the oceans accounting for approximately 93% of this increased energy uptake12,13. Amongst the tropical oceans, the Indian Ocean has undergone the largest warming (0.15°C/decade) in ocean surface11,14,15, with projections of a stronger warming (> 1.5°C) by 2070 and (> 2.5°C) by 2100 across the CMIP5 models16,17. In the low-latitude regions, a warmer ocean surface enhances the ocean stratification thereby reducing the vertical mixing and inhibiting the nutrients (required for photosynthesis) into the sunlit zone of the ocean18,19. This limits the marine primary production subsequently impacting the biodiversity of the ocean11,20,21. Any further warming is therefore expected to affect both the mean biomass and the timings of seasonal phytoplankton blooms in tropical ecosystems22,23. The timing of phytoplankton blooms, known as phenology, directly affects the larval spawning and survival. For example, the onset of local phytoplankton bloom marks the hatching of pink shrimps in North Atlantic24,25. Any change in the bloom initiation timings are therefore likely to proliferate across the higher trophic levels—popularly known as the match-mismatch hypothesis26. Phenology has been argued to be one of the most sensitive biological indicators to detecting changes in the marine ecosystem by the Intergovernmental Panel on Climate Change27. With the increasing diversity and resolution of our observations, it is imperative to examine the response of ecological indicators of the marine ecosystem to the rapidly warming tropical oceans, and ocean color is currently the only window to understand the impact of the changing climate on ocean biology; therefore classified as an essential climate variable28.
To date, cost-effective, routine spatio-temporal observations of phytoplankton biomass have been remarkably possible only through the satellite ocean color sensors making it vital in operational forecasting, oceanographic research and numerical modelling of climate29,30. Remote-sensed ocean color provides measurements of chlorophyll—the phytoplankton pigment that undergoes photosynthesis and is a proxy to marine phytoplankton31. The ocean color fields as measured by SeaWiFS (1997–2010), MODIS(2002–), MERIS(2002–2012), VIIRS(2011–) sensors span different time periods with very limited overlapping between the two missions. In view of this, a recent Ocean-Color Climate Change Initiative (OC-CCI) by the European Space Agency (ESA) provides a high-quality, long-term chlorophyll dataset at a very high resolution (~ 4 km), achieved by blending ocean color observations from multiple satellite missions and applying quality corrections32,33—thus bringing the current global records of chlorophyll to more than two decades (22 years). The availability of OC-CCI chlorophyll data has resulted in a consistent rise in the assessment of trends in the marine ecosystem; therefore expanding our ability to detect the signature of human-induced climate change on the marine ecosystem33,34. Previous studies suggest a decline in the marine primary production throughout the tropical oceans, particularly the open oceans and have warned of detrimental impacts on the marine food web in response to future ocean warming35–38. The western Indian Ocean, which is the most biologically productive region of the Indian Ocean, has already undergone a significant decline of 20–30% in the surface phytoplankton distribution during 1998–2013, as shown by both satellite observations and the CMIP5 Earth System Models38. However, the phenology of the seasonal phytoplankton blooms is still overlooked in the Indian Ocean, majorly due to the lack of gap-free observations at a higher temporal frequency.
Satellites have been instrumental in advancing our knowledge of the changing biophysical interactions under the recent climate change scenario. Besides global coverage, data obtained from satellites are high in both sampling frequency and spatial resolution required to assess trends and interannual variability of phytoplankton phenology39. However, satellite observations are subject to gaps—both frequent and persistent—caused by several factors such as sun-glint, atmospheric aerosols, cloud cover, sensor saturation, hardware issues40–42. Missing data tends to reduce the statistical power of a dataset along with affecting the accuracy of the estimating parameters43. The merging of chlorophyll by OC-CCI has led to a reduction in missing values40 as a result of the overlap in the spatial coverage of satellite sensors. However, this does not solve the critical issue of gaps in satellite data occurring due to the presence of clouds 44,45.
Over the tropical Indian Ocean during the summer monsoon, the gaps in ocean color data can be as high as 40% (Fig. 1a).. Enhanced convective activity owing to moisture-laden monsoonal cross-equatorial flow during June–September leads to formation of persistent cloud cover over the Indian Ocean north of 10S46,47. This results in a considerable reduction in outgoing longwave radiation (OLR) by up to 100 W/m2 from the annual mean (Fig. 1b), hence preventing the satellites from observing the ocean48. This poses a major challenge in the usage of satellite data as it is during the summer monsoon season that the highest productivity, driven by enhanced vertical mixing and offshore wind-driven upwelling, is experienced in the north Indian Ocean3,6. Furthermore, it is important to note that contrary to the Indian Ocean, satellites have good data coverage over the Pacific and Atlantic (Supplementary Fig. S1). This may be one of the factors leading to the fact that the Indian Ocean is least understood of all the tropical basins38. Some of the highly productive regions in the tropical oceans include eastern Pacific and eastern Atlantic. However, due to low cloudiness (reflected in the OLR, Fig. 1b), both the regions have a good satellite coverage49. Hence, in studies that consider the global domain, this difference of cloud cover in the different basins can bias the regional estimates of both the phytoplankton distribution and phenology44,50 and needs to be considered. The potential biases introduced by intermittent data in assessment of ecological trends and variability has already been demonstrated in the scientific literature. Errors of typically 15 and 30 days in the bloom peak and initiation timings respectively and high uncertainty (> 2 weeks) in the duration of the seasonal phytoplankton bloom have been estimated when dealing with incomplete time series44,51; making gap-filling a prerequisite for detecting the phenological (or, ecological) indices of the ocean ecosystem using the existing observations.
Most common gap-filling techniques of ocean color data involve spatial or temporal interpolation51–53; filtering54,55; and substitution by mean, median, minimum56. Conventionally, interpolation has been the most widely used tool in scientific literature to deal with data-gaps. It involves extending the aerial coverage of the data by utilizing the information of the neighboring observations. However, if not performed cautiously, excessive smoothing (or, interpolation) can disrupt data quality by blending the sub-grid to grid scale features, leading to under- or over- estimation of the chlorophyll concentrations; thereby making the data unreliable for extracting information of the important local biophysical processes. This is because interpolation is based on the assumption that the missing pixel has a linear relationship with the surrounding pixels which is far from reality, as phytoplankton blooms are known to occur in patches and their concentration varies dramatically from the coastal to the open ocean waters. Hence, beyond a certain neighboring grid, interpolation of ocean color data becomes invalid57. Additionally, interpolating the data to a coarser temporal resolution of the order of monthly scale might reduce the frequency of gaps but the reduced time scale will leave the data inadequate for estimation of phenological indicators. Similarly, substituting the missing value with the sample average seems convenient as a method of gap-filling, but the phenological algorithms to estimate the interannual variability and trends, when applied to the reconstructed data, leads to flawed outcomes. Moreover, since the ocean color data follows a skewed distribution, using the data-mean to fill the missing values would be inappropriate for a skewed distribution (skewness value = 1.51, seen in the histogram in Fig. S2).
There are a few advanced methods presented in the literature for gap-filling such as neural networks, empirical orthogonal functions (DINEOF); however they are limited to single value imputation overlooking the uncertainty of estimates around the true value—an important aspect of gap-filling40,52,58. Most of the studies have utilized a small sample size (< 10 years) of satellite data and has been limited to reconstructing a gap-free climatology, which does not allow for examining the interannual variability and trends. An attempt to prepare a gap-free climatology of chlorophyll for the tropical Indian Ocean employing 7 years of SeaWiFS data has been carried out by Levy et al. (2006)54,55. It was the first attempt using remote-sensed ocean chlorophyll to produce a gap-filled dataset over the Indian Ocean. However, their methodology is based on filtering techniques, essentially relying on the information of surrounding spatio-temporal values for estimating the missing value and limited to single value imputation. Even with the recent technological advancement and computational resources available, the current set of CMIP6 model outputs of chlorophyll are limited to a monthly scale. Hence, this problem of intermittent datasets needs addressal with the existing satellite ocean color data, which drives the purpose of this study.
The choice of the method becomes increasingly important as the amount of missing data increases such as in the case of the tropical Indian Ocean. More importantly, to draw useful inferences in the phenological indices estimated from these gap-filled datasets, it becomes crucial to address the uncertainty of the estimates. With the single imputation methods used to fill the missing data, no information of the uncertainty associated with the analyzed parameters can be determined. We believe that this issue can be addressed by applying computational statistical tools of moderate complexity. Hence, in this study, we propose a methodology of gap-filling which enables us to quantify the phenological indices along with uncertainty measures. We achieve this by performing a large number of simulations using the classical Monte-Carlo method in addition to a strict optimal local averaging to fill the gaps in ocean color data. The Monte-Carlo procedure allows us to impute the missing pixel with a set of pseudo-random values in place of a single value as done with the other techniques of gap-filling. This approach will generate ensembles of reconstructed datasets rather than just producing a single gap-filled data. This provides the additional advantage of using Monte-Carlo approach, as we will have a range of possible values in the derived phenological parameters. The proposed methodology is explained in the Methods section and schematically illustrated in Fig. 3.
The primary emphasis of this study is to acknowledge the importance of determining uncertainty in the estimated parameters derived using these gap-filled datasets which is a significant advantage of using this method, otherwise not possible with the conventional gap-filling approaches. Though focused on marine ecosystems, this methodology can be extended to using other variables of the climate system. We hope that this work contributes towards improving the use of existing datasets to extract reliable information of biophysical processes of the marine ecosystem.