The COVID-19 pandemic highlighted the need to gain a better understanding of the community attributes that govern a population’s susceptibility to epidemic transmission and spread [17, 21, 61]. A focus on population factors that govern early epidemic transmission is particularly key, as determining what makes some regions more vulnerable to pandemic transmission than others at this stage could enhance the development of institutional and societal preparedness measures to prevent or significantly mitigate the spread of future outbreaks [14]. In this research, we combined a RF model with a ML model agnostic explanative framework in conjunction with components of the CDC/ATDSR SVI data and number of COVID-19 cases that occurred during the early phase of the outbreak (defined as the period from reportage of first case to peak of the first wave) in order to determine the social vulnerability components that were most predictive of initial COVID-19 growth rate in Florida at the ZIP code-level. This work extends previous studies on this topic in several important ways. First, in the context of socioeconomic vulnerability and COVID-19 initial outbreaks, past studies have used cumulative cases [17, 22, 23] and incidence [21, 24, 25] as outcomes. While these studies have provided information on how SVI factors may impact the rise in early cases, they furnish less information on how these factors may affect the speed of epidemic growth [27]. Understanding the impact of vulnerability factors on the growth rate of an epidemic during its early phase on the other hand can convey information about the time scale and impact of disease spread, including how quickly the epidemic will grow, when it might peak, its overall magnitude, and how long it will last, all of which constitute important knowledge for guiding epidemic management [26, 28, 62]. Impact of SVI variables on COVID-19 growth rates have been analyzed, but at larger spatial scales and using a limited number of socioeconomic variables as predictors [32, 63]. In contrast, this study estimated and analyzed growth rates of the pandemic as a function of more than 20 socioeconomic variables at the fine spatial scale of ZIP codes (Table 1). Understanding the impact of these factors on the growth of an epidemic at this local spatial scale will allow a better appreciation of variations in the nature, scale, timing and speed of disease propagation across a region, which will be useful for the development and implementation of better defined and even more spatially-focused preparatory, control and prevention efforts. Given the variability in the way outbreak preparedness is operationalized across the US, the development of such analytical frameworks for assessing local-level variability to epidemic outbreaks and spread has been highlighted as a critical need for preparing for future epidemics [14]. Lastly, the use of ML-based covariate assessment methods using interpretative tools, such as SHAP, provides an enhancement via the ability of such frameworks for carrying out and explaining complex multidimensional data analysis [64, 65].
Our study first addressed the epidemic growth rates associated with the first wave of the COVID-19 pandemic observed in the ZIP codes of Florida. The results, obtained via fits of a simple log-linear regression model [27], indicate that this initial growth phase varied significantly between the ZIP code jurisdictions in Florida, with distinct hotspots of heightened transmission found along the central, southern and northern coastal and northwest regions of the state (Fig. 2). This immediately underscores the need to better understand the local vulnerability/resilience factors that underpinned the occurrence of such spatial variation, so that gaps in current preparatory measures can be addressed to prevent future impact of an outbreak on Florida communities. The RF model we trained to predict these growth rates using 22 unique ZIP code-level features, comprising socioeconomic, demographic, transportation and building type, and health-related factors, in this study was found to perform accurately to provide significant initial insights into this topic. The first finding of importance here was that our SHAP analysis identified six important factors from among the 22 factors investigated that was most associated with the observed early growth rate of the pandemic at the ZIP code level in Florida (Fig. 3, Table S1). The SHAP plots revealed that ZIP code features, viz. percentage single-parent households, population density and percentage of population speaking English less than fluently, percentage living in group quarters and percentage of the population burdened by housing costs, generally had a positive impact on epidemic growth prediction rates and thus making the ZIP codes with higher levels of these features vulnerable to heightened early transmission of the pandemic. By contrast and less intuitively, higher levels of CHD had an overall negative impact on epidemic growth predictions making ZIP codes with higher CHD levels less vulnerable to the early spread of the pandemic (see Fig. 4). The positive association between the first factors and incidence of COVID-19 cases has been reported previously at various spatial levels, from census tract to county and state levels [17, 18, 20, 66–69], albeit this is the first time that this positive relationship has also been documented for initial growth of the pandemic. Various explanations, ranging from the financial struggles faced by single parents [70–72], the impact of population density in enhancing contacts between susceptible and infected sub-groups in a community [68, 73–75], the overrepresentation of individuals less fluent in English in occupations deemed to be “essential” [10, 76, 77], their inability to fully understand safety guidelines, and propensity to live in multi-generational crowded homes [78, 79], and the unique exposure of individuals living in group quarters, including in nursing homes [80, 81] and correctional facilities [82, 83] have been proposed for this association and may similarly underlie the impacts of these factors in making the ZIP codes with high levels of these features also vulnerable to early heightened propagation of COVID-19.
The SHAP dependency plots shown in Fig. 4 provide additional insights regarding the relationships of the identified six most important factors or features and the RF predictions for early epidemic growth rates at the ZIP code-level. The results indicate a generally sharp positive increase in growth rate predictions as the percentage of single-parent households and percent of populations that have a language barrier or live in group quarters in a ZIP code increase. More complex functional associations were however identified for early epidemic growth predictions and increases with covariates, such as population density, percentage of population burdened by housing costs and percent of population in a ZIP code with CHD (Fig. 4). Indeed, the results show that at the highest levels of these variables at the ZIP code level, predictions for growth rates turned negative, with the most negative association at high covariate values found for CHD. Further, the non-linear form of the relationships indicates that the highest vulnerability to high epidemic growths during the early phase of the pandemic were exhibited by ZIP codes with intermediate levels of increase in the values of these covariates. While these unexpected findings suggest the need for further research, it is possible to speculate at least for population density and for the observed non-linear association of epidemic growth with CHD levels that the negative predictions at the highest levels for each of these variables may be linked to the fact that public services and possibly population compliance with protective measures (mask wearing, observing safe distances) may be higher in densely populated centers, such as major cities. Higher compliance by CHD patients to observing protective measures may similarly also underlie the negative association between high CHD levels and initial COVID-19 growth rates. These findings highlight how applying SHAP analysis in combination with ML modeling can facilitate the unearthing of novel insights into the impact of factors that drive epidemic transmission whilst also serving as a foundation for guiding future investigations.
The SVI metric developed and evaluated in this work using the top 6 important ZIP code-level features that contributed to at least > 7.5% of the RF model predictions revealed the Florida ZIP codes that were prone to high levels of COVID-19 transmission during the initial phase of the pandemic in the state. The SVI distribution showed a positive correlation with initial ZIP code-level epidemic growth indicating that the developed social vulnerability metric is able to reliably identify local ZIP code jurisdictions that were most sensitive to high initial outbreak conflation or those that were less susceptible to initial epidemic transmission. Moreover, it can be seen that such a metric can also be useful for assessing spatial variations in local vulnerability to the initial spread of an epidemic within a larger region, such as a state (Fig. 5). This finding underscores the importance of considering local-level community vulnerability to the spread of an epidemic in order to better understand the interplay between social determinants of outbreaks and development of more targeted mitigation efforts to curtail the emergence and spread of epidemics in a region [17, 20, 61, 67, 84]. In the latter regard, it is pertinent to note that our analysis has demonstrated that several specific community vulnerabilities, viz. percentage single-parent households, population density, percentage of population speaking English less than fluently, percentage living in group quarters and percentage of the population burdened by housing costs, were most salient in determining the rapid initial local spread of COVID-19 in Florida. Taken together, these results show how indexes such as the SVI can help not only to identify communities most susceptible to epidemic flare-ups but also by further identifying the specific vulnerability factors provide information that could be used to effectively allocate resources that could contain or mitigate epidemic spread across a region [11, 60]. These should include policy that protects and supports “essential” workers with workplace social distancing and mask-wearing mandates, access to diagnostic tests, health insurance, paid sick leave, supplementary income, and education regarding the disease’s transmission and protective measures in the native language of each worker [85–88]. Similarly, single parents with dependent children need a financial security net, which can be based on provision of supplementary income, paid sick leave, and affordable childcare, counseling services and health insurance [70–72]. Specific policy that addresses transmission in group quarters will also require to be developed going forward based on the nature of each type of quarter [83, 89–91]. This study indicates that focused policies guided by metrics such as the ML-based SVI developed here will help to minimize and slow early community spread which will be crucial to preventing the formation of the severe pandemics with major societal impacts as observed in the case of COVID-19.
Our analysis and findings must be interpreted considering the study’s limitations. Firstly, it is possible that the number of reported daily COVID-19 may have been understated, affecting the data on our outcome. If the number of cumulative cases was underreported differently by the socioeconomic factors included as our predictive features, our results might be biased. Furthermore, even though our independent variables comprised 22 distinct factors inspired by the 2020 CDC/ATSDR SVI themes, it is plausible that the incorporation of additional confounding factors, such as the number of persons using public transportation and movement of individuals between ZIP codes, would be beneficial. Such connectivity between ZIP codes may be particularly important as a ZIP code’s reported COVID-19 cases might be a function of epidemic transmission in neighboring communities. Our analysis also did not explicitly consider the intensity of the initial lockdowns that were implemented between April to early May and the actual compliance of each ZIP code’s population to such lockdowns owning to paucity of local-level data on these variables. ML-based methods can make better predictions with large amounts of training data, which our study did not have due to the spatial extent and focus of our study to considering only Florida. Therefore, our conclusions require to be supported by further investigations using more complete data and other types of analytical methodologies than used presently. Such research must also consider other regions of the United States if a more generalizable account of the vulnerability and factors underlying such vulnerability of local communities to epidemics across the country is to be realized.