We created a new neighborhood health data set by pooling information in three datasets from the Centers for Disease Control and Prevention (CDC), the Census Bureau, and the Environmental Protection Agency (EPA) in the US. Census tract was used as a proxy of neighborhood. Data on the prevalence of health outcomes, prevention, and health behavior measures were drawn from the CDC’s 500 Cities Project 2017 data release on 28,004 census tracts.16 The project was funded by the Robert Wood Johnson Foundation in conjunction with the CDC Foundation. Socio-demographic measures for the selected census tracts were from the 2011-2015 American Community Survey 5-Year Estimates. 17,18 Information on environmental exposures was obtained from the EPA’s Environmental Justice Screening (EJSCREEN) database. 19 We did not obtain IRB approval as this ecological study used census tract level data from publicly available data sources.
We included four types of neighborhood risk factors: i) unhealthy behaviors (e.g., smoking, no leisure-time physical activity, insufficient sleep, and obesity), ii) prevention measures (e.g., lack of health insurance, visits to dentist, colonoscopy screening, up to date on a core set of preventative services for male and females), iii) sociodemographic indicators (e.g., age, sex, race/ethnicity, income, and education), and iv) environmental measures (e.g., ambient air pollution). Both the stroke outcome and its predictor variables were measured at the neighborhood level (no person-level data were used). Detailed description of the variables, their data sources and distributions are shown in Table 1. We excluded 1307 census tracts that had missing data on key variables. Among the 1307 census tracts, 975 had missing health measures, 137 had missing socio-demographic measures and 295 had missing environmental data. Our final analytical dataset included 26,697 census tracts.
We first explored a heuristic approach to remove the minimum number of highly correlated predictor variables. Redundant predictors add complexity to the model than information they provide to the model. Using highly correlated predictors in regression models can lead to highly unstable results. The variance inflation factor (VIF) can be used to identify predictors that are impacted but does not determine which should be removed to resolve the problem. We followed an iterative algorithm to remove the minimum number of variables to ensure that all pairwise correlations are below a certain threshold, for which we chose 0.75. 20 Details of the algorithm appear in Figure 2.
We then applied a high-performance nonparametric machine learning technique, QRFs, on the reduced data with no highly correlated variables. QRFs is a generalization of the Random forests (RFs). RFs is a machine learning modeling technique that builds an ensemble of regression trees to flexibly capture the relationship between the conditional mean of the response and predictor variables and has gained popularity in medical research for its high prediction accuracy and adaptability.21-23 QRFs utilizes the infrastructure of RFs, and gives a non-parametric and accurate way of estimating conditional quantiles. The method has been shown to be consistent and competitive in terms of predictive power. 24 QRFs grows an ensemble of regression trees, employing random nodes and split point selection as in the standard RF algorithm, but for each node in each tree, RFs keeps only the mean of the observations that fall into this node, whereas QRFs keeps the values of all observations in the node. Thus QRFs can assess the conditional distribution function of the response given the covariates, and can provide a fuller picture of the exposure-outcome relationship than mean-based RFs.
We developed and implemented a variable selection algorithm based on the variable importance scores generated by QRFs to determine the most critical predictors for the 90th percentile of the neighborhood-level prevalence rate of stroke. The algorithm is described in Figure 2. A similar algorithm was suggested by Dietrich et al. for implementing RFs with survival outcomes but without assessing the optimal balance between the prediction error and the number of selected variables. 25 The importance score for each variable is computed by randomly permuting the values of each predictor for the out-of-bag (OOB) sample of the predictor for each tree and measuring the decrease in model accuracy by the permutation averaged across the forest. The more important the variable is, the larger decrease (i.e., importance score) is produced by the permutation. We carried out an iterative process for variable selection. Each time we removed the least important variable and rebuilt a QRFs model with the remaining variables and recorded the out-of-bag (OOB) average quantile loss (AQL) until no variable is left. We used AQL for the evaluation of model performance because the true conditional quantiles of the responses are unobservable. So as suggested by Wang et al and Fang et al, we computed the prediction error of the -th conditional quantile by averaging the quantile loss function, , over all observations, where . 26,27 We then plotted the OOB AQLs against the number of selected variables, and set the final model to be the one corresponding to the ‘elbow’ point, which achieved the best balance between the smallest OOB AQL and the parsimoniousness of the selected variables.
To empirically evaluate whether our machine learning algorithm selected major determinants, we compared QRFs with classical linear QR including all predictors additively, termed as LQR-AllVar, which is frequently used in public health. We compared the metric AQL and AQL reduction per predictor – defined as (AQLnull – AQLmethod)/Number of Predictorsmethod, where AQLnull is the AQL from the null model, i.e., intercept only model, and AQLmethod corresponds to the AQL of each specific method. AQL reduction per predictor answers the question of how much gain do we get for adding each predictor variable suggested by a variable selection approach, and therefore methods that give larger AQL reduction per predictor are desired.
Finally, to “unblackbox” machine learning, we included the major predictors selected by QRFs in a linear QR model to quantify the effects of each predictor on different percentiles of the response, and in a LR model to show how mean-based analysis may provide incomplete and biased summary of the effect of exposures. All statistical analyses were performed using R version 3.6.1. QRFs models were built using the “quantregForest” R package.