Quantile regression forests to identify determinants of stroke: implications for neighborhoods with high prevalence

doi:10.21203/rs.3.rs-34390/v1

Download PDF

Research article

Quantile regression forests to identify determinants of stroke: implications for neighborhoods with high prevalence

https://doi.org/10.21203/rs.3.rs-34390/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background : Stroke exerts a massive burden on the U.S. health and economy. Place-based evidence is increasingly recognized as a critical part of stroke management but identifying the key determinants of stroke and the underlying effect mechanisms at the neighborhood level is a topic that has been treated sparingly in the literature. We aim to fill in the research gaps. We develop and apply analytical approaches to address two challenges. First, domain expertise on drivers of neighborhood-level stroke outcomes is limited. Second, commonly used linear regression methods may provide incomplete and biased conclusions.

Methods: We created a new neighborhood health data set at census tract level by pooling information from multiple sources. We developed and applied a machine learning based quantile regression method to uncover crucial neighborhood characteristics for neighborhood stroke outcomes among vulnerable neighborhoods burdened with high prevalence of stroke.

Results: Neighborhoods with a larger share of non-Hispanic blacks, older adults or people with insufficient sleep tended to have a higher prevalence of stroke, whereas neighborhoods with a higher socio-economic status in terms of income and education had a lower prevalence of stroke.

The effects of five major determinants varied geographically and were significantly stronger among neighborhoods with high prevalence of stroke.

Conclusions: Highly flexible machine learning identifies true drivers of neighborhood cardiovascular health outcomes from wide-ranging information in an agnostic and reproducible way. The identified major determinants and the effect mechanisms can provide important avenues for prioritizing and allocating resources to develop optimal community-level interventions for stroke prevention.

Health Economics & Outcomes Research

Health Policy

Infectious Diseases

Prevention

Cardiovascular Health

Neighborhood

Machine Learning

Quantile Regression

Stroke is the fifth leading cause of death in the United States and is a major cause of serious disability for adults. ¹ The prevalence of stroke is approximately 3%, accounting for one of every 20 deaths. With an estimated $45.5 billion in direct and indirect costs, stroke is a chronic disease exerting a massive burden on the U.S. health and economy. Considerable research has been conducted on the risk factors for stroke at the individual level. ^2-4 These studies have demonstrated accumulative scientific evidence showing that stroke is associated with modifiable risk factors, such as high blood pressure, obesity, high cholesterol, and health behavioral risk factors such as smoking, sleep deprivation and sedentary lifestyle. ^5-8 There are also remarkable disparities, with higher stroke incidence and prevalence found among older population, Blacks and those with low socioeconomic status.⁹

In comparison, few studies have examined the mechanisms between neighborhood characteristics and neighborhood-level prevalence of stroke, despite the growing awareness that individuals’ health is closely related to the neighborhood environment they live in.^10,11 The connections between place and health can be seen in the apparent clustering of high prevalence of stroke in the Stroke Belt states and in certain census tracts across major US cities.^12,13 However, there is an insufficient understanding of what and how neighborhood characteristics drive the neighborhood-level prevalence of stroke. Identifying critical predictors is important as it provides an opportunity for policymakers to plan tailored community-based interventions, which have been shown to be more effective and cost-effective in reducing the burden of cardiovascular disease and curbing health care costs compared to individual-based interventions.¹⁴

This study aims to contribute to neighborhood cardiovascular health research. We address two primary challenges. First, in public health research, domain expertise is frequently used for variable selection. However, subject matter expertise on key drivers of neighborhood-level cardiovascular health outcomes and their relative importance is limited. In practice, variable selection is often carried out with certain degree of arbitrariness (e.g., tests based on statistical significance level, the order in which variables are entered into a model, the choice of a statistical model). In addition, the relative importance of each variable in relation to the outcome is often unclear.

Second, commonly used linear regression (LR) methods for determining the association between exposures and an outcome assess how the mean of the conditional distribution of the outcome varies with exposures. However, the mean of the neighborhood-level prevalence of stroke may be a poor indicator of central tendency and conveys limited information about how prevalence of stroke varies across different neighborhoods. The distribution of the neighborhood-level prevalence of stroke is skewed; see Figure 1. The effect of a factor may be different across quantiles. Consequently, using LR methods to estimate only the effects at the mean level may result in incomplete and biased conclusion about the effect.

Research is needed to understand the most important links between neighborhood level characteristics and a high prevalence of stroke at the neighborhood level, as such knowledge would aid in prioritizing and deploying prevention interventions for the affected communities. Focusing on these vulnerable communities requires an analysis of the tail of a distribution, e.g.,, 90^th percentile of the distribution of the prevalence of stroke as it signals “troubled” neighborhoods.

Quantile regression (QR) methods are well suited to estimate how specified quantiles, or percentiles of the distribution of the outcome variable vary with covariates, and is robust against outliers and is more informative for a skewed distribution than mean-based regression. ¹⁵ In this article, we demonstrate the value of a highly flexible machine learning based quantile regression method in studying neighborhood stroke burden.

We first created a large-scale neighborhood health data by pooling information from multiple sources and considered 24 factors. These factors have been linked to cardiovascular health outcomes at the individual patient level, and can be grouped into four major domains, unhealthy behaviors, prevention measures, sociodemographic indicators and environmental measures. ^5,6,8,9 We then exploited quantile regression random forests (QRFs) – a machine learning modeling technique – to rank the relative importance of the potential predictors, and proposed and implemented an algorithm to identify a set of major determinants for the distribution of neighborhood-level prevalence of stroke. We further compared the performance of our machine learning method to the performance of regression approaches commonly used in practice. Finally, we quantified the effects of the identified major determinants on stroke prevalence in vulnerable neighborhoods where the stroke prevalence ranked in the 90^th percentile, and assessed the bias from mean-based analyses.

Results from this study will provide insights into how to prioritize and incorporate the fabric of neighborhood health and sociodemographic environment into stroke-prevention strategies for communities heavily burdened with stroke.

We created a new neighborhood health data set by pooling information in three datasets from the Centers for Disease Control and Prevention (CDC), the Census Bureau, and the Environmental Protection Agency (EPA) in the US. Census tract was used as a proxy of neighborhood. Data on the prevalence of health outcomes, prevention, and health behavior measures were drawn from the CDC’s 500 Cities Project 2017 data release on 28,004 census tracts.¹⁶ The project was funded by the Robert Wood Johnson Foundation in conjunction with the CDC Foundation. Socio-demographic measures for the selected census tracts were from the 2011-2015 American Community Survey 5-Year Estimates. ^17,18 Information on environmental exposures was obtained from the EPA’s Environmental Justice Screening (EJSCREEN) database. ¹⁹ We did not obtain IRB approval as this ecological study used census tract level data from publicly available data sources.

We included four types of neighborhood risk factors: i) unhealthy behaviors (e.g., smoking, no leisure-time physical activity, insufficient sleep, and obesity), ii) prevention measures (e.g., lack of health insurance, visits to dentist, colonoscopy screening, up to date on a core set of preventative services for male and females), iii) sociodemographic indicators (e.g., age, sex, race/ethnicity, income, and education), and iv) environmental measures (e.g., ambient air pollution). Both the stroke outcome and its predictor variables were measured at the neighborhood level (no person-level data were used). Detailed description of the variables, their data sources and distributions are shown in Table 1. We excluded 1307 census tracts that had missing data on key variables. Among the 1307 census tracts, 975 had missing health measures, 137 had missing socio-demographic measures and 295 had missing environmental data. Our final analytical dataset included 26,697 census tracts.

We first explored a heuristic approach to remove the minimum number of highly correlated predictor variables. Redundant predictors add complexity to the model than information they provide to the model. Using highly correlated predictors in regression models can lead to highly unstable results. The variance inflation factor (VIF) can be used to identify predictors that are impacted but does not determine which should be removed to resolve the problem. We followed an iterative algorithm to remove the minimum number of variables to ensure that all pairwise correlations are below a certain threshold, for which we chose 0.75. ²⁰ Details of the algorithm appear in Figure 2.

We then applied a high-performance nonparametric machine learning technique, QRFs, on the reduced data with no highly correlated variables. QRFs is a generalization of the Random forests (RFs). RFs is a machine learning modeling technique that builds an ensemble of regression trees to flexibly capture the relationship between the conditional mean of the response and predictor variables and has gained popularity in medical research for its high prediction accuracy and adaptability.^21-23 QRFs utilizes the infrastructure of RFs, and gives a non-parametric and accurate way of estimating conditional quantiles. The method has been shown to be consistent and competitive in terms of predictive power. ²⁴ QRFs grows an ensemble of regression trees, employing random nodes and split point selection as in the standard RF algorithm, but for each node in each tree, RFs keeps only the mean of the observations that fall into this node, whereas QRFs keeps the values of all observations in the node. Thus QRFs can assess the conditional distribution function of the response given the covariates, and can provide a fuller picture of the exposure-outcome relationship than mean-based RFs.

We developed and implemented a variable selection algorithm based on the variable importance scores generated by QRFs to determine the most critical predictors for the 90^th percentile of the neighborhood-level prevalence rate of stroke. The algorithm is described in Figure 2. A similar algorithm was suggested by Dietrich et al. for implementing RFs with survival outcomes but without assessing the optimal balance between the prediction error and the number of selected variables. ²⁵ The importance score for each variable is computed by randomly permuting the values of each predictor for the out-of-bag (OOB) sample of the predictor for each tree and measuring the decrease in model accuracy by the permutation averaged across the forest. The more important the variable is, the larger decrease (i.e., importance score) is produced by the permutation. We carried out an iterative process for variable selection. Each time we removed the least important variable and rebuilt a QRFs model with the remaining variables and recorded the out-of-bag (OOB) average quantile loss (AQL) until no variable is left. We used AQL for the evaluation of model performance because the true conditional quantiles of the responses are unobservable. So as suggested by Wang et al and Fang et al, we computed the prediction error of the -th conditional quantile by averaging the quantile loss function, , over all observations, where . ^26,27 We then plotted the OOB AQLs against the number of selected variables, and set the final model to be the one corresponding to the ‘elbow’ point, which achieved the best balance between the smallest OOB AQL and the parsimoniousness of the selected variables.

To empirically evaluate whether our machine learning algorithm selected major determinants, we compared QRFs with classical linear QR including all predictors additively, termed as LQR-AllVar, which is frequently used in public health. We compared the metric AQL and AQL reduction per predictor – defined as (AQL_null – AQL_method)/Number of Predictors_method,where AQL_null is the AQL from the null model, i.e., intercept only model, and AQL_method corresponds to the AQL of each specific method. AQL reduction per predictor answers the question of how much gain do we get for adding each predictor variable suggested by a variable selection approach, and therefore methods that give larger AQL reduction per predictor are desired.

Finally, to “unblackbox” machine learning, we included the major predictors selected by QRFs in a linear QR model to quantify the effects of each predictor on different percentiles of the response, and in a LR model to show how mean-based analysis may provide incomplete and biased summary of the effect of exposures. All statistical analyses were performed using R version 3.6.1. QRFs models were built using the “quantregForest” R package.

We first applied the iterative algorithm (Step 1-4 in Figure 2) to identify and remove 8 redundant and highly correlated variables from the 24 candidate predictors. We then built a QRFs model with the remaining16 predictors and ranked the relative importance of each predictor in relation to the 90^th percentile of the neighborhood-level prevalence of stroke; see Figure 3. Sociodemographic indicators related to race, age, income level, education, and unhealthy sleep behavior appeared to be the leading neighborhood-level risk factors for high prevalence of stroke, whereas the environmental measures and gender composition are of relatively low importance.

We further identified major determinants of high stroke prevalence using the relative importance scores. Targeting the 90^th percentile of the prevalence of stroke at the neighborhood level, our QRFs based variable selection algorithm (Step 5-9 in Figure 2) identified five crucial factors that explained the majority of the variability in stroke prevalence among the most vulnerable neighborhoods. They are, in the order of relative importance, the share of non-Hispanic blacks, the proportion the percentage of population over 65 years of age, median household income, the percentage of population with insufficient sleep and the share of population with higher education. These five predictors correspond to the ‘elbow’ point in Figure 4 – variables remained in the QRFs model in the 11^th iteration of our QRFs variable selection algorithm. Together the predictors reduced the AQL from the null model (with no predictors) by 70%, similar to the percentage of reduction in AQL (72.5%) delivered by a full model including all 16 available predictors, as suggested in Figure 4 by the curve of OOB AQL gradually reaching a plateau after the ‘elbow’ point. The AQL reduction per predictor achieved by these five predictors was 0.04 as compared to 0.01 by the full model.

Figure 5 compares the performances of QRFs and LQR-AllVar in terms of the prediction error of the 90^th quantile, number of selected variables and prediction error reduction per predictor. While QRFs distinguished only five factors out of 16 available factors, the machine learning based method gave a similar AQL as LQR-AllVar did for predicting the 90^th percentile of the neighborhood-level prevalence of stroke, and obtained a significantly larger error reduction per predictor. These results corroborated that our method identified true determinants.

An ‘unblackboxing’ analysis provided interpretable effects of the identified major determinants on the high prevalence of stroke at the neighborhood level. To demonstrate that a risk factor may have different effects on the tails of the outcome distribution than on the outcome on average, we examined the respective effects on the 90^th(upper tail), 50^th (median) and 10^th (lower tail) quantile and the mean effects. Figure 6 displays the point estimates and 95% confidence intervals for each of the five major factor. First, larger shares of non-Hispanic blacks, older residents over 65 years of age and people who have insufficient sleep were positively associated with higher 90^th, median and 10^th quantile of the neighborhood-level prevalence of stroke. Median household income and the fraction of adults with higher education were inversely associated with all three quantiles. Second, all five major factors disproportionally affects different parts of the outcome distribution. The fractions of non-Hispanic blacks, older adults, highly educated residents and people with insufficient sleep had significantly larger (absolute) effects on the upper tail than on the lower tail. Third, estimates from the mean-based LR analysis hardly covered the QR estimates. These findings suggest that analyses based on the premise that the prevalence of stroke is uniformly or symmetrically distributed across the nation would lead to an incomplete and biased summary of the effect of exposures. A geographical comparison of the effects on the 90^th and 10^th percentile appears in Figure 7. Take the New York City as an example, Manhattan and Bronx sit at the opposite tails of stroke prevalence distribution (lower (10^th percentile) and upper (90^th percentile), respectively), the effects of major factors such as the prevalence of insufficient sleep and the age structure are substantially different (e.g., non-overlapping confidence intervals of the effect estimates) between these two neighborhoods, underscoring heightened influence of insufficient sleep and older population in Bronx than in Manhattan, which in turn can provide guidance for developing targeted intervention programs.

In this study, we developed and applied a robust and reproducible machine learning based approach to identify major factors for the tails of the distribution of the neighborhood-level cardiovascular health outcome, prevalence of stroke, when the distribution was not normal, and investigated the underlying effect mechanisms of the major factors, leveraging a high-performance nonparametric quantile regression technique, QRFs. We exploited a large-scale dataset with wide-ranging information from unhealthy behaviors and prevention measures to sociodemographic status and environmental factors, pooled from more than 20,000 census tracts in 500 cities of the US.

Our approach identified a parsimonious set of predictors for quantiles of the neighborhood-level prevalence of stroke, shedding light on the true drivers for high prevalence of stroke at the neighborhood-level. The identified neighborhood characteristics were in good agreement with known individual-level risk factors. Neighborhoods with a larger share of non-Hispanic blacks, older adults or people who have insufficient sleep tended to have a higher prevalence of stroke, whereas neighborhoods with a higher socio-economic status in terms of income and education had a lower prevalence of stroke. All of five factors disproportionally affected the prevalence of stroke among neighborhoods with different stroke prevalence profile. The effects on the 90^th percentile (upper tail) were significantly higher than effects on the 10^th percentile (lower tail), and higher than effects at the mean level. Using mean-based LR methods would have led to a limited and biased conclusion. Our approach offered a “higher-resolution” analysis that can be used to expand and deepen the existing quantitative evidence on stroke prevalence and its risk factors.

Results from our study may help inform public health policies. Establishing key neighborhood characteristics for high neighborhood-level prevalence of stroke allows policymakers to prioritize communities burdened with a high prevalence of stroke in developing and customizing community-based intervention programs to improve cardiovascular health outcomes. For example, resources may be allocated to the boroughs of New York City that have a high prevalence of stroke (e.g., the Bronx) to develop community-level educational interventions, that promote exercise, improve bedroom ambience or alleviate sleep disorders that may promote or interfere with sleep. ²⁸ As the share of non-Hispanic blacks and the older population structure are two key components that may drive up the prevalence of stroke, it is critical for communities to make efforts to address avoidable inequalities and to eliminate health and health care disparities.²⁹

Identifying the most influential and true determinants from wide-ranging information is challenging, especially when the number of relevant predictors is sparse relative to the total number of available predictors and relationships between predictors and outcomes may be nonlinear. The presence of skewness in the outcome elevates the complexity. Previous studies that evaluated the relationships between neighborhood characteristics and cardiovascular health outcomes are typically conducted at the individual level, and have limitations in analytical approaches. ³⁰ The skewness of the outcome is typically ignored as mean-based regression analyses are commonly used. Predictors are often selected a priori or using test procedures based on some arbitrary threshold value. As a result, these studies may not provide specific insights into precise drivers for diverse neighborhoods with varied prevalence of cardiovascular diseases.

Our method is capable of specifying the effect of a predictor on the tail of the outcome distribution in the presence of skewedness that is missed by others. We compared our approach to classical QR, and classical LR. Our approach achieved nearly the same prediction error reduction with only five predictors as the full QR model. In comparison, implementation of the two-standard-deviation approach within the framework of QRFs proposed in Fang et al. ²⁷selected only one variable, failing to capture many important predictors. Our “higher resolution” analysis showed that the major determinants disproportionally affected neighborhood-level stroke outcomes, underscoring the larger effects in the areas with a higher prevalence. In conjunction with the ranking of variable importance, our method can provide valuable guidance for targeted community-based interventions.

There are several limitations in this study. First, some behavioral and health outcome measures available in the 500 Cities Data were estimated by the CDC using a small area estimation approach. Although these estimated measures may not be accurate as real statistics, they provide the best available data for these small areas and the approach has been well validated. ³¹ Second, we could not make causal claims about the relationship between neighborhood characteristics and health outcomes due to the nature of the cross-sectional data and the ecological study design. However, our results identified important factors of neighborhood cardiovascular health and can potentially stimulate future causal inference research in neighborhood cardiovascular health. Finally, there could be other important variables that were not included in our study, either unmeasured or not collected in our data, due to the complexity of the neighborhood cardiovascular health. Despite the potential omitted variables, by combining data from three different large datasets and using an innovative machine learning approach, we believe the scope and depth of our analysis can provide important insights on policymaking and lead to more innovative investigations in the area of neighborhood population health.

Highly flexible machine learning identifies drivers of neighborhood cardiovascular health outcomes from wide-ranging information in an agnostic and reproducible way. Quantile regression based approaches provide an opportunity to deepen and expand the quantitative evidence gained from mean-based analyses. The identified major determinants and the effect mechanisms can provide important avenues for prioritizing and allocating resources to develop optimal community-level interventions for stroke prevention.

Ethics approval and consent to participate

This study used census tract level data (no patient-level data was used) from publicly available data sources. Ethical approval is not application for this study.

Consent for publication

Not applicable

Availability of data and materials

We used 3 datasets during the current study. CDC’s 500 Cities Project 2017 data release on 28,004 census tracts is publicly available on its website, https://chronicdata.cdc.gov/browse?category=500+Cities.¹⁶ The 2011-2015 American Community Survey 5-Year Estimates is publicly available on the website, https://www.census.gov/data/developers/data-sets/acs-5year.html.¹⁷ The EPA’s Environmental Justice Screening (EJSCREEN) database is also publicly available on the website, https://www.epa.gov/ejscreen/download-ejscreen-data.¹⁹

Competing interests

The authors declare that they have no competing interests.

Funding

This research was supported in part by award ME2017C3 9041 from the Patient-Centered Outcomes Research Institute (PCORI), and a grant from the National Heart, Lung, And Blood Institute of the NIH under Award Number R01HL141427. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the PCORI or NIH. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Authors' contributions

LH contributed to the design and methodology of the study, supervised the statistical analysis, managed the literature searches, wrote the first draft of the manuscript, and reviewed the final version.

JJ undertook the statistical analysis and helped write the first draft of the manuscript.

BL obtained the data used in the analysis, and contributed to the correction of the manuscript. YL contributed to the correction of the manuscript.

All authors contributed to and have approved the final manuscript.

Acknowledgements

Not applicable

LR: linear regression; QR: quantile regression; VIF: variance inflation factor; QRFs: quantile regression forests; RFs: random forests; OOB: out-of-bag; AQL: average quantile loss; LQR-AllVar: linear quantile regression including all variables; CDC: Centers for Disease Control and Prevention; EPA: Environmental Protection Agency; EJSCREEN : Environmental Justice Screening; VIF: variance inflation factor.

Mozaffarian D, Benjamin Emelia J, Go Alan S, et al. Heart Disease and Stroke Statistics-2016 Update. Circulation. 2016;133(4):e38-e360.
You Roger X, McNeil John J, O’Malley Heather M, Davis Stephen M, Thrift Amanda G, Donnan Geoffrey A. Risk Factors for Stroke Due to Cerebral Infarction in Young Adults. Stroke. 1997;28(10):1913-1918.
Whisnant Jack P. Modeling of Risk Factors for Ischemic Stroke. Stroke. 1997;28(9):1840-1844.
Müller-Nordhorn J, Nolte Christian H, Rossnagel K, et al. Knowledge About Risk Factors for Stroke. Stroke. 2006;37(4):946-950.
Go AS, Mozaffarian D, Roger VL, et al. Heart Disease and Stroke Statistics-2014 Ipdate: A Report from the American Heart Association. Circulation. 2014;129(3):e28-e292.
Bridgwood B, Lager KE, Mistri AK, Khunti K, Wilson AD, Modi P. Interventions for improving modifiable risk factor control in the secondary prevention of stroke. Cochrane Database Syst Rev. 2018;5(5):CD009103-CD009103.
Cappuccio FP, Cooper D, D'Elia L, Strazzullo P, Miller MA. Sleep duration predicts cardiovascular outcomes: a systematic review and meta-analysis of prospective studies. European Heart Journal. 2011;32(12):1484-1492.
Boehme AK, Esenwa C, Elkind MSV. Stroke Risk Factors, Genetics, and Prevention. Circ Res. 2017;120(3):472-495.
Kelly-Hayes M. Influence of Age and Health Behaviors on Stroke Risk: Lessons from Longitudinal Studies. J Am Geriatr Soc. 2010;58 Suppl 2(Suppl 2):S325-S328.
Schüle SA, Bolte G. Interactive and independent associations between the socioeconomic and objective built environment on the neighbourhood level and individual health: a systematic review of multilevel studies. PLoS One. 2015;10(4):e0123456-e0123456.
Osypuk TL, Ehntholt A, Moon JR, Gilsanz P, Glymour MM. Neighborhood Differences in Post-Stroke Mortality. Circ Cardiovasc Qual Outcomes. 2017;10(2):e002547.
Dworkis DA, Marvel J, Sanossian N, Arora S. Neighborhood-level stroke hot spots within major United States cities. The American Journal of Emergency Medicine. 2019.
Karp David N, Wolff Catherine S, Wiebe Douglas J, Branas Charles C, Carr Brendan G, Mullen Michael T. Reassessing the Stroke Belt. Stroke. 2016;47(7):1939-1942.
Mensah GA, Cooper RS, Siega-Riz AM, et al. Reducing Cardiovascular Disparities Through Community-Engaged Implementation Research: A National Heart, Lung, and Blood Institute Workshop Report. Circ Res. 2018;122(2):213-230.
Wei Y, Kehm RD, Goldberg M, Terry MB. Applications for Quantile Regression in Epidemiology. Current Epidemiology Reports. 2019;6(2):191-199.
500 Cities: Local Data for Better Health. Centers for Disease Control and Prevention; 2017. https://www.cdc.gov/500cities/index.htm.
American Community Survey 5-Year Data (2009-2018). United States Census Bureau. https://www.census.gov/data/developers/data-sets/acs-5year.html.
American FactFinder (AFF). United States Census Bureau. https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml.
Environmental Justice Mapping and Screening Tool. United States Environmental Protection Agency. https://www.epa.gov/ejscreen.
Kuhn M, Johnson K. Applied Predictive Modeling. 2nd ed. New York: Springer; 2018.
Breiman L. Random Forests. Machine Learning. 2001;45(1):5-32.
Genuer R, Poggi J-M, Tuleau-Malot C. Variable selection using random forests. Pattern Recognition Letters. 2010;31(14):2225-2236.
Mazumdar M, Lin J-YJ, Zhang W, et al. Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data. BMC Health Services Research. 2020;20(1):350.
Meinshausen N. Quantile Regression Forests. J Mach Learn Res. 2006;7:983–999.
Dietrich S, Floegel A, Troll M, et al. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol. 2016;45(5):1406-1420.
Wang L, Wu Y, Li R. Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension. Journal of the American Statistical Association. 2012;107(497):214-222.
Fang Y, Xu P, Yang J, Qin Y. A quantile regression forest based method to predict drug response and assess prediction reliability. PLoS One. 2018;13(10):e0205155.
Redeker NS, Caruso CC, Hashmi SD, Mullington JM, Grandner M, Morgenthaler TI. Workplace Interventions to Promote Sleep Health and an Alert, Healthy Workforce. J Clin Sleep Med. 2019;15(4):649-657.
Srinivasan S, Williams SD. Transitioning from health disparities to a health equity research agenda: the time is now. Public Health Rep. 2014;129 Suppl 2(Suppl 2):71-76.
Kershaw KN, Osypuk TL, Do DP, De Chavez PJ, Diez Roux AV. Neighborhood-level racial/ethnic residential segregation and incident cardiovascular disease: the multi-ethnic study of atherosclerosis. Circulation. 2015;131(2):141-148.
Zhang X, Holt JB, Yun S, Lu H, Greenlund KJ, Croft JB. Validation of multilevel regression and poststratification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system. Am J Epidemiol. 2015;182(2):127-137.

Table 1: Distribution of 24 potential neighborhood-level predictors and prevalence of stroke across 500 cities. Measures are in percentages for all variables except those marked with an asterisk.

Domain	Variable Name	Definition	Min	Q1	Median	Q3	Max	Mean	Data source
Health Outcomes	STROKE	Stroke among adults aged ≥18 years	0.30	2.20	2.80	3.60	18.80	3.11	CDC 500 Cities Data
Unhealthy Behaviors	SMOKING	Current smoking among adults aged ≥18 years	2.00	14.30	18.30	23.10	48.70	19.10	CDC 500 Cities Data^a
	NO_PA	No leisure-time physical activity among adults aged ≥18 years	7.90	18.30	24.20	31.60	61.30	25.30
	OBESITY	Obesity among adults aged ≥18 years	8.70	23.70	28.60	34.90	58.50	29.76
	INSUF_SLEEP	Sleeping less than 7 hours among adults aged ≥18 years	18.50	32.50	36.30	41.20	59.80	37.10
Prevention	LACK_INSURANCE	Current lack of health insurance among adults aged 18-64 years	2.50	11.70	18.00	27.40	70.80	20.58	CDC 500 Cities Data
	DENTAL	Visits to dentist or dental clinic among adults aged ≥18 years	18.90	49.80	61.30	70.50	87.10	59.82
	COLON_SCREEN	Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50-75 years	23.40	52.60	60.60	66.60	81.50	59.29
	CORE_PREV_M	Older adults aged ≥65 years who are up to date on a core set of clinical preventive services (Men: Flu shot past year, Pneumococcal polysaccharides vaccine (PPV) shot ever, Colorectal cancer screening)	13.10	24.80	29.90	34.60	52.20	29.88
	CORE_PREV_W	Older adults aged ≥65 years who are up to date on a core set of clinical preventive services (Women: Same as above and Mammogram past 2 years)	9.60	23.00	28.60	33.90	53.80	28.64
Socio-demographic Status	AGE65_OVER	Population aged 65 and over	0.00	10.61	14.81	19.79	100.00	15.81	ACS^b
	AGE18_34	Population aged between 18 and 34	0.00	27.48	33.78	40.76	99.38	34.96
	COLLEGE_HIGHER	Bachelor's degree or higher	0.00	12.27	23.71	40.33	100.00	28.28
	HS_COLLEGE	High school graduate or higher	0.00	75.78	85.51	91.66	100.00	82.44
	FEMALE	Female	0.00	48.82	51.19	53.60	100.00	51.04
	NON_HIS_ASIAN	Not Hispanic or Latino: - Asian alone	0.00	0.72	3.08	8.50	91.32	7.26
	NON_HIS_BLACK	Not Hispanic or Latino: - Black or African American alone	0.00	2.19	7.37	24.43	100.00	19.73
	NON_HIS_OTHER	Not Hispanic or Latino: - Other	0.00	2.07	4.61	8.06	50.70	6.02
	NON_HIS_WHITE	Not Hispanic or Latino: - White alone	0.00	17.24	48.02	72.15	100.00	45.65
	POVERTY	Below poverty level; Estimate; Families	0.00	5.10	12.10	24.00	100.00	16.09
	MED_INCOME*	Median household income in the past 12 months (in thousands)	4.17	34.10	49.58	70.43	250.00	55.49
Environmental factors	HOUSE_PRE1960*	Pre-1960 housing (lead paint indicator) (in thousands)	0.00	0.10	0.48	0.92	8.13	0.59
	TRAFFIC*	Traffic proximity and volume (average number of vehicles/distance)	0.00	0.12	0.39	1.10	62.11	11.73
	OZONE*	Ozone level in air (ppb)	27.63	44.40	48.74	52.81	73.67	48.04	EPA-EJSCREEN^c
	PM25*	PM_2.5 level in air ()	4.97	8.54	9.89	10.66	13.32	9.71	EPA-EJSCREEN^c

Note: ^a census tract level 500 Cities Data from the Centers for Disease Control and Prevention (CDC), which were modeled based on population-based survey data from the Behavioral Risk Factor Surveillance System (BRFSS). ; ^b census tract level data from the 2011-2015 American Community Survey 5-Year Estimates provided by the Census Bureau; ^c To match the geospatial unit of census tract available in the other two data sources, we aggregated the census block group level environmental measures to the census tract level by taking the means for PM_2.5 and O₃, and the sum for the housing data, and the sum of block-group-level population weighted traffic data. PM_2.5 concentrations are annual average of the daily ambient average, and ozone concentrations are average of daily maximum 8-hour level for the summer season. Both PM_2.5 and ozone were from a space-time downscaling fusion model based on monitoring data and modeled data. Traffic data reflect annual average daily traffic count of vehicles, i.e. count of vehicle at major roads within 500 meters divided by distance in meters, and was calculated based on traffic data from the U.S. Department of Transportation. Pre-1960 housing data were based on ACS from the U.S. Census. * indicates variables with absolute measurements as opposed to percentages.

Figure7code.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Quantile regression forests to identify determinants of stroke: implications for neighborhoods with high prevalence

Status:

Version 1

Abstract

Figures

Background

Methods

Results

Discussion

Conclusions

Declarations

Abbreviations

References

Table

Supplementary Files

Status:

Version 1