Background and Purpose— Preterm birth (PTB) is the leading cause of infant mortality in the U.S. and globally. The goal of this study is to increase understanding of PTB risk factors that are present early in pregnancy by leveraging statistical and machine learning techniques on big data.
Methods—The 2016 U.S. birth records is obtained and combined with two other area-level datasets, Area Health Resources File and County Health Ranking. Then, we applied multiple machine learning techniques to study a cohort of 3.6 million singleton deliveries to identify generalizable preterm risk factors.
Results—The most important predictors of preterm birth are gestational and chronic hypertension, interval since last live birth, and history of a previous preterm birth that can respectively explain 14.91%, 6.92%, and 6.50% of the AUC. Parents education is one of the influential variables in prediction of PTB explaining 10.5% of the AUC. The relative importance of race declines when parents are more educated or have received adequate prenatal care. The gradient boosting machines outperformed other machine learning techniques with an AUC of 0.75 (recall: 0.64, specificity: 0.73) for the validation dataset.
Conclusions—Application of ML techniques improved the performance measures in prediction of preterm birth. The results emphasize the importance of socioeconomic factors such as parental education as one of the most important indicators of a preterm birth. More research is needed on the mechanisms through which the socioeconomic factors affect the biological responses.