Data source
This study draws on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health and nutrition indicators with the aim of improving maternal and child health in Ethiopia [9]. The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) [9]. The unit of analysis comprised a total of 10,641 children under age 5 of mothers selected from 645 clusters across the country. This was based on the children’s data obtained from a retrospective information from mothers about their children that died before age five within the five years preceding the survey (2011 to 2016).
Study variables
In this study, the outcome variable – under-five mortality – was measured as a binary outcome. Thus, under-five mortality was measured as being alive (coded as 0) or dead (coded as 1) for all the models.
The predictors (features) used in this study include individual, household, community, and health services factors. The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother’s age at birth (<20, >20), education (No education, primary, secondary/higher), contraceptive use (Yes/No) and mother’s body mass index (BMI) (underweight/overweight and normal). Child factors included whether the child was wanted (child wanted then, wanted later, not at all), sex of the child, birth order (1-2, 3/later), births in last 5 years, and previous birth interval (<2, 2-4, >4 years), as well as whether the child was breastfed within 1 hour of birth. The household factors used were the source of drinking water (improved/unimproved), time to water source, toilet facility (improved/unimproved) and household wealth index (low, middle, high) and household size. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The health services factors included antenatal visits (0, 1-4, 5+ visits), place and mode of delivery services (Facility with Cesarean Section (CS) services, facility without CS, home), and postnatal visits within two months after delivery (Yes/No). The selection of these predictor variables was based on information from existing literature on the subject [6-8].
Analytic strategy
The R programming language (version 3.6.0) and the caret package [10] was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we estimated the rates under-five mortality by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used the widely accepted machine learning algorithms – logistic regression, a random forest model (RFM), K-nearest neighbors (KNN), – to predict under-five mortality in Ethiopia. These three models were selected for the following reasons. First, logistic regression is typically used to analyze binary data and commonly used as an inferential tool in population health research, but it also can be used as a binary classification model. Second, the KNN model is chosen based on its ability to detect linear and nonlinear boundaries between groups. The KNN method relies on finding the best value of k so that the k closest observations are used to predict the value of a given observation. “Closeness” of observations is usually measured using a distance metric such as the Euclidean distance between observations. Third, from a predictive modeling perspective, the random forest model is commonly used in machine learning situations because they are highly flexible and provide better predictive performance. Random forests repeatedly sample the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees. After many of these trees are formed, the forest is examined to see which variable consistently produce a better prediction. In this regard, machine learning techniques draw on a learning process that extracts useful information from the data generation process of previous observations [11]. It is touted as a prominent application of artificial intelligence technology for ensuring good health and social care for an entire population through preventive strategies, and protection from diseases [12].
We randomly selected and trained an 80% sample of the original data, which was eventually used for 10-fold cross-validation to tune the model parameters. The remaining 20% random sample was used as test data to predict the measures of model performance. Because the outcome is unbalanced (there is a low fraction of children in the data who die), the data were down-sampled so the proportions of data in the training set are equivalent for the cases who were alive after 5 years, and those who had died before 5 years. The performance of these algorithms was evaluated using various metrics including the Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) curve, which are useful in deciding which model provides the best discriminatory power between the dead and alive cases. The positive and negative predictive accuracy of each model was also calculated to show how well the model performs in terms of predicting the dead and alive cases, respectively. The results from all of the above models were weighted using person weights provided by the DHS. For the logistic regression model, we infer the importance and significance of predictors using traditional t-statistics and odds ratios derived from the model estimation, while for the random forest and KNN methods, these are not available. For these models, the Mean Decrease in Gini was calculated, which is a measure of variable importance for these models.