Study design and study setting
A community-based cross-sectional study design was conducted to predict home delivery after ANC visits among 15-49 age women using recent DHS from 2011 to 2021 in the East Africa region. East Africa is the largest region, including 19 countries (Burundi, Comoros, Djibouti, Ethiopia, Eritrea, Kenya, Madagascar, Malawi, Mauritius, Mozambique, Rwanda, Seychelles, Somalia, Tanzania, Uganda, Zambia, South Sudan, Zimbabwe, and Sudan)(21). Of these 19 East African countries, 14 have DHS data, whereas Five do not (Djibouti, Somalia, South Sudan, Seychelles, and Mauritius). Among these 14 countries, one has restricted DHS data (Eritrea), and one (Sudan) has old data set from 1989-1990. Thus, this study included 12 countries (Burundi, Ethiopia, Comoros, Uganda, Rwanda, Tanzania, Mozambique, Madagascar, Zimbabwe, Kenya, Zambia, and Malawi) of recent standard DHS data collected between 2011 and 2021 to make more representative for East Africa.
Data source, study population and sampling technique
The study was conducted by using secondary data analysis based on Demographic and Health Surveys (DHS). Different datasets, such as those for men, women, children, births, individuals, and households, are included in each country's survey; for this study, we used the Individual Record (IR) file. DHS used the Population and Housing Census (PHC) as a sampling frame for a two-stage stratified cluster sampling technique. Using independent selection in each sampling stratum and probability sampling proportionate to the EAs' size, Enumeration Areas (EAs) were selected in the first stage. Households were systematically chosen for the second phase. The entire DHS report included a detailed sampling procedure (22, 23). We extracted 75,047 reproductive-age women in the study. However, after managing the data (excluding women who hadn’t ANC visit, missing value, and unknown response) a total weighted sample of 44,123 respondents was included in the study for further analysis.
Study variables
The outcome variable for this study was home delivery after an ANC visit, which is described as a delivery that takes place at home without the presence of a skilled birth attendant, even if they provided antenatal care services in a health facility (24). Maternal age, maternal education, marital status, wealth index, media exposure, sex of household head, previous contraceptive use, the timing of ANC visits, number of ANC visits, residence, birth interval, husband education, and health facility problem were used as the independent variable for this study.
Data management and analysis
Before conducting the statistical analysis, the data were weighted using the primary sampling unit, sampling weight, and strata to restore the survey's representativeness and consider the sampling design for accurate statistical estimates. The number of samples required in each stratum to obtain accurate estimates is determined by sampling statisticians; in the DHS, certain areas were oversampled while others were under-sampled. Therefore, using sampling weight (v005), primary sampling unit (v021), and strata (v022), the distribution of reproductive-age women in the sample needs to be weighted (mathematically adjusted) so that it resembles the true distribution in East Africa to obtain statistics that are representative of the nation.
A total of 75,047 actual samples with selected variables were extracted from the measures of DHS using STATA software version 17 and exported to a CSV file. Then the data was imported into a Jupyter Notebook version 3.11 for further analysis. To make data suitable for machine learning tasks, explanatory data analysis, missing value management, data discretization, outlier detection, balancing target features, and feature selection was applied as a preprocessing task. After all, data split as training (80%) and test data (20%) was performed to fit on a model selected, only variable pass feature selection was fitted to the model.
In this study supervised machine learning algorithms such as Random Forest, Ada Boost, Gaussian NB, MLP, Decision Tree, Logistic Regression (LR), random forest (RF), K-Nearest Neighbors (KNN), Extreme Gradient Boosting (XG Boost), and support vector machines (SVM) (25-28), was performed to predict determinants of home delivery after ANC visit among reproductive age women in East Africa. A tenfold cross-validation method was used for training the models on training data. Performance of the model was measured using different metrics like confusion matrix, and receiver operating Area Under Curve (AUC). Finally, the prediction of home delivery was made after hyperparameter tuning of the best-performed model. All analyses were performed using Python version 3.11 programming language in Jupyter Notebook using imblearn (29), sklearn (30), and SHAP(31) packages.
The relationship between the predictors and the outcome variable was evaluated using the SHAP feature importance method, which also helped identify the independent variables that are most crucial for predicting home delivery after an ANC visit. The Shapley Additive exPlanations (SHAP) analysis employs a game theory framework to provide a global or local interpretation and explanation of any machine learning model's prediction (27). Since tree-based models are typically "black-box" systems, it is uncommon to find interpretations and explanations of high-performing models in machine learning research (27). SHAP has been used as a feature selection mechanism by several researchers, and their results show that machine learning using the SHAP value feature selection method performs better in terms of classification with model explainability (27, 32). Plotting the total Shapley value of each sample's feature will also help you understand how each predictor affects the prediction of home delivery. Here, we can clarify whether a given characteristic makes a woman more likely to give birth at home after ANC visit or less likely.
Additionally, the contributions of each feature to the prediction of a positive class (home delivery) were explained using a waterfall plot (27). The waterfall plot's y-axis shows the independent variables and the feature values that correspond to them for each sample, while the x-axis shows the likelihood that a sample will be classified as belonging to the "home delivery" class. In the waterfall plot, a horizontal bar represents the contribution of each feature. The feature increases the probability that the sample will belong to the positive class, as indicated by positive contributions (red bars). Blue bars representing negative contributions indicate a decline in the probability of the sample falling into the positive class. Finally, the overall methodology workflow is shown below (Figure 1).
Ethical consideration
This study was a secondary data analysis. As a result, a permission letter for data access was obtained from a major demographic and health survey through an online request from http://www.dhsprogram.com. The study's data were publicly available and devoid of any personally identifiable information. The Demographic and Health Surveys (DHS) Program sent us a permission letter. Respondents, families, or sample communities cannot be identified in any way according to the IRB-approved processes for DHS public-use datasets. The data files do not contain names of people or addresses of households. The geographic IDs only descend to the regional level (where regions are frequently very large geographical areas covering several states or provinces).