Table 1 summarizes under-five mortality rates across various sociodemographic factors in India. Religious affiliation shows varying mortality rates, with Hindu families experiencing the highest rate at 3.94%, followed by Muslims (3.36%), Sikhs (3.27%), Christians (2.96%), and others (2.76%). Environmental factors play a crucial role in child mortality. Households using unclean drinking water sources and unimproved sanitation facilities both show higher mortality rates (4.71%) compared to those with clean water sources (3.75%) and improved sanitation (3.35%). Using unclean fuel is associated with higher child mortality (4.38%) than clean fuel (2.88%). Home deliveries have a higher mortality rate (5.30%) than hospital deliveries (3.48%). However, deliveries in other locations have the highest mortality rate at 7.93%. Children who were never breastfed or in the others category have an alarmingly high mortality rate of 20.88%, while those who breastfed for 2-4 years and 4+ years have very low rates (0.51% and 0.01%, respectively). Higher education levels are associated with lower mortality rates, decreasing from 5.24% for no education to 2.11% for secondary and higher education. Similarly, wealthier households have lower mortality rates (2.50%) than poor households (4.63%). Rural areas show a higher mortality rate (3.98%) than urban areas (2.79%), indicating disparities in access to healthcare and other resources.
Table 2 presents the results of a multivariable logistic regression analysis, providing adjusted odds ratios (AOR) for various factors influencing under-five mortality. Religious affiliation shows significant effects, with Muslims having lower odds of child mortality compared to Hindus (AOR: 0.583, p<0.001). Environmental factors continue to show importance, with unclean fuel use associated with higher odds of child mortality (AOR: 1.211, p=0.045). Place of delivery remains crucial, with hospital deliveries showing lower mortality odds than home deliveries (AOR: 0.795, p=0.023). Breastfeeding duration is a strong predictor of child survival. Compared to breastfeeding for 1-2 years, longer durations are associated with significantly lower odds of mortality (2-4 years: AOR: 0.080, p<0.001; 4+ years: AOR: 0.001, p<0.001). Conversely, never breastfeeding or for the others category is associated with much higher odds of mortality (AOR: 3.693, p<0.001). Higher birth orders are associated with increased odds of mortality (Three and above: AOR: 3.441, p<0.001). Regional differences are evident, with the Central region showing higher odds of child mortality than the Northern region (AOR: 1.493, p=0.003), while the Western and Southern regions show lower odds. Education levels continue to show a protective effect, with higher education associated with lower odds of child mortality. Secondary and higher education shows the strongest effect (AOR: 0.647, p=0.005). Middle (AOR: 0.793, p=0.048) and rich (AOR: 0.670, p=0.004) households showed lower child mortality odds than poor households. Gender disparities exist, with female children having lower odds of mortality (AOR: 0.811, p=0.004). Antenatal care (ANC) visits show a protective effect, with 4-6 visits associated with significantly lower odds of child mortality (AOR: 0.684, p=0.008).
Table 3 compares the performance of four classification models: Logistic Tree, Decision Tree (DT), K-Nearest Neighbors (K-NN), and Support Vector Machine (SVM). These models were assessed using test data, and their performance was evaluated based on several metrics: Confusion Matrix, Accuracy, Recall, Precision, F1 Score, and Area Under the Receiver Operating Characteristic curve (AUROC). The Confusion Matrix for each model provides a detailed breakdown of correct and incorrect predictions. For the Logistic Tree model, out of 44,844 alive cases, 41,467 were correctly predicted as alive (true negatives), while 3,377 were incorrectly predicted as dead (false positives). Of the 1,740 actual death cases, 481 were correctly predicted as dead (true positives), while 1,259 were incorrectly predicted as alive (false negatives). This distribution indicates that the Logistic Tree model overpredicts deaths, resulting in many false positives. The Decision Tree (DT) model correctly predicted 44,809 alive cases and 931 death cases, with only 35 false positives and 809 false negatives. This distribution suggests that the DT model is more balanced in its predictions, with fewer errors in both directions than the Logistic Tree model. The K-Nearest Neighbors (K-NN) model correctly identified 44,489 alive cases and 112 death cases, with 355 false positives and 1,628 false negatives. This pattern indicates that the K-NN model tends to underpredict deaths, resulting in many false negatives. The Support Vector Machine (SVM) model predicted all cases as alive, resulting in 44,809 true negatives and 1,775 false negatives. Suggests that the SVM model failed to capture the nuances of the factors leading to child mortality in this dataset.
The Decision Tree model demonstrates the highest accuracy at 96.35%, followed closely by the SVM at 96.21%. However, accuracy alone can be misleading, especially in datasets with imbalanced classes, as is often the case with mortality data, where death cases are typically much fewer than survival cases. The recall metric measures the model's ability to correctly identify positive cases (deaths in this context). The DT model shows the highest recall at 30.00%, followed by the Logistic Tree at 28.00%. Thus, the DT model correctly identified 30% of all death cases in the dataset. The K-NN model's recall is significantly lower at 6.00%, while the SVM model's recall is 0% due to its failure to predict deaths.
The DT model has the highest precision, which indicates the accuracy of positive predictions at 60.00%. This means that when the DT model predicts a death, it is correct 60% of the time. The K-NN model follows with 23.00% precision, followed by the Logistic Tree with only 12.00%. Again, the SVM model's precision is 0% due to its lack of positive predictions.
The F1 Score provides a balanced measure of precision and recall, further confirming the DT model's superior performance. It achieves the highest F1 Score of 26.00%, followed by K-NN at 22.00% and Logistic Tree at 17.00%. The SVM model's F1 Score is 0% due to its lack of true positive predictions. The Area Under the Receiver Operating Characteristic curve (AUROC) provides an aggregate performance measure across all possible classification thresholds. The DT model again leads with an AUROC of 65.00%, followed by the Logistic Tree at 60.05%. The SVM and K-NN models show lower AUROC values of 52.70% and 51.00%, respectively, indicating poor discriminative ability.