The multiparous women of the sample that represented 46% of the initial total retrospective data, were characterized by a PTB prevalence of 8%, which represented more than double the prevalence for nulliparous women. In addition, despite some overlapping, the multiparous women form a distinct group characterized by a relatively lower social status and a higher incidence of gynecological complications as shown on the projection of the first two Principal Analysis Components (Fig. 1).
Nevertheless, this group of 922 multiparous women were still in majority urban, rather older working women with high education in a good income household (Table 1). They have dominantly university-level education (79%) along with their husbands (81%). About 65% reported having a job. Almost all the husbands reported having a job (96%) with a good social level (high income by 71%). They were also dominantly in the age bracket of 25 to 35 (64%), residing in the city (76%). About 33% of the women had rather an obese BMI with 47% presenting an excessive weight gain during the pregnancy.
Table 1
Percentage of each characteristic for the nulliparous women.
Characteristic
|
Percentage
|
(positives/total)
|
Age(25–35 years)
|
64
|
BMI(obese)
|
33
|
Education_husband(high)
|
81
|
Education_mom(high)
|
79
|
Pre_Cesarean(presence)
|
35
|
Pre_Diabetes(presence)
|
5
|
Pre_Eclampsia(presence)
|
4
|
Pre_Hemmorrhage(presence)
|
29
|
Pre_Induction(presence)
|
31
|
Residence(city)
|
76
|
preterm (presence)
|
8
|
smoking(smoker)
|
13
|
Social_status(high)
|
71
|
Weight_gain(excess)
|
47
|
Work_husband(external job)
|
96
|
Work_mom(external job)
|
65
|
The percentage of mothers who smoked during pregnancy was 13%. The dominant gynecological complications during past pregnancies were diabetes (5%) and Pre-eclampsia (4%). Approximately 31% of them have had induction and 29% hemorrhage.
The covariates that presented the highest difference of percentage within the positive and the negative class were Pre-hemorrhage, Weight gain, Age, BMI, and Social status (Fig. 2). The Chi-square test revealed that most of these variables were statistically significant at least at the 5% level (Fig. 2). Smaller, non-statistically significant, differences were observed for pre-diabetes, work husband, and pre-eclampsia. Pre-eclampsia and Pre-diabetes were discarded from further modeling analysis because they gave a low prevalence reaching even 0 for the positive class. It is most likely that women with these indicators were already surveilled for PTB, which may explain their low prevalence.
Table 2
Linear coefficients of each logistic regression model (significant at the level 5% *, 1%** and 1‰ ***).
Factors
|
Modelsa
|
glm
|
glmup
|
glmnetup
|
glmglmnetup
|
Intercept
|
-4.56**
|
-1.39**
|
-3.72
|
-1.97***
|
Age1
|
.54
|
0.86***
|
0.33
|
0.68***
|
BMI1
|
1.07**
|
0.75***
|
0.35
|
0.70***
|
Education_hus1
|
-0.52
|
-0.02
|
.
|
|
Education_mom1
|
-0.01
|
0.12
|
.
|
|
Pre_Cesarean1
|
-0.29
|
-0.52**
|
.
|
|
Pre_Hemmorrhage1
|
1.98***
|
2.11***
|
1.62
|
1.93***
|
Pre_Induction1
|
-0.12
|
0.12
|
.
|
|
Residence1
|
1.27*
|
1.30***
|
0.47
|
1.11***
|
smoking1
|
0.12
|
0.24
|
.
|
|
Social_status1
|
-1.42**
|
-1.82***
|
-1.04
|
-1.79***
|
Weight_gain1
|
1.03*
|
1.06***
|
0.76
|
1.07***
|
Work_hus1
|
-0.28
|
-0.64
|
.
|
|
Work_mom1
|
-0.14
|
0.09
|
.
|
|
aglm: logisitc regression on original data, glmup: logisitc regression up-sample data,
glmnetup: LASSO regression on up-sample data,
glmglmnetup: Logistic regression with selected LASSO variables on up-sample data
|
The logistic regression analysis of the original dataset (glm) led to almost the same significant variables, as the Chi-square test, except that Age was not significant while Residence was added to the list of significant co-factors (Table 2). Despite presenting a high AUC of 0.84, this logistic model gave a low prediction of PTB that did not exceed 16% for the training set and 12% for the validation dataset. The women of the majority class of non-PTB were classified correctly which explains the high AUC (Area Under the Curve) observed (Accuracy higher than 92% for the training and validation dataset).
Table 3
Values of preterm and non-preterm (false positives) prediction for the different models.
Models*
|
Preterm (percent in total preterm)
|
False Positives
|
AUC
|
|
Training set
|
Validation set
|
(perent total)
|
glm
|
16
|
12
|
1
|
0.841
|
glmup
|
78
|
92
|
25
|
0.846
|
glmnetup
|
80
|
92
|
25
|
0.837
|
glmglmnetup
|
76
|
88
|
21
|
0.84
|
*glm: logisitc regression on original data, glmup: logisitc regression up-sample data,
glmnetup: LASSO regression on up-sample data,
glmglmnetup: Logistic regression with selected LASSO variables on up-sample data
|
In contrast, after creating a balanced sample using the up-sampling algorithm and running the logistic model (glmup) on these datasets, the results were notably improved for the PTB prediction (Table 3). Indeed, PTB prediction ranged from 78 for the training set to 92% for the validation dataset although the number of misclassified non-PTB women significantly increased from few cases for the first model (glm) to about 25%, of the total number of pregnant women, for this last regression model. Comparable results were obtained by the LASSO regularized model (glmnetup) and the logistic regression using the selected variables by the LASSO regularization (glmglmnetup) that gave the lowest number of false positives (lower than 21%) while maintaining high PTB prediction, in comparison to all the models (Table 3) but still, the accuracy decreased to around 79%.
The comparison of the distribution of the PTB risk estimated by each model in comparison to original data (Fig. 3), showed that logistic regression before up-sampling (glm) and the Lasso model (glmnet) generally underestimate the probabilities in comparison to the other models. Even the last logistic model using the lasso selected variables slightly under-estimated those probabilities. However, both logistic regression with up-sampling before or after lasso regularization gave a closer risk or probability distribution to the original data than the other models (Fig. 3).
Along with the improvement of preterm prediction the number of statistically significant covariates (at least at the level 5%) also increased from 5 for glm, to 10 in glmup but the glmnetup reduced this number to 6 (Table 2). The regression model using the selected Lasso variables (glmglmnetup) was used to develop a nomogram (Fig. 4). The validation of this nomogram using the data of this study showed the possibility of having a reasonably accurate risk of PTB given the levels of Social status, Residence, Pre-hemorrhage, Age, BMI, and Weight gain for a multiparous woman.