Data sources:
The data were obtained from Colombia's Social Protection Information System and integrated via Power Query. All aggregation and harmonization were performed at the departmental level from 2018--2023 (see Table 1):
Birth data: The birth information on births was extracted from the vital statistics accessible at the Single Registry of Affiliates (RUAF), including maternal age and year of registry.
Mortality data: Information was extracted from the RUAF on deaths due to cause of mortality, age, sex, year of registry, and department of occurrence. A subset of data covering violent deaths, such as transport accidents, falls, firearm injuries, drowning, suffocation, electric shock, exposure to smoke or fire, poisoning, self-inflicted injuries, crimes, homicides, and other external causes (WHO 667), was extracted.
Insurance data: The Unified Database of Affiliates (BDUA) contains data on the number of affiliates with public insurance (with more than 97% of the Colombian people covered) and is classified by affiliation regime, age, sex, and department of residence.
Economic and educational data: Microdata from the National Administrative Department of Statistics (DANE) were used to obtain the departmental gross domestic product (GDP) per capita at current prices (mean yearly income by person), the number of students enrolled by department, the employment rate (occupation rate: the ratio between the labor force and the working-age population) and demographic information on the indigenous population by department.
Data Cleansing and Preparation
Once extracted and integrated, the data were standardized at the departmental and annual levels in Python Colab, and the integrated dataset and code used for the model are accessible as supplementary material. (13,14)
Since the data used in the study were obtained from anonymous databases, this study did not require the approval of an ethics committee for its execution, as the data used did not involve the direct participation of human subjects.
Descriptive analysis
The indicators used were as follows:
- Total Fertility Rate (TFR): This rate was determined by aggregating birth rates across various age groups and multiplying by 5 to represent a standard five-year interval
['TFR] = (
(['Maternity 15 to 19 years old']/['Affiliates 15 to 19 years old']) +
(['Maternity 20 to 24 years old']/['Affiliates 20 to 24 years old']) +
(['Maternity 25 to 29 years old']/['Affiliates 25 to 29 years old']) +
(['Maternity 30 to 34 years old']/['Affiliates 30 to 34 years old']) +
(['Maternity 35 to 39 years old']/['Affiliates 35 to 39 years old']) +
(['Maternity 40 to 44 years old']/['Affiliates 40 to 44 years old']) +
(['Maternity 45--49 years old']/['Affiliates 45--49 years old])
) * 5
- Infant Fertility Rate (IFR) (10--14 years): calculated as the number of births classified as infant births (to mothers aged 10--14) divided by the total number of females aged 10--14, again multiplied by 1,000.
[IFR] =
(['Infant maternity']/['Members 10--14 years old']) * 1000
- Adolescent Fertility Rate (AFR) (15--19 years): This rate is derived from the number of births to mothers aged 15--19 divided by the total number of females in that age group, multiplied by 1,000.
['AFR'] =
(['Maternity 15 to 19 years']/['Affiliates 15 to 19 years']) * 1000
- Early Fertility Rate (EFR) (10--19 years): estimated as the sum of births to mothers aged 10--14 and 15--19 years, divided by the total number of females aged 10--14 and 15--19 years, and then multiplied by 1,000.
['EFR'] = (
(['Infant motherhood'] + ['Motherhood 15 to 19 years old'])/
(['Members 10--14 years old'] + ['Members 15--19 years old']) * 1000
)
The results are presented in heatmaps by department and year, as well as average aggregates for the total period (2018--2023).
Data Modeling
We computed social-demographic determinants at the year and department levels:
- Percentage of the population enrolled in school relative to the total number of school-aged individuals
- Percentage of violent deaths out of total deaths
- The female mortality metrics included the following:
- Percentage of female deaths out of total deaths
- Percentage of violent female deaths out of the total number of female deaths
- Percentage of infant deaths (<18 years old deaths) out of total deaths
- % of the labor force participation rate out of the total number of people of labor age
- % percentage of the subsidized population out of total insured people
- GDP per capita
- Percentage of the indigenous population.
We subsequently constructed a new data frame incorporating these calculated variables with departmental and annual identifiers. To ensure data integrity, we employed a Z score method for outlier detection, identifying and removing observations with Z scores exceeding a threshold of 3. The resulting cleaned dataset provides a robust basis for further analysis, encapsulated in the final data frame, which includes only the relevant demographic rates and excludes outlier observations. This rigorous data processing approach enhances the reliability of subsequent analyses to understand the intricate relationships between demographic factors and health outcomes.
The aggregated and purified data are presented in scatter plots and trend lines showing the correlation between the early fertility rate (EFR) and its potential explanatory variables.
Generalized linear model (GLM) analysis
A generalized linear model (GLM) with a Gaussian family and an identity link function was used to analyze the relationships between the EFR and the selected variables.
The initial GLM included all relevant variables; second, to improve the performance of the model, all variables with high p values (indicating low statistical significance) were iteratively eliminated.
Cross-validation was performed via Sklearn's linear regression algorithm to evaluate the model's performance in predicting the dependent variable. A cross-validation scheme with 5 partitions (KFold) was used, randomly dividing each partition to minimize bias and using a fixed seed to guarantee the reproducibility of the results.
As a measure of the model predictive capacity, the value of the R² coefficient was determined, and the difference between the actual and predicted values was analyzed.