The data analysed in this study came from the 2019 National Health Survey (Pesquisa Nacional de Saúde - PNS), designed to have a nationally representative sample of the Brazilian population. The 2019 PNS is a cross-sectional household survey with a sampling process carried out in three stages. Using clustering sampling techniques, each stage selection was conducted using probability proportional to size, in which first municipalities were selected, followed by census tracts and, finally, their households. More detailed information on PNS's sampling procedures and inclusion criteria can be found elsewhere21. This study is based on participants who have answered the individual questionnaire, comprising 88,531 individuals aged 18 and older. The PNS data is freely available on the Brazilian Institute of Geography and Statistics (IBGE) website: https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html
Tooth Loss Assessment
Tooth loss was measured by asking: "Have you lost any of your permanent upper teeth?" Response options were 1) No; 2) Yes, I have lost (number) teeth; 3) Yes, I have lost all my upper teeth. The same question was asked for the lower permanent teeth. The 2019 Brazilian Health Survey considered complete dentition to be 32 teeth, 16 in the upper and 16 in the lower arch. Upper and lower self-reported tooth loss count was analysed classified into two levels of severity, regardless of the tooth position: 1) Functional dentition, defined as a loss of up to 12 permanent teeth; and 2) Severe tooth loss considered the loss of 23-3217.
Covariates
The following covariates included were considered at the individual and contextual levels: gender (men or women); age; level of education was classified into formative years (0 to 8) and formal education (8 or more years) for both adults and older adults. Officially named in Brazil as colour/race, the self-described race was assessed using five options according to the categories proposed by the Brazilian Institute of Geography and Statistics (IBGE): 1) white; 2) black; 3) yellow; 4) brown, and 5) indigenous. We further categorise this variable into white versus non-white. Self-reported smoking status was measured as those who never smoked and individuals who smoked in the past and/or are current smokers and use of dental floss (yes/no) only for adults due to the percentage of edentulous in the older adults (almost half of the sample that does not have chance to use dental floss due to the loss of all teeth). Dental floss use was considered as an oral health behaviour and was stratified as (yes/no).
Statistical analysis
Absolute and relative frequencies, with their respective 95% confidence intervals (95% CI), were calculated for all variables. Age-standardised estimates for functional dentition and severe tooth loss (including edentulous) were reported for each covariate. Data analysis for the Structural Equation Modelling was performed using the Stata software version 14.2 (StataCorp LP, College Station, United States) using the survey module that considers the effects of stratification and conglomeration in estimating indicators and their precision measures. It consisted of two sub-models: the measurement model, which establishes how the latent constructs are measured, and the structural model, which analyses the structural relationship of and the associations between the variables. The present study's latent variable was the socioeconomic status (SES) created by the per capita income and schooling covariates (Fig. 1). All the other independent variables were observed. Univariate SEM analysis was used to estimate the direct direction between the predictor and tooth loss by three different dummy categories: 1) Last dental visit up to 1 year and over; 2) Last dental visit up to 2 years and over and 3) last dental visit of more than three years and less. After establishing the best cut-off point taking into account the effect size, 95% confidence intervals and goodness of fit, a final SEM model was used to calculate the direct, indirect and total effects. Sociodemographic data (Race, SES, gender and Age) mediated by the dental visits and behavioural covariates (smoke and use of dental floss) was tested using the social determinants of oral health(9) and Sisson's model as theoretical explanations (10). Figure 1 exemplifies the finals SEM models for adults and older adults. Standardised coefficients (SCs) were interpreted as being a small association (SC < 0.10), medium association (SCs between 0.30 to 0.50), and a strong association (SC > 0.50)22.
The quality of fit of the model was evaluated by ordinary comparison and complemented by applying the mean square error of approximation (RMSEA), where values less than or equal to 0.08 are considered adequate. The Comparative Fit Index (CFI) and the Tucker-Lewis Index (TLI) provided additional reliability, and values above 0.80 were considered adequate22. SEM analysis included only those associations previously reported in the literature or plausible associations.
Machine Learning Approach
The Extreme gradient boost algorithm (Xgboost), based on sequential models of decision trees, was used to predict the outcome for adults and older adults. Previous research has shown this algorithm has the highest area under the receiver operating characteristic curve (AUC) than others15. Firstly, the dataset was splitted under a proportion of 75% (training set) and 25% of the testing set. Then, one recipe for all variables was performed, where every categorical variable was dummied, missing values omitted and normalised continuous variables (age) to avoid oversized effects due to differences in scale. Next, we applied 5-fold cross-validation to tune hyperparameters for the training set to avoid overfitting, one for the adults (lack of functional dentition) and the other for the older adults to predict severe tooth loss (including edentulous). A workflow was then constructed, and after each tuned hyperparameter selected by the (AUC) for each 5-fold cross-validation model, they were tested in the test set and their predictive performance on the test set. Appendix 1 shows the strategy for tuning all hyperparameters and the grids created. All of the results presented here are from the test set. Finally, to assess the predictive performance of the trained algorithm, the AUC, accuracy, sensitivity and specificity were calculated.
Furthermore, we computed the importance of each covariate in predicting our study outcomes, one for the adults and the older adults. For the first sensitivity analysis, we stratify our dataset into the five main regions in Brazil (a proxy for human development levels). Southeast and South regions have better development levels than Northeast and North and different concentrations of dental surgeons workforce in both public and private sectors23. The rationale for this is to understand the role of potential contextual covariates changing the importance of individual variables. In the second sensitivity analysis, due to the lower prevalence of lack of functional dentition in our adults' sample (10,5%), machine learning algorithms tend biasing decisions in imbalanced datasets towards the majority class. So, the majority class was undersampled in the cross-fold validation performance in the same proportion (50%) of the minority class. We used R (R Foundation for Statistical Computing, Vienna, Austria) software for our machine learning approach. We followed the STROBE guidelines for human observational studies(von Elm et al. 2014) and the checklist for the artificial intelligence approach24.
Ethical Aspects
The Brazilian Committee approved the Ethics in Human Research (protocol number 3.529.376).