In our approach, we've adopted a multi-step process to develop an effective prediction model. We first collected relevant data, then preprocessed it to eliminate any noise or inconsistencies. Subsequently, we selected the most significant features and trained several machine learning models to evaluate their performance.
3.1 Data Collection Process
A thorough search was conducted to identify relevant databases in the field of mental health and psychology. Several sources were explored, including academic references, online data repositories, and research archives. The database used in this project contains 26 parameters and 2100 samples, encompassing sociodemographic data, employment data, and other familial history data. Indeed, several selection criteria were applied to ensure the quality and relevance of the data for our study. These criteria included source reliability, sample representativeness, measurement quality, and availability of necessary information to address our research question.
3.2 Data Preprocessing
Once the database was selected, steps were taken to clean and preprocess the data to make it analysis-ready. This involved excluding low-quality data, duplicate data, and data irrelevant to our research question, as well as handling missing values. We then assigned default values for all data types in the "Gender" column and created lists for each type to facilitate data manipulation.
Subsequently, measures were taken to manage outliers in the age column. Values below 18 years and above 120 years were deemed implausible or potentially erroneous data. Therefore, to maintain data consistency and logic, these values were replaced with the median age in the dataset. Replacing outliers with the median was chosen as an appropriate method to mitigate the impact of these extreme values on subsequent analysis while preserving the overall age distribution in the sample. This approach helps reduce potential distortions in analysis results while maintaining data representativeness (Figures 1 and 2).
Next, we encoded categorical variables into numerical values using Label Encoding technique to make the data compatible with the algorithms to be used later (Figure 3).
Lastly, age was selected as a numerical variable and underwent scaling using the Min-Max Scaling method as it aims to standardize the ranges of variable values to make them comparable and not dominate the model (Figure 4).
3.3 Data Analysis
3.3.1 Correlation Analysis with Treatment
As part of exploratory data analysis, a correlation matrix was calculated to assess linear relationships between variables and treatment (Figure 5). The correlation matrix helps identify variables most strongly correlated with treatment, providing insights into factors influencing treatment decision. We selected the 10 variables most strongly correlated with treatment, calculated the correlation matrix, and created a heatmap to visualize correlations.
The values in the heatmap represent correlation coefficients, ranging from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no linear correlation.
3.3.2 Age Distribution and Density
To better understand the age distribution in our dataset and visualize the density of each age group, we generated a distribution and density graph. This graph depicts the distribution of ages of individuals in our dataset, illustrating the frequency of each age group and the density of the population at each age interval. This allows us to have a visual overview of age distribution among individuals (Figure 6).
3.3.3 Age Distribution and Density in the Dataset
An essential component of our study is to analyze the distribution of individuals who received treatment for mental health issues among our sample. The graph below presents the total number of individuals treated and those not treated, categorized by gender. This visualization enables us to assess the proportion of participants accessing mental health care and understand treatment disparities among different demographic groups.
3.3.4 Probability of Treatment by Age and Gender
To further examine the likelihood of mental health conditions among our samples based on their age and gender, we created a nested bar chart. This chart in Figure 7 displays the probabilities of treatment for mental health issues, divided by gender and categorized by age group. Vertical bars represent the average probability of treatment, expressed as a percentage, for each age group, while horizontal bars separated by color correspond to different genders. This visualization helps us identify general treatment trends by age and gender, offering valuable insights into mental health disparities in our study population.
3.3.5 Probability of Treatment by Family History
In this section, we present a bar chart illustrating the probabilities of mental health conditions based on family history. The chart in Figure 8 shows the probability of treatment for different categories of family history, stratified by gender. Family history categories are represented on the x-axis, while treatment probabilities are represented on the y-axis. Bars are differentiated by gender, with shades of color corresponding to each gender category. This visualization allows us to observe variations in treatment probability based on family history, as well as gender differences.
3.3.6 Probability of Treatment Based on Mental Health Benefits
In this section, we present a bar chart in Figure 9 illustrating the probability of treatment benefits related to mental health, with bars differentiated by gender. This visualization allows us to observe variations in treatment probability based on benefits, as well as gender differences.
3.3.7 Impact of Work on Mental Health and Probability of Treatment
In this section, a bar chart in Figure 10 is presented showing how work interferes with mental health and how it affects the probability of treatment, based on gender. Bars represent different ways in which work affects mental health, and the height of the bars shows the probability of treatment. This visualization helps us see how work can influence the probability of treatment for mental health and whether it varies between men and women.
3.4 Contribution’s Machine Learning Algorithms
3.4.1 Logistic Regression
Logistic regression (Wang et al., 2021) is a commonly used method for modeling binary or categorical variables, making it a suitable choice for our problem. We constructed a logistic regression model to predict the target variable from the available features in the dataset. We created and trained a logistic regression model on the training set, made class predictions for the test set, and evaluated the model's performance using various performance metrics. The results were recorded in a dictionary for later use in visualizing the performance of different models.
3.4.2 KNeighbors Classifier
The K-Neighbors algorithm (Zhao et al., 2020) is a supervised learning method used for classification and regression, classifying a data point based on the labels of its nearest neighbors in the feature space. We created and trained a K-Neighbors model on the training set. The best parameters for the model were determined using a random search. Class predictions were made for the test set, and the model's performance was evaluated using various performance metrics.
3.4.3 Decision Tree Classifier
Decision trees (Feng et al., 2020) are supervised learning models used for classification and regression, partitioning the feature space into homogeneous subsets based on feature values. We built a decision tree classification model to predict the target variable from the available features. The best parameters for the model were determined using a random search. Class predictions were made for the test set, and the model's performance was evaluated using various performance metrics.
3.4.4 Random Forest
Random forest (Rahman et al., 2020) is a supervised learning method that combines multiple decision trees to improve prediction accuracy and robustness. We used the random forest method to construct a mental health prediction model from the available features. First, we determined the best hyperparameters for our model using a random search over a predefined parameter grid. We then built our random forest model using these optimized parameters with the RandomForestClassifier class from the scikit-learn library. The model was trained on the training set and evaluated on the test set to assess its performance.
3.4.5 Bagging
Bagging (Bootstrap Aggregating) (Jemili et al., 2023) is an ensemble method that combines multiple learning models to produce more robust and accurate predictions. We used the bagging technique to construct a mental health prediction model from the available features. A base model was created using a decision tree, which was then used to form each predictor in our bagging ensemble. Using the BaggingClassifier class from the scikit-learn library, we built our bagging ensemble by specifying the base model and other parameters such as the maximum number of samples and features. The model was trained on the training set and evaluated on the test set.
3.4.6 Boosting
Boosting (Jemili et al., 2023) is an ensemble technique that combines multiple weak models to produce a more robust and accurate model. We started by creating a base model using a shallow decision tree. This base model was used to form each predictor in our boosting ensemble. Using the AdaBoostClassifier class from the scikit-learn library, we built our boosting ensemble by specifying the base model and parameters such as the maximum number of estimators. The model was trained on the training set and evaluated on the test set.
3.4.7 Stacking
Stacking (Kamel et al., 2022) involves applying a machine learning algorithm to classifiers generated by another machine learning algorithm. This approach aggregates different models to improve the final prediction quality. We used the stacking method to construct a mental health prediction model from the available features. Several base models were created using different learning algorithms, including K-Nearest Neighbors, random forest, and Gaussian Naive Bayes. Using the StackingClassifier class from the scikit-learn library, we built our stacking model by specifying the base models and a meta-classifier to aggregate their predictions. The model was trained on the training set and evaluated on the test set.
3.4.8 Neural Network
A neural network (Bahri et al., 2023) is a mathematical model inspired by the human brain, composed of multiple interconnected layers of neurons. It is used in artificial intelligence to learn from data and solve various problems, such as classification, regression, and pattern recognition. We started by creating an Adagrad optimizer instance using the Adagrad class from the TensorFlow.keras.optimizers library. We then built our neural network using the DNNClassifier class from the TensorFlow.estimator library, featuring two hidden layers with ten nodes each and using the Adagrad optimizer. The features used by the model were specified in the 'feature columns' list. This configuration ensured the model could learn complex relationships between input features and the target variable while optimizing network weights using the Adagrad optimization algorithm. Finally, we evaluated the neural network model's performance on the test set to determine its accuracy in predicting individuals' mental health.
3.5 Performance Evaluation Methods
In evaluating the performance of the aforementioned algorithms, we adopted a systematic approach to measure the effectiveness of each model in predicting mental health outcomes. First, we split our data into training and test sets using the "train test split" function, which allows us to reserve a portion of the data for evaluating the models on unseen data (see Figure 11).
Next, we used various performance metrics specific to each method to assess the predictive capabilities of the models. For classification models such as logistic regression, k-nearest neighbors, decision trees, and random forests, we used metrics including:
- Accuracy: Measures the overall correctness of the model's predictions.
- Precision: Indicates the proportion of correctly identified positive cases out of all cases predicted as positive.
- Recall: Also known as sensitivity, measures the model's ability to correctly identify all actual positive cases.
- F1-Score: Combines precision and recall into a single metric to provide a balanced measure of the model's performance, particularly useful when there is an uneven class distribution.
For neural network models, we also used accuracy as a primary performance metric, but we additionally examined metrics such as loss and the ROC curve to evaluate the model's ability to distinguish between different classes.
Finally, we compared the performances of the different methods by analyzing the results and identifying the method that demonstrates the best performance in terms of accuracy and generalization capability. This evaluation allowed us to determine which model is most appropriate for our specific task of predicting mental health outcomes, providing valuable insights for making informed decisions in our analysis.