Part I
Prevalence of depression
Prevalence of depression is found to vary across regions with 2% of the population screening positive for MDD in some Asian countries and up to 21% in some European nations (26).
Undergraduates are known to have greater depressive symptoms with the highest rates being observed in some African regions with prevalence rates as high as 40.1% and lower-middle income countries with rates as high as 42.5%, considering studies conducted both before and after the corona virus pandemic (27). In Sri Lanka however, MDD was observed in 9.3% of the undergraduates in some state universities in 2015 (28). There is only one study conducted during the pandemic, where depression rates were found to be as high as 87.85% for the students in the radiology unit of the University of Peradeniya (30). This may be due to the Covid 19 pandemic.
This is the only study of this nature conducted among undergraduates in Colombo University after the pandemic. As shown in Fig. 1, it was observed that 29.7% of the respondents screened positive for depression.
Gender and depression
Figure 2 presents the distribution of depression for the two different genders of the respondents. The study has found that depression seems to be more prevalent in females than in males, with 34.2% of the females screening positive for MDD and only 23.9% of the males screening positive for MDD. However, the phi coefficient indicated that there was no significant relationship between the two variables (p value = 0.062).
Even from an international perspective, females are found to be more likely to develop depression in comparison to males. A study which reviewed the prevalence of depression across 29 countries found that in all countries, females were more likely to develop depressive symptoms, indicating that this is a gender issue rather than a cultural problem. (31)
Alcohol intake and depression
Figure 3 depicts how the level of depression in an undergraduate varies with the frequency of consuming alcohol. According to the figure, there does not seem to be any association between the 2 variables.
Since certain international studies which focused on the relationship between alcohol consumption of women and depression found a significant association between these two variables (32), in this study too this was considered. The percentage component bar chart corresponding to the alcohol intake of the female respondents of the study, with the level of depression, is given below in Fig. 4.
When considering the females respondents, it can be clearly seen that as the intake of alcohol increases, so does the percentage of those who have depression. As observed, only 31% of those who never consume alcohol screened positive for MDD while 85% of those who occasionally consume alcohol screened positive for MDD. Hence the pattern indicated by the female respondents seems to follow the pattern of undergraduate women in international universities, whereas such a relationship was not observed when both genders were considered together.
The rank-biserial correlation coefficient was used to check for any significant associations between the two variables mentioned above. When both the genders were considered, there was no significant association to be found (p value = 0.06). However, when only the female respondents were considered, the two variables were significantly associated (p value = 0.005). Thus, it seems that undergraduate females who deal with MDD have a higher tendency of consuming alcohol more frequently.
Year in university versus depression
Considering the year of study, as seen in Fig. 5, 2nd second year students seem to be screening positive for MDD more than students in the other years, indicating an association between the two variables, year of study and depression. Furthermore, the Chi Squared test of independence indicates that the relationship between the two variables is significant. (p value = 0.04).
The difference in depression levels observed for different academic years could be due to the level of stress experienced by the undergraduates. Studies have shown that exposure to stressful life events is a factor that is significantly associated with depression (33, 34).
In this study, second year students tend to have more stress, most likely due to the pressure of getting selected to an honours degree programme of their choice. This could be the reason for the above obtained result. To verify this, the relationship between the academic year and the level of stress associated with academic activities was investigated, and the results are shown below.
Academic year versus the stress associated with academic activities
Figure 6, confirms the reasoning given above, with 2nd year students having the highest level of stress associated with academic activities. However, the 4th year students also seem to have a high level of stress, possibly due to their final year research projects and striving to achieve a good class. As observed, more than 60% of both the 2nd year and 4th year students have stated that they experience high levels of stress when dealing with academic activities while less than 50% of the 1st and 3rd year students have indicated the same.
Furthermore, the Chi-squared test of independence indicates a significant relationship between the year of study of an undergraduate and the level of stress related to their academic work (p value = 0.045).
Satisfaction with academic achievements versus depression
As observed in Fig. 7, the percentage of depressed individuals seems to increase with dissatisfaction with their academic achievements, with only 17.6% of those who are satisfied with their academic achievements screening positive for MDD whereas 46.4% of those who are not satisfied with their academic achievements screened positive for MDD. Furthermore, the Goodman and Kruskal’s lambda indicated a significant relationship between the two variables (p value = 0.001)
While the results obtained in this study are not directly related to the academic performance of individuals, in general, those who are less satisfied with their academic achievements are more likely to have obtained poorer grades than those who are more satisfied with their academic achievements, indicating that students with depression tend to do more poorly in terms of academics when compared with those without depression.
The causes for this could be those discovered in international studies that students who are dealing with depression tend to have poor memory (35), difficulty concentrating (36) and cognitive impairment (37), which could all contribute to poor academic performance.
Income and depression
As observed in Fig. 8, those in the higher income brackets (family income of more than 250,000 rupees per month) show lower depressive symptoms in comparison to those whose family income is less than 250,000 rupees. This could be because the economic crisis is more likely to have more severely affected the poorer households.
However, a major difference in the prevalence of depression cannot be observed for the different categories of income. Furthermore, the association between the 2 variables was found to not be statistically significant using the rank biserial correlation coefficient (p value = 0.657)
While low levels of income are associated with certain life stresses which may lead to the development of depressive symptoms, high income earners also have their own problems to deal with such as long work hours (38), meeting deadlines and achieving goals which increase stress levels (39). Studies have found that undergraduates who come from such households have a lot of pressure put on them to overachieve. Furthermore, they tend to feel a sense of isolation from their parents (40). These factors may lead those from higher income families to report similar levels of depression as those from low-income families, and this may be the reason for not observing a significant relationship between income and the level of depression.
Effect of being harassed in university on depression
It is clear when observing Fig. 9 that there is an association between developing depression and being harassed in university, with 67.4% of those who have been harassed screening positive for MDD while only 27.9% of those who have not been harassed screened positive for MDD. The relationship between the 2 variables was found to be statistically significant, using the phi coefficient (p value = 0.002)
Literature reveals that those who have been ragged develop psychological problems, physiological problems and behavioural problems due to the stress associated with ragging. Apart from these negative effects, harassment of any form is found to be associated with an increased risk of developing depression (41). This seems to be the case in the current study as well as there is a large difference between the percentage of students who screened positive for MDD when comparing those who have faced some form of harassment in university with those who have not. Hence, being harassed in university either by a student or a staff member seems to have a relationship with the development of depressive symptoms. However, while these two variables were significantly associated, it was also found that only 4.6% of the respondents reported being ragged or harassed in university.
Effect of a breakup son depression
The breaking up of a relationship has an association with MDD as observed in Fig. 10, with 77.1% of those who have recently broken up screening positive for MDD while only 28.1% of those who have not recently broken up screened positive for MDD. Furthermore, the association between the level of depression and whether or not a respondent recently broke up is found to be statistically significant using the phi coefficient (p value = 0.001).
The reason why such a large difference was observed in the prevalence of depression when comparing those who had recently broken up with those who had not, may be due to the fact that many relationships which are formed in the university are among students who are at the age of marriage (The average age of marriage in Sri Lanka is 24.1 years for females and 27.3 years for males (42)) and hence, it is reasonable to assume that the relationships were formed in the hopes of getting married. These types of relationships lead to higher levels of depressive symptoms when they end (43).
The effect of breakups on developing depressive symptoms is more for females than for males, as seen below (See Fig. 11 (a) and 11 (b)). While 91.1% of the women who have recently broken up screened positive for MDD only 64.3% of the men who have gone through a breakup had MDD. However, among those who had a recent break up, gender and depression are not found to be significantly associated (p value = 0.151 for phi coefficient).
How being in a satisfying or unsatisfying relationship impacts depression
As observed in Figs. 12 (a) and 12 (b), 37.6% of those who are in a relationship which is not satisfying screened positive for MDD whereas only 21.0% of those who are in a relationship which is satisfying screened positive for MDD.
Negative relationships are seen to increase the prevalence of depression (44). This can be seen to be true in the current study as well by observing the prevalence rates of depression in those who are in satisfactory and unsatisfactory relationships. Furthermore, there is a significant association between the level of depression and whether or not someone is in a satisfying relationship (p value = 0.045 according to the phi coefficient). However, such a significance is not observed when checking whether or not a respondent is in an unsatisfactory relationship with the level of depression (p value = 0.617 for the phi coefficient).
Interestingly, 57.5% of female respondents who are not satisfied with their relationship screened positive for MDD whereas this was true for only 8.8% of the male respondents (Figs. 13 (a) and 13 (b)).
However, with the use of the phi coefficient, the association between gender and depression for those in an unsatisfying relationship was found to be not statistically significant (p value = 0.058)
Satisfaction with physical appearance and depression
Figure 14 shows that the percentage of those who screen positive for MDD tends to decrease as satisfaction with their physical appearance increases. This is found to be further validated through the rank biserial correlation coefficient which verified that the two variables were significantly associated (p value = 0)
Part II
Results of model fitting
Once the representative sample was derived from the full sample of 360 observations using the novel approach mentioned above, it left 288 observations. This, when separated into a training and testing set, left 230 training observations and 58 testing observations. The selected models were fitted on the training set using 5-fold cross validation, and the results of the model fitting process is given in Table 1.
Table 1
Machine learning models fitted on an imbalanced representative sample (Results) – Fitted on the training set
Machine Learning Model
|
Accuracy
|
Precision
|
Recall
|
K Nearest Neighbor Classifier
|
68%
|
51%
|
26%
|
Linear Discriminant Analysis
|
59%
|
34%
|
30%
|
Quadrating Discriminant Analysis
|
67%
|
20%
|
4%
|
Logistic Regression
|
60%
|
35%
|
25%
|
Decision Tree Classifier
|
60%
|
38%
|
37%
|
Support Vector Classifier
|
60%
|
34%
|
27%
|
Random Forest
|
68%
|
57%
|
19%
|
Gradient Boosting
|
61%
|
39%
|
32%
|
Light Gradient Boosting
|
67%
|
48%
|
30%
|
Extreme Gradient Boosting
|
66%
|
45%
|
23%
|
Gaussian Naive Bayes
|
41%
|
34%
|
86%
|
CatBoost
|
68%
|
50%
|
16%
|
Artificial Neural Networks
|
60%
|
35%
|
28%
|
Fixing the problem of class imbalance
It was observed that there is a class imbalance problem in the dataset, with 32% of undergraduates screening positive for MDD and 68% screening negative. Addressing this issue may yield better scores for the recall. Hence the SMOTE technique was used to fix the class imbalance problem.
However, applying SMOTE creates synthetic data, and hence, applying it to the training set and then using 5-fold cross validation on the same set would likely yield inflated results (Since the validation set would also then contain synthetic data). Thus, the training set was separated into 2, where 80% of the observations were considered when applying SMOTE, and the model was validated on the remaining 20% of observations. Table 2 represents the results of the model fitting process which is fitted on the data after having applied the SMOTE Technique.
Table 2
Machine learning models fitted on a balanced representative sample (Results)
Machine Learning Model
|
Accuracy
|
Precision
|
Recall
|
K Nearest Neighbor Classifier
|
48%
|
36%
|
63%
|
Linear Discriminant Analysis
|
57%
|
33%
|
25%
|
Quadrating Discriminant Analysis
|
66%
|
nan
|
0%
|
Logistic Regression
|
60%
|
42%
|
31%
|
Decision Tree Classifier
|
58%
|
29%
|
13%
|
Support Vector Classifier
|
61%
|
42%
|
31%
|
Random Forest
|
66%
|
50%
|
13%
|
Gradient Boosting
|
70%
|
60%
|
38%
|
Light Gradient Boosting
|
63%
|
44%
|
25%
|
Extreme Gradient Boosting
|
61%
|
38%
|
19%
|
Gaussian Naive Bayes
|
52%
|
38%
|
56%
|
CatBoost
|
63%
|
44%
|
25%
|
Artificial Neural Networks
|
63%
|
46%
|
38%
|
Final model
As observed, the highest accuracy as well as precision was obtained for the Gradient Boosting Classifier after fixing the problem of the class imbalance. Hence, this is the model that was used for further analysis.
Feature Selection
In total there were 113 independent variables after encoding. Variables were dropped one by one in the model fitting process, starting from the least important variable to the most important variable, as identified by the variable importance plot. At each stage, the model was fitted and evaluated on the test set to identify how many variables should be kept to obtain the best model. It was found that the best accuracy, sensitivity and F1 score are obtained after removing the 67 least important variables, leaving the model to work with 46 variables. The variable importance plot of the most important variables in fitting the gradient boosting classifier is shown below in Fig. 15.
Hyperparameter Tuning
The hyperparameters were tuned to maximize the recall as this is the most important score to consider in the model being built. This is because the recall would give the percentage of those with depression who were identified correctly, and it is of utmost importance that an individual who has a potential of developing depression is identified and directed to properly diagnose themselves. Identifying an individual without depression as having a risk of developing symptoms is not a major problem, as once they are directed to the relevant healthcare professional, they can get themselves evaluated to know whether they are at risk or not. However, if someone who is at risk is identified as not being at risk, then he/she will not seek professional help. Thus, the model to be developed will focus on maximizing the recall scores. Table 3 depicts the best hyperparameters to be used in fitting the gradient boosting classifier so as to obtain the maximum possible recall.
Table 3
Best hyperparameters of the gradient boosting classifier
Hyperparameter
|
Values that were tried
|
Best value for the hyperparameter
|
learning_rate
|
0.1, 0.05, 0.01
|
0.05
|
n_estimators
|
50, 100, 250
|
50
|
max_depth
|
2, 3, 4
|
2
|
min_samples_split
|
2, 4, 8
|
2
|
min_samples_leaf
|
1, 2, 4
|
1
|
subsample
|
0.5, 0.8, 1.0
|
0.8
|
max_features
|
None, sqrt, log2
|
None
|
Once the optimal model was selected, after having tuned the hyperparameters, the model was evaluated on both the train and the test set. The scores obtained are shown in Table 4.
Table 4
Performance metrics of both train and test sets in the tuned model
Score
|
Training Set Values
|
Test Set Values
|
Accuracy
|
87%
|
79%
|
Precision
|
87%
|
78%
|
Recall
|
88%
|
72%
|
Specificity
|
87%
|
85%
|
F1 score
|
87%
|
75%
|
The model seems to perform quite well on both the train and the test set.
Comparison with international studies
The model scores obtained in this research are comparable with international studies which have tried to predict depression using socio-demographic factors.
-
A study conducted to predict depression in U.S. adults with hypertension found a model with 77% accuracy. (45)
-
A study which focused on children developed a model that can predict depression with an accuracy of 82% (46)
-
A study conducted on citizens of Bangladesh from different age groups developed a model which had an impressive accuracy of 93% in making predictions. (47)
As observed, the model developed in this research is comparable with the models spoken of in international studies.