Background: This paper explores different machine learning algorithms and approaches for predicting alum income to obtain insights on the strongest predictors for income and a ‘high’ earners’ class.
Methods: The study examines the alum sample data obtained from a survey from Tecnologico de Monterrey, a multicampus Mexican private university, and analyses it within the cross-industry standard process for data mining. Survey results include 17,898 and 12,275 observations before and after cleaning and pre-processing, respectively. The dataset includes values for income and a large set of independent variables, including demographic and occupational attributes of the former students and academic attributes from the institution’s history. We conduct an in-depth analysis to determine whether the accuracy of traditional algorithms in econometric research to predict income can be improved with a data science approach. Furthermore, we present insights on patterns obtained using explainable artificial intelligence techniques.
Results: Results show that the gradient boosting model outperformed the parametric models, linear and logistic regression, in predicting alum’s current income with statistically significant results (p < 0.05) in three tasks: ordinary least-squares regression, multi-class classification and binary classification. Moreover, the linear and logistic regression models were found to be the most accurate methods for predicting the alum’s first income. The non-parametric models showed no significant improvements.
Conclusion: We identified that age, gender, working hours per week, first income after graduation and variables related to the alum’s job position and firm contributed to explaining their income. Findings indicated a gender wage gap, suggesting that further work is needed to enable equality.