7.1 Experimental setup
The goal of this research is to construct a unique ensemble learning-based intelligent prediction model to forecast student performance. The suggested strategy made use of cross validation, machine learning, and baseline model combinations. The classification results of these classifiers were subjected to ensemble learning. In order to evaluate the effectiveness of ensemble learning-based classifiers, tests were carried out on the UCI repository using a Python program with Tenser Flow, Keras, and other related libraries. On an 11th generation Intel Core i7 with two NVIDIA GeForce RTX 3060 Laptop GPUs with 6.0 GB and 7.9 GB of RAM, research was conducted.
7.1 Sources of datasets
We used seven datasets to verify the resilience and efficiency of our suggested methods. Details of this dataset are shown in the Table 4, first four datasets are used to access the whole approach and our focus is on predicting student's grades and factors for only the next upcoming semester.
The remaining datasets—Emotions, Flags, and Stanford background—used in our investigations are related to non-academic fields including music, audio, and photos but serve as a standard for measuring how well our arbitrator miniature performs. As previously indicated, it might be challenging to locate the necessary scholarly datasets for this strategy in readily accessible sources. Additionally, none of the associated university-level research that we are aware of provided their datasets for experimental replication, most likely due to data privacy limitations [51].
The chosen non-academic dataset’s statistical properties may be found in [52][53][54]. According to our defined dataset, shown in Fig. 2, these dataset kinds are valid. These benchmarked datasets comprise multi-label numerical outputs that are specifically abstract representations of various rates of influential elements. As a result, they may be applied to generally evaluate performance. Additionally, these datasets are acceptable because they offer a number of crucial qualities, including data density, cardinality, and distinct [50][51].
The three datasets, DS1, DS2 and DS3 are produced datasets that were built using suggested characteristics for the output variable as indicated in [18–20]. Fourth one is the real dataset, known as AD1, which was gathered from 12 polytechnic colleges in Karnataka, India. Use of questionnaires was employed for the data collecting. Investigations are conducted on three different stages. Stage 1 involves putting the results of the baseline models into practice. Stage 2 involves a 10-fold cross-validation improvement of the baseline models (10- CV). Stage 3 involves putting the proposed model—a fusion of the baseline models with 10-CV and FA—into action.
Table 4
Description of datasets used in the study
Datasets | Source | Type | Samples | Features | Labels |
DS1 | Generated | Academic | 2000 | 30 (Domestic Factors) | 5 |
DS2 | Generated | Academic | 1000 | 5 (Soft Skill factors) | 5 |
DS3 | Generated | Academic | 3000 | 30 (Individual factors and School factors) | 5 |
AD1 | Real | Academic | 1500 | Combination of important factors from DS1, DS2 and DS3 | 5 |
Emotions | Kaggle | Non Academic | 593 | 72 | 6 |
Flags | Kaggle | Non Academic | 194 | 10 | 7 |
Stanford Background | Kaggle | Non Academic | 2407 | 294 | 6 |
7.2 Implications of Baseline Models
We put forward the nine most well-liked machine learning methods:.. The tables display the two performance measures, classification accuracy and RMSE. Table 5 revealed that the logistic regression model was the least, but the Random Forest approach produced the best results in terms of classification accuracy and RMSE, indicating that it might be a good model. : Support vector machine, Random Forest, K Nearest Neighbor, Logistic Regression, Artificial neural network, Decision Tree, XG boost, Ada boost and Naïve Bayes.
Table 5
Comparison of different classification Algorithms with their Prediction Accuracy for DS1 dataset
ML Algorithms | Accuracy (Baseline models) | Accuracy (Baseline models + 10-cv ) | Accuracy (Proposed Model) |
Support Vector Machine | 85.36% | 86.66% | 87.32% |
Ada boost | 73.50% | 75.34% | 80.50% |
XG Boost | 74.45% | 75.48% | 81.33% |
Random Forest | 85.71% | 87.28% | 91.79% |
Logistic Regression. | 60.76% | 72.66% | 80.57% |
Artificial Neural Network | 77.64% | 83.62% | 87.75% |
K Nearest Neighbor | 75.87% | 81.43% | 88.65% |
Decision Tree | 80.65% | 85.34% | 90.45% |
Naïve Bayes | 65.76% | 77.45% | 83.65% |
Table 6
Comparison of different classification Algorithms with their Prediction Accuracy for DS2 dataset
Ml Algorithms | Accuracy (Baseline models) | Accuracy (Baseline models + 10-cv ) | Accuracy (Proposed Model) |
Support Vector Machine | 85.96% | 87.76% | 88.46% |
Ada boost | 74.10% | 77.94% | 81.98% |
XG Boost | 74.89% | 76.41% | 83.67% |
Random Forest | 86.11% | 89.88% | 92.70% |
Logistic Regression. | 61.76% | 73.09% | 83.07% |
Artificial Neural Network | 76.68% | 82.12% | 88.75% |
K Nearest Neighbor | 77.34% | 83.33% | 87.65% |
Decision Tree | 83.62% | 86.56% | 87.45% |
Naïve Bayes | 68.16% | 79.56% | 83.65% |
Table 7
Comparison of different classification Algorithms with their Prediction Accuracy for DS3 dataset
ML Algorithms | Accuracy (Baseline models) | Accuracy (Baseline models + 10-cv ) | Accuracy (Proposed Model) |
Support Vector Machine | 86.16% | 88.21% | 91.12% |
Ada boost | 74.94% | 79.42% | 85.67% |
XG Boost | 76.29% | 80.81% | 87.45% |
Random Forest | 87.15% | 90.63% | 95.89% |
Logistic Regression | 63.36% | 79.09% | 82.33% |
Artificial Neural Network | 78.78% | 87.10% | 90.89% |
K Nearest Neighbor | 79.04% | 85.56% | 89.09% |
Decision Tree | 82.60% | 89.87% | 93.34% |
Naïve Bayes | 70.16% | 77.58% | 84.31% |
7.2 Implications of Baseline Models with k-fold Cross-Validation
In prediction and classification models, the k-fold cross-validation approach is frequently used to divide the dataset into k-1 sub folds for training sets and 1 fold for testing sets, then rotate the folds. Since it functions best at this split, we performed 10-fold cross validation in this trial. 10% of the data was utilized for testing, while 90% was used in the training stage. The average of all assessment criteria is then calculated when all interactions have been completed. The accuracy of SVM was increased by 4% as shown in Table 4. The weak logistic regression classifier’s performance was subsequently markedly enhanced to 80.66%.
Table 8
Comparison of different classification Algorithms with their RMSE for DS1 dataset
ML Algorithms | RMSE (Baseline models) | RMSE (Baseline models + 10-cv ) | RMSE (Proposed Model) |
Support Vector Machine | 0.701 | 0.691 | 0.410 |
Ada boost | 1.033 | 0.914 | 0.681 |
XG Boost | 1.134 | 0.721 | 0.521 |
Random Forest | 0.602 | 0.474 | 0.321 |
Logistic Regression. | 1.164 | 0.931 | 0.664 |
Artificial Neural Network | 0.876 | 0.603 | 0.489 |
K Nearest Neighbor | 0.908 | 0.827 | 0.532 |
Decision Tree | 0.779 | 0.546 | 0.369 |
Naïve Bayes | 1.023 | 0.943 | 0.787 |
Table 9
Comparison of different classification Algorithms with their RMSE for DS2 dataset
ML Algorithms | RMSE (Baseline models) | RMSE (Baseline models + 10-cv ) | RMSE (Proposed Model) |
Support Vector Machine | 0.611 | 0.521 | 0.401 |
Ada boost | 0.903 | 0.804 | 0.654 |
XG Boost | 1.022 | 0.771 | 0.489 |
Random Forest | 0.634 | 0.454 | 0.309 |
Logistic Regression. | 1.055 | 0.841 | 0.598 |
Artificial Neural Network | 0.89 | 0.765 | 0.633 |
K Nearest Neighbor | 0.9 | 0.667 | 0.609 |
Decision Tree | 0.8 | 0.698 | 0.577 |
Naïve Bayes | 0.955 | 0.9 | 0.799 |
Table 10
Comparison of different classification Algorithms with their RMSE for DS3 dataset
ML Algorithms | RMSE (Baseline models) | RMSE (Baseline models + 10-cv ) | RMSE (Proposed Model) |
Support Vector Machine | 0.7 | 0.554 | 0.266 |
Ada boost | 0.89 | 0.676 | 0.356 |
XG Boost | 0.877 | 0.6 | 0.334 |
Random Forest | 0.614 | 0.289 | 0.119 |
Logistic Regression. | 0.767 | 0.556 | 0.455 |
Artificial Neural Network | 0.745 | 0.599 | 0.476 |
K Nearest Neighbor | 0.776 | 0.567 | 0.324 |
Decision Tree | 0.696 | 0.435 | 0.235 |
Naïve Bayes | 0.8 | 0.676 | 0.514 |
7.3 Implications of Proposed mode1
By combining the baseline models with a feature reduction strategy called FA, we were able to create the fusion models we have suggested. One of the effective techniques in classification models for eliminating unrelated or unnecessary features is feature extraction. Dimensionality reduction by FA [13][21] may most certainly be used as regularization to avoid over fitting and boost model accuracy. People frequently fall into the trap of believing that FA chooses certain characteristics from the dataset while discarding others. Actually, the algorithm creates a fresh dataset of attributes by combining the previous ones. Table 4 demonstrates how the suggested model helps classifiers become more accurate.
Table 11
Comparison of different classification Algorithms with their Prediction Accuracy for AD1 dataset
ML Algorithms | Accuracy (Baseline models) | Accuracy (Baseline models + 10-cv ) | Accuracy (Proposed Model) |
Support Vector Machine | 87.46% | 91.66% | 93.32% |
Ada boost | 75.57% | 83.34% | 89.50% |
XG Boost | 77.03% | 84.48% | 91.33% |
Random Forest | 89.73% | 93.28% | 97.79% |
Logistic Regression. | 67.56% | 80.66% | 86.57% |
Artificial Neural Network | 83.78% | 88.90% | 93.89% |
K Nearest Neighbor | 79.97% | 83.14% | 92.78% |
Decision Tree | 84.09% | 91.97% | 95.07% |
Naïve Bayes | 73.45% | 81.59% | 90.31% |
Figures 3, 4, 5, 9 showed how each model performed according to its accuracy during each step. We discovered that the 10-CV improvement in conjunction with PCA produces the greatest results in forecasting student performance. The Figs. 6, 7, 8, 10 display how well the models' RMSE performed at each phase. The proposed models that have been suggested could produce relatively little RMSE. In this prediction scenario, the hybrid RF algorithm provided the least value of RMSE, demonstrating its superiority as the best predictive model. According to the findings, we may enhance the performance of our basic models by employing 10-CV. We also noticed that the innovative fusion models that were suggested may improve classification performance and produce better outcomes. The suggested proposed models can be viewed as the best prediction models for resolving categorization and prediction issues.
7.4 Analysis of the proposed Approach in relation to the existing Approaches
Multiple classifiers have been used in numerous research papers in EDM to predict student achievement.[55] recently put up an ensemble model to recognize at-risk kids and give them guidance on managing their learning. They combined four ensemble algorithms—bagging, random subspace, multilayer perceptron, and random forest—with four single classifiers. The evaluation's findings revealed that the ensemble model had an accuracy rate of 91.70%. Another study used logging data to identify at-risk children by estimating their learning success based on their learning habits. Along with Random Forest, Multilayer Perceptron, and Gaussian Naive Bayes, they employed Logistic Regression. The outcomes demonstrated that Random Forest outperformed other models including the baseline Logistic Regression with 89% accuracy [56]. A model to forecast pupils' achievement based on their daily actions was introduced by [57]. Using data mining techniques, an AA model has been proposed to assess institutional performance based on key performance metrics. The results revealed that, in compared to other machine learning models used in the study, artificial neural networks performed better in terms of accuracy (82.9%) [59]. To improve the effectiveness of classifiers, ensemble approaches including bagging, boosting, and random forests were used to predict student performance in a learning management system based on behavioral variables. To improve academic performance, ensemble techniques were used to the classifiers, yielding an accuracy of 91.5% [60]. [61] Proposed a forecast model for students' performance based on data mining. A decision tree, logistic regression, naive Bayes tree, artificial neural network, support vector machine, and k-nearest neighbor are a few of the data mining techniques used to assess student performance. These classifiers employed ensemble approaches including bagging, boosting, random forest, and voting to increase their output. According to the findings, bagging enhanced the decision tree algorithm's accuracy from 90.4–91.4%. Similar improvements were made to RMSE findings, which went from 0.904 to 0.914, and to precision results, which went from 0.905 to 0.914. The suggested model includes four well-known ensemble approaches, including bagging, boosting, stacking, and voting, in addition to nine conventional machine learning algorithms. The NB model assessed the RMSE score at 0.71% and 0.75%, respectively, by integrating boosting and GBT with AdaBoost [61]. [62] Research centered on the use of data mining tools and ensemble approaches to predict students' performance. Additionally, they put forth brand-new hybrid classifiers to produce precise forecasts of student performance. In contrast to basic classifiers and ensemble approaches used in the same research, the findings indicated that the hybrid model surpassed the other classifiers in terms of accuracy (i.e., 81.67). In contrast to cutting-edge ensemble methodologies suggested in EDM, the fusion ensemble-based strategy used in this research study to boost academic performance achieved the greatest accuracy (i.e., 97.79%) and RMSE 0.119. As a consequence, the study's findings show that the suggested prediction model is reliable. In comparison to previous approaches that stress how the fusion of ensemble techniques might increase the proportion of prediction, our approach performs better across the board [62].
7.5 Threats to Validity and Research Limitations
We highlight the key risks that might possibly impair the internal and external validity of our method with regard to the viability of employing the suggested hybrid/multi-label classier models and the validity of the conducted experiments. Threats to internal validity may come from biases used in designing the studies. However, using non-real datasets for studies poses a possible danger to external validity [51]. By reporting on the results that are an average of 10 similarly designed runs, we have lessened the potential of unexpected biases that may be created when we configured our trials. We additionally handle this issue by configuring the Random Number Generator in Python since initializing weight coefficients randomly may lead to different measurements for each run.
Regarding the universality of our strategy, locating relevant and extensive datasets that encompass student outcomes over many years and for many degrees presents a substantial problem when attempting predictive modelling in the area of teaching and learning. We utilized the greatest publicly available free datasets we could locate, which adds sampling bias. It goes without saying that such small datasets do not account for all potential variables that can affect student progress. For instance, despite its indisputable significance, student participation is neither available nor taken into consideration. The databases show how students with certain majors performed, which may be different from those pursuing other degrees, like psychology. Additionally, we verified our model using non-academic datasets, which can show patterns different from those seen in actual educational datasets [51].
Academic achievement for students is an inter phenomenon that may be analyzed from several perspectives. It is uncertain if the model will be capable of predicting other measures of academic success, such as scores on standardized tests and the percentage of students who achieve their goals, with the same level of accuracy, particularly when several measures are combined to determine a student's success [51]. Undoubtedly, this is a problem for future investigation. Furthermore, we haven't conducted any additional testing using alternative datasets to verify the model's dependability.
As a result, we were unable to assert the universality of our technique. However, it could be essential to adjust the hybrid/multi-label classifier models to account for different educational contexts and datasets. Future research will use authentic academic datasets to investigate the highlighted constraints in further depth.