The implementation process of our proposed recommender system includes 5 phases of data mining from OULSD standard database, data preprocessing, and model construction with a combination of deep learning networks such as LSTM, MLP, GRU, Bilstm improved with attention technique, weighting parameters, and training. Finally, using the trained network, we recommend resources to users.
The proposed model (Fig. 1) consists of two independent blocks for processing long-term interests and short-term interests. Dataset records are divided into short-term and long-term sections based on a specific time axis. This separation of user interests and different views and valuing of interests based on the time axis, while increasing the system's accuracy and not encountering too many errors, suggest suitable and personalized resources based on the needs and tastes of students. Short-term interests play a more effective role than long-term interests in offering educational resources.
Researchers have looked at short-term interests as a fixed feature in many past works and therefore assigned the same weight to the items. Then the compression algorithm presented in this treatise is applied to the long-term part, and the data is compressed in both row and column dimensions. In the following, for the users in the data bank who have not been active in a short period, we have considered their last activity in the long-term sector as the activity in the short-term sector. After the design and creation of the model, the training begins. At this stage, for each user, the first record of his activity in the short-term section is repeated for all his activities in the long-term section (this work is repeated for the number of records in the short-term section).
Finally, the remaining records from the database have been used to test the loss and accuracy of the test data set.
2.1. Data Preprocessing
The steps in the preprocessing phase can be seen in Fig. 2. First, we extract the provided resources, student features, courses held, and student performance and evaluation in each course from the OULAD standard database. After combining we proceed to categorizing, mapping the features, deleting the empty or incorrect data and normalizing the features.
Dataset As input to the training and testing phase of our proposed model, we have used the standard analytical learning database of OULAD Free University, which is stored in CSV format. [47, 45].
This database is collected from sample data from students, including demographic data, the courses attended, their set of study activities during the course, and the final results of each course. More than 30,000 students interact with the virtual learning environment (VLE) and include data about 22 courses that indicate the learning subject and the set of sessions that ended with the test. This data is collected through a daily summary of student clicks on various sources. In OULAD, tables are linked to a person using unique identifiers. The files used are briefly described below:
Assessments
There are several evaluations and one final exam in each course; their data is available in this file.
Student Info
This file holds data about students' demographics along with their results. In addition, each student can have several records.
Student Vle
Includes student's clicks and interactions with resources available in Vlr that can be in Html, pdf, etc. formats.
Student Assessment
keeps the results of the evaluations made during the course per student
For more information, study [44, 45].
After merging the data with Formula 1, we proceed to normalize the feature in the domain [0,1].
$${x}^{\text{*}}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$
1
Where xmin represents the minimum eigenvalue, xmax is the maximum value, x^* is the normalized value, and x is the original data. Converting the available string values in the database to numeric values is another step in the preprocessing step. In Table 1, you can see an example of the values available in the database that are mapped to numerical values in Table 2.
Table 1
A sample of database data before mapping
code_module
|
code_presentation
|
id_student
|
gender
|
highest_education
|
age_band
|
final_result
|
score_mean
|
id_site
|
date
|
sum_click
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546669
|
-5
|
16
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546662
|
-5
|
44
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546652
|
-5
|
1
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546668
|
-5
|
2
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546652
|
-5
|
1
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546670
|
-7
|
2
|
AAA
|
2013J
|
11391
|
M
|
HE Qualification
|
55<=
|
Pass
|
82
|
546671
|
-7
|
2
|
Table 2
The values of features mapped to the number
code_module
|
code_presentation
|
age_band
|
gender
|
highest_education
|
final_result
|
AAA
|
0.1
|
2013B
|
540
|
0–35
|
0.1
|
F
|
0.1
|
A Level or Equivalent
|
0.1
|
Distinction
|
0.1
|
BBB
|
0.2
|
2013J
|
720
|
35–55
|
0.2
|
M
|
0.2
|
HE Qualification
|
0.2
|
Fail
|
0.2
|
CCC
|
0.3
|
2014B
|
180
|
55<=
|
0.3
|
-
|
-
|
Lower Than A Level
|
0.3
|
Pass
|
0.3
|
DDD
|
0.4
|
2014J
|
360
|
-
|
-
|
-
|
-
|
No Formal quals
|
0.4
|
Withdrawn
|
0.4
|
EEE
|
0.5
|
-
|
-
|
-
|
-
|
-
|
-
|
Post Graduate Qualification
|
0.5
|
-
|
-
|
FFF
|
0.6
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
GGG
|
0.7
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
2.2. Records labeling
Since the available data do not have a defined label, in this research, we have labeled the records as follows:
From the set of activities recorded for each student's joint courses, from the course that has the highest score in his assessments (which can be the effect of the resources studied) the source that has the most clicks (which can indicate the taste and interest of the student) has been chosen as the label.
The result of data labeling was the separation of sources into 562 categories. Their frequency can be seen in Fig. 3. At the end of the preprocessing phase, the data is divided into training, test, and validation sets.
Investigating the correlation between variables and labels
To investigate the possible correlation between the label and the existing variables that were used as input to train the model, we used the correlation test, and as you can see in Figure Fig. 4, there is no significant correlation between the variables and their label.
2.3.1. Symbols
In this research, our main goal is to extract the interests and priorities of users by considering User-Item interactions. For example, each click on the learning source is considered an action for the user. We suppose U and V, respectively, present a set of users and items, and for each user, u∈U is consecutive time windows: Wu = {w1u, w2u ,…,wtu}, where t indicates the total number of time windows. Also, wtu represents a set of smaller time units as wtu = {d1,d2,…,dx} where x represents the length of the time windows. In addition, user's related items (u) in time windows (t) are expressed by wtu. In each time window, there is a number of events {et, iu∈ Rm | i = 1, 2, ..., wtu|} where et and iu describe event i in time units of the time window (dx). Another point to note is that the user u, Interacts with resources | vi∈V |in each event.
For a time stage t, the Stu session represents the short-term interests of the user at time t, and the sessions before the time stage t represent the long-term interests of the user, which is defined as Lt−1u = S1u ∪ S2u ∪ ... ∪ St−1u. Our goal is to predict the next learning resource (et,i+1u) in the Stu session.
1. Short-term Interests Layer:
The task of the short-term class based on the attention technique is to generate recommendations considering the user's long-term interests while processing a sequence of training resources in the current session to extract short-term interests. To recommend the next source, the user's short-term interests are essential. The user's short-term interests are different types of short-term interests in different categories of educational resources.
Researchers have looked at short-term interests as a fixed feature in many past works and therefore assigned the same weight to the items. As a result, diversity in short-term interests has not been properly evaluated. Thus, the proposed model's architecture based on the attention technique has given weight to both long-term and short-term sessions. With this technique, the characteristics of the user u are fully taken into account. In addition, two-way LSTM is used to predict the recommendation to look at both the past and the future and be sensitive to the variation in the user's short-term interests in both directions; this way, learners' behavioral changes can be covered.
To discover these features, we propose a module based on two-way LSTM networks to extract periodic features and capture such time dependence of the input feature.
In this research, we have evaluated single-layer and double-layer models.
The input with length 12, \(\text{I}=\{{i}_{1},{i}_{2},\dots {i}_{12}\}\) is entered into the proposed model and the output \({y}_{t}\) is calculated according to formula 2.
$${h}_{t}^{f}=\text{t}\text{a}\text{n}\text{h}({W}_{xh}^{f}{x}_{t}+{W}_{hh}^{f}{h}_{t-1}^{f}+{b}_{h}^{f})$$
$${h}_{t}^{b}=\text{t}\text{a}\text{n}\text{h}\left({W}_{xh}^{b}{x}_{t}+{W}_{hh}^{b}{h}_{t+1}^{b}+{b}_{h}^{b}\right)$$
2
$${y}_{t}={W}_{hy}^{f}{h}_{t}^{f}+{W}_{hy}^{b}{h}_{t}^{b}+{b}_{y}$$
As you will see in Table 6, the results obtained from the bi-layer BiLSTM network, where the output \({y}_{t}\) from the lower layer are transferred to the input of the upper layer, which are more favorable.
The main idea of the attention technique is to learn to assign accurate (normalized) weights to a set of features. So, higher weights indicate that the corresponding feature contains more important information for the given task. (Fig. 5)
Attention techniques are divided into two categories based on calculating attention scores.
1) Standard vanilla attention and 2) Collaborative attention. Note Vanilla uses a parameterized content vector, while collaborative attention is related to learning attention weights from two sequences. In this treatise, method 1 (formula 3 [163]) is used.
$${u}_{t}=\text{t}\text{a}\text{n}\text{h}(\text{W}{ĥ}_{t}+\text{b})$$
$${\alpha }_{t}= \frac{\text{e}\text{x}\text{p}\left({\text{u}}_{t}^{T}u\right)}{\sum _{t}\text{e}\text{x}\text{p}\left({\text{u}}_{t}^{T}u\right)}$$
3
$$v=\sum _{t}{a}_{t}{ĥ}_{t}$$
\({u}_{t}\) : The vector of valuing the features
\({\alpha }_{t}:\) Normalized weight of features obtained by softmax function.
\(v\) : The sum of all input information that includes the sum of the weights of each \({h}_{t}\) With \({\alpha }_{t}\) as the corresponding weights
Then the v vector is entered into a fully connected layer with softmax activation to perform the final classification. The recommendation is a vector \(y\in {R}^{2}\) with significant and non-significant probability. Using argmax, we select the highest probability as the model recommendation.
2. Long-term Interests Layer:
The task of the long-term class is to provide the user's long-term interest information to the short-term interest class based on the attention technique before the short-term interest class enters the current session. The user's last k sessions should be considered first. The attention technique calculates the importance of each item in the set of short-term items for a specific user. On the other hand, the compression algorithm, in turn, gives weight to the resources available in different time windows with a policy. This information is entered into the system to represent the user's interests and priorities.
In the long-term interests class, we use the compression technique in both row and column dimensions due to the large volume of data records belonging to each student. The main innovation of the long-term interests section is in the compression phase of users' long-term data.
After applying compression on long-term data with a window length of 7, the available data will decrease from 12*10543682 to 11*5167599, and GRU is entered as input to a two-layer multi-cell network the output \({y}_{t}\) is calculated according to formula 4.
$${z}_{t}={\sigma }({W}_{z}.[{h}_{t-1,}{x}_{t}\left]\right)$$
$${r}_{t}={\sigma }({W}_{r}.[{h}_{t-1,}{x}_{t}\left]\right)$$
$${ĥ}_{t}=\text{t}\text{a}\text{n}\text{h}\left(W.\left[{r}_{t}*{h}_{t-1,}{x}_{t}\right]\right) \left(4\right)$$
$${h}_{t}=\left(1-{z}_{t}\right)* {h}_{t-1}{+z}_{t}* {ĥ}_{t}$$
The result of short-term and long-term class training are connected, then entered into the MLP layer with the relu activation function. After applying dropout = 0.25, the last step enters it into the MLP layer with the softmax activation function. The cost function used in this structure is categorical cross-entropy. In the model training process, due to the large volume of input data, we used the minibatch method or size 1028.
To achieve better results in feature extraction, model parameters in each layer have been adjusted many times, and the model has been trained, tested, and evaluated. Finally, the learning rate of 0.0001 and the relu activation functions in the MLP layer have worked optimally in the model structure. The dropout layer is used to prevent the complications of fully connected layers, preventing the overfitting of the network. The allocated value for it is also among the causes obtained by setting different values, training, and testing the model.
In similar problems where the number of existing classes is more than two classes, the Softmax activation function and the cross-class entropy cost function are used in the output layer of the model.
1- Compression in Feature Level
To select the best features for a supervised learning model, "Supervised Feature Selection methods" are provided. These algorithms aim to select the best subset of features to ensure the optimal performance of a supervised model that uses "labeled data" to select the best features. However, in the absence of labeled data, methods such as "unsupervised feature selection methods" have been implemented that score features based on various criteria such as "variance", "entropy", and "ability of features in data retention related to local similarities and other items.
In this research, according to the nature of network input data and the considerable features that the Correlation Matrix algorithm has, it has been used as the feature selection algorithm.
2- Compression in Record Level
One of the lesser-known ways is to compress rows to reduce the amount of data. In our proposed method (Fig. 6), none of the lines that show user interaction with the learning environment is deleted. Also, to improve the results of recommender systems, in the time vector of user behavior, the more we move from birth to the present, the more we value that performance.
We consider one (or a limited number) of windows as a background and treat user interactions in different time windows as objects moving in the foreground. In addition, we assign weight to time intervals so that each user action, according to the position in the relevant time window, gains a proportionate weight.
We have considered lower weights for very distant time windows and more weights for very close time windows. For us, the repetition of features in the period under study is also important, and for the number of repetitions of each weighted feature in this background, a new weight is assigned to it. Finally, each feature is repeated only once in the session intended as the background. In the final table of data compression, the new field holds the frequency of the merged feature.
2. Methods And Tools Of Data Analysis
We seek to predict the best educational resources, to evaluate the performance of the proposed method, a set of criteria are tested and evaluated as follows:
$$Accuracy=\frac{ TP+TN}{TP+\text{T}N+FP+FN}$$
5
Indicates what percentage of experimental records are properly categorized.
$$Precision=\frac{\sum _{x\in X}|R\left(x\right)\cap H\left(x\right)|}{\sum _{x\in X}\left|R\left(x\right)\right|}$$
6
$$Recall=\frac{\sum _{x\in X}|R\left(x\right)\cap H\left(x\right)|}{\sum _{x\in X}\left|H\left(x\right)\right|}$$
7
$$F1=\frac{2*Precision* Recall}{Precision+Recall}$$
8
Here x is a student from the set of all students X, R (x) represents the learning resources recommended for student x, and H (x) represents the learning resources observed by learner x. [53]