This section presents the implementation and testing details of the proposed query expansion based on word embedding. The first part of the implementation deals with the word embedding training dataset, such that two of the word embedding models, namely the SkipGram [9] and the GloVe [12] models, are trained using two sentence forms. The first form includes all the words in the dataset, and the second form includes only the AL-Definite words (Arabic words with the prefix 'AL'). This part of the implementation aims to test the effect of using a representative subset of the dataset against the whole dataset, which was explained in section 4.1.
The second part of the implementation and testing regards the application of PRFQE schemes that are explained in section 4.2. The word embedding models were trained using the TREC-2001/2002, which includes 383783 Associate France Press (AFP) news stories, such that four options of the training text were used, where two options (whole dataset, representative subset) are combined with other two options (stemming, no stemming options), as presented in Table-1.
After dividing the TREC dataset into sentences, it is found that the number of sentences formed using all words of the dataset is 3118047. When using AL-Definite words as a representative subset, it is 2956145 sentences. Although the number of sentences of the AL-Definite words is about 94.8% of the sentences using the whole dataset, all documents were represented by AL-Definite sentences. The size of the sentences is much smaller than the sentences of the complete dataset; the size of the text in the case of the AL-Definite sentences is about 30% of the size of sentences of the complete dataset sentences; these statistics are presented in figure-3.
The following settings of the SkipGram model were used: word frequency selected to be ten as in[17] and [20], the size of the word vector is 300, and the window size is 5. The settings of the GloVe were: window size = 4, iterations = 10, and vector size = 5. The most similar words to each word are determined and stored in an array.
The trained word embedding models were used to test different scenarios of PRFQE, as follows: the effect of stemming of model training, the effect of the indexing method, the effect of the number of pseudo-relevant documents, the effect of the number of words selected from each document (extended words), the effect of the method in which top-m words from each document are selected; i.e., select top-m words from each document against selecting top m×d words from the combination of all documents, the option of applying word embedding for only the original query is tested against applying embedding on the extended query, and finally, the effect of using larger dataset to train the word embedding models is tested by adding sentences from the Watan-2004 [37] dataset to the sentences obtained from the TREC dataset. These experiments are described as illustrated in Table-2. The methods described in Table-2 are tested on the four options described in Table-1, and for the two word-embedding models (SkipGram and GloVe). The embedded words for each query word were stored in tables for each training method; a maximum of three closest embedded words were used because it has been found that it is a satisfactory size of a word context as in [11].
The PRFQE methods were applied using a Java application that implements the indexing and search system, where each expansion method was iterated to expand two to five pseudo-relevant documents (the parameter d is variated from 2 to 5).
The words were nested for each document iteration from 10 to 20 expansion words (the parameter m is variated from 10 to 20). All of the word embedding experiments were tested using the two closest embedded words (the parameter t = 2) because it is empirically found that t = 2 is the best value for this parameter. The experiments were applied to the 75 topics (or queries) of the TREC-2001/2002. Each iteration was evaluated for the Mean Average Precision (MAP), Precision at the tenth retrieved document (P10), and for the R-Precision, which is the precision at the Rth returned document where R is the number of the actual number of the relevant documents to each query, according to the relevance judgment of the TREC.
The indexing and search system was implemented such that stop words were removed, and words were normalized by applying the Light10 stemmer. The title and the description of each topic are used to form a query, and the retrieval is also enhanced by giving the title words of each topic double the weight of the description words.
5.1 Results and discussion
The main objective of this study is to shorten the training time of the word-embedding model by using less text while preserving the retrieval performance of the word-embedded PRFQE. To test the achievement of this objective, we presented a comparison between the learning rates of training the word embedding model using all words of the dataset and using just the AL-Definite words to train the model, and the result is illustrated in figure-4 (a). It can be noticed that there is a minor change in the learning rate between using a representative subset of words (AL-Definite words) and all of the words for training the model. Moreover, the rate of vocabulary construction (for the SkipGram as an example) trained using All-words (ALLNOSTEM) and using only AL-Definite words with stemming (ALSTEM) is illustrated in figure-4 (b), which shows a wide gap between the rate of word vectorizing in the two cases; i.e., the word vectorizing rate of using AL-Definite words as a representative subset is much faster than the word vectorizing rate of using the whole dataset words, as an example, training the model by ALSTEM gives an average of 83% enhancement over the rate in case of ALLNOSTEM.
Figure-4: Comparing the learning rate and the word vectorization rate for the two methods ALSTEM and ALLNOSTEM
The training time is reduced to about 10% of the time needed to train the model using all words of the dataset and word vectorization rate because of the big difference in the text size of sentences of the two methods, as illustrated earlier in figure-3. These results show the effectiveness of using AL-Definite words as a representative subset of words for training the word embedding models. However, this big difference between the learning and word vectorizing rates should not negatively affect the word embedding PRFQE's retrieval results.
The following experiments were designed to examine the retrieval effectiveness for each training criterion using two different document indexing methods, namely: All-Words, in which all of its words index each document, and the AL-Before method, that each document is indexed only by the AL-Definite words and words before them, as explained in [35]. The experiments are evaluated at different recall levels, and the results are presented using three diagrams: a diagram for the MAP, the P10, and the R-Precision.
The first experiment determines the effect of word embedding on retrieval by comparing PRFQE with and without word embedding and examines the importance of training the word embedding model using AL-Definite words as a representative subset instead of All-Words. In this experiment, two options for training the SkipGram embedding model are used; the first option is to train the model using All-words of the dataset, and the other option is to use only AL-Definite words as training text. Using the terminology of Table-1 and Table-2, this experiment compares SX and ESX (using ALLNOSTEM and ALNOSTEM).
The indexing method used is the AL-Before, and the results are presented in figure-5; each label on the x-axis of this figure represents an extended document, such that label '1' right-to label '12' means expand by two documents, label '12' right-to label '23' means expand by three documents, label '23' right-to label '34' means expand by four documents, label '34' right-to the end of the axis means expand by five documents, and for each expanded document, the extended words are varying from 10 to 20, this description of the x-axis is the same for all of the following figures.
It could be observed that the maximum MAP results for the three curves (in figure-5(a)) are obtained at the horizontal axis points that represent d = 3 by extending the top three ranked documents, and at the same point that represents m = 13; i.e., extending 13 words of each of the three documents. The two training options show similar maximum MAP (see Figure-5(a)), and the maximum precision at P10 is the same as well, which is obtained at d = 5 (as presented by Figure-5(b)), but the R-Precision for the ESX + ALNOSTEM is better than that of ESX + ALLNOSTEM, and it is obtained at d = 3, m = 13, as in Figure-5(c). Moreover, using word embedding (ESX) shows an apparent enhancement (at the three recall levels) over PRFQE without word embedding (SX). The similar results obtained by training the word embedding model by all of the words (ALLNOSTEM) and training by only the AL-Definite words (ALNOSTEM) assures the representativeness of the AL-Definite words to the whole dataset. The two training methods show similar results at the three recall levels that assure the stability of the results obtained by training the model by this subset of words while preserving the retrieval performance.
The explanation for obtaining the maximum P10 by extending the original query by five documents and 11 to 14 words for each document is that the probability of a relevant document (in the top-10) can be estimated by the P10 without query expansion, which is about 54% as presented in [35], so it is most likely to bring more relevant documents in the top-5 returned documents, for example, the P@1 (the precision at the first retrieved document) without query expansion was 60.7%. However, it becomes 61.56% after a word-embedded query expansion, meaning word embedding gives more chance for the first retrieved document to be relevant.
The following experiment examines the effect of stemming on training the word embedding model, so the system is fed with embedded words resulting from training the GloVe model using the AL-Definite words, without stemming (ALNOSTEM), and AL-Definite words with Light10 stemming (ALSTEM), and the results were as illustrated in figure-6. The results show a slight enhancement of MAP for ALNOSTEM over stemmed text ALSTEM (figure-6(a)); this slight difference is justified as ALNOSTEM shows better results for higher recall levels that are used to calculate MAP, which is consistent with the results concluded by other researchers, as in [5] and [17]. On the other hand, ALSTEM shows better results in lower recall levels (at P10), where the maximum P10 is obtained at extending the original query by five documents, as shown in figure-6(b), which could be explained as in the previous experiment. It is worth noting that the number of words for the ALSTEM is about 60% of that of the ALLNOSTEM, which makes the training time shorter, so a trade-off between the MAP and the training time could give more implementation options.
The maximum R-Precision for the ALNOSTEM is higher than the maximum R-Precision for ALSTEM, as in figure-6(c); it is obtained by extending the original query by the words of the top-3 returned documents, while the maximum R-precision of the ALSTEM is obtained at extending the query by the words of the top-5 documents.
Using different word embedding models and PRFQE shows insignificant differences regarding the MAP results; for example, the results of the SkipGram and the GloVe for the same query expansion scheme are represented in figure-7 (a). This result is relevant to the results reported by other researchers, such as [22]. However, there are some differences in P10, as in figure-7(b). The slight difference of P10 showed by the SkipGram over the GloVe can be explained by the fact that GloVe determines the similarity between words on a global co-occurrence within a dataset. In contrast, the SkipGram determines this similarity based on local co-occurrence within a text. Hence, the results become closer at higher recall levels, such as R-Precision, as presented in figure-7(c).
The following experiments examine different options for applying word embedding to the PRFQE. The first experiment in this context compares applying the word embedding to the original query, i.e., before PRFQE (ESX), which is the first scheme described in section 4.2, and the application of word embedding to the expanded query (SXE), the second scheme described –also- in section 4.2. the results presented in figure-8(a) show that embedding the original query (the first scheme) has better MAP than embedding the pseudo-relevance expanded query (the second scheme), and the P10 of ESX is better than its value for the SXE as presented in figure-8(b). This result satisfies the last assumption (point 4) listed at the end of section 4.2. It can be explained as the expanded query (in SXE) having words added from the pseudo-relevant documents, which itself could have a different meaning than the original query words, and the similar embedded words of these expanded words most likely to have a different meaning than words of the original query. A closer look at figure-8(a) shows that as the number of extended words increases, the difference in the MAP becomes more pronounced, indicating that as more words are added, they carry a different meaning than the words of the original query.
Another experiment examines the effect of selecting top-m weighted words from each distinct pseudo-relevant expanded document (ESXD) against selecting the top m×d weighted words after mixing m words from each of the d pseudo-relevant expanded documents (ESXA), as explained in section 4.2.
The result of this experiment is plotted in figure-9. It could be observed that the maximum MAP for the ESXD is slightly higher than the MAP of the ESXA for most of the values of d and t (Figure-9(a)), and the maximum MAP for both of the methods found at d = 3, but the maximum P10 of the ESXA is higher than that of the ESXD (see Figure-9(b)). It is found at d = 3 for the ESXD and at d = 5 for the ESXA. The result can be explained as the ESXA adds top weighted words (globally) from a combination of the pseudo-relevant documents; it is more probable to have the right similar semantic words within this combination since the top-weighted words are most likely to be repeated in more than one document making it as topic-specific words. On the other hand, in the ESXD, the top-weighted words are determined (locally) in each separate document that could be irrelevant to the query. However, for higher recall levels, it is more probable to find returned documents having the exact words but of different semantics, which explains lower values of the MAP of the ESXA.
For more testing of the proposed model training method, more experiments were applied to test the effect of using more than one dataset and two different indexing methods. The word embedding models trained by the text of both the TREC-2001 (AFP news) and the Watan-2004 datasets. The watan-2004 dataset has about 20291 news documents of the Al-Watan Omani newspaper, distributed over six topics, prepared by Murad Abbas [37], it is about 114 MB in size. The word embedding models (GloVe and SkipGram) are trained using sentences that contain only the AL-Definite words from a combination of stemmed text of these two datasets (ALSTEM), where the documents are indexed using the AL-Before indexing method. The results of the word-embedded PRFQE are presented in Figure-10.
It could be observed from Figure-10(a) that the MAP of training the Glove model with the AFP only has slightly higher values than training the model using both of the datasets. However, at lower recall levels, the maximum precision at P10 and R-Precision of applying both datasets show better results, as presented in Figure-10(b) and Figure-10(c), respectively.
These results indicate that as the model trained on more text of the same type (news documents for nearly the same period in this case), it could show better results at lower recall levels, but it has less effect on the overall recall space. The justification of these results is that the news documents of the Watan-2004 have focused on the local news of Oman and the Gulf countries. At the same time, the AFP has an international focus, so some words (of Oman and Gulf local semantics) could be added to the context of the AFP words beside other global words. The added words of similar global semantics increase the precision at lower recall levels. However, the added words of local semantics have a negative effect on the overall recall results (as presented in Figure-10(a)).
Another experiment applied using the SkipGram word embedding model trained by using the two datasets (AFP and Watan-2004), and the word-embedded PRFQE is tested on the two indexing methods: AL-Before and All-Words, the MAP and P10 results are plotted in Figure-11. The maximum MAP of the All-Words indexing shows slight improvement over the AL-Before indexing, and the maximum MAP of the two methods is obtained at d = 3. However, the maximum precision at the 10th retrieved document (P10) of the AL-Before is higher than that of the All-Words indexing method, and it is obtained at d = 5, as in figure-11(b). The All-Words method has better MAP since all words are included in its index, so it is more probable to have more semantically related words (to the query words) as we add more extended words, which could be similar to words of the returned documents at higher recall levels, or co-occur with these words at some context, which justifies the gap between the two methods as more extended documents are used, as in figure-11(a). On the other hand, the AL-Before method has higher maximum precision (P10) because it is more probable to have AL-Words (of the extended documents) similar to the AL-Words that are embedded in the query words since the SkipGram is trained only using the AL-Words.
5.2 Comparison to other works:
To compare the results obtained by the proposed methods in this paper to those of other related works, we summarize the maximum results gained for three different recall levels as in Table-3. The percentage of the enhancement is used for comparison since each work could have a different implementation environment.
From Table-3, the MAP of using word embedding with the All-Words indexing method has a 7.2% enhancement over PRFQE without word embedding, and it has a 21.6% MAP enhancement over the basic All-Words without query expansion. The average precision at the 10th returned document (P10 for the All-Words indexing method) is enhanced by 20.5% over the baseline indexing and 13.7% P10 enhancement of the word-embedded query expansion over the PRFQE without word embedding. The most related work to the method proposed in this paper is El Mahdaouy et al. [17] since they used the same dataset and evaluation metric. In that work, the maximum MAP is 41.11, and the maximum P10 is 55.07, while the MAP is 3% greater than the maximum MAP gained by the methods proposed in this paper, but the P10 of this paper is 7% greater than the P10 of the method proposed in [17]. Moreover, the word discovery rate of this paper is 83% greater than the word discovery rate of [17], see figure-4(b), and the training time is shortened to 10% of the SkipGram, for example.
Table-3: A statistical summary of the maximum results obtained by the implemented methods
Method
|
MAP
|
P10
|
R-P*
|
% enhancement to the basic All-Words without PRFQE
|
% enhancement the PRFQE without word embedding
|
%MAP
|
%P10
|
%R-P
|
%MAP
|
%P10
|
%R-P
|
AL-Before without query expansion
|
36.3
|
53.3
|
37.6
|
9.01
|
9.67
|
6.52
|
-3.94
|
3.50
|
-4.33
|
AL-Before PRFDQE without word embedding
|
39.2
|
56
|
40.6
|
17.72
|
15.23
|
15.01
|
3.73
|
8.74
|
3.31
|
AL-Before PRFQE with word embedding
|
40.37
|
58.96
|
41.94
|
21.23
|
21.32
|
18.81
|
6.83
|
14.49
|
6.72
|
All-Words without query expansion*
|
33.3
|
48.6
|
35.3
|
0.00
|
0.00
|
0.00
|
-11.88
|
-5.63
|
-10.18
|
All-Words PRFQE without word embedding*
|
37.79
|
51.5
|
39.3
|
13.48
|
5.97
|
11.33
|
0.00
|
0.00
|
0.00
|
All-Words PRFQE with word embedding
|
40.53
|
58.59
|
41.85
|
21.71
|
20.56
|
18.56
|
7.25
|
13.77
|
6.49
|
* R-P : R-Precision
**This is the baseline All-Words indexing with Light10 stemming, no extra weight is given to the title words, and words having DF = 1 are included. All other methods implemented such that words with DF = 1 are excluded. Numbers in bold indicates the best results.
|
Farhan YH et al. in [23] used the same dataset for evaluation; their method showed about 43.5% P10 for the TREC 2001/2002, less than the P10 showed by the proposed method in this paper. Hiba ALMarwi et al. in [1] evaluated their proposal at the query level. They gave precision to each query and did not include an average or MAP value for all the queries.
Ahmed Cherif Mazari and Abdelhamid Djeffal in [13] Ahmed Cherif Mazari and Abdelhamid Djeffal in [13] used a semantic tree to reform the pseudo-relevance expanded query. They evaluated their proposed method using the Arabic BBC News corpora, and their results show a MAP of 0.524 at P10, which is a 5.8% improvement over the baseline of their test.