Improved Arabic Query Expansion using Word Embedding

doi:10.21203/rs.3.rs-4065010/v1

Download PDF

Research Article

Improved Arabic Query Expansion using Word Embedding

https://doi.org/10.21203/rs.3.rs-4065010/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Word embedding enhances pseudo-relevance feedback query expansion (PRFQE), but training word embedding models needs a long time and is applied on large-size datasets. Moreover, training embedding models need special processing for languages with rich vocabulary and complex morphological structures, such as Arabic. This paper proposes using a representative subset of a dataset to train such models and defines the conditions of representativeness. Using a suitable subset of words to train a word embedding model is effective since it dramatically decreases the training time while preserving the retrieval efficiency. This paper shows that the subset of words that have the prefix ‘AL,’ or the AL-Definite words, represent the TREC2001/2022 dataset, and, for example, the time needed to train the SkipGram word embedding model by the AL-Definite words of this dataset becomes 10% of the time the whole dataset needs. The trained models are used to embed words for different scenarios of Arabic query expansion, and the proposed training method shows effectiveness as it outperforms the ordinary PRFQE by at least 7% Mean Average Precision (MAP) and 14.5% precision improvement at the 10th returned document (P10). Moreover, the improvement over not using the query expansion is 21.7% for MAP and 21.32% for the P10. The results show no significant differences between using different word embedding models for Arabic query expansion.

Arabic Information Retrieval

Query Expansion

Word Embedding

Pseudo-relevance feedback

Word2Vec

GloVe

SkipGram

The main problem with query answering is that most users submit short (less descriptive) queries or use words that differ from those used in documents [1]. Furthermore, writing informative user queries is still tricky, and it is hard to represent their needs [2], in addition to the problems encountered when dealing with natural languages, such as synonymy and polysemy. The Arabic language has an additional problem since it is a fluctuated language; some words could have the same shape but have different diacritics, so they represent different meanings, which leads to word ambiguity [3].

Query expansion using relevance feedback, either user feedback or pseudo-relevance feedback, was proposed to expand users' queries by adding (or embedding) other words most likely to have the same meaning as query words and resubmit the query [4]. These embedded words should be selected to increase the probability of occurring in more relevant documents to that query, improving retrieval [5]. In most cases, query expansion systems add synonyms, hyponyms, and hypernyms of the words of the original query to improve precision and recall [6].

Word embedding uses a large dataset to determine the context of some words, and each word is represented as a vector of values, such that similar words have closer vectors [7]. These vectors are formed by positioning words so that their position corresponds to their semantic properties [8].

Most methods that use word embedding to expand users' queries rely on the assumption that words co-occur in the same context and mostly have the same semantics [8] or the "distributional hypothesis." Word embedding techniques could be prediction-based, depending on the word context, such as (Continuous Bag of Words CBOW, SkipGram) [9], and FatText [10], trained on character n-grams, or count-based that depends on global word-word co-occurrence beside the local context of a word to determine words' vectors [11], an example of count-based is GloVe [12]. Moreover, queries could be expanded using local or global word embedding. Local embedding query expansion relies on adding selected words from the pseudo-relevant documents of the query, while global embedding incorporates semantically related words from all over the corpus [13].

As words in the same context co-occur with query terms and are embedded to enhance retrieval, other problems could be raised due to this word embedding. It could be possible that those words co-occur with too many other words, and some are common words that are most likely to have less significance to be added to the query. Adding these words to the query will return a larger number of other documents that are not relevant to the query. This situation will increase false negative documents and reduce precision or result in minor precision improvement.

However, the effectiveness of word embedding is sensitive to the choices made during the training phase, where the choice of training corpus and term normalization affect the retrieval efficiency, where it is found that the result when normalizing terms before training is different from normalizing after training [5]. Therefore, word embedding models could produce biased word vectors as they are sensitive to the training corpus. Moreover, these models do not distinguish between similarity and relatedness [14]. Another issue with these word embedding models is that some words of opposite meanings could have the same context and, consequently, reported as similar terms in each vector [15].

Recently, many works of word embedding have been done to enhance Arabic information retrieval using pseudo-relevance Arabic query expansion. Some works depend on determining the distribution of words in the pseudo-relevant document; for example, Abdelkader El Mahdaouy et al. in [3]. Other researchers included the morphological annotation of Arabic words in the embedding procedure, such as Rana Salama et al. [16]. They incorporated the POS tags of words before word embedding and the lemma of each word to leverage the morphological and semantic similarity of words to the query words. Hiba AlMarwi et al. [1] used WordNet to include semantically similar words to Arabic words before embedding, and they used Particle Swarm Optimization to calculate the weight of the term. They expanded the query by the top-m weighted words. Arabic Word Net is also used by [13] to identify the concepts of text to be used to expand queries, in addition to simple and compound words.

This paper proposes an Arabic query expansion word selection to improve Arabic query expansion by focusing on a subset of representative words to train the word embedding instead of using all words of the dataset. These words are more likely representative of the dataset semantics than selecting all the words while reducing the training time and the space needed to store words' vectors. This proposal attempts to solve query expansion's inherent problems, such as document cleansing, indexing, and ranking [6], which is essential to be more accurate since the pseudo-relevant set of documents used to expand a query is selected using a ranking algorithm. Moreover, some methods rank the selected words according to statistical or semantic properties [17]. The trained word embedding model is used for query expansion by applying different scenarios, such as embedding words of the original query or embedding the expanded query, expanding using the top-weighted words of each document, or using top-weighted words of all top-ranked pseudo-relevant documents.

Finally, word embedding models work at the word level, so techniques must be used to make these models effective when dealing with languages with rich morphological structures [18], such as the Arabic language, which motivated this work.

The rest of the paper is organized as follows: section-2 gives a review of the related work and concludes the gaps that motivate this work, a brief description of word embedding is given in section-3, the proposed method of Arabic query expansion is illustrated in section-4 followed by the implementation and evaluation of this method in section-5; finally, the overall work is concluded in section-6.

Query expansion aims to add more terms to the user query to increase the probability of having a match between possible relevant documents. Several research efforts tried to achieve this goal by increasing this probability using statistical methods (or contextual word embedding). However, they mainly rely on language models, in which a word can be predicted given its context. Other researchers used semantic methods to add these terms. They used pre-defined structures of semantic relationships. In this case, words' semantics could be determined using a thesaurus or an ontology; [19] and [20] are examples. On the other hand, some researchers combined these two approaches and proposed hybrid query expansion methods, as in [1].

Statistical-based methods use terms from the returned documents or the pseudo-relevant set to expand the users' queries. In contrast, the semantic methods use semantic-related terms to query terms and documents. The relatedness is determined using some knowledge graph. In this case, the terms used to expand queries may not be included in the documents of the corpus, so in order to get better results, they need to expand documents' terms in the same way, or what is called dense retrieval [21]. The following works are examples of statistical methods.

El Mahdaouy in [22] and [17] proposed a query expansion method using word embedding similarity to improve Arabic information retrieval. They extracted terms from the top-ranked documents of the pseudo-relevant set according to their distribution in the documents besides their similarity to the original query terms. They used the TREC 2001/2002 collection to train the Word2Vec and other data sets, including Wikipedia and tested their proposal on the TREC 2001/2002 collection.

Farhan YH et al. [23] proposed an expansion of the query instead of expanding each term. They incorporated several neural network layers to average the word's vectors of a query to get its semantics.

Ibrahim Muwad et al. [24] proposed a query expansion based on a bi-gram model, which expands the query by adding similar bigrams to the query terms. There is no pseudo-relevance feedback.

The following examples of recent works present ontology-based semantic research efforts for query expansion.

Ashraf Mahjoob et al. in [19] used WordNet to expand query terms using two methods; the first one expand all terms of queries, forming a single expanded query. The second method used the expanded terms to form a set of queries and combined the results of these queries to get the final relevant document list.

Shivani Jain et al. [20] proposed a fuzzy ontology-based query expansion, combining global and domain-specific ontologies. The resulting fuzzy ontology determines the semantically related terms to the query terms. The weight of terms is determined using a fuzzy function depending on several types of semantic relationships.

Hybrid methods expand queries by lexical words and their related words in an ontology or thesauri, for example,

Hiba ALMarwi in [1] proposed a hybrid approach for query expansion, which depends on Particle Swarm Optimization to infer semantic properties (and avoids query drift) out of possible statistical query expansions, which are obtained using word embedding, WordNet, and term frequency.

Ayyoob Imani et al. [25] improved the retrieval by applying a classifier on the word embedding results to select words in the context of a given word correctly. They applied the proposed method to TREC English collections. Rana Aref Salama et al. [16] used embedding similar words, embedding morphological annotations, and representing word vectors as lemma forms to enhance query expansion. They evaluated the proposed method on different NLP applications. Ayat Elnahhas in [26] used an Arabic dictionary and words from the pseudo-relevance feedback to expand queries, and Hiteshwar Kumar in [27] used web knowledge for query expansion. Said El Alaoui and Khalid Zidani [28] proposed a hybrid method for query expansion by using Part-of-Speech and Arabic WordNet to extend the pseudo-relevance expanded query. They aimed to answer questions by converting them into queries and expanding these queries to improve their results. Another approach uses contextualized query expansion using a chunk of text instead of the whole document, for example, Zhi Zheng et al. in [29].

In conclusion, PRFQE is an effective method to improve retrieval, but it could be enhanced using different approaches, as presented above. Mainly, using a neural network model for word embedding is effective but needs a large training data set, which consumes more time [11]. On the other hand, an extensive empirical study should be conducted to determine the optimal settings, such as the number of pseudo-relevant documents, top selected terms, global (corpus-based scale) or local (pseudo-relevant documents scale) term similarity, and what to expand. Expand queries only or expand documents and queries; these factors were found to have considerable influence on the retrieval effectiveness, as reported in [5].

The effectiveness of word embedding motivated the research of this paper, which proposes a simple method of word embedding PRFQE by limiting the training of the embedding model to a specially selected representative subset of words and examining different schemes of query expansion. This method shortens the time needed for training and, at the same time, keeps the retrieval efficient.

Word embedding can be classified as prediction-based, which depends on analyzing the local context of a text, and count-based models that analyze the overall dataset for counts and frequencies of words [11]. Some researchers refer to these models as text-based models since other later models were proposed to enhance these models by incorporating an ontology, as indicated in [18]. This section will focus on text-based models since it is used in this research.

Word2Vec proposed two models, namely the Continuous bag-of-words and SkipGram, as prediction-based models. The former predicts the central word by examining some context, i.e., it maximizes the probability of a word to be in some context [9]. On the other hand, SkipGram is reversely used as a central word to predict its context [11]. Word2Vec produces a vector for each word that combines all the senses of that word, so it is a non-contextualized model [30]. For better prediction, FastText [10] proposed an improvement of the SkipGram instead of word embedding to n-gram character embedding. N-gram character embedding could be better used for fluctuated languages, where a word could contain information in its parts. Moreover, infrequent words that are generally irrespectively analyzed could be better estimated, which leads to improved lexical tasks [7].

Co-Occurrence Matrix, or count-based models, this model considers the global representation of a corpus to obtain the co-occurrence matrix of words. The entries of this matrix could be obtained using a Singular Value Decomposition SVD, which tackles problems of sparsity of high dimensional embedding matrices and disparity of word's distribution, simplifying complex computations [7]. C0-Occurrence could better be found by comparing the co-occurrence of a word x with a word y, more than it co-occurs with a stop word, for example [31].

GloVe, is an embedding model that considers the ratio of word co-occurrence instead of the actual number of co-occurrences of two words globally in a corpus. It is found that the ratio of the probability of a word co-occurring with other words to the probability of another word co-occurring with the exact words represents a meaning scale. This model shows superiority over other models in some NLP applications [11].

A transformer-based model and contextualized models, such as BERT [32], use attention to give high weights to important words in the context of some words and ignore others. In this model, each sentence begins with a "classification" modifier "CLS" and ends with a separator "SEP," the model reads sequences (sentences) several times in both directions, and the output is a vector for each token. A token can be predicted by substituting a term in a sentence with a MASK token, and the model returns multiple alternatives of that position of the MASK in a given sentence. Recently, ChatGPT, developed by OpenAI, used transformers for advanced human-like interactive conversation between the system and the user.

This section explains the proposed Arabic query expansion based on word embedding. The method includes training the word embedding model using a subset of words that represent the overall dataset, with a detailed description of the approach of selecting this representative subset and the rationale behind this selection. The other topic that is explained in this section is how to use the trained word embedding to enhance the pseudo-relevance feedback query expansion (PRFQE).

4.1 Training the word embedding model

In this proposal, the word embedding model is trained using a variant of each sentence in the data set. Each variant of a sentence is composed of specially selected words of the original sentence. The set of the selected words should be small enough to make the system efficient and representative enough to share the same contexts with other words with high probability since the training dataset considerably influences the effectiveness of the word embedding model [5].

The distribution of words on documents in the index could be used as an indicator of the representativeness of a subset of words. The minimum requirement of this selected subset is that all documents should be included in the index; i.e., the indexer should not miss documents. The other condition is that the selected representative subset of words should preserve the context of sentences of these documents. This condition can be validated by the ratio of words of the overall dataset that appear in the context of any of the member words of this selected subset; the representative subset is defined by definition-1.

Definition-1:

Given that the vocabulary of some dataset language is V, and a selected subset of terms Vs. Í V is used to train the word embedding model; the subset Vs is considered representative if it satisfies the following two conditions:

For each document D in the corpus, at least one word wi exists in Vs such that P(wi|D) > 0.
For each word wi in Vs., there exists a word wj in V where the probability of wj to co-occur with wi in a context Xwi: P(wj|wi) = 1.

The first condition of definition-1 could be used to preserve an essential property of the embedding model, which is the "good geometry" property; a model satisfies good geometry as it distributes the words of documents around the selected subset of words of that dataset. The second condition of definition-1 states that a word of the subset in a document makes the model "reliable" since it guarantees that each word of the representative subset co-exists with at least one word of the corpus in at least a single context X^wi. Good geometry, reliability, and other properties of the word embedding model are explained in [7].

Since the result of a word embedding depends on the data set [5], the selected subset of words, which produces shorter sentences, is expected to bring more words of similar semantics to co-occur in a sentence. Moreover, the dimension of each word is expected to be smaller, which could enhance the semantic relationship between words in the same context that depends on the dimension size of the word embedding and the word frequency [33].

The selection of a representative subset could follow the same approach used for feature selection, which is based on some linguistic intuition and an empirical procedure, as presented by [34]. Linguistic intuition means that this subset demonstrates the semantics of the overall dataset. The proposed selection criteria of a representative subset of an Arabic dataset follow the linguistic intuition given in [35]. That linguistic intuition indicates that the subset of AL-Definite words could be selected as a representative of a dataset. To determine that this subset of Arabic words can be considered representative of an Arabic dataset, it should satisfy the conditions of definition-1. As illustrated by [35], the AL-Definite words were distributed over all of the datasets, since –for example- these words indexed all of the documents of the TREC-2001/2002 corpus; i.e., each document contains a subset of AL-Definite words, which satisfies the first condition of Definition-1.

Moreover, it has been found that 98.3% of the words (after stemming) of the TREC-2001/2002 co-occur within a three-word-sized context around an AL-Definite word. Intuitively, if the context of a word is widened, then more other words could be included within a word's context in this subset, which satisfies the second condition of definition-1. Additionally, the researchers in[35] found that using these definite words and words before them, for example, improved the retrieval over using all words of the corpus, making this selection satisfy information retrieval applications, which satisfies the assumption in [34] that feature selection is an application dependent task.

The word embedding model is trained using the selected subset of representative words, as follows: divide the dataset into sentences, each sentence is composed of only the definite words, either stemmed or not, then the top similar words to each word are stored in a table, the implementation details are presented in the implementation and evaluation section, and the whole process is illustrated by figure-1.

To theoretically estimate the relationship between training the embedding model using the whole dataset and with a representative subset of that dataset, we will use the approach explained in [5]. In that approach, the Jaccard average similarity between two embedding models, each model resulted from training the same word embedding by a different dataset, and the correlation between these two resulted models is calculated.

The subset of the selected words is considered representative if the computed similarity between the model trained by this subset and the model trained by the whole dataset is high and the normalized correlation between them is close to one. The set of words and its vectors resulting from training a word embedding on a dataset is called embedding space. Considering the embedding space E_V of the model trained by the whole dataset V, and (by applying to the second condition of definition-1 E_Vs ⊆ E_V) the embedding space of the model trained by a representative subset V_s, the intersection between them is E_Vs, so for each word (w) in E_Vs, the intersection of the context of w in both embedding spaces is its context in E_Vs since the context of w in E_Vs is a subset of its context in E_V. And so the union of the two contexts of w is its context in E_V. The Jaccard similarity (J_i) between the embedding spaces trained by both of V and V_s is given by equation-1.

$${J}_{i}\left({E}_{V},{E}_{Vs}\right)=\frac{1}{\left|{E}_{Vs}\right|}{\sum }_{w\in {E}_{Vs}}\frac{{|X}_{i}^{{E}_{Vs}}|}{{|X}_{i}^{{E}_{V}}|}$$

……………………………………..……

Where ${X}_{i}^{{E}_{V}}$ is the i^th context of the closest words of the word w in the embedding space E_V, and ${X}_{i}^{{E}_{Vs}}$ is the corresponding i^th context of the closest words of w in the embedding space E_Vs, where $\left|{X}_{i}^{{E}_{Vs}}\right|$≥ 0 and$\left|{X}_{i}^{{E}_{V}}\right|>0$.

It is clear that the conditions of definition-1 make higher values of J_i since it ensures that the words of the selected subset are distributed over all of the contexts of the whole dataset.

The correlation between the embedding spaces produced by the model when trained by the whole dataset and by a representative subset should be high since there is a complete overlap between the two spaces, because V_s is included in V. The correlation between E_Vs and E_V is given by equation-2.

$${C}_{i}\left({E}_{V},{E}_{Vs}\right)=\frac{1}{\left|{E}_{Vs}\right|}\sum _{w\in {E}_{Vs}}\frac{1}{2}(1+C\left({X}_{i}^{{E}_{V}}\left(w\right),{X}_{i}^{{E}_{Vs}}\left(w\right)\right))$$

…………………..

It is easy to show that the value of correlation is high because it is computed between the contexts of the same word in both context spaces, and from equation-2, it could be noticed that the summation indicates that a word $w\in {E}_{Vs}$ is also included in the E_v.

The explanation given above indicates that a representative subset of a dataset, which satisfies definition-1, could be effectively used to train the embedding model without losing the accuracy of the query-expanded retrieval system and at a shorter training time.

After training the word embedding model, a query could be expanded using different scenarios for query expansion and embedding. Scenarios differ regarding the time of query word embedding: before or after pseudo-relevance expansion. Furthermore, which word embedding is applied: on the top-selected words of each pseudo-relevant document or the top-selected words of the combination of all pseudo-relevant documents? The following subsections explain each of the three embedding scenarios.

4.2 Pseudo-Relevance Feedback

The proposed query expansion scheme depends on determining the top-most similar words to each query word using word embedding. The query is pseudo-relevant expanded after getting the returned documents of the first run; as a result of that query, selecting d-top ranked documents, and out of each selected document, select m-top weighted words. For each selected word w, get the t-top similar embedded words; the query is then expanded using different alternatives, as explained later in this section, and the query is resubmitted to the system for a second run. The maximum number of words that are embedded in each query could be estimated by equation-3; this equation assumes that word embedding applied to the original query, not to the expanded query.

$${M}_{t}=d m+t \left|Q\right|$$

……………………………

Where M_t is the maximum number of expanded words to the query Q, and |Q| is the number of words in Q.

These three parameters (d, t, m) are tuned to give more chance to returning more relevant documents to a query.

Tuning these parameters should take care of the following issues:

(1) The number of the top-selected pseudo-relevant documents (d) should be large enough to highly represent query semantics and small enough to prevent other false positive documents from being included in the pseudo-relevant set of documents since it is found that selecting a small number of pseudo-relevant documents could get bad results in case of "Synonymy or Polysemy types of queries" [27].

(2) As the number of top-weighted words (m) selected from each pseudo-relevant document gets larger, more unwanted words could be added to the query since lower weights of words (in a document) relatively indicate a weak representation of that document, and adding other words in the context of such lower weighted words (using word embedding) is expected to change the semantic of the query.

(3) Finally, as t gets larger as the similarity to the extended words gets lower since words that share the same context are not necessary to have the same semantics in an embedding model [8], and adding embedded words from higher dimensional vectors gives more chance to change the semantic of a context because increasing the dimension of the word embedding has adverse effects on the model [33].

In this paper, we will use the following two schemes of query expansion and embedding:

a) The first scheme is to embed words of the original query, get query result, get the top-d similar documents, pseudo-relevant expand the embedded query, re-weight words, and resubmit the query; this scheme is represented by figure-2(a).

b) The second scheme is to submit the original query, get query results, get top-d similar documents, pseudo-relevant expand the query, embed the expanded query, and resubmit the query; the second scheme is described in figure-2 (b).

On the other hand, two methods of selecting expansion words from the top-d documents are used: (1) for each document of the top-d returned documents, choose the top-m weighted words, (2) an alternative method of selecting these expansion words could be achieved by selecting the top-d relevant documents, mixing all of the selected words of all documents in one list, re-weight repeated terms, descend sorting the list, and select top-weighted words from the list as computed in equation-3.

Having the embedding model trained using the representative subset of words, we now discuss how this selection will improve the retrieval of such query expansion methods and apply the assumptions and schemes of pseudo-relevance feedback query expansion. Initially, word embedding query expansion depends on the assumption that the original query words have similar words in other contexts, which could be included in more relevant documents to that query [23]. According to this proposition, and after applying word embedding, the following possible cases could increase the chance of returning more relevant documents to a query:

A. Adding top-t words of pseudo-relevant documents to the query will increase the similarity between these documents and the query, keeping these documents in the top-d similar documents, which preserves the relevant returned documents.

B. The added top-t words (from each top-d returned document) that weren't included in the original query could be included in other relevant documents not returned in the first run (before query expansion). These documents could be included in the top similar returned documents, increasing the precision of that query.

C. Adding embedded words to the original query could have the following effects on the returned documents:

1. The added embedded words could be included in relevant documents that did not return in the first run, giving these documents a chance to be in the top-similar documents, which will –also- increase the precision of that query.

2. The added embedded words could be included in the relevant returned documents, increasing the similarity of these documents to that query. So, these documents will rank higher in the returned top-similar documents, positively affecting the precision at lower recall levels.

3. The added embedded words could increase the frequency of some query words, as they could be repeated in the original query, giving higher weights to these words. It is obvious to give these words higher weight since repeating such words in the context of the query's words gives more evidence that the embedded words have the same semantics as the query.

4. The order in which the embedding is applied affects the result of query expansion; if we only embed the original query (the first scheme), the context of query words will only be added, so returned documents are expected to have the same semantics as query words. Suppose the second scheme is used to apply embedding to the pseudo-relevant expanded query. In that case, the returned documents will have contexts of similar semantics to the query words and the semantics of the selected top-t words of other pseudo-relevant documents. It is expected that the second scheme will add more true-positive and, at the same time, add false-positive similar documents to the result of the second run since many embedded words of different semantics (than the original query words) could be added, this order of embedding could negatively affect the retrieval in case of top-t list gets longer, because as more words added to the query as the chance to add unwanted words increased.

In the next section, the details of implementing the proposed pseudo-relevance expansion schemes are presented, where different scenarios are tested to show the effectiveness of the proposed training of the word embedding model using a subset of words.

This section presents the implementation and testing details of the proposed query expansion based on word embedding. The first part of the implementation deals with the word embedding training dataset, such that two of the word embedding models, namely the SkipGram [9] and the GloVe [12] models, are trained using two sentence forms. The first form includes all the words in the dataset, and the second form includes only the AL-Definite words (Arabic words with the prefix 'AL'). This part of the implementation aims to test the effect of using a representative subset of the dataset against the whole dataset, which was explained in section 4.1.

Table-1: Training Text Forms Description

Text Form	Description
ALLNOSTEM	Training the embedding model using all of the text (whole dataset) without stemming.
ALNOSTEM	Training the embedding model using only AL-Definite words without stemming.
ALLSTEM	Training the embedding model using all of the text with Light10 [36] stemming
ALSTEM	Training the embedding model using AL-Definite words with Light10 stemming.

The second part of the implementation and testing regards the application of PRFQE schemes that are explained in section 4.2. The word embedding models were trained using the TREC-2001/2002, which includes 383783 Associate France Press (AFP) news stories, such that four options of the training text were used, where two options (whole dataset, representative subset) are combined with other two options (stemming, no stemming options), as presented in Table-1.

After dividing the TREC dataset into sentences, it is found that the number of sentences formed using all words of the dataset is 3118047. When using AL-Definite words as a representative subset, it is 2956145 sentences. Although the number of sentences of the AL-Definite words is about 94.8% of the sentences using the whole dataset, all documents were represented by AL-Definite sentences. The size of the sentences is much smaller than the sentences of the complete dataset; the size of the text in the case of the AL-Definite sentences is about 30% of the size of sentences of the complete dataset sentences; these statistics are presented in figure-3.

The following settings of the SkipGram model were used: word frequency selected to be ten as in[17] and [20], the size of the word vector is 300, and the window size is 5. The settings of the GloVe were: window size = 4, iterations = 10, and vector size = 5. The most similar words to each word are determined and stored in an array.

The trained word embedding models were used to test different scenarios of PRFQE, as follows: the effect of stemming of model training, the effect of the indexing method, the effect of the number of pseudo-relevant documents, the effect of the number of words selected from each document (extended words), the effect of the method in which top-m words from each document are selected; i.e., select top-m words from each document against selecting top m×d words from the combination of all documents, the option of applying word embedding for only the original query is tested against applying embedding on the extended query, and finally, the effect of using larger dataset to train the word embedding models is tested by adding sentences from the Watan-2004 [37] dataset to the sentences obtained from the TREC dataset. These experiments are described as illustrated in Table-2. The methods described in Table-2 are tested on the four options described in Table-1, and for the two word-embedding models (SkipGram and GloVe). The embedded words for each query word were stored in tables for each training method; a maximum of three closest embedded words were used because it has been found that it is a satisfactory size of a word context as in [11].

The PRFQE methods were applied using a Java application that implements the indexing and search system, where each expansion method was iterated to expand two to five pseudo-relevant documents (the parameter d is variated from 2 to 5).

Table-2: Pseudo-Relevance Query Expansion Schemes

Scheme	Description
SX	Pseudo-relevance feedback without word embedding.
ESX	Embed words of the original query, and pseudo-relevance expansion
SXE	Pseudo-relevance expand query, embed the expanded query, and resubmit
ESXD	ESX where expansion is by adding top-m words from each pseudo-relevant document
ESXA	ESX where expansion is by adding top d×m words from a combination of all pseudo-relevant documents.

The words were nested for each document iteration from 10 to 20 expansion words (the parameter m is variated from 10 to 20). All of the word embedding experiments were tested using the two closest embedded words (the parameter t = 2) because it is empirically found that t = 2 is the best value for this parameter. The experiments were applied to the 75 topics (or queries) of the TREC-2001/2002. Each iteration was evaluated for the Mean Average Precision (MAP), Precision at the tenth retrieved document (P10), and for the R-Precision, which is the precision at the R^th returned document where R is the number of the actual number of the relevant documents to each query, according to the relevance judgment of the TREC.

The indexing and search system was implemented such that stop words were removed, and words were normalized by applying the Light10 stemmer. The title and the description of each topic are used to form a query, and the retrieval is also enhanced by giving the title words of each topic double the weight of the description words.

5.1 Results and discussion

The main objective of this study is to shorten the training time of the word-embedding model by using less text while preserving the retrieval performance of the word-embedded PRFQE. To test the achievement of this objective, we presented a comparison between the learning rates of training the word embedding model using all words of the dataset and using just the AL-Definite words to train the model, and the result is illustrated in figure-4 (a). It can be noticed that there is a minor change in the learning rate between using a representative subset of words (AL-Definite words) and all of the words for training the model. Moreover, the rate of vocabulary construction (for the SkipGram as an example) trained using All-words (ALLNOSTEM) and using only AL-Definite words with stemming (ALSTEM) is illustrated in figure-4 (b), which shows a wide gap between the rate of word vectorizing in the two cases; i.e., the word vectorizing rate of using AL-Definite words as a representative subset is much faster than the word vectorizing rate of using the whole dataset words, as an example, training the model by ALSTEM gives an average of 83% enhancement over the rate in case of ALLNOSTEM.

Figure-4: Comparing the learning rate and the word vectorization rate for the two methods ALSTEM and ALLNOSTEM

The training time is reduced to about 10% of the time needed to train the model using all words of the dataset and word vectorization rate because of the big difference in the text size of sentences of the two methods, as illustrated earlier in figure-3. These results show the effectiveness of using AL-Definite words as a representative subset of words for training the word embedding models. However, this big difference between the learning and word vectorizing rates should not negatively affect the word embedding PRFQE's retrieval results.

The following experiments were designed to examine the retrieval effectiveness for each training criterion using two different document indexing methods, namely: All-Words, in which all of its words index each document, and the AL-Before method, that each document is indexed only by the AL-Definite words and words before them, as explained in [35]. The experiments are evaluated at different recall levels, and the results are presented using three diagrams: a diagram for the MAP, the P10, and the R-Precision.

The first experiment determines the effect of word embedding on retrieval by comparing PRFQE with and without word embedding and examines the importance of training the word embedding model using AL-Definite words as a representative subset instead of All-Words. In this experiment, two options for training the SkipGram embedding model are used; the first option is to train the model using All-words of the dataset, and the other option is to use only AL-Definite words as training text. Using the terminology of Table-1 and Table-2, this experiment compares SX and ESX (using ALLNOSTEM and ALNOSTEM).

The indexing method used is the AL-Before, and the results are presented in figure-5; each label on the x-axis of this figure represents an extended document, such that label '1' right-to label '12' means expand by two documents, label '12' right-to label '23' means expand by three documents, label '23' right-to label '34' means expand by four documents, label '34' right-to the end of the axis means expand by five documents, and for each expanded document, the extended words are varying from 10 to 20, this description of the x-axis is the same for all of the following figures.

It could be observed that the maximum MAP results for the three curves (in figure-5(a)) are obtained at the horizontal axis points that represent d = 3 by extending the top three ranked documents, and at the same point that represents m = 13; i.e., extending 13 words of each of the three documents. The two training options show similar maximum MAP (see Figure-5(a)), and the maximum precision at P10 is the same as well, which is obtained at d = 5 (as presented by Figure-5(b)), but the R-Precision for the ESX + ALNOSTEM is better than that of ESX + ALLNOSTEM, and it is obtained at d = 3, m = 13, as in Figure-5(c). Moreover, using word embedding (ESX) shows an apparent enhancement (at the three recall levels) over PRFQE without word embedding (SX). The similar results obtained by training the word embedding model by all of the words (ALLNOSTEM) and training by only the AL-Definite words (ALNOSTEM) assures the representativeness of the AL-Definite words to the whole dataset. The two training methods show similar results at the three recall levels that assure the stability of the results obtained by training the model by this subset of words while preserving the retrieval performance.

The explanation for obtaining the maximum P10 by extending the original query by five documents and 11 to 14 words for each document is that the probability of a relevant document (in the top-10) can be estimated by the P10 without query expansion, which is about 54% as presented in [35], so it is most likely to bring more relevant documents in the top-5 returned documents, for example, the P@1 (the precision at the first retrieved document) without query expansion was 60.7%. However, it becomes 61.56% after a word-embedded query expansion, meaning word embedding gives more chance for the first retrieved document to be relevant.

The following experiment examines the effect of stemming on training the word embedding model, so the system is fed with embedded words resulting from training the GloVe model using the AL-Definite words, without stemming (ALNOSTEM), and AL-Definite words with Light10 stemming (ALSTEM), and the results were as illustrated in figure-6. The results show a slight enhancement of MAP for ALNOSTEM over stemmed text ALSTEM (figure-6(a)); this slight difference is justified as ALNOSTEM shows better results for higher recall levels that are used to calculate MAP, which is consistent with the results concluded by other researchers, as in [5] and [17]. On the other hand, ALSTEM shows better results in lower recall levels (at P10), where the maximum P10 is obtained at extending the original query by five documents, as shown in figure-6(b), which could be explained as in the previous experiment. It is worth noting that the number of words for the ALSTEM is about 60% of that of the ALLNOSTEM, which makes the training time shorter, so a trade-off between the MAP and the training time could give more implementation options.

The maximum R-Precision for the ALNOSTEM is higher than the maximum R-Precision for ALSTEM, as in figure-6(c); it is obtained by extending the original query by the words of the top-3 returned documents, while the maximum R-precision of the ALSTEM is obtained at extending the query by the words of the top-5 documents.

Using different word embedding models and PRFQE shows insignificant differences regarding the MAP results; for example, the results of the SkipGram and the GloVe for the same query expansion scheme are represented in figure-7 (a). This result is relevant to the results reported by other researchers, such as [22]. However, there are some differences in P10, as in figure-7(b). The slight difference of P10 showed by the SkipGram over the GloVe can be explained by the fact that GloVe determines the similarity between words on a global co-occurrence within a dataset. In contrast, the SkipGram determines this similarity based on local co-occurrence within a text. Hence, the results become closer at higher recall levels, such as R-Precision, as presented in figure-7(c).

The following experiments examine different options for applying word embedding to the PRFQE. The first experiment in this context compares applying the word embedding to the original query, i.e., before PRFQE (ESX), which is the first scheme described in section 4.2, and the application of word embedding to the expanded query (SXE), the second scheme described –also- in section 4.2. the results presented in figure-8(a) show that embedding the original query (the first scheme) has better MAP than embedding the pseudo-relevance expanded query (the second scheme), and the P10 of ESX is better than its value for the SXE as presented in figure-8(b). This result satisfies the last assumption (point 4) listed at the end of section 4.2. It can be explained as the expanded query (in SXE) having words added from the pseudo-relevant documents, which itself could have a different meaning than the original query words, and the similar embedded words of these expanded words most likely to have a different meaning than words of the original query. A closer look at figure-8(a) shows that as the number of extended words increases, the difference in the MAP becomes more pronounced, indicating that as more words are added, they carry a different meaning than the words of the original query.

Another experiment examines the effect of selecting top-m weighted words from each distinct pseudo-relevant expanded document (ESXD) against selecting the top m×d weighted words after mixing m words from each of the d pseudo-relevant expanded documents (ESXA), as explained in section 4.2.

The result of this experiment is plotted in figure-9. It could be observed that the maximum MAP for the ESXD is slightly higher than the MAP of the ESXA for most of the values of d and t (Figure-9(a)), and the maximum MAP for both of the methods found at d = 3, but the maximum P10 of the ESXA is higher than that of the ESXD (see Figure-9(b)). It is found at d = 3 for the ESXD and at d = 5 for the ESXA. The result can be explained as the ESXA adds top weighted words (globally) from a combination of the pseudo-relevant documents; it is more probable to have the right similar semantic words within this combination since the top-weighted words are most likely to be repeated in more than one document making it as topic-specific words. On the other hand, in the ESXD, the top-weighted words are determined (locally) in each separate document that could be irrelevant to the query. However, for higher recall levels, it is more probable to find returned documents having the exact words but of different semantics, which explains lower values of the MAP of the ESXA.

For more testing of the proposed model training method, more experiments were applied to test the effect of using more than one dataset and two different indexing methods. The word embedding models trained by the text of both the TREC-2001 (AFP news) and the Watan-2004 datasets. The watan-2004 dataset has about 20291 news documents of the Al-Watan Omani newspaper, distributed over six topics, prepared by Murad Abbas [37], it is about 114 MB in size. The word embedding models (GloVe and SkipGram) are trained using sentences that contain only the AL-Definite words from a combination of stemmed text of these two datasets (ALSTEM), where the documents are indexed using the AL-Before indexing method. The results of the word-embedded PRFQE are presented in Figure-10.

It could be observed from Figure-10(a) that the MAP of training the Glove model with the AFP only has slightly higher values than training the model using both of the datasets. However, at lower recall levels, the maximum precision at P10 and R-Precision of applying both datasets show better results, as presented in Figure-10(b) and Figure-10(c), respectively.

These results indicate that as the model trained on more text of the same type (news documents for nearly the same period in this case), it could show better results at lower recall levels, but it has less effect on the overall recall space. The justification of these results is that the news documents of the Watan-2004 have focused on the local news of Oman and the Gulf countries. At the same time, the AFP has an international focus, so some words (of Oman and Gulf local semantics) could be added to the context of the AFP words beside other global words. The added words of similar global semantics increase the precision at lower recall levels. However, the added words of local semantics have a negative effect on the overall recall results (as presented in Figure-10(a)).

Another experiment applied using the SkipGram word embedding model trained by using the two datasets (AFP and Watan-2004), and the word-embedded PRFQE is tested on the two indexing methods: AL-Before and All-Words, the MAP and P10 results are plotted in Figure-11. The maximum MAP of the All-Words indexing shows slight improvement over the AL-Before indexing, and the maximum MAP of the two methods is obtained at d = 3. However, the maximum precision at the 10th retrieved document (P10) of the AL-Before is higher than that of the All-Words indexing method, and it is obtained at d = 5, as in figure-11(b). The All-Words method has better MAP since all words are included in its index, so it is more probable to have more semantically related words (to the query words) as we add more extended words, which could be similar to words of the returned documents at higher recall levels, or co-occur with these words at some context, which justifies the gap between the two methods as more extended documents are used, as in figure-11(a). On the other hand, the AL-Before method has higher maximum precision (P10) because it is more probable to have AL-Words (of the extended documents) similar to the AL-Words that are embedded in the query words since the SkipGram is trained only using the AL-Words.

5.2 Comparison to other works:

To compare the results obtained by the proposed methods in this paper to those of other related works, we summarize the maximum results gained for three different recall levels as in Table-3. The percentage of the enhancement is used for comparison since each work could have a different implementation environment.

From Table-3, the MAP of using word embedding with the All-Words indexing method has a 7.2% enhancement over PRFQE without word embedding, and it has a 21.6% MAP enhancement over the basic All-Words without query expansion. The average precision at the 10th returned document (P10 for the All-Words indexing method) is enhanced by 20.5% over the baseline indexing and 13.7% P10 enhancement of the word-embedded query expansion over the PRFQE without word embedding. The most related work to the method proposed in this paper is El Mahdaouy et al. [17] since they used the same dataset and evaluation metric. In that work, the maximum MAP is 41.11, and the maximum P10 is 55.07, while the MAP is 3% greater than the maximum MAP gained by the methods proposed in this paper, but the P10 of this paper is 7% greater than the P10 of the method proposed in [17]. Moreover, the word discovery rate of this paper is 83% greater than the word discovery rate of [17], see figure-4(b), and the training time is shortened to 10% of the SkipGram, for example.

Table-3: A statistical summary of the maximum results obtained by the implemented methods

Method	MAP	P10	R-P*	% enhancement to the basic All-Words without PRFQE			% enhancement the PRFQE without word embedding
Method	MAP	P10	R-P*	%MAP	%P10	%R-P	%MAP	%P10	%R-P
AL-Before without query expansion	36.3	53.3	37.6	9.01	9.67	6.52	-3.94	3.50	-4.33
AL-Before PRFDQE without word embedding	39.2	56	40.6	17.72	15.23	15.01	3.73	8.74	3.31
AL-Before PRFQE with word embedding	40.37	58.96	41.94	21.23	21.32	18.81	6.83	14.49	6.72
All-Words without query expansion*	33.3	48.6	35.3	0.00	0.00	0.00	-11.88	-5.63	-10.18
All-Words PRFQE without word embedding*	37.79	51.5	39.3	13.48	5.97	11.33	0.00	0.00	0.00
All-Words PRFQE with word embedding	40.53	58.59	41.85	21.71	20.56	18.56	7.25	13.77	6.49
* R-P : R-Precision **This is the baseline All-Words indexing with Light10 stemming, no extra weight is given to the title words, and words having DF = 1 are included. All other methods implemented such that words with DF = 1 are excluded. Numbers in bold indicates the best results.

Farhan YH et al. in [23] used the same dataset for evaluation; their method showed about 43.5% P10 for the TREC 2001/2002, less than the P10 showed by the proposed method in this paper. Hiba ALMarwi et al. in [1] evaluated their proposal at the query level. They gave precision to each query and did not include an average or MAP value for all the queries.

Ahmed Cherif Mazari and Abdelhamid Djeffal in [13] Ahmed Cherif Mazari and Abdelhamid Djeffal in [13] used a semantic tree to reform the pseudo-relevance expanded query. They evaluated their proposed method using the Arabic BBC News corpora, and their results show a MAP of 0.524 at P10, which is a 5.8% improvement over the baseline of their test.

This paper proposes an efficient Arabic query expansion scheme based on word embedding, such that the word embedding models are trained by using a representative subset of words instead of using the whole dataset, decreasing the training time and, at the same time, preserving the retrieval efficiency. Different alternatives of Arabic query expansion were examined: embed the original query, embed the expanded query, expand selected words from each distinct document, or select words from a combination of all of the returned documents to gather, and for two different indexing schemes (AL-Before and All-Words).

The results show that the subset of AL-Definite words is representative of an Arabic dataset to train a word embedding model, as it satisfies the conditions given in definition-1 and shows efficient retrieval. There is no significant difference between the retrieval of the word embedding trained by AL-Definite words and the word embedding query expansion trained by using whole dataset words. Moreover, training a model using the AL-Definite words shows better precision at lower recall levels, at P10, for example. Training the model using a stemmed text shortens the training time without affecting the overall retrieval, but it could improve retrieval at lower recall levels. The word embedding model does not significantly affect the overall average precision, but it could make a difference at lower recall levels.

The results show that embedding the original query words is better than embedding the words of the expanded query, i.e., word embedding should precede PRFQE. There is no significant difference (regarding the MAP) between expanding the embedded query by words from a combination of the top-weighted returned documents and selecting top-weighted words from each document. However, it is better to select the expanded words from a combination of top-similar returned documents at lower recall levels.

In general, systems concerned with better retrieval at lower recall levels should expand the query by a combination or re-weighted words from more top-returned documents. The P10 could estimate this number the system shows before PRFQE.

Using different datasets to train the word embedding model could enhance the retrieval at lower recall levels, but it could negatively affect the overall average results.

In future work, we will apply the proposed method of representing a dataset by a subset of words for training a transformer-based word embedding, such as BERT, to reduce the training time and to develop some schemes to use these models for more efficient keyword-based PRFQE.

Author Contribution

It is a single-author manuscript

H. ALMarwi, M. Ghurab, and I. Al-Baltah, “A hybrid semantic query expansion approach for Arabic information retrieval,” Journal of Big Data, vol. 7, no. 1, 2020, doi: 10.1186/s40537-020-00310-z.
Y. H. Farhan, M. Mohd, and S. A. M. Noah, “Survey of Automatic Query Expansion for Arabic Text Retrieval,” Journal of Information Science Theory and Practice, vol. 8, no. 4, pp. 67–86, 2020, doi: 10.1633/JISTaP.2020.8.4.6.
A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Improving Arabic information retrieval using word embedding similarities,” International Journal of Speech Technology, vol. 21, no. 1, pp. 121–136, Mar. 2018, doi: 10.1007/s10772-018-9492-y.
J. Xu and W. B. Croft, “Query expansion using local and global document analysis,” SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp. 4–11, 1996, doi: 10.1145/243199.243202.
D. Roy, D. Ganguly, S. Bhatia, S. Bedathur, and M. Mitra, “Using word embeddings for information retrieval: How collection and term normalization choices affect performance,” International Conference on Information and Knowledge Management, Proceedings, pp. 1835–1838, 2018, doi: 10.1145/3269206.3269277.
D. K. Sharma, R. Pamula, and D. S. Chauhan, “Semantic approaches for query expansion,” Evolutionary Intelligence, vol. 14, no. 2, pp. 1101–1116, 2021, doi: 10.1007/s12065-020-00554-x.
B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating word embedding models: methods and experimental results,” {APSIPA} Transactions on Signal and Information Processing, vol. 8, no. 1, 2019, doi: 10.1017/atsip.2019.12.
J. Camacho-Collados and M. T. Pilehvar, “Embeddings in Natural Language Processing,” COLING 2020 - 28th International Conference on Computational Linguistics, Tutorial Abstracts, pp. 10–15, 2020, doi: 10.18653/v1/2020.coling-tutorials.2.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, 2013.
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017, doi: 10.1162/tacl_a_00051.
F. Almeida and G. Xexéo, “Word Embeddings: A Survey,” no. January, 2019.
J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP}), 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162.
A. C. Mazari and A. Djeffal, “Hybrid Query Expansion Model Based on Pseudo Relevance Feedback and Semantic Tree for Arabic IR,” International Journal of Information Retrieval Research, vol. 12, no. 1, pp. 1–16, 2021, doi: 10.4018/ijirr.289949.
F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney, and F. Rudzicz, “A survey of word embeddings for clinical text,” Journal of Biomedical Informatics: X, vol. 4, no. October, p. 100057, 2019, doi: 10.1016/j.yjbinx.2019.100057.
D. Albakour et al., “Third International Workshop on Recent Trends in News Information Retrieval (NewsIR’19),” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 1429–1431.
R. A. Salama, A. Youssef, and A. Fahmy, “Morphological Word Embedding for Arabic,” Procedia Computer Science, vol. 142, pp. 83–93, 2018, doi: 10.1016/j.procs.2018.10.463.
A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Word-embedding-based pseudo-relevance feedback for Arabic information retrieval,” Journal of Information Science, vol. 45, no. 4, pp. 429–442, 2019, doi: 10.1177/0165551518792210.
J. J. Lastra-Díaz, J. Goikoetxea, M. A. Hadj Taieb, A. García-Serrano, M. Ben Aouicha, and E. Agirre, “A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art,” Engineering Applications of Artificial Intelligence, vol. 85, pp. 645–665, 2019, doi: https://doi.org/10.1016/j.engappai.2019.07.010.
A. Y. Mahgoub, M. A. Rashwan, H. Raafat, M. A. Zahran, and M. B. Fayek, “Semantic Query Expansion for Arabic Information Retrieval,” ANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings, pp. 87–92, 2014, doi: 10.3115/v1/w14-3611.
S. Jain, K. R. Seeja, and R. Jindal, “A fuzzy ontology framework in information retrieval using semantic query expansion,” International Journal of Information Management Data Insights, vol. 1, no. 1, p. 100009, 2021, doi: 10.1016/j.jjimei.2021.100009.
X. Wang, C. Macdonald, N. Tonellotto, and I. Ounis, “Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval,” in Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021, pp. 297–306, doi: 10.1145/3471158.3472250.
A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Improving Arabic information retrieval using word embedding similarities,” International Journal of Speech Technology, vol. 21, no. 1, pp. 121–136, 2018, doi: 10.1007/s10772-018-9492-y.
Y. H. Farhan, S. A. M. Noah, M. Mohd, and J. Atwan, “Word-embedding-based query expansion: Incorporating Deep Averaging Networks in Arabic document retrieval,” Journal of Information Science, vol. 0, no. 0, p. 01655515211040659, doi: 10.1177/01655515211040659.
I. Moawad, W. Alromima, and R. Elgohary, “Bi-Gram Term Collocations-based Query Expansion Approach for Improving Arabic Information Retrieval,” Arabian Journal for Science and Engineering, Mar. 2018, doi: 10.1007/s13369-018-3145-y.
A. Imani, A. Vakili, A. Montazer, and A. Shakery, “Deep neural networks for query expansion using word embeddings,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11438 LNCS, pp. 203–210, 2019, doi: 10.1007/978-3-030-15719-7_26.
A. Elnahaas, N. Elfishawy, M. Elsayed, G. Atteya, and M. Tolba, “Query Expansion for Arabic Information Retrieval Model: Performance Analysis and Modification,” The Egyptian Journal of Language Engineering, vol. 5, no. 1, pp. 11–24, 2018, doi: 10.21608/ejle.2018.59298.
H. K. Azad, A. Deepak, C. Chakraborty, and K. Abhishek, “Improving query expansion using pseudo-relevant web knowledge for information retrieval,” Pattern Recognition Letters, vol. 158, pp. 148–156, 2022, doi: https://doi.org/10.1016/j.patrec.2022.04.013.
I. Lahbari, S. O. El Alaoui, and K. A. Zidani, “Toward a new Arabic question answering system,” Int. Arab J. Inf. Technol., vol. 15, no. 3A, pp. 610–619, 2018.
Z. Zheng, K. Hui, B. He, X. Han, L. Sun, and A. Yates, “Contextualized query expansion via unsupervised chunk selection for text retrieval,” Information Processing & Management, vol. 58, no. 5, p. 102672, 2021, doi: https://doi.org/10.1016/j.ipm.2021.102672.
A. Miaschi and F. Dell’Orletta, “Contextual and non-contextual word embeddings: an in-depth linguistic investigation,” in Proceedings of the 5th Workshop on Representation Learning for NLP, 2020, pp. 110–119.
D. Rohde, L. Gonnerman, and D. Plaut, “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” Cognitive Science - COGSCI, vol. 8, 2006.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv, vol. abs/1810.0, 2019.
M. Chugh, P. A. Whigham, and G. Dick, “Stability of Word Embeddings Using Word2Vec,” in AI 2018: Advances in Artificial Intelligence, 2018, pp. 812–818.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.
Y. A. Al-Lahham, “Index Term Selection Heuristics for Arabic Text Retrieval,” Arabian Journal for Science and Engineering, vol. 46, no. 4, pp. 3345–3355, 2021, doi: 10.1007/s13369-020-05022-3.
L. S. Larkey, L. Ballesteros, and M. E. Connell, “Light Stemming for Arabic Information Retrieval,” in Arabic Computational Morphology, Dordrecht: Springer Netherlands, 2007, pp. 221–243.
M. Abbas, K. Smaïli, and D. Berkani, “Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan,” Arab Gulf Journal of Scientific Research, Sep. 2011, doi: 10.51758/AGJSR-3/4-2011-0017.

https://openAI.com/chatgpt
https://sites.google.com/site/mouradabbas9/corpora/text-corpora?authuser=0

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Improved Arabic Query Expansion using Word Embedding

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Word Embedding

4. Proposed Arabic Query Expansion

4.1 Training the word embedding model

4.2 Pseudo-Relevance Feedback

5. Implementation and Evaluation

5.1 Results and discussion

5.2 Comparison to other works:

6. Conclusion

Declarations

Author Contribution

References

Footnotes

Additional Declarations

Status:

Version 1