Investigating the Use of Different Recurrent Neural Networks for Natural Language Inference in Arabic

doi:10.21203/rs.3.rs-2693608/v1

Download PDF

Research Article

Investigating the Use of Different Recurrent Neural Networks for Natural Language Inference in Arabic

https://doi.org/10.21203/rs.3.rs-2693608/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Natural language inference (NLI) is a subfield of natural language processing (NLP) that involves determining the logical relationship between two pieces of text, usually a premise and a hypothesis. The goal of an NLI system is to classify the inference relationship between the premise and the hypothesis into one of three categories, namely, entailment, contradiction and neutral. An understanding of this relationship is useful for performing several NLP tasks, such as, summarization, question-answering, information retrieval, etc. Given the potential of neural networks to handle complex natural language tasks, and the absence of prior research on their application to NLI in Arabic. In this context, we propose the investigation of different neural network models for the NLI task in Arabic language. Particularly, we carried out experiments using various types of recurrent neural networks, such as, Simple RNN, LSTM, GRU, Bi-LSTM, and BiGRU to find the one that performs best on the NLI task in Arabic. We performed our experiments using the existing datasets for Arabic NLI namely, ArbTE, XNLI, and ArNLI.

Natural language processing

Natural Language Inference

Recognizing textual entailment

Arabic language

Artificial neural networks

Recurrent neural networks

RNN

LSTM

GRU

Bi-LSTM

Bi-GRU

Natural Language Inference (NLI) is a continuous process of research and one of the most essential tasks in natural language processing (NLP). It involves determining the logical relationship, such as, entailment, contradiction, or neutral, between a pair of texts referred to as the premise (P) and hypothesis (H). The NLI task is a key component of many NLP and artificial intelligence (AI) systems, it has a wide range of applications in various domains, including, improving the performance of dialogue, question answering, machine translation systems, etc. In these systems, NLI can be used to understand the meaning and context of user input or source language text and generate more accurate and relevant responses or translations. NLI can also be used in other applications, such as, text classification, summarization, and information retrieval, where it can be used to understand the meaning and relationships between texts to categorize or summarize them or retrieve relevant information.

There are many approaches to the NLI task, including, rule-based approaches, machine learning, and deep learning approaches. Neural networks (NN) are particularly effective for NLI tasks due to their ability to understand the complexities of natural language, which is essential for accurately determining the relationships between sentences. To the best of our knowledge, no work has been done to address the Arabic NLI problem using NN techniques, and most existing works are based on feature engineering and statistical methods.

Recurrent Neural Networks (RNN) are a type of NNs that can process input sequences using their internal state (memory), making them suitable for many NLP applications. Therefore, we present in this work RNN model applied to the task of NLI in Arabic. Although, RNN has not been investigated for the Arabic NLI, it has been examined and applied widely in other domains and languages, such as, English, and has shown outstanding results. The main contributions of our work are listed as follows:

An NLI system for Arabic based on recurrent neural network (RNN) is proposed by considering the inference task as a classification problem and without using any manual feature engineering;
Different recurrent neural network types are tested namely, simple RNN, LSTM, GRU, BiLSTM, and BiGRU;
Word embeddings are used to represent words of the input text pairs. In order to determine the appropriate vector embedding size, we tested the model's performance with different sizes (100, 300, and 500).

Our work varies from previous research for the Arabic NLI in that it shifts from traditional ML algorithms to neural network approach. Furthermore, it is the first Arabic NLI system that uses three separate datasets to evaluate the proposed system's performance, namely ArbTEDS, XNLI and ArNLI. The rest of the paper is organized as follows. Section 2 reviews the related works.

Section 3 provides an overview of the background and concepts relevant to the proposed work. Section 4 presents the proposed approach for NLI in detail. Section 5 describes discusses the results. Finally, Section 6 elaborates the conclusion and some future work.

Existing approaches for the NLI in the Arabic language can be classified into three categories, namely, syntactic, lexical, and semantic approaches.

The structure and order of words in a sentence are examined in NLI syntactic approaches. This can include analyzing the grammar and syntax of the sentences, as well as the relationships between the words. The first system designed for Arabic NLI, introduced in [3], used a syntactic approach which focuses on the syntax of a language. Among the contributions proposed in this work, the version that achieved the highest accuracy used the Artificial Bee Colony (ABC) algorithm to search for the editing distance on the syntactic trees of the P-H pair.

The lexical approaches involve examining the words and their meanings in a text segment. The system proposed in [4] is a lexical-based approach that performs the NLI task by applying the "ATE" algorithm to calculate common words in the P-H pair and uses cosine directional similarity. Besides, it incorporates Arabic WordNet to extract related words and considers negation and polarity as additional features. Another lexical-based approach is the system proposed in [19] that detects entailment by calculating word overlap and applying bigram matching between the premise and hypothesis. Additionally, the system in [7] detects NLI for Arabic by employing a semantic similarity measure and a word sense disambiguation process. Another example is the ARTESys + system [11], which converts each P-H pair to a normalized representation and enriches it with useful information such as related keywords, POS tags, synonyms, antonyms, etc. Alhijawi and Awajan [1] proposed an RTE model that uses a genetic algorithm to determine the optimal combination of text similarity measures and their corresponding weights to form a similarity function.

Semantic approaches consider the meaning of texts by mapping language expressions to semantic representations. These approaches can use various techniques to extract and represent the meanings of words and sentences, such as word embeddings, ontologies, or knowledge graphs. The AR-SLoTE system [8 ; 9] used a semantic representation based on a logical representation. Then, a set of features, such as, predicate-argument overlap, named entity correspondence, and semantic similarity are extracted to score the similarity between the logical representations of the P-H pair. These features are then combined and used to generate feature vectors for each pair in order to determine the inference relation. The proposed system was used to improve the performance of a question answering system by identifying the most accurate answer [6]. Another semantic approach is applied by the LR-ALL system [5], which combines traditional features (such as length, similarity score, and named entities) with word embedding-based features to make NLI decisions. Another approach proposed in [12] uses a modified and extended version of the Earth Mover's Distance measure and word embeddings to determine the inference relationship. The authors in [18] present an NLI system for Arabic using a new dataset they have created called ArNLI. the authors use a variety of features, including named entity recognition, WordNet similarity, special stopwords, and features related to numbers, dates, and times. Additionally, they used different language models, such as TFIDF, N-Grams, and word embeddings.

In order to make the final inference decision, all the Arabic works mentioned above use machine learning classifiers. We observe that there is a lack of prior research on the use of artificial neural network models for the NLI task in Arabic language. Below, we will present some works that have used neural networks to address the NLI task for English.

The authors in [13] proposed a deep learning model that uses Long Short-Term Memory (LSTM) networks to process the premise and hypothesis separately, and then concatenates the resulting representations. The concatenated sentence representations are then fed into multiple tanh layers and a final layer for classification. A word-by-word neural attention mechanism was proposed in [26] to identify key terms in the premise that impact the overall classification decision. They employed a process called conditional encoding. They used two LSTMs for the premise and hypothesis, and computed word-by-word attention based on all the output vectors of the hypothesis. In [25] the authors propose a model for aligning words in the premise and hypothesis with the use of attention mechanisms to perform a soft alignment. The model decomposes the NLI task into a comparison of aligned words. Then, lexical-level decisions are combined to make the final classification decision. In [29], the authors determine the inference relationship between the P-H pairs using a matching aggregation framework. The framework involves matching the word vectors of the two sentences, which are then aggregated into a vector for the final classification, using either LSTM or a convolutional neural network (CNN). The authors in [24] propose a single sentence-encoding model based on Bidirectional LSTM (Bi-LSTM) networks, in which word embeddings and outputs from previous layers are combined at each layer. The authors in [32] used tree representations of the premise and hypothesis. Then, they used an attention mechanism to align the premise and hypothesis tree nodes and then determined the inference relation recursively. In [23], the authors propose a network that use LSTM encoding, various combinations of attention models, and recursive networks. Another architecture is presented in [28] that produces alignment features of the premise and hypothesis using an attention mechanism, then compresses the alignment features to a scalar feature and encodes them using LSTM encoders to improve sentence representation. A model based on a combination of CNN and attention mechanisms is presented in [31]. The proposed system uses a local matching-integration strategy with an attention mechanism to build a matching aggregation framework that aims to find the inference relationship between P-H pairs by accumulating the dominant features of fine-grained alignments across the sentences. The authors in [16] propose a model that involves matching and aligning words in the hypothesis with those in the premise, determining the degree of alignment from multiple perspectives, and applying a Bi-LSTM model to determine the relationship between the alignment vectors. In addition, the authors in [21] propose a model for the task of multiple-premise inference that uses a gate mechanism and use multiple LSTMs. Another model for multiple-premise inference is proposed in [30] by applying local matching-integration and concatenate-matching. In local matching, each premise sentence is individually matched with the hypothesis sentence to obtain multiple local classification results, which are then merged to get the final inference classification. Concatenate matching solves the multiple-premise entailment (MPE) task by transforming it into a single-premise task by concatenating multiple premises into a single premise. Recently, an inference network model is proposed in [27], that is based on the Transformer model structure. The model retains the non-local feature extraction advantages of the self-attention mechanism and combines it with the convolution method to enhance the feature attention of the local domain, and effectively combine local and non-local features.

Natural language inference (NLI) also known as Recognizing Textual Entailment (RTE), is a subtask of NLP that aims to understand the meaning and relationships between words and phrases in human language. The NLI task has applications in a variety of fields, such as machine translation, information retrieval, and question answering. It differs from text similarity in that an inference relationship must be one-way. More specifically, the hypothesis must be derived from a premise, but the premise does not have to be entailed by the hypothesis [10]. The goal of NLI is to determine the inference relation between two short and ordered texts [15], which can be:

Entailment: The premise entails the hypothesis. Given the first example in Table 1, the correct NLI label would be "entailment," because the statement "The moon orbits the earth" logically implies that the moon moves. It's also important to note that the relationship between the two statements is typically considered to be in one direction. This means that if the premise entails the hypothesis, it does not necessarily follow that the hypothesis also entails the premise, which is the case in the mentioned example.
Contradiction: The premise and the hypothesis are mutually exclusive. Consider the second example in the Table 1, the statements “The boy is sick and staying in the hospital” and “The boy is playing football in the garden” cannot both be true at the same time. So, the correct NLI label for this example would be "contradiction".
Neutral: The premise and the hypothesis are unrelated. The third example in the table illustrates a situation where the NLI relationship between the premise and the hypothesis is neutral. This is because the statement "The book is large" does not entail or contradict the statement "The book is written in Arabic".

The NLI task is a challenge because it involves determining the relationship between two sentences, which requires a deep understanding of the meaning and context of each sentence. Deep learning, a subfield of machine learning based on the use of artificial neural networks, has demonstrated effectiveness in a variety of NLP tasks, including NLI. These neural networks are trained on large datasets and learn to recognize patterns and relationships in the data by adjusting the weights and biases of the connections between neurons. This allows the model to perform well on NLP tasks by learning to understand and generate human language. Therefore, neural networks have the potential to perform well on the NLI task, as they are able to learn and understand the complexities of natural language, which is crucial for accurately inferring relationships between sentences.

Table 1

Examples of Natural Language Inference in Arabic
	Premise	Label	Hypothesis
1	القمر يحلق فوق الأرض (al-qamar yahluq fuq al-ard) “Translation: The moon orbits the earth”	Entailment	القمر يتحرك (al-qamar yataharrak) “Translation: The moon moves”
2	الولد مريض ويقيم في المستشفى (al-walad marīd wa yuqīm fī al-mustašfā) “Translation: The boy is sick and staying in the hospital”	Contradiction	الولد يلعب كرة القدم في الحديقة (al-walad yal'ab kurat al-qadam fī al-ħadīqa) “Translation: The boy is playing football in the garden”
3	الكتاب كبير (al-kitab kabiir) “Translation: The book is large”	Neutral	الكتاب مكتوب بلغة عربية (al-kitab maktūb bil-luġah 'arabiyyah) “Translation: The book is written in Arabic”

A neural network (NN) is inspired by the structure and function of the human brain. It is composed of many individual computational units, called neurons, which are connected to each other through a network of weighted connections. These connections allow information to flow between the neurons, allowing the NN to process and interpret input data. NNs are able to learn from data, either by being trained on a labeled dataset or by learning from their own experiences. This learning ability enables them to perform a wide range of tasks in many different fields, including computer vision, robotics, and NLP.

There are many different types of neural networks that have been applied to NLP tasks, and each type has its own strengths and weaknesses. Recurrent neural networks (RNNs) are particularly well-suited to processing sequential data, such as natural language text. Unlike traditional feedforward neural networks, which only have connections that flow in one direction (from input to output), RNNs have connections that form a directed cycle, allowing information to flow in multiple directions and be shared across different parts of the network. This makes RNNs particularly effective at modeling the relationships between elements in a sequence, such as the words in a sentence. However, one of the challenges in training RNNs is the potential for the backpropagated error to either "explode" or "vanish," which can make convergence infeasible or hinder the network's ability to learn long-term dependencies.

The issue of the error exploding or vanishing arises due to the use of backpropagation through time (BPTT) to train RNNs. BPTT involves unrolling the RNN's computational graph and applying the chain rule to compute the gradients of the loss function with respect to the weights at each time step. However, this process can lead to gradients that are either very large (exploding) or very small (vanishing) due to the repeated application of the chain rule. To mitigate these issues, various techniques have been developed, such as LSTM and GRU models.

The Long short-term memory (LSTM) networks are a type of recurrent neural network. They were first proposed in [17] as a solution to the problem of vanishing gradients in traditional RNNs. LSTM networks are able to maintain a long-term memory, which allows them to capture long-term dependencies in the data and make predictions based on the entire context of the input sequence. LSTM networks have been successfully applied to a wide range of NLP tasks, including language translation, speech recognition, text generation, and natural language inference.

Gated recurrent unit (GRU) is a type of RNN introduced in 2014 as a simplified version of LSTM. It is designed to address the vanishing and exploding gradient problems that can occur when training traditional RNNs on long-term dependencies. GRUs use gating mechanisms to selectively update or forget their hidden state at each time step, based on the input and previous hidden state. More specifically, GRU uses gates to control the flow of information in the network and enable the model to learn long-term dependencies. However, it uses a different mechanism for generating the update and updating the cell state. Specifically, GRU uses two gates: a reset gate and an update gate. The reset gate determines which information from the previous state should be discarded, and the update gate determines which information should be passed on to the current state. By using these gates, GRU is able to effectively learn long-term dependencies without the need for the more complex structure of LSTM. GRUs have a relatively simple architecture compared to other types of RNNs. In general, LSTMs tend to be more powerful and able to handle longer-term dependencies, but they can also be slower to train and run compared to GRUs. GRUs are simpler and faster, but may not be as effective at handling longer-term dependencies. Ultimately, the choice between LSTMs and GRUs will depend on the specific characteristics of the task and the dataset being used.

Moreover, bidirectional LSTM (Bi-LSTM) and bidirectional GRU (Bi-GRU) are designed to process input sequences in both forward and backward directions. This allows the network to make use of both past and future context, which can be particularly useful in NLP tasks where the meaning of a word can be influenced by the words that come before and after it in a sentence. Bi-LSTM networks are based on LSTM units, while Bi-GRU networks are based on gated GRUs. Both Bi-LSTMs and Bi-GRUs are able to effectively capture and use contextual information in NLP tasks by processing input sequences in both forward and backward directions. The goal of our study is to evaluate these different RNN types and compare their performance on the NLI task for Arabic.

Besides, many current NN models use word embeddings as input, which allows them to learn representations for words with similar meanings. It is a powerful technique to learn word representation and can perform excellently on various NLP tasks. It involves creating numerical vectors that capture the meaning of words based on their co-occurrence in a large dataset of text. Word embeddings are typically created by mapping words to vectors in a high-dimensional space, where each dimension represents a particular feature or aspect of the word's meaning. These vectors can then be used as input to machine learning models, such as NNs, to perform different NLP tasks, including NLI.

In this work, we propose an approach for detecting natural language inference (NLI) in Arabic sentences using neural network architectures. This approach is implemented in a system that consists of four main components: text pre-processing, sequence representation, word embedding, and a recurrent neural network (RNN) modeling (see Fig. 1).

The text pre-processing step involves removing diacritics, normalizing, and cleaning the text, and words stemming. Then, the sequence representation step involves tokenizing the text, organizing it into sequences, and padding the sequences to ensure uniform length. Afterwards, the word embedding layer maps the tokenized words to a continuous vector space, allowing the RNN model to process the input more effectively. We experimented with multiple RNN types to determine the best one for our task, namely, Simple RNN, LSTM, GRU, Bi-LSTM, and Bi-GRU. The RNN model is then used for prediction, outputting the inference relationship of a given pair of sentences, weather it is entailment, contradiction, or neutral. Each step will be discussed in details in the following subsections.

4.1. Data collection

Previously, the ArbTEDS^[1] (Arabic Textual Entailment Dataset) was widely used in Arabic for the NLI task. Created by Alabbas [2] using a semi-automatic method, ArbTEDS was the first dataset of its kind for Arabic NLI and contains around 600 pairs of text/hypothesis (T/H) statements, with an equal number of pairs for each of the two classes of "entails" and "not entails". However, despite its popularity, ArbTEDS has some limitations. It is relatively small and may not be suitable for use with deep learning methods to solve the NLI problem. Additionally, it is only annotated with two classes, making it suitable only for binary classification.

The XNLI dataset^[2] is a cross-lingual natural language inference (NLI) corpus created in [14]. It consists of 7500 human-annotated development and test pairs from the MultiNLI corpus that have been annotated with textual entailment and translated into 14 languages, resulting in a total of 112,500 annotated pairs. The XNLI dataset is unique in that it allows for the evaluation of NLI models on multiple languages, making it a valuable resource for researchers working on developing cross-lingual NLI systems. The large number of annotated pairs in the XNLI dataset also makes it a useful resource for training and evaluating deep learning models for NLI tasks.

ArNLI is a dataset for three-way NLI in Arabic, created in [18]. It consists of over 12,000 sentences and was created by first automatically translating into Arabic two existing English datasets: SICK dataset [22] which was used in SemEval-2014 Task 1^[3]and PHEME^[4] dataset [20]. The authors then manually corrected a subset of the translations and annotated additional pairs from pre-existing sources. The final ArNLI dataset contains 6366 pairs, divided into three classes: 1932 of entailed pairs, 1073 of contradicted pairs, and 3361 of neutral pairs.

4.2. Pre-processing

Preprocessing is an important step in the development of a NN model. It involves preparing the data for modeling by applying various techniques to improve its performance. One common preprocessing step is the removal of diacritics, that can be useful for simplifying the text and reducing the number of unique characters. Diacritics, also known as tashkeel, are symbols added to letters to indicate certain phonetic values in Arabic. Removing diacritics can make it easier for a NN model to learn from the text by reducing the complexity of the input data. To remove diacritics, we excluded the Unicode characters between "U + 0617-U + 061A" and "U + 064B - U + 0652". These character ranges include the most common diacritics used in Arabic. Another preprocessing step is the normalization of the multiple forms of letters. In Arabic, there are multiple shapes for some letters, such as, "alif", which can be written in the forms (ا, أ, إ, or آ). Normalizing these multiple forms involves converting them to a single standard form, such as (ا). Finally, we performed a series of cleaning steps, including, removing all the punctuations, applying regex (i.e., regular expressions) operations to remove unnecessary parts of the text. These operations are useful for removing or replacing a wide range of unnecessary elements from the text, such as specific characters, sequences or patterns of characters.

4.3. Sequence representation

In order to use text data as input or output for a NN model, it is necessary to convert the text into a numerical representation that the model can understand. To do so, the first step is to perform tokenization, followed by sequencing and padding:

Tokenization. Tokenization involves breaking the text down into individual tokens, or small units of meaning. To tokenize the text data, we used the “tokenizer” function of the Keras^[5] deep learning framework, which is a powerful learning framework for developing and evaluating deep learning models using Python.

Sequencing. Sequencing includes assigning a unique integer to each token in a dataset, so that the tokens can be represented as a sequence of integers. So, the resulting tokens are used to create integer sequences, which are lists of integers representing the tokens. By transforming the text data into a numerical representation, the text data will be represented in a more structured and numerical form that is easier for the RNN model to process.

Pad sequences. Padding is the process of adding a fixed number of additional elements, typically zeros, to a sequence of data in order to bring it up to a desired length. This is done in order to ensure that all sequences have the same length, which is often required by machine learning models. These padded integer sequences are then used as input for the embedding layer of the RNN model. It is important to ensure that all text sequences in the input data are padded or truncated to the same length since the RNN model expects a fixed-size input.

4.4. Word embedding

Word embedding is a way of representing words and documents in a continuous numerical vector space, using dense vectors instead of the traditional large, sparse vectors that were used in bag-of-words encoding schemes. In these older schemes, a large vector was used to represent each word in a vocabulary, and a given word or document would be represented by a vector filled with mostly zero values. With word embedding, each word is represented by a dense vector that captures its meaning and context based on how it is used in text. The position of a word within the vector space, known as its embedding, is learned from the text and reflects the surrounding words.

The purpose of the embedding layer in the proposed model is to represent the input data in a way that captures the semantic relationships between words. By using this layer, the RNN model can capture these relationships and use them to make more accurate predictions. It takes padded integers of text data as input and converts these integers into continuous, dense representations that can be processed by the RNN. The input integers represent unique words in the text data, and the embedding layer converts each integer into a vector of continuous values that represents the word's meaning or context. For example, consider a sentence that contains the words "cat" and "dog." An RNN model with an embedding layer can use the generated vectors for these words to understand that they are both animals and are likely related in some way. This allows the model to make more informed predictions based on the meanings and contexts of the words in the input data.

4.5. Recurrent Neural Network Modeling

In our model, the premise and hypothesis are each processed separately by the different RNN networks. The outputs of these RNNs are then concatenated and fed into the classification layer (i.e. the dense layer). More specifically, the model uses two separate RNNs to process the premise and the hypothesis, and then combines the resulting representations before inputting them into the classification layer. This approach allows the model to capture the individual characteristics of the premise and the hypothesis, as well as the relationships between them. By using two separate RNNs and a classification layer, the model is able to consider each input separately and then classify the combined representation. We conducted experiments with different types of RNNs, including, Simple RNN, LSTM, GRU, BiLSTM, and BiGRU. Noting that in our experiments, each of these RNNs was tested separately.

Simple Recurrent Neural Networks (Simple RNNs). Simple RNNs are the most basic type of RNN, with a single layer of hidden units that process the input data sequentially. They are useful for learning simple patterns in sequential data but can struggle to learn longer-term dependencies.

Long Short-Term Memory (LSTM). As we said previously, LSTM networks are a type of RNN that incorporate additional mechanisms, that allow the network to selectively remember or forget past information. They have been shown to outperform other types of RNNs on many natural language processing tasks.

Gated recurrent units (GRUs). GRUs are a type of RNN that use gating mechanisms to control the flow of information through the network. They are designed to be simpler and faster to train than LSTMs and have been used successfully for a wide range of tasks.

Bidirectional Long Short-Term Memory (BiLSTM). A BiLSTM combines the ability of an LSTM to handle long-term dependencies with the ability to process data in a bidirectional manner. This makes it particularly well-suited for tasks such as language modeling, where the meaning of a word depends on its context.

Bidirectional Gated Recurrent Unit (BiGRU). A BiGRU is a type of neural network that is specifically designed to handle long-term dependencies. Like BiLSTM. BiGRU processes input sequences in both forward and backward directions.

Afterwards, the output of the RNN model passes through the dense layer, which applies a linear transformation to the data and adds a bias term. Then, the output of the dense layer is passed through the “Sigmoid” or “Softmax” activation functions, depending on the classification type. “Sigmoid” is used for binary classification (entails, not entails), it is also called the logistic function. This function always outputs a value between 0 and 1. It is used to predict the probability that the relation of inference of the input data is true. The “Softmax” function is applied to the three-way classification (entailment, contradiction, neutral). The class with the highest probability in the output probability distribution is then selected as the predicted class for the input data.

[1] http://www.cs.man.ac.uk/~ramsay/ArabicTE/

[2] https://paperswithcode.com/dataset/xnli

[3] https://alt.qcri.org/semeval2014/task1/

[4] https://www.pheme.eu/2016/04/12/pheme-rte-dataset

[5] https://keras.io/

In this study, we designed and trained a recurrent neural network (RNN) model to perform the NLI task on an Arabic annotated NLI dataset of P-H pairs. The goal of the NLI task is to predict the inference relationship between a given premise and hypothesis, based on their content and meaning. To do this, we first pre-processed, padded, and sequenced each pair of sentences (P and H) to ensure that they have the same length and dimension. We then used an embedding layer to map the input sequences to dense vectors, which capture the semantic information of the input words. The vectors for the premise and hypothesis were then concatenated and passed through several dense layers to extract higher-level features. The output of the dense layers was then passed through a final output layer with the related activation function to predict the inference label. The proposed model can be used to determine the entailment relationship between Arabic sentences. Before being fed as input to the model, each new pair of sentences (P and H) must be pre-processed, tokenized, and padded to the maximum length. This is necessary because RNN networks require inputs of the same length and dimension. The pre-processed and tokenized sentences are first passed through an embedding layer, which converts them into numerical representations (embeddings) that capture the semantic similarity between words, allowing the RNN to understand the meaning and context of the input sequences.

We evaluated the models' performance using the existing Arabic NLI datasets, both separately and in combination. We discovered that combining the Arabic portion of the XNLI dataset with the ArNLI dataset resulted in lower performance for natural language inference models compared to using only the ArNLI dataset. This decrease in performance may have been caused by errors or noise in the translations within the XNLI dataset, as we observed upon closer examination. As an example, when evaluating the three-ways NLI classification, the BiLSTM model, which had the highest performance among all models, saw a decrease in accuracy and F1 score from 63,65% and 61,38% to 49,60% and 47,50%, respectively, when using the Arabic portion of the XNLI dataset with the ArNLI dataset. As a result, we chose to continue training our models using only the ArNLI dataset.

Besides, in the evaluation results, we found that removing stop words caused a decrease in the performance of our model. This is likely because stop words, including negation words, play a role in the meaning and interpretation of a sentence. When we removed these words, we risked losing important context and information that could affect the performance of our model. The negation words are particularly important for identifying the sense of a sentence and determining the inference relation between the premise and hypothesis. Removing these words had a significant negative impact on the accuracy of our model. Therefore, we opted not to remove stop words, including negation words, during the preprocessing stage for our RNN model. This decision led to an improvement in performance and more accurate results.

Moreover, the size of the output vectors from the embedding layer can have a significant impact on the model's performance and computational efficiency. More specifically, if the output dimension of the embedding layer is too small, the vectors may not be able to capture enough information about the words in the input data, which can negatively impact the model's performance. On the other hand, if the output dimension is too large, the vectors may be unnecessarily large and complex, which can increase the computational cost of training and inference and decrease the model's efficiency. So, to further improve the performance of our RNN model, we used the Keras Embedding Layer, which is a layer available in the Keras library. We experimented with different embedding sizes, including 100, 300, and 500, and found that using an output dimension of 300 resulted in the best performance for our model. This means that for each word in the input data, we represented it with a 300-dimensional vector. Using this approach allowed us to capture more detailed and nuanced information about the words in the input data, which in turn helped improve the performance of our model.

Five models with different architectures were tested for the inference prediction task. Each model has two input layers for the premise and hypothesis, a layer using the correspondent architecture (Simple RNN, LSTM, GRU, BiLSTM, or BiGRU depending on the model), a hidden dense layer with dropout, and an output layer for prediction. The models are compiled using the Adam optimizer and either the binary cross-entropy or the categorical cross-entropy loss functions, depending on the classification type.

Adam optimizer. The Adam optimizer is a popular choice for training deep learning models because it is computationally efficient, has good convergence properties, and can adaptively adjust the learning rates of different model parameters based on the historical gradient information.

Binary cross-entropy. The binary cross-entropy loss function is a widely used metric for evaluating the performance of binary classification models. It is a specialized version of the categorical cross-entropy loss function that is specifically designed for tasks with only two classes. It is a measure of how well the model is able to predict the correct class (entailment or not entailment) for the given input pair of sentences. The model will assign a probability to the input being classified as entailment, with a high probability indicated by a score closer to 1.

Categorical cross-entropy. The categorical cross-entropy loss function is a widely used metric for evaluating the performance of multi-class classification models. It calculates the difference between the predicted probability distribution of the classes and the true distribution represented by the labels, and is particularly useful when the classes are imbalanced or the cost of misclassification is high. In our case, we use the categorical cross-entropy loss function to train a 3-class classification model for the entailment prediction task. The model is trained to predict whether a given premise and hypothesis entail each other (entailment), contradict each other (contradiction), or neither (neutral). The model will return a probability score for each class, representing the likelihood that the input belongs to that class. The class with the highest probability score is typically considered the model's inference prediction for the input P-H pair.

The following table (Table 2) displays the evaluation results of the RNN models for the NLI task, which were obtained using different model architectures for both binary and three-way NLI classification tasks. To assess the performance of the trained models, we used both accuracy and F1 score as evaluation metrics.

Table 2

Evaluation Results of RNN Models for Natural Language Inference in Arabic
Classification Type	DATASET	MODEL	Accuracy	F1-score
Binary Classification (True, False)	ArabTEDS + ArNLI	Simple RNN	68,15%	30,71%
		LSTM	73,52%	52,97%
		BiLSTM	73,60%	57,77%
		GRU	72,45%	51,20%
		BiGRU	72,81%	56,69%
3-way (Entailment, Contradiction, Neutral)	ArNLI	Simple RNN	53,77%	38.57%
		LSTM	61,07%	54.70%
		BiLSTM	63,65%	61,38%
		GRU	61,77%	59,51%
		BiGRU	61,07%	60,01%

Table 2 presents the performance of various models on the classification task, with the accuracy and F1 score reflecting the models' ability to correctly classify input sentence pairs for the NLI task. These metrics provide insight into the effectiveness of the different models.

Based on the evaluation results, the Bi-LSTM model with a 300-dimensional embedding vector size outperformed all other models in both binary and 3-way classification NLI tasks. It achieved the highest accuracy and F1 score among all tested models, making it the top performer on both tasks. These results suggest that the Bi-LSTM model is highly effective at accurately classifying input data for the NLI task. The evaluation results showed that all models performed better in the 3-way classification task (entailment, contradiction, and neutral) than in the binary classification task (entails, not entails) in terms of F1 score. This may be because the 3-class classification task allows the models to better distinguish between the three classes, as there is a clear difference between contradiction and neutral. In contrast, the binary classification task groups contradiction and neutral together as "not entails" which may lead to confusion for the model. These results suggest that using a 3-class classification approach may be more effective in improving the performance of NLI models in certain situations.

The following table (Table 3) lists the different hyperparameters for the Bi-LSTM model, which had the best performance. To optimize the performance of the Bi-LSTM model, we conducted a hyperparameter optimization process by testing various combinations of hyperparameters and evaluating the model's performance on the validation set. Through this process, we were able to identify the optimal set of hyperparameter values that resulted in the best model performance.

Table 3

Bi-LSTM Model Hyperparameter Settings
Parameter	Description	Value
LSTM units	The number of units determines the size of the hidden state vector in the LSTM layers and is a key factor in the model's ability to store and process information about the input sequence.	64
Initial Learning Rate	The step size that the model takes towards the minimum of the loss function during training.	0,001
L2 Regularization RateX	It is a type of regularization that adds a penalty to the model's loss function based on the sum of the squares of the model's weights.	0,001
Dropout rate	The percentage of the output of the model that is randomly set to zero during training, which can help to prevent overfitting and improve the generalization ability of the model.	0,5
Recurrent Dropout rate	The percentage of recurrent connections in an RNN that are dropped out during training.	0,5
Batch size	Number of training examples that are processed together in each iteration of the training process.	32

Figure 2 presents a summary of the model's architecture, providing a visual representation of the model's structure. It includes information on the number of layers, the number of parameters, and the output shape of each layer. The figure illustrates the model's architecture, including the input and output layers, as well as word embedding and hidden layers. The summary presents a clear and concise overview of the design choices implemented in the model. Figure 3 shows the performance of the Bi-LSTM model on the training and validation sets over several epochs. The green line represents the model's performance on the training set, while the blue line represents its performance on the validation set. Plot (a), (b) and (c) show the accuracy, f1-score and loss, respectively, on the training and validation sets. The x-axis represents the number of epochs and the y-axis represents the metric value (accuracy, F1 score, or loss).

Most of the existing works for Arabic NLI have used the ArbTEDS dataset, which includes only 600 pairs of texts. These works have adopted only binary classification approach (entails or not entails. ArNLI is a larger dataset of 6366 pairs that was created recently [18]. The authors used this dataset with different machine learning methods and features to classify the inference relationship into three categories: inference, neutral, or contradiction. They found that the best accuracy was achieved by combining the contradiction vector with either the tri-grams of words vector or the Chars vector. Using various classifiers, they achieved accuracies between 58% and 75%. In our own work, we used only RNN models with word embeddings on the ArNLI dataset for Arabic NLI, without any other feature extraction. We obtained encouraging results, with the best performance being a Bi-LSTM model that achieved an accuracy of 63,65% and an f1-score of 61,38% for the 3-way classification. These results highlight the effectiveness of using RNNs, specifically Bi-LSTM with word embeddings, for NLI tasks on Arabic language data. However, we recognize that there is still room for improvement and plan to take further steps in order to increase performance.

The Table 4 provides a summary of the various works that have been proposed for Arabic NLI, including our work, and lists details on the type of the used approach, the employed method, the classification type, the evaluation values, and the used dataset. The approaches are classified as syntactic, lexical, semantic, or deep learning based, depending on the level of abstraction at which they operate when processing the NLI task. Syntactic approaches focus on the structure and arrangement of words in a sentence, lexical approaches focus on the meanings of individual words, semantic approaches rely on semantic techniques to extract the meanings of words and phrases, and deep learning-based approaches use neural networks to learn to represent and process natural language to make inferences. These categories can be used to classify approaches for natural language inference (NLI) tasks.

Table 4

Summary of Arabic NLI Approaches and Results
Approach	Refe-rence	Strategy	Classi-fication	Evaluation		Dataset
Approach	Refe-rence	Strategy	Classi-fication	Accu-racy	F1-score	Dataset
Syntactic approaches	[3]	Search for the editing distance on the syntactic trees of the P-H pair	Binary	66,3%	66.4%	ArbTEDS
Lexical approaches	[4]	- Extract related words - Identify the negation and polarity of the P-H pair	Binary	69,3%	-	ArbTEDS
	[19]	Bigram Matching	Binary	-	61%	ArbTEDS
	[7]	Use of semantic measure and word sense disambiguation	Binary	70%	-	200 Questions/Answers pairs
	[11]	- Enriched representation of P and H - Alignment based approach	Binary	75.84%	-	ArbTEDS
	[1]	- Linear combination of text similarity mea-sures and weights - Use of genetic algorithm to derive an optimal similarity function	Binary	73.3%	71.7%	ArbTEDS
Semantic approaches	[8 ; 9]	Predicate-argument representation and extraction of features	Binary	73.33%	70.4%	500 Questions/ Answers pairs
	[5]	Use of traditional and Word Embedding features	Binary	76.2%	-	ArbTEDS
	[12]	Calculate the degree of similarity using Earth Mover's Distance and word embeddings	Binary	76.5%	-	ArbTEDS
	[18]	- Use a variety of features - Use different language models (TFIDF, N-Grams, and Word Embeddings)	Three-way	ranges between 58% and 75% using different classi-fiers	-	ArNLI
Deep learning approaches	Our proposed work	Training different types of recurrent neural network models with words embeddings	Binary	73,60%	57,77%	ArbTEDS + ArNLI
Deep learning approaches	Our proposed work		Three-way	63,65%	61,38%	ArNLI

Arabic is a difficult language to process due to its high level of morphological ambiguity, the writing style, and the lack of capitalization. Therefore, performing any NLP task in Arabic requires a significant amount of feature engineering and preprocessing. The objective of this work is to design and train an RNN model to perform the NLI task in Arabic, without using any feature engineering. In order to achieve this goal, we conducted experiments using different types of RNNs, including Simple RNN, LSTM, GRU, Bi-LSTM and BiGRU. The purpose of these experiments is to evaluate the performance of the various neural network architectures on the NLI task in Arabic, and to determine which type of RNN is best suited for this task. Finally, we found that deep learning approaches, particularly Bi-LSTM networks, were effective for detecting Arabic NLI. We were successful in achieving promising results using a Bi-LSTM model with word embeddings for the 3-way classification task, which achieved an accuracy of 63.65% and an f1-score of 61.38%. However, we believe there is potential to further improve performance through various strategies. These may include:

Incorporating additional features: We have achieved good results using only word embeddings, but we believe we can improve performance further by incorporating additional syntactic, lexical, or semantic features.
Using pre-trained word embeddings: We plan to use pre-trained word embeddings on large corpora for Arabic, as they may provide our model with a strong foundation of language knowledge and potentially improve performance
Adding attention mechanisms: Attention mechanisms allow the model to focus on certain words or phrases in the input more heavily, which can be particularly helpful for NLI tasks where certain words or phrases carry a lot of meaning.
Increasing the size of the dataset: A larger dataset can provide the model with more examples to learn from, which can improve performance.
Adding CNN layers: Convolutional neural networks (CNNs) are a type of neural network that are particularly well-suited for processing sequential data such as text. Adding CNN layers to our model may help to improve performance.

In conclusion, our research on natural language inference for Arabic language using Bi-directional LSTM models and word embeddings has demonstrated promising results. Our findings indicate that this approach is effective in solving the task at hand. However, we acknowledge that there is still potential for further improvement and we plan to continue working on optimizing our model to achieve better results in the future.

Ethical Approval

Not applicable

Competing interests

The authors declare no competing interests.

Authors' contributions

Mabrouka Bensghaier carried out the experiment and wrote the manuscript with support from Wided Bakari and Mahmoud Neji. All authors have read and approved the final version of the manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability statement

Three different datasets were analyzed during the current study. The first and second datasets are publicly available from the following sources: http://www.cs.man.ac.uk/~ramsay/ArabicTE/ https://paperswithcode.com/dataset/xnli, and the third dataset is available upon request. Interested parties can contact the corresponding author at [[email protected]].

Alhijawi, Bushra, and Arafat Awajan. "Novel textual entailment technique for the Arabic language using genetic algorithm." Computer Speech & Language 68 (2021).
Alabbas Maytham. "A Dataset for Arabic Textual Entailment." Proceedings of the Student Research Workshop associated with RANLP 2013 (2013).
Alabbas, Maytham. Textual entailment for modern standard arabic. Diss. The University of Manchester, United Kingdom (Doctoral dissertation), (2013)
Al-Khawaldeh Fatima T. A Study of the Effect of Resolving Negation and Sentiment Analysis in Recognizing Text Entailment for Arabic.World of Computer Science & Information Technology Journal, vol. 5, no 7 (2015).
Almarwani, Nada, and Mona Diab. "Arabic textual entailment with word embeddings." Proceedings of the third arabic natural language processing workshop, (2017)
Ben-Sghaier, Mabrouka, Wided Bakari, and Mahmoud Neji. "An Arabic question-answering system combining a semantic and logical representation of texts." International conference on intelligent systems design and applications. Springer, Cham, (2017).
Ben-Sghaier, Mabrouka, Bakari, Wided, et Neji, Mahmoud. Recognizing Textual Entailment for Arabic using semantic similarity and Word Sense Disambiguation. In: Language Processing And Knowledge Management International (2018).
Ben-Sghaier, Mabrouka, Wided Bakari, and Mahmoud Neji. "Arabic Logic Textual Entailment with Feature Extraction and Combination." International Conference on Intelligent Systems Design and Applications. Springer, Cham, (2018).
Ben-Sghaier, Mabrouka, Wided Bakari, and Mahmoud Neji. "Ar-SLoTE: A Recognizing Textual Entailment Tool for Arabic Question/Answering Systems." 2019 7th International conference on ICT & Accessibility (ICTA). IEEE, (2019).
Ben-Sghaier, Mabrouka, Wided Bakari, and Mahmoud Neji. "Classification and Analysis of Arabic Natural Language Inference Systems." Procedia Computer Science 176 (2020).
Boudaa, Tarik, Mohamed El Marouani, and Nourddine Enneya. "Alignment based approach for Arabic textual entailment." Procedia computer science 148, (2019).
Boudaa, Tarik, Mohamed El Marouani, and Nourddine Enneya. "Using Earth Mover's Distance and Word Embeddings for Recognizing Textual Entailment in Arabic." Computación y Sistemas 24.4 (2020).
Bowman Samuel R., Angeli Gabor, POTTS Christopher, et al. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, (2015).
Conneau, Alexis, Lample, Guillaume, Rinott, Ruty, et al. XNLI: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018).
Dagan, Ido, Oren Glickman, and Bernardo Magnini. "The PASCAL recognising textual entailment challenge." Machine Learning Challenges Workshop. Springer, Berlin, Heidelberg, (2005).
Ghaeini, Reza, Sadid A. Hasan, Vivek Datla, Joey Liu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aaditya Prakash, Xiaoli Z. Fern, and Oladimeji Farri. "Dr-bilstm: Dependent reading bidirectional lstm for natural language inference." arXiv preprint arXiv:1802.05577 (2018).
Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997).
Jallad, Khloud Al, and Nada Ghneim. "ArNLI: Arabic Natural Language Inference for Entailment and Contradiction Detection." arXiv preprint arXiv:2209.13953 (2022).
Khader, Mariam, Arafat Awajan, and Akram Alkouz. "Textual entailment for Arabic language based on lexical and semantic matching." International Journal of Computing & Information Sciences 12.1 (2016).
Lendvai, Piroska, Isabelle Augenstein, Kalina Bontcheva, and Thierry Declerck. "Monolingual social media datasets for detecting contradiction and entailment." In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), (2016).
Liu, Chunhua, Shan Jiang, Hainan Yu, and Dong Yu. "Multi-turn inference matching network for natural language inference." In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 131-143. Springer, Cham, (2018).
Marelli, Marco, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. "A SICK cure for the evaluation of compositional distributional semantic models." In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), (2014).
Munkhdalai, Tsendsuren, and Hong Yu. "Neural tree indexers for text understanding." Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 1. NIH Public Access, (2017).
Nie, Yixin, and Mohit Bansal. "Shortcut-stacked sentence encoders for multi-domain inference." arXiv preprint arXiv:1708.02312 (2017).
Parikh, Ankur P., Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. "A decomposable attention model for natural language inference." arXiv preprint arXiv:1606.01933 (2016).
Rocktäschel, Tim, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, and Phil Blunsom. "Reasoning about entailment with neural attention." arXiv preprint arXiv:1509.06664 (2015).
Sun, Lei, and Hengxin Yan. "Feature Fusion Transformer Network for Natural Language Inference." 2022 IEEE International Conference on Mechatronics and Automation (ICMA). IEEE, (2022).
Tay, Yi, Luu Anh Tuan, and Siu Cheung Hui. "Compare, compress and propagate: Enhancing neural architectures with alignment factorization for natural language inference." arXiv preprint arXiv:1801.00102 (2017).
Wang, Shuohang, and Jing Jiang. "A compare-aggregate model for matching text sequences." arXiv preprint arXiv:1611.01747 (2016).
Wu, Pin, Zhidan Lei, Quan Zhou, Rukang Zhu, Xuting Chang, Junwu Sun, Wenjie Zhang, and Yike Guo. "Multiple premises entailment recognition based on attention and gate mechanism." Expert Systems with Applications 147 (2020).
Yin, Wenpeng, and Hinrich Schütze. "Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms." arXiv preprint arXiv:1710.00519 (2017).
Zhao, Kai, Liang Huang, and Mingbo Ma. "Textual entailment with structured attentions and composition." arXiv preprint arXiv:1701.01126 (2017).

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Investigating the Use of Different Recurrent Neural Networks for Natural Language Inference in Arabic

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Works

3 Background

4 Proposed Approach

4.1. Data collection

4.2. Pre-processing

4.3. Sequence representation

4.4. Word embedding

4.5. Recurrent Neural Network Modeling

5 Results And Evaluation

6 Conclusion And Future Work

Declarations

References

Additional Declarations

Status:

Version 1