In this study, we designed and trained a recurrent neural network (RNN) model to perform the NLI task on an Arabic annotated NLI dataset of P-H pairs. The goal of the NLI task is to predict the inference relationship between a given premise and hypothesis, based on their content and meaning. To do this, we first pre-processed, padded, and sequenced each pair of sentences (P and H) to ensure that they have the same length and dimension. We then used an embedding layer to map the input sequences to dense vectors, which capture the semantic information of the input words. The vectors for the premise and hypothesis were then concatenated and passed through several dense layers to extract higher-level features. The output of the dense layers was then passed through a final output layer with the related activation function to predict the inference label. The proposed model can be used to determine the entailment relationship between Arabic sentences. Before being fed as input to the model, each new pair of sentences (P and H) must be pre-processed, tokenized, and padded to the maximum length. This is necessary because RNN networks require inputs of the same length and dimension. The pre-processed and tokenized sentences are first passed through an embedding layer, which converts them into numerical representations (embeddings) that capture the semantic similarity between words, allowing the RNN to understand the meaning and context of the input sequences.
We evaluated the models' performance using the existing Arabic NLI datasets, both separately and in combination. We discovered that combining the Arabic portion of the XNLI dataset with the ArNLI dataset resulted in lower performance for natural language inference models compared to using only the ArNLI dataset. This decrease in performance may have been caused by errors or noise in the translations within the XNLI dataset, as we observed upon closer examination. As an example, when evaluating the three-ways NLI classification, the BiLSTM model, which had the highest performance among all models, saw a decrease in accuracy and F1 score from 63,65% and 61,38% to 49,60% and 47,50%, respectively, when using the Arabic portion of the XNLI dataset with the ArNLI dataset. As a result, we chose to continue training our models using only the ArNLI dataset.
Besides, in the evaluation results, we found that removing stop words caused a decrease in the performance of our model. This is likely because stop words, including negation words, play a role in the meaning and interpretation of a sentence. When we removed these words, we risked losing important context and information that could affect the performance of our model. The negation words are particularly important for identifying the sense of a sentence and determining the inference relation between the premise and hypothesis. Removing these words had a significant negative impact on the accuracy of our model. Therefore, we opted not to remove stop words, including negation words, during the preprocessing stage for our RNN model. This decision led to an improvement in performance and more accurate results.
Moreover, the size of the output vectors from the embedding layer can have a significant impact on the model's performance and computational efficiency. More specifically, if the output dimension of the embedding layer is too small, the vectors may not be able to capture enough information about the words in the input data, which can negatively impact the model's performance. On the other hand, if the output dimension is too large, the vectors may be unnecessarily large and complex, which can increase the computational cost of training and inference and decrease the model's efficiency. So, to further improve the performance of our RNN model, we used the Keras Embedding Layer, which is a layer available in the Keras library. We experimented with different embedding sizes, including 100, 300, and 500, and found that using an output dimension of 300 resulted in the best performance for our model. This means that for each word in the input data, we represented it with a 300-dimensional vector. Using this approach allowed us to capture more detailed and nuanced information about the words in the input data, which in turn helped improve the performance of our model.
Five models with different architectures were tested for the inference prediction task. Each model has two input layers for the premise and hypothesis, a layer using the correspondent architecture (Simple RNN, LSTM, GRU, BiLSTM, or BiGRU depending on the model), a hidden dense layer with dropout, and an output layer for prediction. The models are compiled using the Adam optimizer and either the binary cross-entropy or the categorical cross-entropy loss functions, depending on the classification type.
Adam optimizer. The Adam optimizer is a popular choice for training deep learning models because it is computationally efficient, has good convergence properties, and can adaptively adjust the learning rates of different model parameters based on the historical gradient information.
Binary cross-entropy. The binary cross-entropy loss function is a widely used metric for evaluating the performance of binary classification models. It is a specialized version of the categorical cross-entropy loss function that is specifically designed for tasks with only two classes. It is a measure of how well the model is able to predict the correct class (entailment or not entailment) for the given input pair of sentences. The model will assign a probability to the input being classified as entailment, with a high probability indicated by a score closer to 1.
Categorical cross-entropy. The categorical cross-entropy loss function is a widely used metric for evaluating the performance of multi-class classification models. It calculates the difference between the predicted probability distribution of the classes and the true distribution represented by the labels, and is particularly useful when the classes are imbalanced or the cost of misclassification is high. In our case, we use the categorical cross-entropy loss function to train a 3-class classification model for the entailment prediction task. The model is trained to predict whether a given premise and hypothesis entail each other (entailment), contradict each other (contradiction), or neither (neutral). The model will return a probability score for each class, representing the likelihood that the input belongs to that class. The class with the highest probability score is typically considered the model's inference prediction for the input P-H pair.
The following table (Table 2) displays the evaluation results of the RNN models for the NLI task, which were obtained using different model architectures for both binary and three-way NLI classification tasks. To assess the performance of the trained models, we used both accuracy and F1 score as evaluation metrics.
Table 2
Evaluation Results of RNN Models for Natural Language Inference in Arabic
Classification Type | DATASET | MODEL | Accuracy | F1-score |
Binary Classification (True, False) | ArabTEDS + ArNLI | Simple RNN | 68,15% | 30,71% |
LSTM | 73,52% | 52,97% |
BiLSTM | 73,60% | 57,77% |
GRU | 72,45% | 51,20% |
BiGRU | 72,81% | 56,69% |
3-way (Entailment, Contradiction, Neutral) | ArNLI | Simple RNN | 53,77% | 38.57% |
LSTM | 61,07% | 54.70% |
BiLSTM | 63,65% | 61,38% |
GRU | 61,77% | 59,51% |
BiGRU | 61,07% | 60,01% |
Table 2 presents the performance of various models on the classification task, with the accuracy and F1 score reflecting the models' ability to correctly classify input sentence pairs for the NLI task. These metrics provide insight into the effectiveness of the different models.
Based on the evaluation results, the Bi-LSTM model with a 300-dimensional embedding vector size outperformed all other models in both binary and 3-way classification NLI tasks. It achieved the highest accuracy and F1 score among all tested models, making it the top performer on both tasks. These results suggest that the Bi-LSTM model is highly effective at accurately classifying input data for the NLI task. The evaluation results showed that all models performed better in the 3-way classification task (entailment, contradiction, and neutral) than in the binary classification task (entails, not entails) in terms of F1 score. This may be because the 3-class classification task allows the models to better distinguish between the three classes, as there is a clear difference between contradiction and neutral. In contrast, the binary classification task groups contradiction and neutral together as "not entails" which may lead to confusion for the model. These results suggest that using a 3-class classification approach may be more effective in improving the performance of NLI models in certain situations.
The following table (Table 3) lists the different hyperparameters for the Bi-LSTM model, which had the best performance. To optimize the performance of the Bi-LSTM model, we conducted a hyperparameter optimization process by testing various combinations of hyperparameters and evaluating the model's performance on the validation set. Through this process, we were able to identify the optimal set of hyperparameter values that resulted in the best model performance.
Table 3
Bi-LSTM Model Hyperparameter Settings
Parameter | Description | Value |
LSTM units | The number of units determines the size of the hidden state vector in the LSTM layers and is a key factor in the model's ability to store and process information about the input sequence. | 64 |
Initial Learning Rate | The step size that the model takes towards the minimum of the loss function during training. | 0,001 |
L2 Regularization RateX | It is a type of regularization that adds a penalty to the model's loss function based on the sum of the squares of the model's weights. | 0,001 |
Dropout rate | The percentage of the output of the model that is randomly set to zero during training, which can help to prevent overfitting and improve the generalization ability of the model. | 0,5 |
Recurrent Dropout rate | The percentage of recurrent connections in an RNN that are dropped out during training. | 0,5 |
Batch size | Number of training examples that are processed together in each iteration of the training process. | 32 |
Figure 2 presents a summary of the model's architecture, providing a visual representation of the model's structure. It includes information on the number of layers, the number of parameters, and the output shape of each layer. The figure illustrates the model's architecture, including the input and output layers, as well as word embedding and hidden layers. The summary presents a clear and concise overview of the design choices implemented in the model. Figure 3 shows the performance of the Bi-LSTM model on the training and validation sets over several epochs. The green line represents the model's performance on the training set, while the blue line represents its performance on the validation set. Plot (a), (b) and (c) show the accuracy, f1-score and loss, respectively, on the training and validation sets. The x-axis represents the number of epochs and the y-axis represents the metric value (accuracy, F1 score, or loss).
Most of the existing works for Arabic NLI have used the ArbTEDS dataset, which includes only 600 pairs of texts. These works have adopted only binary classification approach (entails or not entails. ArNLI is a larger dataset of 6366 pairs that was created recently [18]. The authors used this dataset with different machine learning methods and features to classify the inference relationship into three categories: inference, neutral, or contradiction. They found that the best accuracy was achieved by combining the contradiction vector with either the tri-grams of words vector or the Chars vector. Using various classifiers, they achieved accuracies between 58% and 75%. In our own work, we used only RNN models with word embeddings on the ArNLI dataset for Arabic NLI, without any other feature extraction. We obtained encouraging results, with the best performance being a Bi-LSTM model that achieved an accuracy of 63,65% and an f1-score of 61,38% for the 3-way classification. These results highlight the effectiveness of using RNNs, specifically Bi-LSTM with word embeddings, for NLI tasks on Arabic language data. However, we recognize that there is still room for improvement and plan to take further steps in order to increase performance.
The Table 4 provides a summary of the various works that have been proposed for Arabic NLI, including our work, and lists details on the type of the used approach, the employed method, the classification type, the evaluation values, and the used dataset. The approaches are classified as syntactic, lexical, semantic, or deep learning based, depending on the level of abstraction at which they operate when processing the NLI task. Syntactic approaches focus on the structure and arrangement of words in a sentence, lexical approaches focus on the meanings of individual words, semantic approaches rely on semantic techniques to extract the meanings of words and phrases, and deep learning-based approaches use neural networks to learn to represent and process natural language to make inferences. These categories can be used to classify approaches for natural language inference (NLI) tasks.
Table 4
Summary of Arabic NLI Approaches and Results
Approach | Refe-rence | Strategy | Classi-fication | Evaluation | Dataset |
Accu-racy | F1-score |
Syntactic approaches | [3] | Search for the editing distance on the syntactic trees of the P-H pair | Binary | 66,3% | 66.4% | ArbTEDS |
Lexical approaches | [4] | - Extract related words - Identify the negation and polarity of the P-H pair | Binary | 69,3% | - | ArbTEDS |
[19] | Bigram Matching | Binary | - | 61% | ArbTEDS |
[7] | Use of semantic measure and word sense disambiguation | Binary | 70% | - | 200 Questions/Answers pairs |
[11] | - Enriched representation of P and H - Alignment based approach | Binary | 75.84% | - | ArbTEDS |
[1] | - Linear combination of text similarity mea-sures and weights - Use of genetic algorithm to derive an optimal similarity function | Binary | 73.3% | 71.7% | ArbTEDS |
Semantic approaches | [8 ; 9] | Predicate-argument representation and extraction of features | Binary | 73.33% | 70.4% | 500 Questions/ Answers pairs |
[5] | Use of traditional and Word Embedding features | Binary | 76.2% | - | ArbTEDS |
[12] | Calculate the degree of similarity using Earth Mover's Distance and word embeddings | Binary | 76.5% | - | ArbTEDS |
[18] | - Use a variety of features - Use different language models (TFIDF, N-Grams, and Word Embeddings) | Three-way | ranges between 58% and 75% using different classi-fiers | - | ArNLI |
Deep learning approaches | Our proposed work | Training different types of recurrent neural network models with words embeddings | Binary | 73,60% | 57,77% | ArbTEDS + ArNLI |
Three-way | 63,65% | 61,38% | ArNLI |