Neural Machine Translation and Post-editing: Improving translation of domain specific English terms for a low resource language

doi:10.21203/rs.3.rs-3145838/v1

Download PDF

Research Article

Neural Machine Translation and Post-editing: Improving translation of domain specific English terms for a low resource language

https://doi.org/10.21203/rs.3.rs-3145838/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Neural Machine Translation (NMT) has demonstrated salient enhancement and quality output in various aspects, but issues remain when translating specific terms across domains when data are heterogeneous or dealing with rare phrases. To explore this, the translation system for the language pair: English and Persian, was considered, since low resource languages like Persian require more effort and attention to provide accurate translation when dealing with texts in specific domains. An aligned parallel terminology database was created in the Information Technology domain, where all terms were manually annotated. Data were selected from a collection of parallel sentences from the OPUS repository. Terms in the source and the target language were identified and annotated manually. After training an NMT model, the system was manually evaluated on the effectiveness of the application of instance selection to retrieve sentences. Its use led to an overall enhancement in post-editing, thereby achieving a better translation. Besides gaining insights into errors and system performance, this method provides guidance for setting priorities for future extensions and improvements for machine translation. It also sheds light on the use of instance selection across different scenarios in relation to low resource languages like Persian to achieve an overall high quality term translation accuracy.

Instance selection

Manual evaluation

Neural Machine Translation

Post-editing

Term annotation

Specific terms play a significant semantic role in texts within particular subject domains, which impact on the importance of their accurate translation (Arčan et al., 2015). While data on terms and their equivalents assist technical translators’ decision-making, the compilation of lexical resources as a data source increasingly impacts on texts, including translation. Although recent research into NMT has shown sophisticated results within different Natural Language Processing (NLP) tasks and methods of Recurrent Neural Network (RNN), including the self-attention approach and other convolutional methods, results have shown challenges remain when compared with human translation. This particularly relates to other Machine Translation (MT) approaches, such as Statistical Machine Translation (SMT) and Neural MT, with MT being considered the most successful alternative (Haque, et al., 2019). A state-of-the-art performance has been attributed to MT in different aspects such as, translation of media, blogs, weather reports, subtitling, and social media, becoming the industry norm in today’s global digital world. Nevertheless, despite improved translation techniques, and awareness of the need for annotated data at various levels (Bergmanis & Pinnis, 2021; Susanto et al., 2020), Arčan et al.’s (2015) identification of the term and terminology relationship challenge is very real as there is frequent failure with regard to translating the original sense of specific domain terms. Such specialized translation is vital when people, products, and companies need to deal with rapid changes as exemplified during COVID-19 (McCulloch, 2020). Translation service providers, using MT in production, expect translations to be unambiguous and compatible with the related context and the specific domain involved, yet assessment of terminology translation is one of the least explored areas (Haque, 2019). Moreover, it highlights the importance of the post-editing process, modification or correction of machine-translated texts and output (Lee et al., 2011), since issues arise in MT’s ability to reliably evaluate translations and verify that translators can easily benefit from MT (Bentivogli et al., 2016). In addition, MT’s effectiveness in translating specific domain terms is further challenged when low resource languages are also involved as more effort and attention are required (Gu et al., 2018). Thus, this research explored the MT system’s ability to translate terminology in the Information Technology domain from English to the low resource language of Persian by comparing the application of two types of translation techniques’ (static and instance selection) post-editing. This provided the opportunity to manually evaluate by checking and comparing their performance (Ustaszewski, 2021).

1.1Persian language features

As a morphologically rich language, Persian has many of characteristics that are not shared by other languages, including English. For example, it does not use articles: “a”, “an”, and “the”. There is no distinction between capital and lower-case letters and symbols and abbreviations are scarcely used. This challenges ML applications as does its differing sentence structure compared with English. Not only do parts of speech, such as nouns, subjects, adverbs and verbs vary in their locations within a sentence, sometimes they are omitted altogether. Moreover, it is a pro-drop language, meaning that the subject sometimes can be omitted. Some Persian words also have various versions of spelling, and it is not uncommon for translators to invent new words. This can result in an Out-Of-Vocabulary (OOV) output. Thus, an advantage of this research is its focus on two languages that are totally different in various aspects, such as word order, syntax, and spelling, and how the potential ensuing ambiguity can be reduced in MT outputs.

Domain adaptation and terminology

Emerging literature on translation quality for low resource languages shows promising overall results regarding different approaches. For instance, Ahmadnia et al. (2017) used English in applying the pivot language technique to translate Persian to Spanish, whereas Tars et al. (2021) used monolingual data to create synthetic bilingual corpora through back translation. But the combination of Persian, as low resource, along with aspects of variation, together with lack of specialized data makes building a multi-domain NMT tool more challenging (Gu et al., 2018; Karakanta et al., 2018). Also, given that the usage of terminology is firmly related to the sphere of domain adaptation, this plays a very significant role for customizing the engine in MT. Not only is this technique used when the intention is to learn from a so-called source data distribution, to obtain a high performing model on various related target data, it is also advantageous for learning on data emanating from diverse domains. In line with this trend, researchers have focused on in-domain adaptation to train the model (Zhang & Zong, 2016). Chu, Dabre, & Kurohashi (2017) compared NMT domain adaptation of fine-tuning with multi domain NMT, and “mixed fine-tuning” (a combination of both) and found the latter more effective and sensitive to the quality of in-domain. With respect to out-of-domain training data an instance-based adaptive Neural MT approach was shown to effectively handle translation requests from multiple domains in an unsupervised manner (e.g., Chen et al., 2016; Hoang & Sima’an, 2014). Instance selection aims to retrieve the most relevant sentences/training instances from the pool of multi-domain data. Their relevance regarding the input segment is measured in terms of a similarity score based on the term frequency—inverse document frequency, generally used in information retrieval. The larger the number of words in common between the training and the input sentences, the higher the score, which is important as the quality of the material used for local tuning directly affects the next processing steps. Taking these differing approaches into account, the present research investigated how the use of terminology and instance selection technique affected translation quality within a specialized domain using current NMT systems. However, it is noted that the effort undertaken to use terminology successfully differs greatly from paradigm to paradigm. Nevertheless, since the integration of terminology can take place at various steps of the MT process and be applied at different stages of MT training, a better general and specialized translation quality within different domains was expected.

Terminology in NMT and accuracy

Successful translation in MT systems depends upon the accurate use of terminology; thus, an SMT system for instance, needs to be taught to understand terminology that includes both ambiguous and unknown words within the parallel training data. While in SMT it is possible to directly specify how to translate specific words and phrases in the end-to-end paradigm of NMT systems that consist of encoder and decoder networks, forcing terminology may not be supported or easy to manage. Moreover, in spite of the fact that NMT may reach high quality results, the translation may still not be sufficiently adequate for the demands of many industry-specific domains’ daily needs. Even though NMT has been shown to assist in-domain or monolingual data to learn domain specific terms, it is not a universally applicable solution. This is because domains may be narrow and lacking in data for the bootstrapping techniques involved to work (Dinu et al., 2019, Turchi et al., 2017). Therefore, this also indicates that more research needs to be done to improve the quality of translation.

Having established that issues prevail in MT’s ability to accurately translate terminology and particularly when low resource languages are involved, there has been a focus on the mathematical aspects of estimation of statistical parameters for the language model - Statistical MT (SMT). This depends on the calculation of the conditional probability of a particular target sentence given the sentence or word of the source language. The idea is to exploit the probability of words given their contexts, such that by selecting words with a higher probability the system’s output will be enhanced. Thus, the possibility to guarantee the translation of specific input words and phrases is almost manageable because it is performed as a sequence of distinct levels and steps. Although this makes it possible to specify directly how to translate specific words and phrases, the translation quality of SMT remains difficult because the system is incapable of modelling long-distance dependencies between words (Tan et al., 2020). Moreover, with the end-to-end paradigm of Neural MT (NMT) systems, which uses a single large neural network to model the entire translation process, translation of terminology is not as easily supported and so more difficult to manage. Even though NMT systems work with the state-of-the-art, encoder-decoder architecture, having such impressive quality improvements to control the NMT output, does not fully resolve the problem as there is a need to address user-provided terminology constraints. This reinforces the need for further research into NMT’s outputs and inadequacies for many specific domains (Haque et al., 2019; Dinu et al., 2019) and thus supports the potential of learning from human post-edits despite research being limited in relation to the NMT framework (Turchi et al., 2017).

Thus, a growing body of research has begun to systematically explore better solutions to improve translation quality by integrating human with machine translation (Bentivogli et al., 2016). Turchi et al. (2017) have deemed the investigation of human post-edits presents a powerful strategy that can further illuminate the NMT framework. By focusing on post-editing, using both human and machine corrections to check whether MT suggestions alone are appropriate can help justify translators can actually benefit from their use. In addition, research into automatic post-editing approaches (Chatterjee, Federmann, Negri, & Turchi, 2019) and effort aware Neural Automatic post-editing (Tebbifakhr, Negri, & Turchi, 2019) has explored attempts to improve translation through a combination of fuzzy-match repair (FMR). This aims to lessen the post-editing effort of translation retrieved from a translation memory (Ortega, Sánchez-Martínez, Turchi, & Negri, 2019) and automatic post-editing (APE). The input of the APE system is augmented with a special token at the beginning of both the source text and the MT output to inform the system of the amount of correction. While automatic evaluation has been more successful with NMT, human evaluation has been found more beneficial for the output of SMT (Castilho et al., 2018). Nonetheless, Bentivogli et al.’s (2016) application of the Italian-English benchmark, the Multi Domain Academic Gold Standard with Manual Annotation of Terminology (MAGMATic), that focused on NMT’s ability to translate domain specific terminology found this able to fill a gap because it supplies a valuable test set for quality assessment based on translating terminology. They manually annotated data with the aim to check the terminology annotation reliability (Dinhu et al., 2019). Such approaches are relevant to improving MT’s original output. Similarly, research investigating context-aware, monolingual repair in NMT (Ortega, Sánchez-Martínez, Turchi, & Negri, 2019) has shown greater efficiency for post editing techniques, however, the problem of ensuring a highly reliable post-editing accuracy industry ‘standard’ remains an open issue.

With respect to improving the quality of English-Persian translation, research on domain adaptation, instance selection and post editing for NMT evaluation systems has largely neglected terminology translation. This relates typically to issues with Persian language as noted earlier, including a shortage of text processing tools and resources, parallel corpora and also incomplete computational bilingual dictionaries. However, a few studies have focused on Rule-based MT (Saedi et al., 2009), improving Persian-English SMT (Mohaghegh & Sarrafzadeh 2009; Pilevar & Feili 2009; Ahmadnia, 2017), descriptive computational translation (Ghasemi & Hashemian, 2016), or adaptation of Google translate for cross-lingual information in English-Persian (Rahmani, 2017). It is well recognised that the majority of works addressing domain adaptation for NMT evaluation systems neglect Persian language and the sphere of MT is considered an unspecified field (Ghasemi et al., 2016). Moreover, research that has considered the translation of IT domain specific terminology using instance selection, has not considered the capability of NMT. Testing this method can provide guidance for setting priorities for future extensions and improvements of systems and provision of a much-needed baseline for future comparisons.

The research approach is presented in the following sections, beginning with a detailed description of the dataset, the data annotation procedure, the inter-annotator agreement, and then information about data and system preparation for translation, including how the translations was able to be manually evaluated.

3.1Dataset

The data were extracted from the OPUS repository[1], since it is a growing resource of freely accessible parallel corpora for low resource languages, which are important for most data-driven approaches to MT. KDE4 was chosen as most appropriate for the test-set to select and annotate terms. The remaining datasets, namely “Global voice” ”Gnome” “Open subtitle” “QED” “Tanzil” “TED 2013” “TEP” “Ubuntu” and “Wikipedia” were used as training data. These datasets constituted the biggest parallel corpus in OPUS. Initial cleaning removed duplicated sentences. All IT domain-specific terms in the test-set were identified and the MT system was built with the training dataset, which included the remaining part of the OPUS repository.

3.2Term identification and annotation

Both single word terms and multi-word terms were annotated. However, the process of term identification was not easy because of the unclear boundary between language for general purposes and Language for Specific Purposes (LSP). To tackle this problem, the researchers annotated the terms based on the level of confidence upon which they could agree i.e. to what extent they were sure or uncertain that the terms were in the IT domain. Therefore, the terms were labeled as Sure or Possible. Labeling the data based on such criteria was notably useful, since it enhanced the quality of the selection process (Scansani et al., 2019).

The total data-set was manually annotated by a human annotator, who was a native speaker of Persian, and fluent in English, with a background in linguistics. To ensure the reliability of the resources and plausibility of annotation, a portion of data was also annotated by an inter-annotator, who was also a native speaker of Persian and fluent in English. This process helped evaluate the extent to which the annotator and the re-annotator agreed on the categorisation. A guideline was also developed to increase the annotation quality and ensure the task would be more comprehensible for the inter-annotator. Thus, the inter-annotator could easily and confidently decide and choose which term was indeed certain and which were possible. This returned 373 terms from the 400 sentences out of which there was a discrepancy between the annotator and inter-annotator for only six terms. Since the difference was so low it was deemed not necessary to apply Cohen's Kappa measure of interrater reliability.

The MT-EQuAl toolkit (Girardi et al., 2014) was used to annotate the data since it provides a manual assessment of the MT output. Being web-based, multi-function, open-source, it allowed multi-users to conduct annotations, including remotely for three different tasks: quality rating of translation, annotation of translation error, and word alignment. To perform the annotation for each source sentence it provided the target sentence in a parallel fashion. The English and Persian sentences were aligned for efficient annotation of the terms as sure or possible. In short, MT-EQuAl marked the source and target terms on either side of the test-set, making the annotation process fast and convenient.

4.1Annotation statistics

The descriptive statistics of the occurrences of the terms were compiled for both data sets: development and evaluation. Table 1 shows 732 Sure/Possible unique or distinct terms were found for the development data set (Dev) and 1845 Sure/Possible non-unique terms (all occurrences of the terms) with 249 single terms and 483 multiple word terms. For the evaluation data set, Table 2 shows 702 unique terms and 1826 non-unique terms were identified, with a total number of 208 single terms and total of 494 multiple-words terms.

Table 1

Development data set terms’ statistics
	Number of Unique terms	Non-unique terms	Number of single terms	Number of multi-words terms
Dev Sure	600	1425	176	424
	Number of Unique terms	Non-unique terms	Number of single terms	Number of multi-words terms
Dev Poss	132	420	73	59
	Total Number of unique terms in Sure & Possible	Total of non-unique terms in Sure & Possible	Total number of single terms in Sure & Possible	Total number of multi terms in Sure & Possible
Dev Total	732	1845	249	483

Table 2

Evaluation data set term’s statistics
	Number of Unique terms	Non-unique terms	Number of single terms	Number of multi-words terms
Test sure	631	1443	171	460
	Number of Unique terms	Non-unique terms	Number of single terms	Number of multi-words terms
Test poss	71	383	37	34
	Total number of unique terms in Sure & Possible	Total of non-unique terms in Sure & Possible	Total number of single terms in Sure & Possible	Total number of multi terms in Sure & Possible
Test total	702	1826	208	494

After translation, the annotated terms were used to manually compare and evaluate the performance of the two systems in translating the terms. By injecting the terms into the system, the BLEU score was computed for only sentences that contained the terms to computationally assess the potential of instance selection.

4.2Neural Translation Architecture

For the efficiency of the process Vaswani et al’s (2017) Transformer (big) NMT was utilised. This relies on stacking several identical layers, such as, multi-head attentions and a position-wise, feed-forward network. Based on the standard Sequence-to-Sequence architecture, this model consists of two essential components, namely the encoder and decoder, where the encoder first maps an input sequence of symbol representations (x₁,. . ., x_n) to a sequence of continuous representations z = (z₁,. . ., z_n). Then, given z, the decoder generates an output sequence (y₁. . ., y_m) of symbols, one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, so providing an accumulative result. Thus, it required less time to train and has been shown to provide a better quality and faster translation compared with other systems (Vaswani et al., 2017). Also, the byte pair encoding (BPE) of size 40,000 was applied in this pre-processing step because it enabled open-vocabulary decoding, which splits the vocabulary into sub-word units. The advantage of this was that through merging every occurrence of the most frequent pair, it added the new character n-gram to the vocabulary until the desired vocabulary size was achieved. As a consequence, it ensured that there were no out-of-vocabulary words in the corpus. To train the system, a batch size, of a maximum of 1500 tokens, was used and each system was trained for a maximum of 150 epochs. Table 3, shows the statistics for the English-Persian side of the dataset, and the number of sentences in the training in the test, and in the development set.

Table 3

Statistics of the English-Persian side of the dataset

Dataset

Number of sentences

Training

7237957

Test

6350

Dev

2000

Thus, the MT systems were built with the training set shown in Table 3 that included the remaining part of the data. To annotate the data, the test-set was used that included 6350 sentences. Finally, for each specific system a set of 2000 sentence pairs were randomly selected as the development set. Any duplicated sentence pairs were removed for the instance selection system, such that the system would not retrieve exactly the same test sentences.

4.3Manual evaluation

Table 4 shows the results of the comparison of the two systems, static and instance selection, including for the translation of the terms within the training data. The overall average performance of both systems is very similar with instance selection outperforming the alternative setting by only 0.4% on the BLEU score. Given this slight difference, the researchers investigated conducted a detailed manual inspection to reveal any particular strengths and weaknesses in a more descriptive way.

Table 4

BLEU score for static and instance selection MT systems
Systems	BLEU
Static	40.6
Instance selection	41

Besides gaining deeper insights into the specific errors in relation to the system performances this manual evaluation helped to validate the capabilities of translating specific domain terms. This follow-up action can also provide guidance for setting priorities for future extensions and improvements within different scenarios.

As the amount of time involved precluded a deep analysis of the complete corpus, the study made use of simple random sampling where every part of the corpora had the same probability of selection. Each part of the corpora had the same chance of being selected into this sample. Some 3000 sentences were selected from each corpus and then each system’s performance was manually evaluated. Based on the source sentences, all the instances were counted in each system. Consequently, the count of occurrences of correctly translated terms in the system output were compiled along with percentage errors.

4.4Manual linguistic evaluation of results and output comparison

Table 5 identifies six terms where a discrepancy was found in performance between the two systems. In focusing on these terms, which appear in boldface: File, HTML, Toolbar, Desktop, Username and IP Address, all six were incorrectly translated by both systems. Thus, a report of the statistics and analysis on how the systems translated each specific term follows to help illuminate the situation.

Table 5

List of selected terms for evaluation
Term	f	Term	f	Term	f
File	90	Web	4	Server	10
Filename	3	Web-site	1	Network	6
Click	7	Load	6	Disk	125
Full-screen	6	Toolbar	4	Print	3
Internet address	6	Desktop	13	Icon	5
Button	5	Keyboard	4	Browser	4
Link	8	Operating system	1	Retrieve	8
HTML	8	Username	8	Configure	5
System	10	Scan	4	Font	10
Monitor	3	IP address	7	Bookmark	4

Table 6 shows the results of frequency counts and error percentages for each system’s translation. To be counted as correct, the term translation needed to be matched to the translation reference (test-set). It is noted that terms such as ‘File’ were considered a single-term, whereas terms such as ‘IP address’ was considered as two.

Table 6

Statistics of frequency and error percentage of each system translating terms
Terms	Translation	f	N of errors	Error %	Failed To translate	Mistranslated
File	پرونده
Static system		90	15^*	0.5%	2^*	13^*
Instance selection system		90	12^*	0.4%	0	12^*
Desktop	رومیزی
Static system		13	0	0%	0	0
Instance selection system		13	1^*	0.0333%	0	1^*
Username	نام کاربری
Static system		8	3^*	0.1%	2^*	1^*
Instance selection system		8	0	0	0	0
HTML	زنگام
Static system		8	6^*	0.2%	2^*	4^*
Instance selection system		8	3^*	0.1%	2^*	1^*
IP address	نشانی اینترنتی
Static system		6	5^*	0.2%	2^*	3^*
Instance selection system		6	3^*	0.01333%	0	3^*
Toolbar	نوار ابزار
Static system		6	6^*	0.1333	2^*	4^*
Instance selection system		6	0	0	0	0

The following six exemplars explain how the comparison of translation performance of the two systems was made and identify issues arising to inform future practice.

Example 1

Translation of the term ‘File’

For the analysis of the data, the outputs of the different systems were evaluated, and a search was conducted for errors that were prevalent and systematic. For the term “File” the source sentence contains the term that should be translated into پرونده /Parvande/, which is the correct translation of the term in Persian as shown in exemplar one:

Source: Unexpected end-of-file on standard input

Static: Unexpected end on standard input: ورودی های استاندارد غیر منتظره

Instance-selection: Unexpected end-of-file on standard input: پایان غیار منتظاره ی ورودی های استاندارد پرونده

The static system mistranslated the term into English 13 times, thus translating directly into File with 0.5% error as shown in Table 6. In two cases it failed to translate the term. While the instance selection mistranslated the term into the English word ‘File’ 12 times, with 0.4% error, it translated all terms. Even though the statistical difference between the two systems is low, the results reveal the cases where instance selection provided a better-quality translation. The example clearly shows the static system omitted the translation –of-file, and gave an incomprehensible translation in Persian, whereas the instance selection translated the term and conveyed the correct meaning. Although the focus of the study was on evaluating the performance of the systems in translating terminology, it is also evident that the static system left out the sentence verb, whereas instance selection was successful. It appears that instance selection’s more structured approach was an advantage.

Example 2

Translation of the term ‘Desktop’

Consideration of the term ‘Desktop’, that is also considered as ‘Sure’ and a single term, shows some departure from the findings of the other cases. For example, the static system translated the term without any errors from English (Desktop) into Persian رومیزی) /Rumizi). But, with this term, the instance selection system translated correctly on only one occasion out of 13 occurrences. Table 6 shows a 0.0333 percentage error.

Source: Graphical desktop environment

Static: Graphical desktop environment: محیط رومیزی گرافیکی

Instance selection: Graphical desktop environment: گرافیکی desktop محیط

Thus, the static system correctly translated the term into its Persian equivalent, whereas the instance selection system translated directly into ‘Desktop’. In this case the static system outperformed the instance selection system in the translation of this term. Of note also is that the instance selection system retrieved some training sentences where the term was not translated into Persian, but was used as it is, reporting the English term into the target. This might have happened because the instance selection system strongly depends on the retrieved sentences.

Example 3

Translation of the term ‘Username’

The third term ‘Username’ had the same frequency as the previous term, HTML. It was also considered as ‘Sure’ and a single term and had a frequency of eight, as shown in Table 6. Within all the selected sentences, the instance selection system translated the term without any error. This means that the system correctly translated all the occurrences of the term from the source, ‘username’, into the target language (نام کاربری / Name karbari). In contrast, however, out of these eight occurrences of the term amongst all the sentences, the static system failed to translate the term twice and on one occasion, instead of the compound word ‘username’ it only partially translated it as ‘name’. As shown in Table 6, the error percentage of the static system was 1% whilst for the instance selection system was 0% errors. In this case, the instance selection system clearly outperformed the static system.

Source: This is your username

Static: This is your name: این نام شما است

Instance selection: This is your username: این نام کاربری شما است

As it is clear in this example, the static system varied in its translation and omitted the term or partially translated it, and so failed to produce the correct translation of the sentence. But the instance selection perfectly translated the term and sentence, thus representing the complete meaning.

Example 4

Translation of the term ‘HTML’

With regards to the analysis of the next term, ‘HTML’, with the translation of زنگام /Zangam to Persian, it is considered as ‘sure’ and a single term in the study. Out of eight occurrences of this term, while the instance selection system had 3 errors (0.1% in total), two errors correspond to the absence of the system translating the term in the whole sentence and one error in which it directly translated the term from source into the target language (HTML to HTML). However, the static system had double the errors (six) out of the eight occurrences of the term in translation with 0.2% errors. Twice it did not translate the term and on four occasions it used the English directly into Persian.

Source: Inspect HTML

Static: Inspect: با توجه به

Instance selection: Inspect HTML: HTML با توجه به

Thus, as shown in the above example the static system did not translate the term in question. Even though, the instance selection system did not translate the term into its equivalent Persian term, it directly used the English term into Persian. Thus, it conveyed the meaning and translation is understandable.

Example 5

Translation of the term ‘IP address’

IP address, the fifth term, is another that the systems translated differently. It is considered as ‘Sure’ but a multiple term in the study. The term has the frequency of six within the data set. As shown in Table 6’s summary of the statistics and error percentages in this term’s translation the static system only correctly translated it into Persian once نشانی اینترنتی) /Neshani Interneti’). In failing to translate the term on another occasion, it mistranslated the term into Persian for ‘IP network’ (شبکه اینترنتی) as opposed to IP address on a third. However, for the remaining three times, the static system was accurate and directly proposed the English term into the Persian sentence. In contrast, the instance selection system was able to directly translate the term into Persian three times out of the six with an error of only 0.01333%, e.g. ‘IP address’ into ‘IP address’. For the other three times the instance selection system correctly translated the term into its Persian equivalent (نشانی اینترنتی/ Neshani Interneti).

Source: The IP address as seen by the machine

Static: The IP network as seen by the machine همانطور که توسط ماشین دیده شده شبکه اینترنتی

Instance selection: The IP address as seen by the machine همانطور که توسط ماشین دیده شده نشانی اینترنتی

As depicted in the above example it can be seen that the static system miss-translated this multiple term, thus, suggesting it is effective at translating the first part of the term but failed to translate the second part. In comparison, the instance selection system was able to retrieve the term correctly two-thirds of the time.

Example Translation of the term ‘Toolbar’

The last term ‘Toolbar’, which is ‘Sure’ and also a single term, as shown in Table 6, had a frequency of six errors within the whole selected data. The instance selection system translated the term without any mistake by mapping ‘Toolbar’ correctly into نوار ابزار/ Navarabzar. However, in comparison, out of the total six occurrences of the term, the static system mistranslated the term four times. Instead of translating the term into the Persian word, this system mistranslated it on half these occasions into 'Status-bar’ (میله وضعیت/Mile vazeyat) and on the other two times failed to translate at all. Thus again, the instance selection system outperformed the static system. As shown in the example, it is clear that the system miss-translated the term into ‘Status bar’ instead of ‘Toolbar’, while instance selection correctly translated on every occasion.

Source: Remove from Toolbar

Static: Remove from Status bar: حذف از نوار وضعیت

Instance selection: Remove from Toolbar: حذف از نوار ابزار

4.5Manual evaluation results

Thirty terms were selected for the manual evaluation of the results, with each term having an equal chance of being chosen through use of random sampling. Based on the discrepancies of translation for the six identified terms the analyses as discussed in the previous section, it is clear that the instance selection system performed better in most of the cases. However, the static system outperformed the instance selection system on one occasion, which was in for the term ‘Desktop’ where instance selection mistranslated the term. This reinforces the value of analysing the results despite the slight difference in the two systems’ statistical performance considering the BLEU score. In particular, it can be argued instance selection gave a better overall quality of translation. Empirically, the study provides evidence of the effectiveness of the instance selection approach and shows that the manual evaluation was much more informative than the BLEU score because the BLEU score does not highlight any improvements, compared with the manual evaluation that identified the nature of the issues involved.

4.5.1BLEU score comparison

To further investigate the potential of instance selection’s injection of the terms into the system, the researchers computed the BLEU score only for the sentences that contained the term(s) in the test-set to evaluate and compare the performance of the two systems. Thus, the BLEU score was computed for both the static and instance selection models. However, as Table 7 shows, these analyses were applied to three datasets, firstly, sentences with ‘Sure’ terms only, then those with ‘Possible’ terms only, and finally those sentences that contained ‘Sure’ plus ‘Possible’ terms (where there was more than one term in a sentence). Table 7 shows the descriptive statistics of the BLEU score for the two different systems across these variables.

Table 7

BLEU score of systems in different configurations
Instance-selection system	BLEU	Static system	BLEU
Sure terms	40.15	Sure terms	42.70
Possible terms	47.30	Possible terms	44.92
Sure + possible terms	43.36	Sure + possible terms	42.85

For ‘Sure’ terms, the static system’s, BLEU score of 42.70 was greater than the instance selection system’s score of 40.15. As it is shown in Table 7, the static system performance by 2.55 points more than the instance selection system. In contrast, for the Possible terms, the result was the reverse. The instance selection system’s score of 47.30 exceeded the static system’s score of 44.92 by 2.38. However, as noted earlier, the BLEU score was also computed for sentences that contained both a Sure term and a Possible term where the difference was minimal (0.51). Although overall differences are small overall, the results suggest the instance selection system may be best for translating the terms in different configurations. The possible reason is that instance selection forces the MT to use the terms that are in the retrieved sentences, although this may create issues if the terms have not been seen in the training data very often. Since the MT system is forced to correctly translate the individual terms, this can penalise the remainder of the translation through not knowing the correct translation of the next word. Compared with the ‘Sure’ terms, which can be confidently translated, as they have rarer use, and the system translates and starts again, the Possible terms have an element of ambiguity. These can have more frequent use and therefore are seen more often in the training data.

In arguing the instance selection performed slightly better than the static system in different scenarios and aspects according to the BLEU data comparison, on the basis of the deeper analysis of the translation data, the instance selection system was found to be better in more aspects compared to the static system. In summary, the instance selection system outperformed the static system in translating single terms, whilst the static system had better performance only in one case. Moreover, the instance selection system was superior in translating multiword terms, whereas the static system omitted translating the second part of a multiword term. Furthermore, as a result of computing the BLEU score on only sentences containing the terms in different configurations, the instance selection system outperformed the static system in translating Sure terms on every occasion except one. The instance selection system also performed very well when translating Possible terms as well as when translating sentences with both Sure plus Possible terms. Therefore, evidence was found to support the argument that the instance selection system was able to translate the given terms more effectively than the static system. It can also be concluded that even though the static system was found to be generally less effective compared with the instance selection system, its performance was better with respect to two words in the study. Firstly, it correctly translated the term ‘Desktop’ into Persian while the instance selection system directly transferred this term from English into Persian. Secondly, the static system performed better than the instance selection system on the basis of the BLEU score when the sentences contained a Sure term only.

The main contribution of this paper is the proposal of a novel direction in translation, which is the application of the instance selection translation system to the pair of English-Persian languages. In particular, the contribution impacts on the following elements as they illuminate the key considerations in making decisions with regard to the English-Persian translation context: (1) terminology annotation, (2) creation of an English-Persian manual annotated dataset, (3) Instance selection technique, (4) MT post-manual evaluation, and (5) use of NMT technologies. Firstly, regarding terminology annotation, the research illuminates the challenges of NMT and especially in relation to languages with low digital resources like Persian. It affirms the main difficulties that arise from the scarcity of annotated corpora and the consequent problems that emerge from poor translation. It also highlights how term annotation is a key factor in NMT with regards to achieving accuracy of a term’s translation, thus reiterating the importance of understanding the relationship between the study of terms and terminologies alongside the study of technical translation when undertaking such research and translation work (Arčan et al., 2015). In addition, the research has produced the manually-annotated dataset for the English-Persian IT domain, which can also be used for future research to add to the field. Secondly, the research applied the instance selection technique, which has not been done before in this targeted language pair. The results showed more accuracy compared to the static system in both overall translation and in translating only the sentences that contained the terms. Last but not least, another novelty of this study is that little work has been done to manually check and evaluate the performance of systems in translating terminology in the direction of English to Persian. Thus, the fact that this evaluation showed the better accuracy of the instance selection system compared with its counterpart, the static system, further informs both this research and future research into low resource languages.

In summary, this paper describes approaches for making machine translation more linguistically clear by manually annotating and analyzing specific domain terms to better inform the post-editing process. A manual annotation of data in the IT domain was applied and then, the performances of the two aforementioned systems were analyzed by manually checking which performed more effectively in translating the terms. Although the performance of the static system had shown improved analysis of the training data, this was likely due to the fact that the system already had a good coverage of the data. Whilst the BLEU score showed that the performances of the two systems was very much the same, the study’s detailed evaluation and deeper analysis of the data were able to identify how the performance and translation quality of the instance selection system was relatively better when compared with the static system in specifically different aspects. These aspects related to the manual evaluation of the two systems in translating both single and multiple terms and also when computing BLEU score in different configurations of the systems. Moreover, the manual evaluation of data highlighted the limitations inherent in such approaches, depending on the external tools. For example, the systems involved are sensitivity to errors related to the extraction and translation of terminology. Importantly, the post-manual evaluation showed that the neural system is capable of learning aspects of the language through the use of external tools such as instance selection. In addition, the fact that instance selection performed generally better than the static system, and that less errors were produced by the machine and repeated errors were avoided by learning from corrections, showed it to be of considerable help for the translators, thus supporting Turchi et al. (2017) findings in learning from integrating human post-editing. As the observed reduction in the post-editing effort clearly indicates, MT can help to boost the overall users’ productivity and experience and provide support with the potential to increase insights into machine teaching for neural MT. On this basis, future research should include the investigation of more domains and languages, with evaluation and comparison of the performance of the two systems considered here. Moreover, further evaluation and improvement of the Persian MT systems should be taken into consideration. This would add to the knowledge of translation techniques and practice and help achieve a better translation quality from diverse languages into Persian. Given the detailed method and results presented here, it is now possible to select and improve systems with respect to the translation task. For example, if there is a post-editor involved, this can help facilitate a focus on fixing issues that are difficult to post-edit. If the goal is to provide information to end users, then the focus would be on those issues arising that affect readability most.

No funding was received for this research.

All data sets are available for the research.

The authors declare no conflict of interest regarding the research.

A.Conceptualized the project and A.B. compiled the data bases and performed the analyses; A.C. reviewed the process and verified the results. A. wrote the manuscript with consultation and input from all authors.

Ahmadnia, B., Serrano, J., & Haffari, G. (2017). Persian-Spanish low-resource statistical machine translation through English as pivot language.In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Sept 4-6. Varna, Bulgaria, pp. 24–30.
Arčan, M., Turchi, M., & Buitelaar, P. (2015). Knowledge portability with semantic expansion of ontology labels. In: Proceedings of the 53^rd Annual Meeting of the Association for Computational Linguistics and the 7^th International Joint Conference on Natural Language Processing. Beijing, China, July 26-31, pp. 708–718 https://doi: 10.3115/v1/P15-1069
Bentivogli, L., Bertoldi, N., Cettolo, M., Federico, M., Negri, M., & Turchi, M. (2016). On the evaluation of adaptive machine translation for human post-editing. IEEE Transactions on Audio Speech and Language Processing (TASLP) 24, 388–399.
Bergmanis, T., & Pinnis, M. (2021). Facilitating terminology translation with target lemma annotations. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3105–3111, Online. Association for Computational Linguistics (ACL).
Castilho, S., Doherty, S., Gaspari, F., & Moorkens, J. (2018). Evaluating the impact of light post-editing on usability.In: Proceedings of the tenth international conference on language resources and evaluation, Portorož, 23–28 May 2016, pp. 310–316.
Chatterjee, R., Federmann, C., Negri, M., & Turchi, M. (2019). Findings of the WMT 2019 shared task on automatic post-editing. In: Proceedings of the Fourth Conference on Machine Translation: Shared Task Papers.
Chu, C., Dabre, R., & Kurohashi, S. (2017). An empirical comparison of simple domain adaptation methods for Neural Machine Translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, pp 385–391.
Dinu, G., Mathur, P., Federico, M., & Al-Onaizan, Y. (2019). Training neural machine translation to apply terminology constraints. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. Association for Computational Linguistics, pp. 3063–3068. htpp://arXiv:1906.01105v2 [cs.CL]
Ghasemi, H., & Hashemian, M. (2016) A comparative study of "Google translate" translations: An error analysis of English-to-Persian and Persian-to-English translations. English Language Teaching 9, 13-17. https://eric.ed.gov/?id=EJ1089886
Girardi, C., Bentivogli, L., Farajian, M., & Federico, M. (2014). MT-EQuAl: A toolkit for manual assessment of machine translation output. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations. Dublin, Ireland, August 23-29, pp. 120–123. https://www.aclweb.org/anthology/C14-2026.pdf
Gu, J., Hassan, H., Devlin, J., & Li, V. O. K. (2018). Universal neural machine translation for extremely low resource languages. In: Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology Volume 1 (Long Papers), New Orleans, Louisiana, pp. 344–354. http//:arXiv:1802.05368v2 [cs.CL]
Haque, R., Hasanuzzaman, M., & Way, A. (2019). TermEval: An automatic metric for evaluating terminology translation in MT. In: Proceedings of CICLing 2019, the 20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France. https://mohammedhasanuzzaman.github.io/papers/CICLING4.pdf
Hoang, C., & Sima’an, K. (2014). Latent domain translation models in mix-of-domains haystack. In: Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1928–1939.
Karakanta, A., Dehdari1, J., & Genabith, J. V. (2018). Neural machine translation for low-resource languages without parallel corpora. Machine Translation, 32, 167–189. https://doi.org/10.1007/s10590-017-9203-5
Lee, J., & Liao, P. (2011). A comparative study of human translation and machine translation with post-editing. Compilation and Translation Review, 4(2), 105-149.
McCulloch, G. (2020). Covid 19 is history’s biggest translation challenge. Wired. Retrieved March 12 2021, from https://www.wired.com/story/covid-language-translation-problem/
Mohaghegh, M., & Sarrafzadeh, A. (2009). An analysis of the effect of training data variation in English-Persian statistical machine translation. In: Proceedings of the Innovations in Information Technology, 2009. Al Ain, UAE, pp. 105-109. http://dx.doi.org/10.1109/IIT.2009.5413782
Ortega, J., Sánchez-Martínez, F., Turchi, M., & Negri, M. (2019). Improving translations by combining fuzzymatch repair with automatic post-editing. In: Proceedings of machine translation summit XVII Volume 1: Research Track, European Association for Machine Translation, Dublin, Ireland, pp 256– 266. https://www.aclweb.org/anthology/W19-6625
Pilevar, M. T., & Feili, H. (2009). PersianSMT: A first attempt to English-Persian statistical machine translation. In: Proceedings of 10th International Conference on statistical analysis of textual data, JADT 2009, Sapienza University of Rome, Italy, pp. 1101–1112. https://www.ledonline.it/ledonline/JADT-2010/allegati/JADT-2010-1101-1112_157-Pilevar
Rahmani, A. (2017). Adapting google translate for English-Persian cross-lingual information retrieval in medical domain. In: 2017 Artificial Intelligence and Signal Processing Conference (AISP). IEEE. https://doi:10.1109/AISP.2017.8324104
Saedi, C., & Shamsfard, M. (2009). Translating Persian documents into English using knowledge based WSD. In: Fourth IEEE International Conference on Digital Information Management, ICDIM 2009, November 1-4, 2009, University of Michigan, Ann Arbor, Michigan, USA, pp. 229–234. https://doi:10.1109/ICDIM.2009.5356770
Shih, C-L. (2007). Mapping out students’ translation process: An MT-specific comparative study. Studies of Translation and Interpretation, 10, 163-190. https://dx.doi.org/10.29786/STI.200712.0005
Susanto, H. R. & Chollampatt, S., & Tan, L. (2020). Lexically constrained neural machine translation with levenshtein transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3536– 3543. https://doi:10.18653/v1/2020.acl-main.325
Tan, Z., Wang, S,, Yang, Z., Chen, G., Huang, X., Sun, M., & Liu, Y. (2020). Neural machine translation: A review of methods, resources, and tools, AI Open, 1, 5–21. https://doi:10.1016/j.aiopen.2020.11.001
Tars, M., Tättar, A., & Fišel, M. (2021). Extremely low-resource machine translation for closely related languages. arXiv preprint arXiv:2105.13065
Tebbifakhr, A., Negri, M., & Turchi, M. (2019) Effort-aware neural automatic postediting. In: Proceedings of the Fourth Conference on Machine Translation (WMT19), Florence, Italy, August, pp. 139-144, Association for Computational Linguistics. https://doi:10.18653/v1/W19-5416
Turchi M, Negri M, Farajian M, Federico M (2017) Continuous learning from human post-edits for neural machine translation. The Prague Bulletin of Mathematical Linguistics 108:233–244 https://dx.doi.org/10.1515/pralin-2017-0023
Ustaszewski, M. (2021). Towards a machine learning approach to the analysis of indirect translation. Translation Studies, 14(3), 313-331. https://doi-org.ezproxy.usq.edu.au/10.1080/14781700.2021.1894226
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: Proceedings of 31^st Conference on Neural Information Systems Processing (NIPS), Long Beach, California, USA, pp. 6000–6010. https://arxiv.org/pdf/1706.03762.pdf
Zhang, J., & Zong, C. (2016). Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). Austin, Texas, USA, pp 1535–1545. Association for Computational Linguistics (ACL). https://www.aclweb.org/anthology/D16-1160.pdf

http://opus.nlpl.eu/

No competing interests reported.

Download PDF

Reviewers invited by journal
28 May, 2024
Editor assigned by journal
07 Aug, 2023
Submission checks completed at journal
19 Jul, 2023
First submitted to journal
06 Jul, 2023

You are reading this latest preprint version

Neural Machine Translation and Post-editing: Improving translation of domain specific English terms for a low resource language

Status:

Version 1

Abstract

1. Introduction

1.1Persian language features

2. The role of post-editing

3. Method

3.1Dataset

3.2Term identification and annotation

4. Results

4.1Annotation statistics

4.2Neural Translation Architecture

4.3Manual evaluation

4.4Manual linguistic evaluation of results and output comparison

4.5Manual evaluation results

4.5.1BLEU score comparison

5. Conclusions

Declarations

References

Footnotes

Additional Declarations

Status:

Version 1