4.2Neural Translation Architecture
For the efficiency of the process Vaswani et al’s (2017) Transformer (big) NMT was utilised. This relies on stacking several identical layers, such as, multi-head attentions and a position-wise, feed-forward network. Based on the standard Sequence-to-Sequence architecture, this model consists of two essential components, namely the encoder and decoder, where the encoder first maps an input sequence of symbol representations (x1,. . ., xn) to a sequence of continuous representations z = (z1,. . ., zn). Then, given z, the decoder generates an output sequence (y1. . ., ym) of symbols, one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, so providing an accumulative result. Thus, it required less time to train and has been shown to provide a better quality and faster translation compared with other systems (Vaswani et al., 2017). Also, the byte pair encoding (BPE) of size 40,000 was applied in this pre-processing step because it enabled open-vocabulary decoding, which splits the vocabulary into sub-word units. The advantage of this was that through merging every occurrence of the most frequent pair, it added the new character n-gram to the vocabulary until the desired vocabulary size was achieved. As a consequence, it ensured that there were no out-of-vocabulary words in the corpus. To train the system, a batch size, of a maximum of 1500 tokens, was used and each system was trained for a maximum of 150 epochs. Table 3, shows the statistics for the English-Persian side of the dataset, and the number of sentences in the training in the test, and in the development set.
Table 3
Statistics of the English-Persian side of the dataset
Dataset | |
| |
| |
| |
Thus, the MT systems were built with the training set shown in Table 3 that included the remaining part of the data. To annotate the data, the test-set was used that included 6350 sentences. Finally, for each specific system a set of 2000 sentence pairs were randomly selected as the development set. Any duplicated sentence pairs were removed for the instance selection system, such that the system would not retrieve exactly the same test sentences.
4.4Manual linguistic evaluation of results and output comparison
Table 5 identifies six terms where a discrepancy was found in performance between the two systems. In focusing on these terms, which appear in boldface: File, HTML, Toolbar, Desktop, Username and IP Address, all six were incorrectly translated by both systems. Thus, a report of the statistics and analysis on how the systems translated each specific term follows to help illuminate the situation.
Table 5
List of selected terms for evaluation
Term | f | Term | f | Term | f |
File | 90 | Web | 4 | Server | 10 |
Filename | 3 | Web-site | 1 | Network | 6 |
Click | 7 | Load | 6 | Disk | 125 |
Full-screen | 6 | Toolbar | 4 | Print | 3 |
Internet address | 6 | Desktop | 13 | Icon | 5 |
Button | 5 | Keyboard | 4 | Browser | 4 |
Link | 8 | Operating system | 1 | Retrieve | 8 |
HTML | 8 | Username | 8 | Configure | 5 |
System | 10 | Scan | 4 | Font | 10 |
Monitor | 3 | IP address | 7 | Bookmark | 4 |
Table 6 shows the results of frequency counts and error percentages for each system’s translation. To be counted as correct, the term translation needed to be matched to the translation reference (test-set). It is noted that terms such as ‘File’ were considered a single-term, whereas terms such as ‘IP address’ was considered as two.
Table 6
Statistics of frequency and error percentage of each system translating terms
Terms | Translation | f | N of errors | Error % | Failed To translate | Mistranslated |
File | پرونده | | | | | |
Static system | | 90 | 15* | 0.5% | 2* | 13* |
Instance selection system | | 90 | 12* | 0.4% | 0 | 12* |
Desktop | رومیزی | | | | | |
Static system | | 13 | 0 | 0% | 0 | 0 |
Instance selection system | | 13 | 1* | 0.0333% | 0 | 1* |
Username | نام کاربری | | | | | |
Static system | | 8 | 3* | 0.1% | 2* | 1* |
Instance selection system | | 8 | 0 | 0 | 0 | 0 |
HTML | زنگام | | | | | |
Static system | | 8 | 6* | 0.2% | 2* | 4* |
Instance selection system | | 8 | 3* | 0.1% | 2* | 1* |
IP address | نشانی اینترنتی | | | | | |
Static system | | 6 | 5* | 0.2% | 2* | 3* |
Instance selection system | | 6 | 3* | 0.01333% | 0 | 3* |
Toolbar | نوار ابزار | | | | | |
Static system | | 6 | 6* | 0.1333 | 2* | 4* |
Instance selection system | | 6 | 0 | 0 | 0 | 0 |
The following six exemplars explain how the comparison of translation performance of the two systems was made and identify issues arising to inform future practice.
Example 1
Translation of the term ‘File’
For the analysis of the data, the outputs of the different systems were evaluated, and a search was conducted for errors that were prevalent and systematic. For the term “File” the source sentence contains the term that should be translated into پرونده /Parvande/, which is the correct translation of the term in Persian as shown in exemplar one:
Source: Unexpected end-of-file on standard input
Static: Unexpected end on standard input: ورودی های استاندارد غیر منتظره
Instance-selection: Unexpected end-of-file on standard input: پایان غیار منتظاره ی ورودی های استاندارد پرونده
The static system mistranslated the term into English 13 times, thus translating directly into File with 0.5% error as shown in Table 6. In two cases it failed to translate the term. While the instance selection mistranslated the term into the English word ‘File’ 12 times, with 0.4% error, it translated all terms. Even though the statistical difference between the two systems is low, the results reveal the cases where instance selection provided a better-quality translation. The example clearly shows the static system omitted the translation –of-file, and gave an incomprehensible translation in Persian, whereas the instance selection translated the term and conveyed the correct meaning. Although the focus of the study was on evaluating the performance of the systems in translating terminology, it is also evident that the static system left out the sentence verb, whereas instance selection was successful. It appears that instance selection’s more structured approach was an advantage.
Example 2
Translation of the term ‘Desktop’
Consideration of the term ‘Desktop’, that is also considered as ‘Sure’ and a single term, shows some departure from the findings of the other cases. For example, the static system translated the term without any errors from English (Desktop) into Persian رومیزی) /Rumizi). But, with this term, the instance selection system translated correctly on only one occasion out of 13 occurrences. Table 6 shows a 0.0333 percentage error.
Source: Graphical desktop environment
Static: Graphical desktop environment: محیط رومیزی گرافیکی
Instance selection: Graphical desktop environment: گرافیکی desktop محیط
Thus, the static system correctly translated the term into its Persian equivalent, whereas the instance selection system translated directly into ‘Desktop’. In this case the static system outperformed the instance selection system in the translation of this term. Of note also is that the instance selection system retrieved some training sentences where the term was not translated into Persian, but was used as it is, reporting the English term into the target. This might have happened because the instance selection system strongly depends on the retrieved sentences.
Example 3
Translation of the term ‘Username’
The third term ‘Username’ had the same frequency as the previous term, HTML. It was also considered as ‘Sure’ and a single term and had a frequency of eight, as shown in Table 6. Within all the selected sentences, the instance selection system translated the term without any error. This means that the system correctly translated all the occurrences of the term from the source, ‘username’, into the target language (نام کاربری / Name karbari). In contrast, however, out of these eight occurrences of the term amongst all the sentences, the static system failed to translate the term twice and on one occasion, instead of the compound word ‘username’ it only partially translated it as ‘name’. As shown in Table 6, the error percentage of the static system was 1% whilst for the instance selection system was 0% errors. In this case, the instance selection system clearly outperformed the static system.
Source: This is your username
Static: This is your name: این نام شما است
Instance selection: This is your username: این نام کاربری شما است
As it is clear in this example, the static system varied in its translation and omitted the term or partially translated it, and so failed to produce the correct translation of the sentence. But the instance selection perfectly translated the term and sentence, thus representing the complete meaning.
Example 4
Translation of the term ‘HTML’
With regards to the analysis of the next term, ‘HTML’, with the translation of زنگام /Zangam to Persian, it is considered as ‘sure’ and a single term in the study. Out of eight occurrences of this term, while the instance selection system had 3 errors (0.1% in total), two errors correspond to the absence of the system translating the term in the whole sentence and one error in which it directly translated the term from source into the target language (HTML to HTML). However, the static system had double the errors (six) out of the eight occurrences of the term in translation with 0.2% errors. Twice it did not translate the term and on four occasions it used the English directly into Persian.
Source: Inspect HTML
Static: Inspect: با توجه به
Instance selection: Inspect HTML: HTML با توجه به
Thus, as shown in the above example the static system did not translate the term in question. Even though, the instance selection system did not translate the term into its equivalent Persian term, it directly used the English term into Persian. Thus, it conveyed the meaning and translation is understandable.
Example 5
Translation of the term ‘IP address’
IP address, the fifth term, is another that the systems translated differently. It is considered as ‘Sure’ but a multiple term in the study. The term has the frequency of six within the data set. As shown in Table 6’s summary of the statistics and error percentages in this term’s translation the static system only correctly translated it into Persian once نشانی اینترنتی) /Neshani Interneti’). In failing to translate the term on another occasion, it mistranslated the term into Persian for ‘IP network’ (شبکه اینترنتی) as opposed to IP address on a third. However, for the remaining three times, the static system was accurate and directly proposed the English term into the Persian sentence. In contrast, the instance selection system was able to directly translate the term into Persian three times out of the six with an error of only 0.01333%, e.g. ‘IP address’ into ‘IP address’. For the other three times the instance selection system correctly translated the term into its Persian equivalent (نشانی اینترنتی/ Neshani Interneti).
Source: The IP address as seen by the machine
Static: The IP network as seen by the machine همانطور که توسط ماشین دیده شده شبکه اینترنتی
Instance selection: The IP address as seen by the machine همانطور که توسط ماشین دیده شده نشانی اینترنتی
As depicted in the above example it can be seen that the static system miss-translated this multiple term, thus, suggesting it is effective at translating the first part of the term but failed to translate the second part. In comparison, the instance selection system was able to retrieve the term correctly two-thirds of the time.
Example Translation of the term ‘Toolbar’
The last term ‘Toolbar’, which is ‘Sure’ and also a single term, as shown in Table 6, had a frequency of six errors within the whole selected data. The instance selection system translated the term without any mistake by mapping ‘Toolbar’ correctly into نوار ابزار/ Navarabzar. However, in comparison, out of the total six occurrences of the term, the static system mistranslated the term four times. Instead of translating the term into the Persian word, this system mistranslated it on half these occasions into 'Status-bar’ (میله وضعیت/Mile vazeyat) and on the other two times failed to translate at all. Thus again, the instance selection system outperformed the static system. As shown in the example, it is clear that the system miss-translated the term into ‘Status bar’ instead of ‘Toolbar’, while instance selection correctly translated on every occasion.
Source: Remove from Toolbar
Static: Remove from Status bar: حذف از نوار وضعیت
Instance selection: Remove from Toolbar: حذف از نوار ابزار
4.5Manual evaluation results
Thirty terms were selected for the manual evaluation of the results, with each term having an equal chance of being chosen through use of random sampling. Based on the discrepancies of translation for the six identified terms the analyses as discussed in the previous section, it is clear that the instance selection system performed better in most of the cases. However, the static system outperformed the instance selection system on one occasion, which was in for the term ‘Desktop’ where instance selection mistranslated the term. This reinforces the value of analysing the results despite the slight difference in the two systems’ statistical performance considering the BLEU score. In particular, it can be argued instance selection gave a better overall quality of translation. Empirically, the study provides evidence of the effectiveness of the instance selection approach and shows that the manual evaluation was much more informative than the BLEU score because the BLEU score does not highlight any improvements, compared with the manual evaluation that identified the nature of the issues involved.
4.5.1BLEU score comparison
To further investigate the potential of instance selection’s injection of the terms into the system, the researchers computed the BLEU score only for the sentences that contained the term(s) in the test-set to evaluate and compare the performance of the two systems. Thus, the BLEU score was computed for both the static and instance selection models. However, as Table 7 shows, these analyses were applied to three datasets, firstly, sentences with ‘Sure’ terms only, then those with ‘Possible’ terms only, and finally those sentences that contained ‘Sure’ plus ‘Possible’ terms (where there was more than one term in a sentence). Table 7 shows the descriptive statistics of the BLEU score for the two different systems across these variables.
Table 7
BLEU score of systems in different configurations
Instance-selection system | BLEU | Static system | BLEU |
Sure terms | 40.15 | Sure terms | 42.70 |
Possible terms | 47.30 | Possible terms | 44.92 |
Sure + possible terms | 43.36 | Sure + possible terms | 42.85 |
For ‘Sure’ terms, the static system’s, BLEU score of 42.70 was greater than the instance selection system’s score of 40.15. As it is shown in Table 7, the static system performance by 2.55 points more than the instance selection system. In contrast, for the Possible terms, the result was the reverse. The instance selection system’s score of 47.30 exceeded the static system’s score of 44.92 by 2.38. However, as noted earlier, the BLEU score was also computed for sentences that contained both a Sure term and a Possible term where the difference was minimal (0.51). Although overall differences are small overall, the results suggest the instance selection system may be best for translating the terms in different configurations. The possible reason is that instance selection forces the MT to use the terms that are in the retrieved sentences, although this may create issues if the terms have not been seen in the training data very often. Since the MT system is forced to correctly translate the individual terms, this can penalise the remainder of the translation through not knowing the correct translation of the next word. Compared with the ‘Sure’ terms, which can be confidently translated, as they have rarer use, and the system translates and starts again, the Possible terms have an element of ambiguity. These can have more frequent use and therefore are seen more often in the training data.
In arguing the instance selection performed slightly better than the static system in different scenarios and aspects according to the BLEU data comparison, on the basis of the deeper analysis of the translation data, the instance selection system was found to be better in more aspects compared to the static system. In summary, the instance selection system outperformed the static system in translating single terms, whilst the static system had better performance only in one case. Moreover, the instance selection system was superior in translating multiword terms, whereas the static system omitted translating the second part of a multiword term. Furthermore, as a result of computing the BLEU score on only sentences containing the terms in different configurations, the instance selection system outperformed the static system in translating Sure terms on every occasion except one. The instance selection system also performed very well when translating Possible terms as well as when translating sentences with both Sure plus Possible terms. Therefore, evidence was found to support the argument that the instance selection system was able to translate the given terms more effectively than the static system. It can also be concluded that even though the static system was found to be generally less effective compared with the instance selection system, its performance was better with respect to two words in the study. Firstly, it correctly translated the term ‘Desktop’ into Persian while the instance selection system directly transferred this term from English into Persian. Secondly, the static system performed better than the instance selection system on the basis of the BLEU score when the sentences contained a Sure term only.