3.1 Dataset Collection
Multi-source translations of the books of the bible and articles from the Jehovah’s Witnesses website (JW.org, 2023) serve as the main sources of parallel sentences. Multi-source refers to translations of texts in multiple languages, such as bible translations, EU parliamentary proceedings, and transcripts from TED talks and United Nations meetings, where a particular sentence has translations to many target languages. Dabre et al. (2021) encourages leveraging multi-source sentences whenever available, as this helps improve the translation.
The Bible is probably the most tapped resource for parallel texts with translations. According to Wycliffe Global Alliance (2023), the full bible has already been translated into 736 languages and the New Testament into 1,658 languages. Chavacano only has translations of the New Testament.
The Jehovah’s Witnesses website JW.org (2023) also contains translations of their articles in multiple languages. Aside from the general articles, the website makes translations of their Watchtower and Awake! magazines available. These magazines span more than two decades of publications.
Besides multi-source texts, bilingual translations of Chavacano to the other related languages and vice versa will also be collected. This is done to augment the dataset, especially since only a few resources are available for the Philippine languages.
Translations of the Chavacano texts to Spanish, English, Cebuano, and Hiligaynon from the Bible and articles are scraped using Webscraper.io2 and Beautiful Soup3, a web-scraping library in Python.
3.2 Preprocessing
The primary goal of preprocessing is to ensure the alignment of scraped texts. The alignment of bible texts is based on the verse numbers for each book chapter. The articles, however, are aligned based on the number of sentence chunks, i.e. group of sentences, captured per text object in the site’s web pages during scraping. In JW.org (2023), the web pages across languages share the same site map, and a single script was used to scrape all articles across the target languages.
3.2.1 Articles: Cleaning and Sentence Segmentation
The aligned articles were preprocessed to remove unnecessary symbols and characters such as ellipses, bullets, asterisks, brackets, and non-breaking white characters. The quotation marks were also removed because some translations in other languages do not contain these. The punctuation marks, however, were retained.
The Bible verses used as in-text references were also removed. For instance, the verse reference, -Santiago 2:14-17., was removed in the succeeding example excerpt.
Ta ayuda con aquellos quien ta necesita.-Santiago 2:14-17.
The scraped sentence chunks are segmented to extract individual sentences. Sentence segmentation was done using Sentence-Splitter4, a free module that allows splitting of text paragraphs into sentences. The heuristics algorithm, however, was based only on the English language. Other segmentation tools were considered, such as the tools from SpaCy, but the Sentence-Splitter produced more aligned chunks.
The chunks that did not produce the same number of sentences across any or all article translations after segmentation are considered misaligned and, therefore, excluded from the final sample set.
3.2.2 Bible: Cleaning and Verse Segmentation
The bible verses did not require the removal of unnecessary symbols. The quotation marks were retained together with the usual punctuation marks.
Verses combined in one bible translation but separate in other translations, for example, Matthew 17 and 18 in the English translation, but Matthew 17-18 in Hiligaynon are removed.
Each verse in the bible is considered a sentence sample, even if it spans more than one sentence or is a combination of sentence and sentence fragments. It is assumed that the translation was done per verse, and to preserve the meaning, the whole verse was used collectively as a sentence sample. The verse number, though, was removed.
3.3 Corpus Preparation
Parallel sentences of language pairs between Chavacano, Cebuano, Hiligaynon, Tagalog, Spanish, and English comprise ChavacanoMT. These language pairs were prepared from the sentence samples collected. Table 1 summarizes each language pair’s number of sentence samples.
Table 1 breaks the type of sentences included in the language-pair datasets as multisource (MS) and non-multisource (NMS). We note that most of the sentence samples for Chavacano are multisource, meaning each sentence sample has translations to the other five languages in the corpus. We also note that the number of Chavacano sentence samples is small compared to the other language pairs.
Table 1 Number of Sentence Samples per Language Pair in ChavacanoMT. MS indicates Multisource, while NMS is Non-Multisource. As a reference, the language codes used are as follows: cbk for Chavacano, ceb for Cebuano, hil for Hiligaynon, tl for Tagalog, en for English, and es for Spanish.
Language Pairs
|
Bible
MS
|
Bible
NMS
|
Articles
MS
|
Articles
NMS
|
Total
Sentences
|
cbk-ceb
|
7,728
|
|
13,931
|
35
|
21,694
|
cbk-hil
|
7,728
|
|
13,931
|
|
21,659
|
cbk-es
|
7,728
|
|
13,931
|
35
|
21,694
|
cbk-en
|
7,728
|
|
13,931
|
35
|
21,694
|
cbk-tl
|
7,728
|
|
13,931
|
|
21,659
|
ceb-hil
|
7,728
|
21,803
|
13,931
|
51,664
|
95,126
|
ceb-es
|
7,728
|
21,803
|
13,931
|
51,776
|
95,238
|
ceb-en
|
7,728
|
21,803
|
13,931
|
51,954
|
95,416
|
ceb-tl
|
7,728
|
21,803
|
13,931
|
46,546
|
90,008
|
hil-es
|
7,728
|
21,803
|
13,931
|
2,865
|
46,327
|
hil-en
|
7,728
|
21,803
|
13,931
|
2,904
|
46,366
|
hil-tl
|
7,728
|
21,803
|
13,931
|
2,904
|
46,366
|
es-en
|
7,728
|
21,803
|
13,931
|
13,420
|
56,882
|
es-tl
|
7,728
|
21,803
|
13,931
|
|
43,462
|
en-tl
|
7,728
|
21,803
|
13,931
|
|
43,462
|
3.4 Corpus Statistics
We describe the corpora using the following statistics: Average Sentence Length, Number of Unique Words, and Number of Shared or Overlapping Words per Language. The statistics are taken from all sentence samples per language, and the count may differ slightly if taken from each language pair.
Figure 1 shows that the sentence samples from the Bible are longer than those from the articles. As discussed in Section 3.2.2, these sentences are verses in the bible that may comprise more than one sentence or sentence fragments.
The number of unique words for each language in each resource is summarized in Figure 2. The number of unique words for Chavacano is arguably low compared to the other languages. We argue this was attributed to Chavacano being less morpho- logically rich than the other languages. Pahulaya (2022) presented that Chavacano comprises simple, compound, affixed, and reduplicated words like other languages. Its verbs, however, do not show inflections in tenses as the markers ya (past), ta (present), and ay (future) are being used. In addition, the Chavacano dictionary (de Dios, Maria Isabelita Riego, 1989) registers about 6,500 entries only, including both heads and derived forms as described in SEAlang Library5. This shows that the number of words captured in the ChavacanoMT corpus is reasonable.
The lexical similarity among languages can be measured from the number of overlapping words across languages. Figure 3 shows this similarity in the corpus.
Figure 3 shows that Chavacano samples share more word overlaps with Spanish, i.e., around 40% of the Chavacano words in the corpus are shared with Spanish. Overlaps with the other languages are also present. The Philippine languages, Cebuano, Hiligaynon, and Tagalog, share the most overlaps. The lexical similarity of Chavacano and the related languages shows the influence of these languages on Chavacano. It is also essential to consider that Spanish and English have a lexical influence on the Philippine languages due to years of colonialism.