ChavacanoMT: A Corpus and Evaluation of Neural Machine Translation for Philippine Creole Spanish

doi:10.21203/rs.3.rs-5022127/v1

Chavacano, formally referred to as Philippine Creole Spanish, is the only Creole spoken in the Philippines. Like many languages, especially Creoles, computational studies on Chavacano are scarce because of the dearth of available corpora. This paper describes the creation of ChavacanoMT, a benchmark corpus for the machine translation study of Philippine Creole Spanish. ChavacanoMT consists of 767,053 parallel sentences between Chavacano and related languages, Spanish, Cebuano, Hiligaynon, Tagalog, and English. It is sourced from scraped bible translations and articles on the Jehovah’s Witness website.

This paper also presents the performance of a multilingual neural machine translation model generated using ChavacanoMT. We report an overall 17 BLEU score on a fine-tuned mT5 model, outperforming an mT5-based model trained from scratch. Our experiments show that ChavacanoMT can generate models on par with a similar system that translates between English and some Philippines languages despite having fewer sentence samples used in training. We also report an improved Chavacano translation to and from its related languages that can be used as benchmark data. In particular, we highlight more than 20 BLEU points of improvement in the translation between Chavacano and English.

The study opens avenues for exploring cross-linguistic interactions of Chavacano and its related languages in its translation that may benefit other low-resource languages.

Philippine Creole Spanish

Chavacano Translation Corpus

Multilingual Translation

Chavacano

the Philippines (Komisyon sa Wikang Filipino, 2020) and the only Spanish-based Creole in Asia. Like many languages worldwide, the computational study of the Chavacano language is scarce. This is primarily due to the dearth of available corpora from which computational and other language studies can be made.

Chavacano is also one of the regional languages in the Philippines that is less computationally studied than other major languages such as Tagalog, Cebuano, and Hiligaynon. The Komisyon sa Wikang Filipino (2020) identified Chavacano as one of the 39 languages in the Philippines already in various stages of endangerment that need preservation and revival.

The OPUS Corpus Collection¹ registers 2,529 English-Chavacano parallel sentences only. No other resources are published. This dataset was also included in PH-MNMT (Coronia, 2022). We aim to augment the existing language resource for Chavacano to encourage computational work on the language.

This paper reports the creation of the ChavacanoMT corpus for benchmarking Chavacano machine translation. The corpus comprises parallel sentences between Chavacano and its related languages, Spanish, Cebuano, Hiligaynon, English, and Tagalog. We also report its usage in a many-to-many machine translation of Chavacano and its related languages.

The remainder of this paper is organized as follows: Section 2 presents the literature on corpus creation for Philippine languages, supporting this work’s contribution to language resources. Information on the Chavacano Creole language and its linguistic properties are also presented. Section 3 details the creation of the ChavacanoMT. This section also presents descriptive information about the corpus. The evaluation of the corpus is shown in a translation experiment discussed in Sections 4 and 5. Finally, the insights from this work, including future directions on the computational linguistic study of Chavacano, are presented in Section 6.

This paper focuses on creating datasets for the under-resource Chavacano language in the context of leveraging related languages to improve the translation quality of low-resource languages (Baliber, Cheng, Adlaon, & Mamonong, 2020; Coronia, 2022; Dabre, Chu, & Kunchukuttan, 2021; Robinson, Hogan, Fulda, & Mortensen, 2022) such as Chavacano.

As of this writing, only the work of Coronia (2022) explored the translation of Chavacano to other Philippine languages, but the results were poor. This is primarily due to poor language representation in the PH-MNMT (Coronia, 2022) dataset that was used.

Moreover, the Creole nature of Chavacano makes it an exciting topic for computational linguistic study, especially since there are few NLP works on Creole languages. The machine translation of Creoles is also under-researched, owing to the lack of publicly available datasets (Dabre & Sukhoo, 2022). Ethnologue (Eberhard, Simons, & Fenning, 2024) registers 92 Creole languages worldwide. In the literature, however, only Haitian (Robinson et al., 2022), Nigerian Pidgin (Ahia & Ogueji, 2020), and Kreol Morisien (Dabre & Sukhoo, 2022) have been explored.

2.1 Chavacano: Philippine Creole Spanish

The Philippine Creole Spanish, known as Chavacano, comprises three major dialects in Ternate, Cavite, and Zamboanga (Lipski, 2001), Philippines. Both the Ternate and Cavite dialects are classified as the Manila Bay PCS. Ternaten˜o was the oldest Spanish-based creole, and Caviten˜o was an off-shoot. Zamboanguen˜o, on the other hand, comprises the largest group of Chavacano speakers in Zamboanga City and neighboring towns and cities in Mindanao. Aside from the population of speakers, Zamboanguen˜o is actively used in blogs, news, and social media that can be used as digital resources. Zamboanguen˜o has support from its local government (DepEd-IX, 2016) while Ternaten˜o and Caviten˜o Chavacano did not receive such support from the national or local government (Lesho & Sippola, 2013). Chavacano in Zamboanga is surviving; the language is dying in Ternate and Cavite City (Genuino, 2005). With this information, this study assumes that the digital resources used in creating the corpus are mostly written in Zamboanguen˜o, as the language is distinctly recognized and used.

The formation of Chavacano in Zamboanga resulted from historical and cultural interactions in the Philippines during the Spanish colonial period. It is considered a Creole language, meaning it emerged as a stable and fully developed language from a mixture of different languages, giving it some properties unique from its source languages. Chavacano belongs to the Creole family of languages of Spanish descent (Eberhard et al., 2024).

Lipski (1992, 2001) reported an exhaustive investigation of the Chavacano language’s historical and social underpinnings in Zamboanga. Accordingly, Chavacano started to develop during the Spanish garrison in Zamboanga. Ilonggo later influenced Chavacano as Iloilo became a stopover for ships from Manila to Zamboanga. Later in the 20th century, immigration from the Central Visayan region to southwest Mindanao added some Visayan or Cebuano items to the language. Over time, Chavacano has adopted English in its lexicon. This account by Lipski (1992, 2001) served as the basis for identifying related languages for Chavacano.

2.2 Chavacano Lexicon and Orthography

The lexicon of Chavacano is largely Spanish (Lipski & Santoro, 2007) but with orthographic shifts. For Zamboanguen˜o, in particular, several stages of relexification occurred to include lexical items of Philippine origin from regional Visayan, Ilonggo, and occasionally Tagalog (Lipski, 2001). Zamboanguen˜o has also adopted a heavy English lexical transfer (Lipski, 1992) over time.

In 2016, the Department of Education Region IX and the Local Government of Zamboanga City published a revised Zamboanga Chavacano Orthography (DepEd- IX, 2016) to standardize the use of written Chavacano. The standardization is based on how the present generation of Zamboanguen˜os uses Chavacano. The orthography describes a way of spelling out Chavacano words using the alphabet of the word’s traced etymology. For example, the Spanish-derived words zacate (grass) and man˜ana (tomorrow) are spelled using the Spanish’s abecedario. In contrast, the Chavacano words of local origin, like kanila (them) and kanamon (us), are spelled using the Philippine alphabet system. Some orthographic shifts are noted from loaned words, such as dropping the letter r in the Spanish verbs like comer (to eat), bailar (to dance), i.e., come, baila. It is also interesting to note that the Spanish writing utilizes diacritics that are not necessarily applied in Chavacano. In general, Chavacano words are spelled the way they are pronounced.

2.3 Chavacano Grammar

Zamboanguen˜o’s grammatical structure differs from any Spanish variety, and while there are standard lexicons, the two are mutually non-intelligible (Lipski, 2001).

Over three centuries of Philippine history influenced the morphology, grammar, and syntax of Zamboanguen˜o (Lipski & Santoro, 2007). Even so, it has retained its Austronesian foundation as evidenced by the Verb-Subject-Object word order, albeit many alternative possibilities (Lipski, 1992). The Philippine languages belong to the Austronesian language family. This contrasts Spanish’s Subject-Verb-Object word order (Lee, 2017). The conjugation of verbs to show tenses also does not apply in Chavacano (DepEd-IX, 2016).

3.1 Dataset Collection

Multi-source translations of the books of the bible and articles from the Jehovah’s Witnesses website (JW.org, 2023) serve as the main sources of parallel sentences. Multi-source refers to translations of texts in multiple languages, such as bible translations, EU parliamentary proceedings, and transcripts from TED talks and United Nations meetings, where a particular sentence has translations to many target languages. Dabre et al. (2021) encourages leveraging multi-source sentences whenever available, as this helps improve the translation.

The Bible is probably the most tapped resource for parallel texts with translations. According to Wycliffe Global Alliance (2023), the full bible has already been translated into 736 languages and the New Testament into 1,658 languages. Chavacano only has translations of the New Testament.

The Jehovah’s Witnesses website JW.org (2023) also contains translations of their articles in multiple languages. Aside from the general articles, the website makes translations of their Watchtower and Awake! magazines available. These magazines span more than two decades of publications.

Besides multi-source texts, bilingual translations of Chavacano to the other related languages and vice versa will also be collected. This is done to augment the dataset, especially since only a few resources are available for the Philippine languages.

Translations of the Chavacano texts to Spanish, English, Cebuano, and Hiligaynon from the Bible and articles are scraped using Webscraper.io² and Beautiful Soup³, a web-scraping library in Python.

3.2 Preprocessing

The primary goal of preprocessing is to ensure the alignment of scraped texts. The alignment of bible texts is based on the verse numbers for each book chapter. The articles, however, are aligned based on the number of sentence chunks, i.e. group of sentences, captured per text object in the site’s web pages during scraping. In JW.org (2023), the web pages across languages share the same site map, and a single script was used to scrape all articles across the target languages.

3.2.1 Articles: Cleaning and Sentence Segmentation

The aligned articles were preprocessed to remove unnecessary symbols and characters such as ellipses, bullets, asterisks, brackets, and non-breaking white characters. The quotation marks were also removed because some translations in other languages do not contain these. The punctuation marks, however, were retained.

The Bible verses used as in-text references were also removed. For instance, the verse reference, -Santiago 2:14-17., was removed in the succeeding example excerpt.

Ta ayuda con aquellos quien ta necesita.-Santiago 2:14-17.

The scraped sentence chunks are segmented to extract individual sentences. Sentence segmentation was done using Sentence-Splitter⁴, a free module that allows splitting of text paragraphs into sentences. The heuristics algorithm, however, was based only on the English language. Other segmentation tools were considered, such as the tools from SpaCy, but the Sentence-Splitter produced more aligned chunks.

The chunks that did not produce the same number of sentences across any or all article translations after segmentation are considered misaligned and, therefore, excluded from the final sample set.

3.2.2 Bible: Cleaning and Verse Segmentation

The bible verses did not require the removal of unnecessary symbols. The quotation marks were retained together with the usual punctuation marks.

Verses combined in one bible translation but separate in other translations, for example, Matthew 17 and 18 in the English translation, but Matthew 17-18 in Hiligaynon are removed.

Each verse in the bible is considered a sentence sample, even if it spans more than one sentence or is a combination of sentence and sentence fragments. It is assumed that the translation was done per verse, and to preserve the meaning, the whole verse was used collectively as a sentence sample. The verse number, though, was removed.

3.3 Corpus Preparation

Parallel sentences of language pairs between Chavacano, Cebuano, Hiligaynon, Tagalog, Spanish, and English comprise ChavacanoMT. These language pairs were prepared from the sentence samples collected. Table 1 summarizes each language pair’s number of sentence samples.

Table 1 breaks the type of sentences included in the language-pair datasets as multisource (MS) and non-multisource (NMS). We note that most of the sentence samples for Chavacano are multisource, meaning each sentence sample has translations to the other five languages in the corpus. We also note that the number of Chavacano sentence samples is small compared to the other language pairs.

Table 1 Number of Sentence Samples per Language Pair in ChavacanoMT. MS indicates Multisource, while NMS is Non-Multisource. As a reference, the language codes used are as follows: cbk for Chavacano, ceb for Cebuano, hil for Hiligaynon, tl for Tagalog, en for English, and es for Spanish.

Language Pairs	Bible MS	Bible NMS	Articles MS	Articles NMS	Total Sentences
cbk-ceb	7,728		13,931	35	21,694
cbk-hil	7,728		13,931		21,659
cbk-es	7,728		13,931	35	21,694
cbk-en	7,728		13,931	35	21,694
cbk-tl	7,728		13,931		21,659
ceb-hil	7,728	21,803	13,931	51,664	95,126
ceb-es	7,728	21,803	13,931	51,776	95,238
ceb-en	7,728	21,803	13,931	51,954	95,416
ceb-tl	7,728	21,803	13,931	46,546	90,008
hil-es	7,728	21,803	13,931	2,865	46,327
hil-en	7,728	21,803	13,931	2,904	46,366
hil-tl	7,728	21,803	13,931	2,904	46,366
es-en	7,728	21,803	13,931	13,420	56,882
es-tl	7,728	21,803	13,931		43,462
en-tl	7,728	21,803	13,931		43,462

3.4 Corpus Statistics

We describe the corpora using the following statistics: Average Sentence Length, Number of Unique Words, and Number of Shared or Overlapping Words per Language. The statistics are taken from all sentence samples per language, and the count may differ slightly if taken from each language pair.

Figure 1 shows that the sentence samples from the Bible are longer than those from the articles. As discussed in Section 3.2.2, these sentences are verses in the bible that may comprise more than one sentence or sentence fragments.

The number of unique words for each language in each resource is summarized in Figure 2. The number of unique words for Chavacano is arguably low compared to the other languages. We argue this was attributed to Chavacano being less morpho- logically rich than the other languages. Pahulaya (2022) presented that Chavacano comprises simple, compound, affixed, and reduplicated words like other languages. Its verbs, however, do not show inflections in tenses as the markers ya (past), ta (present), and ay (future) are being used. In addition, the Chavacano dictionary (de Dios, Maria Isabelita Riego, 1989) registers about 6,500 entries only, including both heads and derived forms as described in SEAlang Library⁵. This shows that the number of words captured in the ChavacanoMT corpus is reasonable.

The lexical similarity among languages can be measured from the number of overlapping words across languages. Figure 3 shows this similarity in the corpus.

Figure 3 shows that Chavacano samples share more word overlaps with Spanish, i.e., around 40% of the Chavacano words in the corpus are shared with Spanish. Overlaps with the other languages are also present. The Philippine languages, Cebuano, Hiligaynon, and Tagalog, share the most overlaps. The lexical similarity of Chavacano and the related languages shows the influence of these languages on Chavacano. It is also essential to consider that Spanish and English have a lexical influence on the Philippine languages due to years of colonialism.

In this section, we present the utilization of the ChavacanoMT corpus in the neural machine translation of Chavacano to and from its related languages, Spanish, Cebuano, Hiligaynon, and English.

4.1 Model Training

We experiment with a multilingual neural machine translation of Chavacano. It has already been established in the literature how related languages, especially those that are high-resource, support the translation of low-resource languages (Dabre,Nakagawa, & Kazawa, 2017; Dabre & Sukhoo, 2022; Goyal, Kumar, & Sharma, 2020; Tubay & Costa-Juss`a, 2018; Zoph, Yuret, May, & Knight, 2016). Such is the case in the neural machine translation of Chavacano. The multilingual training leverages high-resource languages in the dataset: Spanish, Cebuano, and English.

In this experiment, we build a many-to-many machine translation model (a) from scratch using mT5 (Xue et al., 2021) model configuration and (b) from fine-tuning mT5 model using a subset of the ChavacanoMT corpus (subset described in Section 4.2).

In training from scratch, a vocabulary of 32,300 sentence pieces was created using Sentencepiece (Kudo & Richardson, 2018) from combined words of languages in the corpus (Table 1). An mT5-based tokenizer, cbkTokenizer, was also trained from ChavacanoMT. The tokenizer was used to tokenize the dataset.

In the case of fine-tuning, the vocabulary of mT5 with 250,112 sentence pieces and its built-in MT5Tokenizer were used.

The tokenizer and models were trained using Huggingface’s Transformers⁶ library. The model training ran in 8 epochs for both models and was optimized using AdamWeightDecay with a learning rate of 0.001. Bilingual Evaluation Understudy (BLEU) (Papineni, Roukos, Ward, & Zhu, 2002) scores using Sacrebleu (Post, 2018) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, 2004) scores are collected to measure the model’s performance.

4.2 Datasets Used

Based on the historical investigation of Lipski (1992, 2001), the creolization of Chavacano is influenced by Spanish as its lexifier and Philippine languages, Cebuano and Hiligaynon, as adstrates. It has also undergone lexical transfer from English over time.

Table 2 Multilingual Many-to-Many Dataset

Train		Test	Validation	Total Samples
cbk-ceb	15,185	3,255	3,254	21,694
cbk-en	15,185	3,255	3,254	21,694
cbk-es	15,185	3,255	3,254	21,694
cbk-hil	15,161	3,249	3,249	21,659
ceb-cbk	15,185	3,255	3,254	21,694
ceb-en	66,791	14,313	14,312	95,416
ceb-es	66,666	14,286	14,286	95,238
ceb-hil	66,588	14,269	14,269	95,126
en-cbk	15,185	3,255	3,254	21,694
en-ceb	66,791	14,313	14,312	95,416
en-es	39,817	8,533	8,532	56,882
en-hil	32,448	6,954	6,953	46,355
es-cbk	15,185	3,255	3,254	21,694
es-ceb	66,666	14,286	14,286	95,238
es-en	39,817	8,533	8,532	56,882
es-hil	32,428	6,950	6,949	46,327
hil-cbk	15,161	3,249	3,249	21,659
hil-ceb	66,588	14,269	14,269	95,126
hil-en	32,434	6,951	6,950	46,335
hil-es	32,428	6,950	6,949	46,327

This account became the basis of the multilingual dataset for Chavacano translation. The language pairs involving Chavacano, Spanish, Cebuano, Hiligaynon, and English were used except for Tagalog.

In this experiment’s many-to-many training set-up, the language pairs are used in both directions, that is, cbk-en and en-cbk are used. In total, 20 language pairs between Chavacano, Spanish, Cebuano, Hiligaynon, and English were included in the dataset. In this training set-up, all languages serve as source and target languages. It is hoped that by doing so, the similarities shared between Chavacano and the high-resource languages may be represented in both the encoder and decoder sides of the neural network. The summary of the dataset used in this experiment is shown in Table 2.

70% of the dataset was used as a training set, 15% as a validation set, and 15% as a test set. The data splits are stratified according to language pairs. The dataset is unbalanced, with more training samples from the high-resource Cebuano, English, and Spanish language pairs.

Table 3 shows the BLEU and ROUGE-1 scores of the models generated in the experiment. As expected, the fine-tuned model earned better results than the model generated from scratch. This demonstrates the leverage one can get on pre-trained weights even if target languages are not included in the pre-training.

Table 3 Comparison of BLEU and ROUGE-1 scores for (a) mT5-based model from scratch and (b) fine-tuned mT5 model.

Model Type	BLEU	ROUGE-1
Scratch	0.4879	0.1222
Finetuned	17.8274	0.5457

The fine-tuned model was tested further using the individual language pairs in the test set. Table 4 presents the performance scores for each translation direction.

Table 4 Performance results for fine-tuned mT5 model. Table (b) on the left shows the translation direction with better performance.

BLEU		ROUGE-1	BLEU		ROUGE-1
cbk-ceb	21.95	0.62	ceb-cbk	23.54	0.66
hil-ceb	22.34	0.60	ceb-hil	22.79	0.61
en-ceb	20.02	0.59	ceb-en	26.25	0.61
en-cbk	24.06	0.68	cbk-en	35.30	0.69
ceb-es	16.61	0.50	es-ceb	16.34	0.53
hil-cbk	22.10	0.65	cbk-hil	23.51	0.64
en-hil	16.20	0.57	hil-en	23.70	0.59
es-hil	15.13	0.55	hil-es	16.63	0.52
es-cbk	21.79	0.64	cbk-es	24.68	0.60
en-es	19.45	0.54	es-en	25.94	0.60

The results show that the translation from Philippine to foreign language seems better than the other way around, except between Cebuano and Spanish, where the scores are essentially the same in both directions. It is also interesting to note that the highest-performing language pair at 35.30 BLEU is Chavacano-English, whose training samples are among the lowest in the dataset.

As a way of benchmarking, we also compare the Ph-EN and EN-Ph BLEU scores obtained from the experiments of Coronia (2022) with the result of our experiments. Ph here refers to Philippine languages. Table 5 summarizes the best-performing model from Coronia (2022) and the BLEU scores from our experiments.

Table 5 Comparison of bilingual translations from finetuned mT5 multilingual models. Coronia (2022) uses the PH-MNMT dataset, while ours uses the ChavacanoMT corpus.

en-ceb		ceb-en	en-hil	hil-en	en-cbk	cbk-en
Fine-tuned mT5 (Coronia)	21.25	26.67	24.4	26.11	0.08	0.79
Fine-tuned mT5 (Ours)	20.02	26.25	16.20	23.70	24.06	35.30

For context, the PH-MNMT dataset used in Coronia (2022) comprises millions of English-Cebuano and English-Hiligaynon sentence samples compared to ChavacanoMT. ChavacanoMT, however, has more sentence samples for Chavacano. In Coronia (2022), the best-performing mT5 model was fine-tuned on EN-Ph and Ph-EN sentences comprising English, Tagalog, Cebuano, Hiligaynon, Waray, and Chavacano. In contrast, our experiment generated the translation model using a many-to-many training setup involving languages that are specifically related to Chavacano. Both Coronia (2022) and our translation experiment considered using related languages in model training. Although an objective comparison between the models cannot be made, we can infer that ChavacanoMT can produce models that are on par with published results.

This paper presents ChavacanoMT, a benchmark corpus for the machine translation of Chavacano to and from related languages, Spanish, Cebuano, Hiligaynon, Tagalog, and English. The corpus consists of 767,053 parallel sentence samples from 15 language pairs. The corpus is a combination of multi-source and parallel sentences.

Using the ChavacanoMT corpus in our machine translation experiments has demonstrated its potential to significantly enhance the translation quality between Chavacano and its related languages. Our experiments showed that models trained using ChavacanoMT could achieve performance on par with or surpass existing multilingual neural machine translation systems involving Chavacano, particularly in translating between Chavacano and English, with improvements exceeding 20 BLEU points. This highlights the effectiveness of incorporating diverse related languages with multisource sentence samples in building robust machine translation models for low-resource languages in Chavacano. Our findings suggest that leveraging related languages within the corpus improves translation accuracy. This approach underscores the importance of using carefully curated multilingual datasets to support underrepresented languages, ultimately contributing to their preservation and wider accessibility.

The insights gained from this study open avenues for exploring the influence of specific linguistic relationships in translation quality, which could guide the development of more targeted translation strategies for other low-resource languages.

Additionally, ChavacanoMT provides a foundation for further research on cross- linguistic interactions and their implications for the evolution and revitalization of Chavacano. By extending this work, researchers can deepen their understanding of how digital tools and computational methods can support Creole-speaking communities’ linguistic and cultural heritage.

Acknowledgements. We acknowledge the Jehovah’s Witness organization for the permission to scrape their website www.jw.org to support the NLP research on Philippine languages.

Funding: This work was supported by the Commission on Higher Education through its Scholarships for Instructors’ Knowledge Advancement Program (SIKAP) grant.
Conflict of interest/Competing interests: The authors have no competing interests to declare relevant to this article’s content.
Data availability: The corpus built in this study is available from the authors, but restrictions apply. Some resources used in building the corpus were under a non- commercial agreement from Watch Tower Bible and Tract Society, Philippines for the current study. The authors wish to honor the permission by ensuring that out- comes are used only for academic purposes. Data are, however, available from the authors upon reasonable request.

Ahia, O., & Ogueji, K. (2020). Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin. AfricaNLP Workshop. Online. Retrieved from http://arxiv.org/abs/2003.12660
Baliber, R.I., Cheng, C., Adlaon, K.M., Mamonong, V. (2020, December). Bridging Philippine Languages with Multilingual Neural Machine Translation. Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages (pp. 14–22). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.loresmt-1.2.pdf
Coronia, J.D. (2022). Exploring clustering of Philippine languages in multilingual neural machine translation. (Unpublished master’s thesis). De La Salle University, Manila, Philippines. (Retrieved from https://animorepository.dlsu.edu.ph/etdm softtech/4)
Dabre, R., Chu, C., Kunchukuttan, A. (2021, September). A Survey of Multilingual Neural Machine Translation. ACM Computing Surveys, 53 (5), 1–38, https://doi.org/10.1145/3406095 Retrieved 2023-09-12, from https://dl.acm.org/doi/10.1145/3406095
Dabre, R., Nakagawa, T., Kazawa, H. (2017, November). An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation. Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation (pp. 282–286). The National University (Philippines). Retrieved from https://aclanthology.org/Y17-1038
Dabre, R., & Sukhoo, A. (2022, November). KreolMorisienMT: A dataset for Mauritian Creole machine translation. Findings of the association for computational linguistics: Aacl-ijcnlp 2022 (pp. 22–29). Online only: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.findings-aacl.3
de Dios, Maria Isabelita Riego (1989). A composite dictionary of Philippine Creole Spanish. Studies in Philippine Linguistics. Manila: Linguistic Society of the Philippines and Summer Institute of Linguistics.
DepEd-IX (2016). Zamboanga Chavacano Orthography. Local Government of Zamboanga City: Philippines.
Eberhard, D., Simons, G., & Fenning, C. (Eds.). (2024). Ethnologue: Languages of the World (27th ed.). Dallas, Texas: SIL International.
Genuino, C.F. (2005). Language extinction in process across Chabacano communities: A sociolinguistic approach. (Unpublished doctoral disserta- tion). De La Salle University, Manila, Philippines. (Retrieved from https://animorepository.dlsu.edu.ph/etd doctoral/87)
Goyal, V., Kumar, S., Sharma, D.M. (2020). Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (p. 162–168). Association for Computational Linguistics.
JW.org (2023). Official Website of Jehovah’s Witnesses. Available at http://www.https://www.jw.org/en/ (2023/10/18).
Komisyon sa Wikang Filipino (2020). Repositoryo ng mga Wika at Kultura. https:// kwfwikaatkultura.ph/chabacano/.
Kudo, T., & Richardson, J. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.
E. Blanco & W. Lu (Eds.), Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (pp. 66–71). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-2012
Lee, J.F. (2017). Word order and linguistic factors in the second language processing of spanish passive sentences. Hispania, 100 (4), 580–595, Retrieved 2023-06-26, from https://www.jstor.org/stable/26387810
Lesho, M., & Sippola, E. (2013). The sociolinguistic situations of Manila Bay Chabacano-speaking communities. Language Documentation and Conservation, 7 , 1–30,
Lin, C. (2004, July). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W04-1013
Lipski, J. (1992). New thoughts on the origins of Zamboanguen˜o (Philippine Creole Spanish). Language Sciences, 14 (3), 197-231, https:// doi.org/https://doi.org/10.1016/0388-0001(92)90005-Y Retrieved from https://www.sciencedirect.com/science/article/pii/038800019290005Y
Lipski, J. (2001, Aug.). The place of Chabacano in the Philippine linguistic profile. Sociolinguistic Studies, 2 (2), 119–163, https://doi.org/10.1558/sols.v2i2.119 Retrieved from https://journal.equinoxpub.com/SS/article/view/11691
Lipski, J., & Santoro, M. (2007). Zamboanguen˜o Creole Spanish [Bibliographical record]. J. Holm & P. Patrick (Eds.), Comparative creole syntax. parallel outlines of 18 creole grammars (p. 373-398). London: Battlebridge. (Much information is based on Forman (1972).)
Pahulaya, V.L. (2022). Morphological Analysis on the Structure of Chavacano Language: A Complex Mental Process. NeuroQuantology , 20 (6), 9820-9830, https://doi.org/https://doi.org/10.14704/nq.2022.20.6.NQ22960
Papineni, K., Roukos, S., Ward, T., Zhu, W. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 311–318). Philadelphia.
Post, M. (2018, October). A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 186– 191). Belgium, Brussels: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-6319
Robinson, N., Hogan, C., Fulda, N., Mortensen, D.R. (2022, October). Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican. Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022) (pp. 35–42). Gyeongju, Republic of Korea: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.loresmt-1.5
Tubay, B., & Costa-Juss`a, M.R. (2018, October). Neural machine translation with the transformer and multi-source Romance languages for the biomedical WMT 2018 task. Proceedings of the Third Conference on Machine Translation: Shared Task Papers (pp. 667–670). Belgium, Brussels: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W18-6449
Wycliffe Global Alliance (2023). 2023 Global Scripture Access. https://www.wycliffe.net/resources/statistics/. (Accessed: February 9, 2024)
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., . . . Raf- fel, C. (2021, June). mT5: A massively multilingual pre-trained text-to-text transformer. K. Toutanova et al. (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 483–498). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.41
Zoph, B., Yuret, D., May, J., Knight, K. (2016). A Transfer learning for low-resource neural machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (p. 1568–1575). Association for Computational Linguistics.

¹https://opus.nlpl.eu/

²https://webscraper.io/

³https://www.crummy.com/software/BeautifulSoup/

⁴https://github.com/mediacloud/sentence-splitter?tab=readme-ov-file

⁵http://sealang.net/chavacano/dictionary.htm

⁶https://huggingface.co/docs/transformers/en/index

No competing interests reported.

AppendixA.docx

ChavacanoMT: A Corpus and Evaluation of Neural Machine Translation for Philippine Creole Spanish

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

2.1 Chavacano: Philippine Creole Spanish

2.2 Chavacano Lexicon and Orthography

2.3 Chavacano Grammar

3. ChavacanoMT

3.1 Dataset Collection

3.2 Preprocessing

3.2.1 Articles: Cleaning and Sentence Segmentation

3.2.2 Bible: Cleaning and Verse Segmentation

3.3 Corpus Preparation

3.4 Corpus Statistics

4. Chavacano Multilingual Neural Machine Translation

4.1 Model Training

4.2 Datasets Used

5 Results

6 Conclusions

Declarations

References

Footnotes

Additional Declarations

Supplementary Files

Status:

Version 1