VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability

doi:10.21203/rs.3.rs-2877209/v1

Download PDF

Short Report

VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability

https://doi.org/10.21203/rs.3.rs-2877209/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

This paper presents VeLeSpa, a verbal lexicon of Peninsular Spanish, which contains the full paradigms (all 63 cells) in phonological form of 6553 verbs, along with their corresponding frequencies. In this paper, the process and decisions involved in the building of the resource are presented. In addition, based on the most frequent 3000 + verbs, a quantitative analysis is conducted of morphological predictability in Spanish verbal inflection. The results and their drivers are discussed, as well as observed differences with other Romance languages and Latin.

Spanish

paradigm

verb

inflected lexicon

morphological predictability

entropy

The last decades have witnessed a resurgence of research into purely paradigmatic relations (Maiden 1992, Aronoff 1994), for example in the Word-and-Paradigm tradition (Matthews 1965, Blevins 2016). Over the last decade, in addition, it has become easier to analyze these relations (i.e. paradigmatic word-to-word predictability) quantitatively and automatically over whole inflectional systems, thanks to advances in computing power and the development of tools (e.g. Cats Claw [Stump & Finkel 2013], Qumín [Beniamine 2018]) and resources (e.g. inflected lexicons like Unimorph [Kirov et al. 2018]) specific to our field. Much progress still needs to happen, however, regarding language documentation of highly-inflected underresearched families and languages (see e.g. Cruz et al. 2020) and, even in some major languages, regarding the development of well-curated and comparable datasets, preferably in phonological rather than orthographic form.

Indo-European and the Romance language family in particular, are generally very well studied and have been explored in an extensive qualitative philological-comparative literature (see e.g. Maiden 2018). Regarding inflected lexicons in phonological rather than orthographic form, we have both family-wide but low-resolution¹ ones (see ODRVM, Maiden et al. 2010), as well as high-resolution ones for some of the major languages (namely French [Bonami et al. 2014], Latin [Pellegrini & Passarotti 2018], Italian [Pellegrini & Cignarella 2020], and Portuguese [Beniamine et al. 2021]). There is, however, maybe surprisingly, no comparable resource for Spanish, and hence also no analogous quantitative analysis of morphological predictability in Spanish verbs. It is the goal of this paper to fill this gap. Section 2 explains how the Verbal Lexicon of Spanish (VeLeSpa) was built from existing ones, outlines the main challenges and decisions involved, and presents an overview of the final resource. Relying on it and on the Qumin toolkit (Beniamine 2018), Section 3 presents a quantitative analysis of paradigmatic predictability in Spanish verbs. and discusses their interpretation. Section 4 concludes the paper and presents possible avenues for future research.

^[1] The ODRVM (and its CLDF descendant the Romance Verbal Dataset 2.0 of Beniamine et al. 2020) has full and partial paradigms from 73 Romance varieties but a very low number of verbs per variety (24.5 on average). It also contains too much noise (mistakes, and variable phonemic transcription conventions) for its reliable computational use.

The construction of a verbal inflected lexicon of Spanish started from the Spanish verbal paradigms in Wiktionary, as collected in Unimorph (Kirov et al. 2018, https://github.com/unimorph/spa/blob/master/README.md, downloaded 13/03/2023). This contains the full paradigm of 6812 verb lemmas in orthographic form. Although Spanish orthography is very transparent compared to languages like French and English, it has properties that are likely to distort morphological alternations and predictability between word forms. Pairs like rezo /ˈɾeθo/ 'pray.PRS.IND.1SG' and rece /ˈɾeθe/ 'pray.PRS.SBJV.1SG', quepo /ˈkepo/ and cabes /ˈkabes/, huelo /ˈwelo/ and olemos /oˈlemos/, andaba /anˈdaba/ and andábamos /anˈdabamos/, etc. can give the impression of stem or stress alternations of some sort. However, the highlighted orthographic differences do not reflect any contrast in pronunciation. Conversely, orthography can gloss over differences in stress (e.g. amo /ˈamo/ vs amamos /aˈmamos/), lead to incorrect segment alignments (e.g. crezco /kɾeθko/ vs creces /kɾeθes/), etc. Phonological forms, and not the written word, is also the only evidence available to native language users as they acquire the inflectional system of a language. Because of this, phonological forms are generally preferred in linguistic research.

Unlike in other Romance languages, and despite the inadequacies outlined in the paragraph above, there is in Spanish largely² no uncertainty as to how a written word should be pronounced (with uncertainty only in the opposite direction, i.e. on how a given pronunciation is encoded orthographically). Because of this, it was comparatively straightforward to transform orthographic forms into phonological forms through regular expression changes like 'que' > 'ke', 'ci' > 'θi', etc. The accentuation rules of Spanish also allow one to recover the location of word stress and (standard/prescriptive) syllabification. It must be mentioned in this respect that, although substantial variation exists between different varieties and even idiolects regarding hiatus vs diphthong pronunciations in many environments (e.g. enviando 'send.GER' /em.bján.do/ vs /em.bi.án.do/, construimos 'build.1PL.PRS.IND' /kons.tɾwí.mos/ vs /kons.tɾu.í.mos/), the syllabification prescribed by RAE was adopted as the most standard pronunciation. This involves the former pronunciation, i.e. a diphthong, in the previous cases.

Besides the transformation into phonological transcription, further processing of the Unimorph data involved discarding multiword inflectional forms and those involving postposed object clitics (e.g. da-me-lo give.IMP-me-it', llevemos-le 'take.SBJV.IND.1PL- him', diciéndo-se-lo 'telling-him-it'), which are orthographically written as a single word, but involve syntactically independent and morphologically invariable elements (cf. me lo da 'me it gives'). A final adaptation involved eliminating those cases of multiple forms with the same value. Most of these involve form doublets one of which is not used in Peninsular Spanish (e.g. voseo 2SG forms like llevás, corrés besides llevas, corres). A small number of rhese involve overabundance, i.e. forms in more-or-less free variation (see Thornton 2012) like roigo roo royo (all prescriptively acceptable as the 1SG.PRS.IND form of 'gnaw'). Avoiding, as I did, these cases, by keeping the form most commonly used in Peninsular Spanish, has the goal of describing a coherent system, i.e. one that would be learned naturally by a single speaker, eliminating spurious complexity resulting from dialectal or idiolectal variation. Focus in a single dialect (in this case Northern Standard Peninsular Spanish) is required and implemented, of course, also with regard to the phonological transcription of words, where /θ/ and /ʎ/ have been used rather than /s/ and /j/, avoiding also features of colloquial pronunciation like /gw/ in words starting with /w/ (e.g. huele 'smells'), vowel reductions, etc. Unadapted borrowings containing foreign phonemes (e.g. /ʃ/) and clusters (see Footnote 2) were also discarded.

After the process and decisions described, the phonemic inflected lexicon ended up with 6553 verbal paradigms with 63 cells each, for a total of 412839 inflected word forms. Table 1 shows the size of VeLeSpa compared to analogous resources for verbal inflection in other Romance languages. It is evidently the case that the frequency of words in such a large lexicon is highly variable, with many of them being extremely infrequent (Zipf 1932). This (i.e. token frequency) is a factor that is crucial to the Paradigm Cell Filling Problem (Ackerman & Malouf 2013) and beyond, and hence very important for morphological and psycholinguistic research. For this reason, the token frequency of each lemma was also supplied along its inflected forms. The frequencies of different lemmas in the Spanish corpus Corpes XXI (Version 0.93, containing 381 million words) were added, which pattern as illustrated in Fig. 1: only 8 verbs exceed 1 million tokens, 82 exceed 100000 tokens, 650 have ≥ 10000 tokens, 2061 have ≥ 1000 tokens, 3880 ≥ 100 tokens, 5326 ≥ 10 tokens, 6115 attested at least once, and 437 verbs completely unattested in Corpes XXI. The token frequency of cells (based also on Corpes XXI, on its Spain subcorpus) is also given in Fig. 1. This varies between 4012564 for the 3SG.PRS.IND and 4 for the 1PL.FUT.SBJV (note that this tense is essentially extinct in the spoken language).

Table 1

Size of VeLeSpa and comparable inflected lexicons in other Romance varieties
Language	Reference	Lemmas	Word forms
Latin	Pellegrini & Passarotti 2018	3348	850392
French	Bonami et al. 2014	4991	253174
Italian	Pellegrini & Cignarella 2020	2053	108809
Portuguese	Beniamine et al. 2021	4987	324214
Spanish	VeLeSpa, this paper	6554	412839

^[2] Exceptionally, the grapheme 'x' can be pronounced as either /x/ or /ks/. Borrowed words may also preserve a spelling that does not correspond to the one in native words. In the original Unimorph dataset, for example, we can find among others crackear, hackear, shockear, fotoshopear, twitear, etc. This introduces another source of occasional uncertainty regarding the pronunciation of some graphemes or character sequences (e.g. 'sh' cf. deshacer or 'h' cf. hacer).

With the goal of making the present analysis more comparable to the extant ones in related languages, and to shorten computation times, only the 3880 verbs with token frequency ≥ 100 were selected to perform the computations that are reported in this section. This number, of necessity arbitrary, is close to the average number of verbs analyzed in the literature (see Table 1). It corresponds to a frequency ≥ 0.263 per million words in Corpes XXI. This measure is also justified by the fact that most native speakers do not know or use most of the lower-frequency verbs, which hence are generally not part of the inflectional system learned by most speakers of Peninsular Spanish.

The free software toolkit Qumín (Quantitative Modelling of Inflection, Beniamine 2018) was used to perform the computations. The algorithm (Python scripts) works by automatically extracting the alternations between all the different word forms of all lemmas. Within the 63-cell paradigm of Spanish, there are 3906 (= 63*62) pairs of word forms to examine for each verb. The automatic extraction of these alternations (of course comparable manual work would be impossible) allows us to find out which verbs behave the same and which behave differently, and how many classes exist, in every pair of cells in the paradigm.³

Table 2

Extracted morphological alternations between two cells in six verbs
lemma	gloss	'INDPRS2SG'	'IMP2SG'	('INDPRS2SG', 'IMP2SG')
tener	'have'	tjénes	tén	j_es ⇌ _
ir	'go'	bás	bé	ás ⇌ é
hacer	'do'	áθes	áθ	es ⇌
decir	'say'	díθes	dí	θes ⇌
dar	'give'	dás	dá	s ⇌
ver	'see'	bés	bé	s ⇌

As the forms in Table 2 show, the INDPRS2SG and the IMP2SG in Spanish can differ in multiple ways. The morphological contrasts between /tjénes/ and /tén/, /bás/ and /bé/, or /dás/ and /dá/ are all different. The difference between /dás/ and /dá/, and /bés/ and /bé/, by contrast, is identical, with the INDPRS2SG suffixing /s/ to the IMP2SG form. This means that, whereas tener, ir and dar belong to different classes, dar and ver belong, for the purposes of this cell pair, to the same class. The algorithm also provides the phonological context where a given alternation occurs if a coherent one can be identified (note that, apart from the inflected lexicon itself, a different file needs to be provided to Qumín specifying the language's phonemes and their description in terms of distinctive features. This file is also provided in the supplementary materials: https://osf.io/gne73/?view_only=7fc0ddb495aa4cb8b3b50b2a4dbfdb85

Once these patterns are extracted (a computationally demanding process which can take several hours), the patterns themselves can be used for quality control of the resource (checking whether exceptional of infrequent alternations are bona fide irregularities or mistakes in the original lexicon or its phonemization) and to perform further calculations with other Qumin scripts (e.g. conditional entropies [on the basis of single or multiple predictor cells], identifying micro and macro-classes, etc).

An analysis of paradigmatic predictability is often presented first in terms of categorical interpredictability. Inflected word forms can be either not mutually predictable from each other (when there are multiple ways, as in Table 2, in which one form could be altered to generate the other) or they might be mutually predictable. This is the case, for example, of the COND3SG and the INDFUT1PL in Spanish, where ía ⇌ émos describes the morphological relation between the two cells in every single Spanish verb. Out of 3906 pairs of predictor-predicted cells in the Spanish verb paradigm, 553 (14.16%) involve no uncertainty (i.e. conditional entropy = 0). The majority of these are found to be bidirectional, i.e. conditional entropy of A given B is 0 and conditional entropy of B given A is also 0. This, as is also well known generally from previous literature of paradigmatic predictability, means that paradigm cells can be classified into areas of mutual interpredictability (see the notion of 'stem space' in Montermini & Bonami [2013] or distillations in Stump & Finkel [2013]).

In our presently-used dataset of the highest frequency 3880 verbs in VeLeSpa, results show the 14 areas identified in Fig. 2. This is close to the number of interpredictability areas in Latin (15) and in other Romance languages (15 in Italian, 14 in French, 12 in Portuguese). Many of the patterns observed in Spanish are also observed in other Romance languages and are known to readers familiar with the literature on Romance morphomes (see e.g. Maiden 2018). The affinity of future and conditional tenses (Zone 12 in Fig. 2), for example, is well known and diachronically unsurprising, since the tenses share an origin in verbal periphrases involving the infinitive and a tensed form of the verb 'have'. Also well-known is the morphological affinity of the future subjunctive and the past subjunctive (which in Spanish comes in two synonymous versions (e.g. amara (A) and amase (B) 'love'), which derives from the Latin perfectum tenses, all of which shared the same stem. All person-number forms within the imperfect indicative (e.g. amabas and amábamos) are also mutually predictable in Spanish, as in other Romance languages like Italian and Portuguese, and in Latin.

On a broader picture, Spanish is also like other Romance languages in generally not showing natural-class, but instead morphomic (see Aronoff 1994, Maiden 2018, Herce 2023) distributions of its morphological domains. Focusing on multi-cell areas (the ones where naturalness of unnaturalness can be distinguished), all of them except the indicative past imperfect (Z7) can be taken to be unnatural: 2PL.IMP + 1PL.PRS.IND + 2PL.PRS.IND (Z2), 2SG.PRS.IND + 3SG.PRS.IND + 3PL.PRS.IND (Z4), SG.PRS.SBJV + 3PL.PRS.SBJV (Z5), etc. The patterns of interpredictability also vary strongly from one tense to the other (more on this below), which suggests that these domains are morphosyntactically largely arbitrary, and might be regarded as an inherited purely morphological property.

On a more general picture still, Spanish is seen to behave not only like other Romance languages but also in line with well-known cross-linguistic tendencies in that a strong correlation is observed between frequency and irregularity (Herce 2019, Wu et al. 2019). This holds both at the inflection class level (see Fig. 3a), with verbs from singleton or low type frequency classes more common on average than verbs from large classes, and at the cell level (see Fig. 3b), with cells from single-cell (Z1, Z3, Z8, Z10, Z13, Z14 in Fig. 2) or small predictability areas (Z4, Z6, Z9, Z2, Z5), more frequent on average than cells from larger morphological domains (Z7, Z12, Z11). It is a logical necessity that, given the Zipfian nature of linguistic input (Blevins et al. 2017), infrequent verbs and forms are less likely to be learned by rote and more likely to be produced online following more general rules of the language, thus losing irregular morphological traits (see Lieberman et al. 2007). Spanish verbal inflection is not, and arguably cannot be, an exception to this (Herce 2016).

In this sea of similarity, the main differences between the predictability patterns from Spanish and those of its closest siblings concern the present indicative and the past indicative. The former is, in every other Romance language and Latin, the tense which is split into most areas of interpredictability (6 in Italian, 5 in Portuguese and Latin, 4 in French). This is not so in Spanish, where only 3 domains are found: 1SG, 2SG/3SG/3PL, and 1PL/2PL. The tense is hence simpler, in this respect, than in the rest of Romance. It is striking, or at least unparalleled in the family, that such frequent cells as 2SG, 3SG, and 3PL present indicative are all mutually predictable. The past indicative shows the opposite tendency. Whereas other Romance languages split this tense in 2 domains of interpredictability at most, Spanish boasts 4 distinct domains. The lack of predictability is caused, among others, by the alternations displayed in Table 3. As can be seen, there are multiple cross-classifying ways in which cells from those domains can differ from each other in different verbs.

Table 3

Some past indicative inflected forms and alternations in Spanish
INDPST1SG	INDPST2SG	INDPST3SG	INDPST3PL	('INDPSTPFV1SG', 'INDPSTPFV2SG')	('INDPSTPFV1SG', 'INDPSTPFV3SG')	('INDPSTPFV1SG', 'INDPSTPFV3PL')
fwí	fwíste	fwé	fwéɾon	⇌ ste	í ⇌ é	í ⇌ éɾon
estúbe	estubíste	estúbo	estubjéɾon	ú_ ⇌ u_íst	e ⇌ o	ú_e ⇌ u_jéɾon
túbe	tubíste	túbo	tubjéɾon	ú_ ⇌ u_íst	e ⇌ o	ú_e ⇌ u_jéɾon
púde	pudíste	púdo	pudjéɾon	ú_ ⇌ u_íst	e ⇌ o	ú_e ⇌ u_jéɾon
úbe	ubíste	úbo	ubjéɾon	ú_ ⇌ u_íst	e ⇌ o	ú_e ⇌ u_jéɾon
díxe	dixíste	díxo	dixéɾon	í_ ⇌ i_íst	e ⇌ o	í_e ⇌ i_éɾon
dí	díste	djó	djéɾon	⇌ ste	í ⇌ jó	í ⇌ jéɾon
bí	bíste	bjó	bjéɾon	⇌ ste	í ⇌ jó	í ⇌ jéɾon

Apart from discussing the areas of perfect interpredictability, we can also present and comment on the overall uncertainty (i.e. conditional entropies) involved in predicting between word forms among these domains. Zone 3, i.e. the 1SG.PRS.IND is, as Table 4 shows, the least informative cell (see also Table 5). This is also clearly the case in Portuguese (see Beniamine et al. 2021).⁴ Also shared with the other national standard Ibero-Romance language (and beyond in this case) is the fact that the rhizotonic cells of the present (the so-called N-morphome, i.e. domains Z1, Z3, Z4 and Z5) are the least predictable from other cells in the paradigm. In Spanish, this is largely the result of unpredictable stem-vowel diphthongizations (e.g. costar cuesta, cerrar cierra, but cortar corta, cenar cena) that distinguish the rhizotonic (i.e. root-stressed) forms from the rest.

Table 4

Conditional entropies (column given row) between the distillations in Figure 2.
	Z1	Z2	Z3	Z4	Z5	Z6	Z7	Z8	Z9	Z10	Z11	Z12	Z13	Z14
Z1		0.066	0.013	0.001	0.013	0.041	0.045	0.045	0.045	0.043	0.029	0.080	0.028	0.046
Z2	0.191		0.190	0.197	0.190	0.034	0.003	0.005	0.005	0.013	0.013	0.008	0.009	0.006
Z3	0.523	0.429		0.516	0.632	0.585	0.429	0.426	0.424	0.418	0.403	0.421	0.383	0.401
Z4	0.013	0.064	0.029		0.029	0.041	0.044	0.046	0.047	0.042	0.029	0.079	0.027	0.046
Z5	0.006	0.060	0.000	0.003		0.039	0.044	0.044	0.044	0.041	0.028	0.059	0.028	0.044
Z6	0.205	0.105	0.205	0.204	0.205		0.036	0.039	0.039	0.017	0.017	0.096	0.018	0.035
Z7	0.237	0.158	0.226	0.246	0.226	0.075		0.010	0.010	0.058	0.064	0.142	0.061	0.006
Z8	0.230	0.131	0.222	0.220	0.221	0.046	0.004		0.000	0.049	0.055	0.128	0.049	0.005
Z9	0.222	0.132	0.221	0.234	0.222	0.047	0.005	0.014		0.073	0.082	0.129	0.050	0.006
Z10	0.326	0.258	0.295	0.316	0.314	0.205	0.194	0.183	0.183		0.336	0.253	0.336	0.176
Z11	0.187	0.116	0.186	0.188	0.187	0.031	0.051	0.044	0.038	0.019		0.108	0.001	0.040
Z12	0.236	0.012	0.192	0.237	0.192	0.024	0.009	0.010	0.010	0.018	0.018		0.016	0.011
Z13	0.184	0.131	0.181	0.180	0.182	0.060	0.052	0.040	0.042	0.005	0.005	0.127		0.042
Z14	0.230	0.142	0.215	0.223	0.215	0.060	0.000	0.003	0.003	0.052	0.057	0.133	0.056

Alongside the areas of higher unpredictability (gray), Table 4 also shows those of low uncertainty (white). Some distillations that come close to perfect interpredictability are some of the unusual past indicative domains mentioned in Table 3 (namely Z8, Z9 and Z11), as well as Z11 and Z13.

The working hypothesis embedded in the PCFP is that perfect and high predictability relations must be picked up by language users as they learn their language and must be actively used when they produce unobserved forms online on the basis of other forms. Predictability relations, hence, must drive analogical morphological changes in diachrony as well. The morphological affinity of Z11 and Z13 in Spanish, for example, must be the reason why we sometimes witness the emergence of analogical gerunds like *dijendo, *tuviendo, *hiciendo, *pusiendo, etc. in some Peninsular Spanish varieties, where these innovative forms replace standard diciendo, teniendo, haciendo, poniendo, etc. (see Pato & O'Neill 2013). The new forms, which borrow the stem from the Z11 former-perfectum domain, represent an elimination of the very few unpredictable alternations between the two distillations. In the overwhelming majority of Spanish verbs, e.g. amaron amando, corrieron corriendo, vivieron viviendo, murieron muriendo, vinieron viniendo, pidieron pidiendo, etc. a simple ɾo_ ⇌ _do captures the change between 3PL past indicative and the gerund. Only in a few irregular high frequency verbs (e.g. dijeron diciendo, tuvieron teniendo, hicieron haciendo, pusieron poniendo, etc. there are additional changes in the stem as well. Erasing these, as the mentioned substandard varieties do, would bring Z11 and Z13 into the same distillation, which would reduce the overall complexity of the Spanish verbal inflectional system. This (i.e. 'regularity') is generally taken to be the "purpose" of analogical morphological change (see Sturtevant 1947).

Table 5

Average predictability and predictiveness of Spanish distillations
Predictability of distillations		Predictiveness of distillations
Z4	0.19635134	Z3	0.4443304
Z1	0.19544314	Z10	0.27654857
Z5	0.18648071	Z7	0.10242521
Z3	0.17336842	Z9	0.10013906
Z2	0.09558762	Z14	0.09408421
Z12	0.08995327	Z8	0.09156166
Z6	0.04875002	Z13	0.07605122
Z8	0.03951685	Z6	0.07165787
Z7	0.0382461	Z11	0.06894532
Z14	0.03490731	Z4	0.04574052
Z9	0.03461818	Z12	0.04484319
Z11	0.03454562	Z1	0.04333178
Z10	0.03331269	Z2	0.03912075
Z13	0.03167366	Z5	0.03739493

Table 5 shows the predictability and predictiveness of all distillations. As in other Romance languages, it can be observed that differences in predictiveness are much more pronounced than differences in predictability. It awaits further exploration whether this is a property of Romance verbs, or a cross-linguistic trend motivated by the necessity to predict all word forms (but not so much for all word forms to be predictive). The average implicative entropy between Spanish cells is overall 0.073787 (0.12109 between distillations), which is somewhat lower than in the other Romance languages that have been analyzed in a comparable way: 0.28 for Latin (Pellegrini & Passarotti 2018), 0.18 for French, and 0.17 for Portuguese (Beniamine 2018).

It is tempting, but highly speculative at present, to link these differences to the sociolinguistic history of the different languages. There is abundant literature suggesting a link between high levels of contact or historical L2 adult language acquisition and morphological simplification (Kusters 2003, McWhorter 2007, Lupyan & Dale 2010, Trudgill 2011, Bentz & Winter 2013). Spanish is the most widely spoken Romance language and has historically expanded dramatically from its small home in North-Western Castille to close to 500 million people nowadays, spread across 22 countries. This might partially provide a partial explanation to the greater degree of simplification found in Spanish. Of course, future research should be conducted to check that these differences are not due to properties specific to the inflected lexicons used or other factors.

^[3] For more detailed explanation of how the alternations are identified, for example when multiple descriptions are possible (e.g. /éɾes/ /sé/: éɾe_ ⇌ _é, or _ɾes ⇌ s_) see Beniamine et al. 2021.

^[4] Not so in Italian (Pellegrini & Cignarella 2020), where other paradigm cells have innovated so-called 'superstable' markers (Wurzel 1978) like the 2SG.PRS.IND -i, and the 1PL.PRS.IND -iamo

This paper has presented a novel resource, VeLeSpa: A verbal inflected lexicon of Peninsular Spanish verbs in phonological form. The resource, built for computational use and thoroughly checked, contains 6553 verbal paradigms with 63 cells each, for a total of 412839 inflected word forms. The full lexicon and the file with the distinctive features of Spanish phonemes is made freely available for further not-for-profit linguistic research at https://osf.io/gne73/?view_only=7fc0ddb495aa4cb8b3b50b2a4dbfdb85.

The second part of this paper, based on a high-frequency subset of the paradigms in VeLeSpa, contains a quantitative analysis of morphological predictability in Spanish verbs, using the toolkit Qumin (Beniamine 2018). The results allow us to compare Spanish to other Romance languages for which the same analyses have been conducted, such as Latin, French, Italian, and Portuguese. In terms of the number of mutual-predictability domains in the paradigm (14), Spanish shows a degree of complexity very similar to other languages in its family. The structuring principles of paradigmatic predictability in the language are found to be largely morphomic, reflecting the accidents of history (e.g. historical sound changes), rather than morphosyntactic or semantic. This agrees with received wisdom from the philological and Romance historical-comparative literature (e.g. Maiden 2018). At the same time, and in line with cross-linguistic findings, results also reveal a correlation between frequency and irregularity (see Fig. 3).

Alongside these similarities, some differences have also been found between verbal morphological predictability in Spanish and its sister languages. While the PRS.IND tense has been found to be simpler than in the rest of Romance, the PST.PFV.IND shows the opposite trend, i.e. more complexity, as measured by number of interpredictability domains, than the other languages in the family that have been explored so far.⁵ Across all cells, Spanish seems to be generally simpler (i.e. shows lower average conditional entropies) than other Romance languages. It remains to be seen whether this should be attributed to its rapid spread and a large number of incoming L2 speakers from the 10th century onwards.

These subtle differences notwithstanding, stability prevails regarding the paradigmatic and morphological predictability structures observed across Romance verbs. The notable stability of paradigmatic structure has long been argued for in much qualitative diachronic literature, in Romance and beyond (e.g. Meillet 1958, Nichols 1986, Maiden 2018). It is, thus, striking, how similar the systems of contemporary Spanish, Portuguese, French, and Italian are after nearly 2 millennia of separate evolution. Future research would profit from pursuing a 'diachronic turn' (Blevins 2013) in the quantitative exploration of paradigmatic predictability. This is urgently needed to test whether these relations indeed bear a strong long-lasting phylogenetic signal, as most extant synchronic research suggests, and to find out whether or how they can inform language taxonomies and reconstruction, alongside, or in addition to, traditional methods based on systematic sound correspondences. This shall be left for now to future research.

^[5] The morphological simplicity of the PST.PFV.IND beyond Spanish could be partially explained by the obsolescence of this tense in many Romance languages (e.g. French and Italian), where it is often replaced, as in German, by a present perfect construction involving the participle and a conjugated auxiliary verb 'be' or 'have'.

The author has no relevant financial or non-financial interests to disclose.

Acknowledgements

Funding

No special funding associated with this research

Authors contribution

Ackerman, Farrell & Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture. Language 89, 3: 429-464.
Aronoff, Mark. 1994. Morphology by itself: Stems and inflectional classes. Cambridge (MA): MIT press.
Beniamine, Sacha. 2018. Typologie quantitative des systèmes de classes flexionnelles: Université Paris Diderot dissertation.
Beniamine, Sacha, Olivier Bonami, & Ana R. Luís. 2021. The fine implicative structure of European Portuguese conjugation. Isogloss. Open Journal of Romance Linguistics 7: 1-35.
Bentz, Christian, & Bodo Winter. 2013. Languages with more second language learners tend to lose nominal case. Language Dynamics and Change 3, no. 1: 1-27.
Blevins, James P. 2013. The information-theoretic turn. Psihologija 46, no. 4: 355-375.
Blevins, James P. 2016. Word and paradigm morphology. Oxford University Press.
Blevins, James P., Petar Milin, & Michael Ramscar. 2017. The Zipfian paradigm cell filling problem. In Perspectives on morphological organization: 139-158. Leiden: Brill.
Bonami, Olivier, Gauthier Caron, and Clément Plancq. 2014. Construction d'un lexique flexionnel phonétisé libre du français. In SHS Web of Conferences, vol. 8, pp. 2583-2596. EDP Sciences.
Cruz, Hilaria, Gregory Stump, and Antonios Anastasopoulos. 2020. A resource for studying chatino verbal morphology. arXiv preprint arXiv:2004.02083.
Herce, Borja. 2016. Why frequency and morphological irregularity are not independent variables in Spanish: A response to Fratini et al. (2014). Corpus Linguistics and Linguistic Theory 12, no. 2: 389-406.
Herce, Borja. 2019. Deconstructing (ir) regularity. Studies in Language 43, 1: 44-91.
Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui et al. 2018. UniMorph 2.0: universal morphology. arXiv preprint arXiv:1810.11101.
Kusters, Wouter. 2003. Linguistic complexity. Netherlands Graduate School of Linguistics.
Lupyan, Gary, and Rick Dale. 2010. Language structure is partly determined by social structure. PloS one 5, 1: e8559.
Lieberman, Erez, Jean-Baptiste Michel, Joe Jackson, Tina Tang, and Martin A. Nowak. 2007. Quantifying the evolutionary dynamics of language. Nature 449, no. 7163: 713-716.
Lupyan, Gary, & Rick Dale. 2010. Language structure is partly determined by social structure." PloS one 5, no. 1: e8559.
Maiden, Martin. 1992. Irregularity as a determinant of morphological change. Journal of Linguistics 28, 2: 285-312.
Maiden, Martin. 2018. The Romance verb: Morphomic structure and diachrony. Oxford University Press.
Matthews, Peter H. 1965. The inflectional component of a word-and-paradigm grammar." Journal of Linguistics 1, no. 2: 139-171.
McWhorter, John. 2007. Language interrupted: Signs of non-native acquisition in standard language grammars. Oxford: Oxford University Press.
Meillet, Antoine. 1958. Linguistique historique et linguistique générale. Société Linguistique de Paris, Collection Linguistique, 8. Librairie Honoré Champion, Paris.
Montermini, Fabio & Bonami, Olivier. 2013. Stem spaces and predictability in verbal inflection. Lingue E Linguaggio, 12, 2: 171–190. https://doi.org/10.1418/75040.
Nichols, Johanna. 1996. The Comparative Method as heuristic. In Durie, Mark, and Malcolm Ross (Eds.), The comparative method reviewed: Regularity and irregularity in language change: 39-71.
Pato, Enrique, & Paul O'Neill. 2013. Los gerundios analógicos en la historia del español (e iberorromance). Nueva Revista de Filología Hispánica 61, no. 1: 1-27.
Pellegrini, Matteo, and Marco Passarotti. 2018. LatInfLexi: an inflected lexicon of Latin verbs. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). 10-12 December 2018, Torino: 324-329. Accademia University Press.
Pellegrini, Matteo, & Alessandra Teresa Cignarella. 2020. (Stem and Word) Predictability in Italian verb paradigms: An Entropy-Based Study Exploiting the New Resource LeFFI. In Proceedings of the 7th Italian Conference on Computational Linguistics (CLiC-it 2020): 1-6. CEUR.
Stump, Gregory, & Raphael A. Finkel. 2013. Morphological typology: From word to paradigm. Vol. 138. Cambridge: Cambridge University Press.
Sturtevant, Edgar H. 1947: An Introduction to Linguistic Science. New Haven: Yale University Press.
Trudgill, Peter. 2011. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford: Oxford University Press.
Wu, Shijie, Ryan Cotterell, and Timothy J. O'Donnell. 2019. Morphological irregularity correlates with frequency. arXiv preprint arXiv:1906.11483.
Zipf, George Kingsley. 1932. Selected studies of the principle of relative frequency in language.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
13 Jun, 2024
Reviews received at journal
13 Apr, 2024
Reviewers agreed at journal
17 Mar, 2024
Reviews received at journal
12 Jun, 2023
Reviewers agreed at journal
25 May, 2023
Reviewers agreed at journal
18 May, 2023
Reviewers agreed at journal
17 May, 2023
Reviewers invited by journal
17 May, 2023
Editor assigned by journal
17 May, 2023
Submission checks completed at journal
06 May, 2023
First submitted to journal
29 Apr, 2023

You are reading this latest preprint version

VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability

Status:

Version 1

Abstract

Figures

1 Introduction

2 Building VeLeSpa

3 A quantitative analysis of the PCFP in Spanish verbal inflection

4 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1