With the goal of making the present analysis more comparable to the extant ones in related languages, and to shorten computation times, only the 3880 verbs with token frequency ≥ 100 were selected to perform the computations that are reported in this section. This number, of necessity arbitrary, is close to the average number of verbs analyzed in the literature (see Table 1). It corresponds to a frequency ≥ 0.263 per million words in Corpes XXI. This measure is also justified by the fact that most native speakers do not know or use most of the lower-frequency verbs, which hence are generally not part of the inflectional system learned by most speakers of Peninsular Spanish.
The free software toolkit Qumín (Quantitative Modelling of Inflection, Beniamine 2018) was used to perform the computations. The algorithm (Python scripts) works by automatically extracting the alternations between all the different word forms of all lemmas. Within the 63-cell paradigm of Spanish, there are 3906 (= 63*62) pairs of word forms to examine for each verb. The automatic extraction of these alternations (of course comparable manual work would be impossible) allows us to find out which verbs behave the same and which behave differently, and how many classes exist, in every pair of cells in the paradigm.3
Table 2
Extracted morphological alternations between two cells in six verbs
lemma
|
gloss
|
'INDPRS2SG'
|
'IMP2SG'
|
('INDPRS2SG', 'IMP2SG')
|
tener
|
'have'
|
tjénes
|
tén
|
j_es ⇌ _
|
ir
|
'go'
|
bás
|
bé
|
ás ⇌ é
|
hacer
|
'do'
|
áθes
|
áθ
|
es ⇌
|
decir
|
'say'
|
díθes
|
dí
|
θes ⇌
|
dar
|
'give'
|
dás
|
dá
|
s ⇌
|
ver
|
'see'
|
bés
|
bé
|
s ⇌
|
As the forms in Table 2 show, the INDPRS2SG and the IMP2SG in Spanish can differ in multiple ways. The morphological contrasts between /tjénes/ and /tén/, /bás/ and /bé/, or /dás/ and /dá/ are all different. The difference between /dás/ and /dá/, and /bés/ and /bé/, by contrast, is identical, with the INDPRS2SG suffixing /s/ to the IMP2SG form. This means that, whereas tener, ir and dar belong to different classes, dar and ver belong, for the purposes of this cell pair, to the same class. The algorithm also provides the phonological context where a given alternation occurs if a coherent one can be identified (note that, apart from the inflected lexicon itself, a different file needs to be provided to Qumín specifying the language's phonemes and their description in terms of distinctive features. This file is also provided in the supplementary materials: https://osf.io/gne73/?view_only=7fc0ddb495aa4cb8b3b50b2a4dbfdb85
Once these patterns are extracted (a computationally demanding process which can take several hours), the patterns themselves can be used for quality control of the resource (checking whether exceptional of infrequent alternations are bona fide irregularities or mistakes in the original lexicon or its phonemization) and to perform further calculations with other Qumin scripts (e.g. conditional entropies [on the basis of single or multiple predictor cells], identifying micro and macro-classes, etc).
An analysis of paradigmatic predictability is often presented first in terms of categorical interpredictability. Inflected word forms can be either not mutually predictable from each other (when there are multiple ways, as in Table 2, in which one form could be altered to generate the other) or they might be mutually predictable. This is the case, for example, of the COND3SG and the INDFUT1PL in Spanish, where ía ⇌ émos describes the morphological relation between the two cells in every single Spanish verb. Out of 3906 pairs of predictor-predicted cells in the Spanish verb paradigm, 553 (14.16%) involve no uncertainty (i.e. conditional entropy = 0). The majority of these are found to be bidirectional, i.e. conditional entropy of A given B is 0 and conditional entropy of B given A is also 0. This, as is also well known generally from previous literature of paradigmatic predictability, means that paradigm cells can be classified into areas of mutual interpredictability (see the notion of 'stem space' in Montermini & Bonami [2013] or distillations in Stump & Finkel [2013]).
In our presently-used dataset of the highest frequency 3880 verbs in VeLeSpa, results show the 14 areas identified in Fig. 2. This is close to the number of interpredictability areas in Latin (15) and in other Romance languages (15 in Italian, 14 in French, 12 in Portuguese). Many of the patterns observed in Spanish are also observed in other Romance languages and are known to readers familiar with the literature on Romance morphomes (see e.g. Maiden 2018). The affinity of future and conditional tenses (Zone 12 in Fig. 2), for example, is well known and diachronically unsurprising, since the tenses share an origin in verbal periphrases involving the infinitive and a tensed form of the verb 'have'. Also well-known is the morphological affinity of the future subjunctive and the past subjunctive (which in Spanish comes in two synonymous versions (e.g. amara (A) and amase (B) 'love'), which derives from the Latin perfectum tenses, all of which shared the same stem. All person-number forms within the imperfect indicative (e.g. amabas and amábamos) are also mutually predictable in Spanish, as in other Romance languages like Italian and Portuguese, and in Latin.
On a broader picture, Spanish is also like other Romance languages in generally not showing natural-class, but instead morphomic (see Aronoff 1994, Maiden 2018, Herce 2023) distributions of its morphological domains. Focusing on multi-cell areas (the ones where naturalness of unnaturalness can be distinguished), all of them except the indicative past imperfect (Z7) can be taken to be unnatural: 2PL.IMP + 1PL.PRS.IND + 2PL.PRS.IND (Z2), 2SG.PRS.IND + 3SG.PRS.IND + 3PL.PRS.IND (Z4), SG.PRS.SBJV + 3PL.PRS.SBJV (Z5), etc. The patterns of interpredictability also vary strongly from one tense to the other (more on this below), which suggests that these domains are morphosyntactically largely arbitrary, and might be regarded as an inherited purely morphological property.
On a more general picture still, Spanish is seen to behave not only like other Romance languages but also in line with well-known cross-linguistic tendencies in that a strong correlation is observed between frequency and irregularity (Herce 2019, Wu et al. 2019). This holds both at the inflection class level (see Fig. 3a), with verbs from singleton or low type frequency classes more common on average than verbs from large classes, and at the cell level (see Fig. 3b), with cells from single-cell (Z1, Z3, Z8, Z10, Z13, Z14 in Fig. 2) or small predictability areas (Z4, Z6, Z9, Z2, Z5), more frequent on average than cells from larger morphological domains (Z7, Z12, Z11). It is a logical necessity that, given the Zipfian nature of linguistic input (Blevins et al. 2017), infrequent verbs and forms are less likely to be learned by rote and more likely to be produced online following more general rules of the language, thus losing irregular morphological traits (see Lieberman et al. 2007). Spanish verbal inflection is not, and arguably cannot be, an exception to this (Herce 2016).
In this sea of similarity, the main differences between the predictability patterns from Spanish and those of its closest siblings concern the present indicative and the past indicative. The former is, in every other Romance language and Latin, the tense which is split into most areas of interpredictability (6 in Italian, 5 in Portuguese and Latin, 4 in French). This is not so in Spanish, where only 3 domains are found: 1SG, 2SG/3SG/3PL, and 1PL/2PL. The tense is hence simpler, in this respect, than in the rest of Romance. It is striking, or at least unparalleled in the family, that such frequent cells as 2SG, 3SG, and 3PL present indicative are all mutually predictable. The past indicative shows the opposite tendency. Whereas other Romance languages split this tense in 2 domains of interpredictability at most, Spanish boasts 4 distinct domains. The lack of predictability is caused, among others, by the alternations displayed in Table 3. As can be seen, there are multiple cross-classifying ways in which cells from those domains can differ from each other in different verbs.
Table 3
Some past indicative inflected forms and alternations in Spanish
INDPST1SG
|
INDPST2SG
|
INDPST3SG
|
INDPST3PL
|
('INDPSTPFV1SG', 'INDPSTPFV2SG')
|
('INDPSTPFV1SG', 'INDPSTPFV3SG')
|
('INDPSTPFV1SG', 'INDPSTPFV3PL')
|
fwí
|
fwíste
|
fwé
|
fwéɾon
|
⇌ ste
|
í ⇌ é
|
í ⇌ éɾon
|
estúbe
|
estubíste
|
estúbo
|
estubjéɾon
|
ú_ ⇌ u_íst
|
e ⇌ o
|
ú_e ⇌ u_jéɾon
|
túbe
|
tubíste
|
túbo
|
tubjéɾon
|
ú_ ⇌ u_íst
|
e ⇌ o
|
ú_e ⇌ u_jéɾon
|
púde
|
pudíste
|
púdo
|
pudjéɾon
|
ú_ ⇌ u_íst
|
e ⇌ o
|
ú_e ⇌ u_jéɾon
|
úbe
|
ubíste
|
úbo
|
ubjéɾon
|
ú_ ⇌ u_íst
|
e ⇌ o
|
ú_e ⇌ u_jéɾon
|
díxe
|
dixíste
|
díxo
|
dixéɾon
|
í_ ⇌ i_íst
|
e ⇌ o
|
í_e ⇌ i_éɾon
|
dí
|
díste
|
djó
|
djéɾon
|
⇌ ste
|
í ⇌ jó
|
í ⇌ jéɾon
|
bí
|
bíste
|
bjó
|
bjéɾon
|
⇌ ste
|
í ⇌ jó
|
í ⇌ jéɾon
|
Apart from discussing the areas of perfect interpredictability, we can also present and comment on the overall uncertainty (i.e. conditional entropies) involved in predicting between word forms among these domains. Zone 3, i.e. the 1SG.PRS.IND is, as Table 4 shows, the least informative cell (see also Table 5). This is also clearly the case in Portuguese (see Beniamine et al. 2021).4 Also shared with the other national standard Ibero-Romance language (and beyond in this case) is the fact that the rhizotonic cells of the present (the so-called N-morphome, i.e. domains Z1, Z3, Z4 and Z5) are the least predictable from other cells in the paradigm. In Spanish, this is largely the result of unpredictable stem-vowel diphthongizations (e.g. costar cuesta, cerrar cierra, but cortar corta, cenar cena) that distinguish the rhizotonic (i.e. root-stressed) forms from the rest.
Table 4
Conditional entropies (column given row) between the distillations in Figure 2.
|
Z1
|
Z2
|
Z3
|
Z4
|
Z5
|
Z6
|
Z7
|
Z8
|
Z9
|
Z10
|
Z11
|
Z12
|
Z13
|
Z14
|
Z1
|
|
0.066
|
0.013
|
0.001
|
0.013
|
0.041
|
0.045
|
0.045
|
0.045
|
0.043
|
0.029
|
0.080
|
0.028
|
0.046
|
Z2
|
0.191
|
|
0.190
|
0.197
|
0.190
|
0.034
|
0.003
|
0.005
|
0.005
|
0.013
|
0.013
|
0.008
|
0.009
|
0.006
|
Z3
|
0.523
|
0.429
|
|
0.516
|
0.632
|
0.585
|
0.429
|
0.426
|
0.424
|
0.418
|
0.403
|
0.421
|
0.383
|
0.401
|
Z4
|
0.013
|
0.064
|
0.029
|
|
0.029
|
0.041
|
0.044
|
0.046
|
0.047
|
0.042
|
0.029
|
0.079
|
0.027
|
0.046
|
Z5
|
0.006
|
0.060
|
0.000
|
0.003
|
|
0.039
|
0.044
|
0.044
|
0.044
|
0.041
|
0.028
|
0.059
|
0.028
|
0.044
|
Z6
|
0.205
|
0.105
|
0.205
|
0.204
|
0.205
|
|
0.036
|
0.039
|
0.039
|
0.017
|
0.017
|
0.096
|
0.018
|
0.035
|
Z7
|
0.237
|
0.158
|
0.226
|
0.246
|
0.226
|
0.075
|
|
0.010
|
0.010
|
0.058
|
0.064
|
0.142
|
0.061
|
0.006
|
Z8
|
0.230
|
0.131
|
0.222
|
0.220
|
0.221
|
0.046
|
0.004
|
|
0.000
|
0.049
|
0.055
|
0.128
|
0.049
|
0.005
|
Z9
|
0.222
|
0.132
|
0.221
|
0.234
|
0.222
|
0.047
|
0.005
|
0.014
|
|
0.073
|
0.082
|
0.129
|
0.050
|
0.006
|
Z10
|
0.326
|
0.258
|
0.295
|
0.316
|
0.314
|
0.205
|
0.194
|
0.183
|
0.183
|
|
0.336
|
0.253
|
0.336
|
0.176
|
Z11
|
0.187
|
0.116
|
0.186
|
0.188
|
0.187
|
0.031
|
0.051
|
0.044
|
0.038
|
0.019
|
|
0.108
|
0.001
|
0.040
|
Z12
|
0.236
|
0.012
|
0.192
|
0.237
|
0.192
|
0.024
|
0.009
|
0.010
|
0.010
|
0.018
|
0.018
|
|
0.016
|
0.011
|
Z13
|
0.184
|
0.131
|
0.181
|
0.180
|
0.182
|
0.060
|
0.052
|
0.040
|
0.042
|
0.005
|
0.005
|
0.127
|
|
0.042
|
Z14
|
0.230
|
0.142
|
0.215
|
0.223
|
0.215
|
0.060
|
0.000
|
0.003
|
0.003
|
0.052
|
0.057
|
0.133
|
0.056
|
|
Alongside the areas of higher unpredictability (gray), Table 4 also shows those of low uncertainty (white). Some distillations that come close to perfect interpredictability are some of the unusual past indicative domains mentioned in Table 3 (namely Z8, Z9 and Z11), as well as Z11 and Z13.
The working hypothesis embedded in the PCFP is that perfect and high predictability relations must be picked up by language users as they learn their language and must be actively used when they produce unobserved forms online on the basis of other forms. Predictability relations, hence, must drive analogical morphological changes in diachrony as well. The morphological affinity of Z11 and Z13 in Spanish, for example, must be the reason why we sometimes witness the emergence of analogical gerunds like *dijendo, *tuviendo, *hiciendo, *pusiendo, etc. in some Peninsular Spanish varieties, where these innovative forms replace standard diciendo, teniendo, haciendo, poniendo, etc. (see Pato & O'Neill 2013). The new forms, which borrow the stem from the Z11 former-perfectum domain, represent an elimination of the very few unpredictable alternations between the two distillations. In the overwhelming majority of Spanish verbs, e.g. amaron amando, corrieron corriendo, vivieron viviendo, murieron muriendo, vinieron viniendo, pidieron pidiendo, etc. a simple ɾo_ ⇌ _do captures the change between 3PL past indicative and the gerund. Only in a few irregular high frequency verbs (e.g. dijeron diciendo, tuvieron teniendo, hicieron haciendo, pusieron poniendo, etc. there are additional changes in the stem as well. Erasing these, as the mentioned substandard varieties do, would bring Z11 and Z13 into the same distillation, which would reduce the overall complexity of the Spanish verbal inflectional system. This (i.e. 'regularity') is generally taken to be the "purpose" of analogical morphological change (see Sturtevant 1947).
Table 5
Average predictability and predictiveness of Spanish distillations
Predictability of distillations
|
Predictiveness of distillations
|
Z4
|
0.19635134
|
Z3
|
0.4443304
|
Z1
|
0.19544314
|
Z10
|
0.27654857
|
Z5
|
0.18648071
|
Z7
|
0.10242521
|
Z3
|
0.17336842
|
Z9
|
0.10013906
|
Z2
|
0.09558762
|
Z14
|
0.09408421
|
Z12
|
0.08995327
|
Z8
|
0.09156166
|
Z6
|
0.04875002
|
Z13
|
0.07605122
|
Z8
|
0.03951685
|
Z6
|
0.07165787
|
Z7
|
0.0382461
|
Z11
|
0.06894532
|
Z14
|
0.03490731
|
Z4
|
0.04574052
|
Z9
|
0.03461818
|
Z12
|
0.04484319
|
Z11
|
0.03454562
|
Z1
|
0.04333178
|
Z10
|
0.03331269
|
Z2
|
0.03912075
|
Z13
|
0.03167366
|
Z5
|
0.03739493
|
Table 5 shows the predictability and predictiveness of all distillations. As in other Romance languages, it can be observed that differences in predictiveness are much more pronounced than differences in predictability. It awaits further exploration whether this is a property of Romance verbs, or a cross-linguistic trend motivated by the necessity to predict all word forms (but not so much for all word forms to be predictive). The average implicative entropy between Spanish cells is overall 0.073787 (0.12109 between distillations), which is somewhat lower than in the other Romance languages that have been analyzed in a comparable way: 0.28 for Latin (Pellegrini & Passarotti 2018), 0.18 for French, and 0.17 for Portuguese (Beniamine 2018).
It is tempting, but highly speculative at present, to link these differences to the sociolinguistic history of the different languages. There is abundant literature suggesting a link between high levels of contact or historical L2 adult language acquisition and morphological simplification (Kusters 2003, McWhorter 2007, Lupyan & Dale 2010, Trudgill 2011, Bentz & Winter 2013). Spanish is the most widely spoken Romance language and has historically expanded dramatically from its small home in North-Western Castille to close to 500 million people nowadays, spread across 22 countries. This might partially provide a partial explanation to the greater degree of simplification found in Spanish. Of course, future research should be conducted to check that these differences are not due to properties specific to the inflected lexicons used or other factors.
[3] For more detailed explanation of how the alternations are identified, for example when multiple descriptions are possible (e.g. /éɾes/ /sé/: éɾe_ ⇌ _é, or _ɾes ⇌ s_) see Beniamine et al. 2021.
[4] Not so in Italian (Pellegrini & Cignarella 2020), where other paradigm cells have innovated so-called 'superstable' markers (Wurzel 1978) like the 2SG.PRS.IND -i, and the 1PL.PRS.IND -iamo