UD for Spanish
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We usually tokenize them as separate tokens (words) with the exception of abbreviations such as etc. “etc.” which are kept as one token with the period.
- There are two main classes of multi-word tokens:
- Contractions of prepositions and definite articles. Example: al = a + el “to the”, del = de + el “of the”.
- Certain verb forms (infinitives, imperatives, present participles) are writen together with object clitic pronouns, while with other verb forms the clitics are written as separate words. Examples: convertirse = convertir + se “to become” (lit. “to convert itself”), hacerlo “to do it”.
Morphology
Tags
This is an overview only. For more detailed discussion and examples, see the list of Spanish POS tags.
- Spanish uses all 17 universal POS categories, including particles (PART).
- The only word to be tagged as particle is no “not”.
- TODO: rules for the PRON vs. DET distinction.
- Spanish auxiliary verbs (AUX) are:
- ser and estar “to be”, used as copulas
- ser “to be” for the passive (la sentencia fue publicada “the sentence was published”)
- estar “to be” for the progressive (mis hijos están estudiando inglés “my children are studying English”)
- haber “to be/have” for the perfect tenses (ha venido hoy “he came today”)
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
- Infinitive
Inf
, tagged VERB or AUX, e.g. estudiar “to study”. - Finite verb
Fin
, tagged VERB or AUX, e.g. estudio “I study”. - Participle
Part
, tagged VERB, AUX or ADJ, e.g. estudiado “studied”. - Gerund
Ger
(Spanish gerundio) or present participle, e.g. estudiando “studying”. The gerund can be used as a present participle together with the auxiliary estar: Adidas está ayudando a limpiar los océanos. “Adidas is helping to clean up the oceans.” It can also be used with a pseudo-copular verb such as seguir, in which case it is attached to the pseudo-copula as its xcomp: Este gobierno seguirá teniendo que trabajar con él. “This government will still have to work with him.” Finally, it can be used as a converb, in which case it is attached to the main verb as advcl: Lo obtiene viendo a sus amigos. “She obtains it seeing her friends.”
- Infinitive
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
.- The following parts of speech inflect for
Gender
because they must agree with nouns: ADJ, DET. Only a subset of adjectives can inflect for gender, with the suffix -o indicating the masculine and -a the feminine. A large group of adjectives (e.g. grande “big” or feliz “happy”) have just one form regardless of the gender of the modified noun. These adjectives have the gender feature empty.
- The following parts of speech inflect for
- The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite and participles). - Case has 4 possible values:
Nom
,Dat
,Acc
,Com
. It occurs only with personal pronouns (PRON). The “case” (i.e., role w.r.t. predicates or other phrases) of other nominals is expressed using prepositions, not morphologically. - Definite has 2 values:
Ind
,Def
. It is used to distinguish the indefinite and definite articles (DET).
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Abs
. The absolute superlative is marked morphologically on adjectives. Otherwise, the comparative and superlative of most adjectives is formed periphrastically, andDegree=Cmp
is only used with a few irregular forms. - Polarity is used to mark the negative particle no, i.e., only the
Neg
value is used.
Verbal Features
- Infinitives have only the
VerbForm=Inf
feature. - Finite verbs always have one of four values of Mood:
Ind
,Imp
,Sub
andCnd
. - Finite verbs can have one of four values of Tense:
Past
,Imp
,Pres
,Fut
.- Imperative and conditional forms do not have the
Tense
feature. (In Spanish grammar, the conditional is itself often classified as a tense. However, it is a mood in Universal Dependencies.) - The
Tense
feature is also used with the past participles (venido “come”).
- Imperative and conditional forms do not have the
- The Aspect feature is currently not used in Spanish.
It is not needed for the imperfect past tense because UD has the special value
Tense=Imp
. And it is not needed for the perfect tenses because they are constructed periphrastically. - The Voice feature is not used in Spanish because the passive voice is expressed periphrastically.
- Gerunds have only the
VerbForm=Ger
feature. They do not inflect for gender or number; the suffix is always -ndo. - Participles have
VerbForm=Part
,Tense=Past
, Gender (Masc
orFem
), and Number (Sing
orPlur
). The gender and number is annotated also in periphrastic perfect constructions, where the form is obligatorily masculine singular.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM), adjectives (ADJ) and determiners (DET).
- NumForm is used with numerals (NUM) and adjectives (ADJ).
- The Poss feature marks possessive personal determiners (e.g. mi “my”), possessive personal pronouns (e.g. mío “mine”), and possessive interrogative or relative determiners (e.g. cuyo “whose”).
- The Reflex feature is always used together with
PronType=Prs
and it marks reflexive pronouns (me, te, se, nos, os). Note that their forms in the first and second person are ambiguous with irreflexive accusative forms, and theReflex
feature must be decided by context. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - The Polite feature distinguishes informal second-person pronouns (tú, vosotros,
Polite=Infm
) from the formal usted, ustedes (Polite=Form
). - There is one layered feature, Number[psor]. It appears with possessive determiners and encodes the lexical number of the possessor. The extra layer is needed to distinguish this lexical feature from the inflectional number that marks agreement with the modified (possessed) noun.
Other Features
Syntax
This is an overview only. For more detailed discussion and examples, see the list of Spanish relations.
Core Arguments, Oblique Arguments and Adjuncts
- The dominant word order in Spanish is SVO, but other word orders, especially OVS and SOV, are also possible.
- Nominal subject (nsubj) is a bare noun phrase without preposition. If it is a personal pronoun, it must be in the nominative form (note however that Spanish is a pro-drop language, where pronominal subjects can be omitted). It typically occurs preverbally, but it can occur after the verb as well. The morphology of a finite verb (or auxiliary) cross-references the person and number of its subject.
- Direct nominal object (obj) is either a bare noun phrase (for inanimate objects)
or a prepositional phrase with the preposition a (for animate objects)
or a personal pronoun in the accusative form.
Note that the preposition a is otherwise used to mark a range of oblique dependents.
A nominal with that preposition counts as a core argument only if it is animate and it can be substituted by
an accusative third-person pronoun (lo, la, los, las). If it would be substituted by a dative pronoun (le, les)
in the context of the given verb, then it is not core, it is oblique.
- The accusative pronoun is a clitic and its position in the word order is fixed. With finite verbs in indicative or subjunctive, it occurs immediately before the verb and is written as a separate word. With imperatives, infinitives and gerunds, it occurs immediately after the verb (or after a dative clitic, if both are present), and is written together with the verb as one multiword token; we still treat it as a separate syntactic word.
- The accusative clitic may occur even together with the object noun; this construction is called clitic doubling. Both the noun and the clitic are attached directly to the verb. However, the clitic is labeled as the object only if the noun is absent. In case of clitic doubling, the noun is attached as obj and the clitic as expl (expletive).
- The term ‘indirect object’ is traditionally used in Spanish grammar for the argument that represents the
recipient or beneficiary of an action. However, these participants are not core arguments (they use oblique
marking, either a preposition or a dative pronoun), hence they cannot be called indirect objects in UD
and the relation iobj has no use in Spanish. To distinguish them from temporal and local adjuncts, we
use the relation obl:arg for the recipients.
- Under certain circumstances, the dative pronoun le may be used instead of the accusative pronoun lo
to denote the direct object. This is called leísmo (Erichsen, Gerald. “Leísmo and the Use of ‘Le’ in Spanish.”
ThoughtCo, Apr. 5, 2023). The UD annotation
does not distinguish these cases from the standard usage of dative pronouns. They are still tagged as
Case=Dat
and their dependency isobl:arg
, notobj
, despite the fact that leísmo can occur also with primary transitive verbs (e.g. in Spanish PUD: la Revolución le derrocó en 1879 “the Revolution overthrew him in 1879”).
- Under certain circumstances, the dative pronoun le may be used instead of the accusative pronoun lo
to denote the direct object. This is called leísmo (Erichsen, Gerald. “Leísmo and the Use of ‘Le’ in Spanish.”
ThoughtCo, Apr. 5, 2023). The UD annotation
does not distinguish these cases from the standard usage of dative pronouns. They are still tagged as
# text = Jorge mató al dragón.
# text_en = George killed the dragon.
1 Jorge Jorge PROPN _ Gender=Masc|Number=Sing 2 nsubj _ Gloss=George
2 mató matar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ Gloss=killed
3-4 al _ _ _ _ _ _ _ _
3 a a ADP _ _ 5 case _ Gloss=to
4 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 5 det _ Gloss=the
5 dragón dragón NOUN _ Gender=Masc|Number=Sing 2 obj _ Gloss=dragon|SpaceAfter=No
6 . . PUNCT _ _ 2 punct _ _
# text = Jorge lo mató.
# text_en = George killed it.
1 Jorge Jorge PROPN _ Gender=Masc|Number=Sing 3 nsubj _ Gloss=George
2 lo él PRON _ Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs 3 obj _ Gloss=him
3 mató matar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ Gloss=killed|SpaceAfter=No
4 . . PUNCT _ _ 3 punct _ _
# text = El límite sur lo forma la costa.
# text_en = The southern border is formed by the coast.
1 El el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 2 det _ Gloss=the
2 límite límite NOUN _ Gender=Masc|Number=Sing 5 obj _ Gloss=border
3 sur sur NOUN _ Gender=Masc|Number=Sing 2 nmod _ Gloss=south
4 lo él PRON _ Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs 5 expl _ Gloss=him
5 forma formar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ Gloss=forms|SpaceAfter=No
6 la el DET _ Definite=Def|Gender=Fem|Number=Sing|PronType=Art 7 det _ Gloss=the
7 costa costa NOUN _ Gender=Fem|Number=Sing 5 nsubj _ Gloss=coast|SpaceAfter=No
8 . . PUNCT _ _ 5 punct _ _
# text = Mi padre no alquilará su tierra a los irlandeses.
# text_en = My father won't rent his land to the Irish.
1 Mi mi DET _ Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs 2 det _ Gloss=my
2 padre padre NOUN _ Gender=Masc|Number=Sing 4 nsubj _ Gloss=father
3 no no PART _ Polarity=Neg 4 advmod _ Gloss=not
4 alquilará alquilar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin 0 root _ Gloss=will.rent
5 su su DET _ Number=Sing|Person=3|Poss=Yes|PronType=Prs 6 det _ Gloss=his
6 tierra tierra NOUN _ Gender=Fem|Number=Sing 4 obj _ Gloss=land
7 a a ADP _ _ 9 case _ Gloss=to
8 los el DET _ Definite=Def|Gender=Masc|Number=Plur|PronType=Art 9 det _ Gloss=the
9 irlandeses irlandés NOUN _ Gender=Masc|Number=Plur 4 obl:arg _ Gloss=Irish|SpaceAfter=No
10 . . PUNCT _ _ 4 punct _ _
# text = Mi padre no les alquilará su tierra.
# text_en = My father won't rent his land to them.
1 Mi mi DET _ Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs 2 det _ Gloss=my
2 padre padre NOUN _ Gender=Masc|Number=Sing 5 nsubj _ Gloss=father
3 no no PART _ Polarity=Neg 5 advmod _ Gloss=not
4 les él PRON _ Case=Dat|Number=Plur|Person=3|PronType=Prs 5 obl:arg _ Gloss=them
5 alquilará alquilar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin 0 root _ Gloss=will.rent
6 su su DET _ Number=Sing|Person=3|Poss=Yes|PronType=Prs 7 det _ Gloss=his
7 tierra tierra NOUN _ Gender=Fem|Number=Sing 5 obj _ Gloss=land|SpaceAfter=No
8 . . PUNCT _ _ 5 punct _ _
# text = Pedro le dio un libro a María.
# text_en = Pedro gave a book to María.
1 Pedro Pedro PROPN _ Gender=Masc|Number=Sing 3 nsubj _ Gloss=Pedro
2 le él PRON _ Case=Dat|Number=Sing|Person=3|PronType=Prs 3 expl _ Gloss=her
3 dio dar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ Gloss=gave
4 un un DET _ Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 5 det _ Gloss=a
5 libro libro NOUN _ Gender=Masc|Number=Sing 3 obj _ Gloss=book
6 a a ADP _ _ 7 case _ Gloss=to
7 María María PROPN _ Gender=Fem|Number=Sing 3 obl:arg _ Gloss=María|SpaceAfter=No
8 . . PUNCT _ _ 3 punct _ _
- Extra attention has to be paid to the reflexive pronoun se. It can function as:
- Core object (obj): él se vio en el espejo “he sighted himself in the mirror.”
- Dative oblique argument (obl:arg): ella se dio un regalo “she gave herself a gift.”
- Reciprocal core objects (
obj
): se besaron “they kissed each other.” - Reflexive passive (expl:pass): se celebran los cien años del club “hundred years of the club are celebrated” (lit. “celebrate themselves”); se dice que la escribió en París “it is said that he wrote it in Paris.”
- Inherently reflexive verb, cannot exist without the reflexive clitic, and the clitic cannot be substituted by an irreflexive pronoun
or a noun phrase. In many cases, an irreflexive counterpart of the verb actually exists but its meaning is different because it
denotes a different action performed by the agent.
In accord with the current UD guidelines, we label the relation
between the verb and the clitic as expl:pv, not
compound
. Example: se trataba de un negocio nuevo “the matter is a new contract.”- Arrepentirse “regret” is an example of an inherently reflexive verb: There is no *arrepentir.
- Acordarse “remember” is an example where the reflexive morpheme carries derivation: The meaning has significantly shifted from the irreflexive acordar “agree on”.
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Non-verbal Clauses
- The copula verbs ser and estar (be) are used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
- Existential clauses use a different verb, hay (be), and the entity whose existence is asserted is its object: hay algo para comer “there is something to eat.”
Relations Overview
- The following relation subtypes are used in Spanish:
- acl:relcl for relative clauses
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- The following relation types are not used in Spanish at all: clf, dislocated, iobj
Treebanks
There are three Spanish UD treebanks: