UD for Catalan 
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We usually tokenize them as separate tokens (words) with the exception of abbreviations such as etc. “etc.” which are kept as one token with the period.
- There are two main classes of multi-word tokens:
- Contractions of prepositions and definite articles. Example: al = a + el “to the”, del = de + el “of the”.
- Certain verb forms (infinitives, imperatives, present participles) are writen together with object clitic pronouns, while with other verb forms the clitics are written as separate words. Examples: convertir-se = convertir + se “to become” (lit. “to convert itself”), fer-ho “to do it”. Since the verb-clitic combination is written with a hyphen in Catalan, it could be split during the low-level tokenization. However, we treat it as a multi-word token to emphasize parallelism with Spanish, where it is written as one word.
Morphology
Tags
- Catalan uses all 17 universal POS categories, including particles (PART).
- The only word to be tagged as particle is no “not”.
- TODO: rules for the PRON vs. DET distinction.
- Catalan auxiliary verbs (AUX) are:
- ser and estar “to be”, used as copulas
- ser “to be” for the passive (la guia va ser presentada “the guide was presented”)
- estar “to be” for the progressive (la globalització està causant els canvis “globalization is causing changes”)
- haver “to be” for the perfect tenses (¿Què ha passat? “What happened?”)
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
.- The following parts of speech inflect for
Gender
because they must agree with nouns: ADJ, DET. Only a subset of adjectives can inflect for gender. A large group of adjectives (e.g. firal “fair” or gran “big”) have just one form regardless of the gender of the modified noun. These adjectives have the gender feature empty.
- The following parts of speech inflect for
- The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite and participles). - Case has three possible values:
Nom
,Dat
,Acc
. It occurs only with personal pronouns (PRON). The “case” (i.e., role w.r.t. predicates or other phrases) of other nominals is expressed using prepositions, not morphologically. - Definite has 2 values:
Ind
,Def
. It is used to distinguish the indefinite and definite articles (DET).
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Abs
. The absolute superlative is marked morphologically on adjectives. Otherwise, the comparative and superlative of most adjectives is formed periphrastically, andDegree=Cmp
is only used with a few irregular forms. - Polarity is used to mark the negative particle no, i.e., only the
Neg
value is used.
Verbal Features
- Finite verbs always have one of four values of Mood:
Ind
,Imp
,Sub
andCnd
. - Finite verbs can have one of four values of Tense:
Past
,Imp
,Pres
,Fut
.- Imperative and conditional forms do not have the
Tense
feature. (In Catalan grammar, the conditional is itself often classified as a tense. However, it is a mood in Universal Dependencies.) - The
Tense
feature is also used with the past participles (venido “come”).
- Imperative and conditional forms do not have the
- The Aspect feature is currently not used in Catalan.
It is not needed for the imperfect past tense because UD has the special value
Tense=Imp
. And it is not needed for the perfect tenses because they are constructed periphrastically. - The Voice feature is not used in Catalan because the passive voice is expressed periphrastically.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM), adjectives (ADJ) and determiners (DET).
- NumForm is used with numerals (NUM) and adjectives (ADJ).
- The Poss feature marks possessive personal determiners (e.g. meu “my”), possessive personal pronouns (e.g. meva “mine”).
- The Reflex feature is always used together with
PronType=Prs
and it marks reflexive pronouns. Note that their forms in the first and second person are ambiguous with irreflexive accusative forms, and theReflex
feature must be decided by context. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - The Polite feature distinguishes informal second-person pronouns (tu, vosaltres,
Polite=Infm
) from the formal vostè, vostès (Polite=Form
). - There is one layered feature, Number[psor]. It appears with possessive determiners and encodes the lexical number of the possessor. The extra layer is needed to distinguish this lexical feature from the inflectional number that marks agreement with the modified (possessed) noun.
Other Features
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- The dominant word order in Catalan is SVO, but other word orders, especially SOV and OVS, are also possible.
- Nominal subject (nsubj) is a bare noun phrase without preposition. If it is a personal pronoun, it must be in the nominative form (note however that Catalan is a pro-drop language, where pronominal subjects can be omitted). It typically occurs preverbally, but it can occur after the verb as well. The morphology of a finite verb (or auxiliary) cross-references the person and number of its subject.
- Direct nominal object (obj) is either a bare noun phrase
or a personal pronoun in the accusative form.
- The accusative pronoun is a clitic and its position in the word order is fixed. With finite verbs in indicative or subjunctive, it occurs immediately before the verb and is written as a separate word. With imperatives, infinitives and gerunds, it occurs immediately after the verb (or after a dative clitic, if both are present), and is written together with the verb as one multiword token; we still treat it as a separate syntactic word.
- The accusative clitic may occur even together with the object noun; this construction is called clitic doubling. Both the noun and the clitic are attached directly to the verb. However, the clitic is labeled as the object only if the noun is absent. In case of clitic doubling, the noun is attached as obj and the clitic as expl (expletive).
- The term ‘indirect object’ is traditionally used in Catalan grammar for the argument that represents the recipient or beneficiary of an action. However, these participants are not core arguments (they use oblique marking, either a preposition or a dative pronoun), hence they cannot be called indirect objects in UD and the relation iobj has no use in Catalan. To distinguish them from temporal and local adjuncts, we use the relation obl:arg for the recipients.
- Extra attention has to be paid to the reflexive pronoun es. It can function as:
- Core object (obj): es va veure al mirall “he sighted himself in the mirror.”
- Reciprocal core objects (
obj
): es van besar “they kissed each other.” - Reflexive passive (expl:pass): s’ha ofert una atenció psicològica a les persones afectades “psychological attention has been offered to the people affected” (lit. “offered itself”).
- Inherently reflexive verb, cannot exist without the reflexive clitic, and the clitic cannot be substituted by an irreflexive pronoun
or a noun phrase. In many cases, an irreflexive counterpart of the verb actually exists but its meaning is different because it
denotes a different action performed by the agent.
In accord with the current UD guidelines, we label the relation
between the verb and the clitic as expl:pv, not
compound
. Example: es tracta d’una immigració “the matter is immigration;” s’havia de riure “he had to laugh.”
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Non-verbal Clauses
- The copula verbs ser and estar (be) are used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
Subordination
- Instead of a nominal, a clause may act as the subject of another clause. Such clausal subjects are attached as csubj:
- Some verbs take clauses as complements and these clauses alternate with direct objects (nouns or pronouns). Such clausal complements are attached as ccomp:
- Clauses that modify other clauses but do not correspond to core arguments are called adverbial (advcl), but the class is broader than what non-UD theories may recognize as adverbial clauses. They are clausal counterparts of oblique nominals and adverbs.
- Clauses that modify nominals are called adnominal (acl). They are clausal counterparts of nmod and amod.
- A special type of adnominal clause is the relative clause. It contains the relative pronoun que (that is, que is not a subordinating conjunction here), which has the same referent as the nominal modified by the clause. The relation subtype acl:relcl is used for relative clauses.
- There is a productive construction in which an article governs a relative clause. Together they fill any slot in the superordinate clause that allows a nominal. Formally the slot is filled by the article, hence if the slot corresponds to an object of a verb, the article is attached as obj but the clause is not attached as ccomp. The clause formally modifies a nominal (the article), in the same way in which relative clauses are constructed, so the relation between the article and the subordinate clause is acl:relcl. If the governing article is definite (el, la, els, les), it corresponds to English “the one”; if it is indefinite (un, una), it corresponds to “one”. Note that the same analysis would also arise if we posited an elided noun and the promotion of the article to the head position; in the example below, the article in fact represents the nominal els futbolistes “the footballers”, which can be inferred from the preceding sentence in the corpus.
- If a verb expects another predicate (i.e., clause) as complement and the subject of the subordinate
clause is obligatorily coreferential with an argument (subject or object) of the main verb, then the
relation between the two verbs is xcomp. The subordinate verb is typically (but not necessarily)
infinitive, sometimes accompanied with a preposition selected by the main verb. Such complements are
considered core arguments but they do not necessarily alternate with a direct nominal object; in fact,
for certain main verbs they occur together with an object, which is the argument that the subject of
the
xcomp
clause is coreferential with.- In some cases the traditional grammar may list a verb as auxiliary but it does not fit in the more
narrow definition of auxiliaries in UD and is analyzed as the main verb of an
xcomp
construction. - The
xcomp
relation is also used for certain cases of secondary predication (except for optional depictives, for which advcl is used). Secondary predication is often realized using a nominal or an adjective that makes additional claims about the subject (how it looked during the main action, what it became as a result of the action etc.) - In some cases the traditional grammar may list a verb as a (pseudo-)copula but it cannot be a copula
in UD (where only ser and estar have the copula status). Instead, the putative copula is analyzed
as the main verb in an
xcomp
construction.
- In some cases the traditional grammar may list a verb as auxiliary but it does not fit in the more
narrow definition of auxiliaries in UD and is analyzed as the main verb of an
Relations Overview
- The following relation subtypes are used in Catalan:
- acl:relcl for relative clauses
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- The following relation types are not used in Catalan at all: clf, dislocated, iobj
Treebanks
There is one Catalan UD treebank: