home issue tracker

This page pertains to UD version 2.

Universal features

For core part-of-speech categories, see the universal POS tags. The features listed here distinguish additional lexical and grammatical properties of words, not covered by the POS tags.

Lexical features Inflectional features
Nominal* Verbal*
PronType Gender VerbForm
NumType Animacy Mood
Poss Number Tense
Reflex Case Aspect
Foreign Definite Voice
Abbr Degree Evident
Polarity
Person
Polite
  Index: A abbreviation, abessive, ablative, absolute superlative, absolutive, accusative, active, additive, adessive, admirative, adverbial participle, affirmative, allative, animate, antipassive, aorist, article, aspect, associative, B benefactive, C cardinal, case, causative case, causative voice, collective noun, collective numeral, collective pronominal, comitative, common gender, comparative case, comparative degree, complex definiteness, conditional, conjunctive, considerative, construct state, converb, count plural, counting form, D dative, definite, definiteness, degree of comparison, delative, demonstrative, desiderative, destinative, direct case, direct voice, directional allative, distributive case, distributive numeral, dual, E elative, elevated referent, emphatic, equative case, equative degree, ergative, essive, evidentiality, exclamative, F factive, feminine, finite verb, first person, firsthand, foreign word, formal, fourth person, fraction, frequentative, future, G gender, genitive, gerund, gerundive, greater paucal, greater plural, H habitual, human, humbled speaker, I illative, imperative, imperfect tense, imperfective aspect, inanimate, indefinite, indefinite pronominal, indicative, inessive, infinitive, informal, injunctive, instructive, instrumental, interrogative, inverse number, inverse voice, iterative, J jussive, L lative, locative, M masculine, masdar, mass noun, middle voice, modality, mood, motivative, multiplicative numeral, N narrative, necessitative, negative polarity, negative pronominal, neuter, nominative, non-finite verb, non-firsthand, non-human, non-specific indefinite, number, numeral type, O oblique case, optative, ordinal, P participle, partitive, passive, past, past perfect, paucal, perfective aspect, perlative, person, personal, pluperfect, plural, plurale tantum, polarity, politeness, positive degree, positive polarity, possessive, potential, present, preterite, progressive, prolative, pronominal type, prospective, purposive case, purposive mood, Q quantifier, quantitative plural, quotative, R range numeral, reciprocal pronominal, reciprocal voice, reduced definiteness, reflexive, register, relative, S second person, set numeral, singular, singulare tantum, specific indefinite, subjunctive, sublative, superessive, superlative, supine, T temporal, tense, terminal allative, terminative, third person, total, transgressive, translative, trial, U uter, V verb form, verbal adjective, verbal adverb, verbal noun, vocative, voice, Z zero person
* The labels Nominal and Verbal are used as approximate categories only. There is no universal rule that a particular feature can only occur with verbs or nominals (although language-specific rules may define such constraints). Even the boundary between lexical and inflectional features is sometimes blurred: for example, gender is a lexical feature of nouns but an inflectional feature of adjectives or verbs.

Abbr: abbreviation

Values: Yes

Boolean feature. Is this an abbreviation? Note that the abbreviated word(s) typically belongs to a part of speech other than u-pos/X.

Note: This feature is new in UD version 2. It was used as a language-specific addition in several treebanks in version 1.

Yes: it is abbreviation

Examples: [en] etc., J., UK

edit Abbr

AbsErgDatNumber: number agreement with absolutive/ergative/dative argument

Number[abs], Number[erg], Number[dat]

Finite verbs in many Indo-European languages agree in person and number with their subject. In Basque (a polypersonal language), certain verbs overtly mark agreement with up to three arguments: one in the absolutive case, one in ergative and one in dative. Thus in dakarkiogu “we bring it to him/her”, akar is the stem (ekarri = “bring”), d stands for “it” (absolutive argument is the direct object of transitive verbs), ki stands for the dative case, o stands for “he” and gu stands for “we” (ergative argument is the subject of transitive verbs).

One may want to use just Number instead of Number[abs]. However, there are two issues with that (at least in Basque). First, the absolutive argument is not always the subject. For transitive verbs, it is the object, so the parallelism with nominative-accusative languages would be weak anyway. Second, and more important, some Basque finite verbs have additional morphemes of nominal inflection. Thus their form reflects the person-number agreement with the absolutive argument (nor), and nominal inflection (case, number etc.) at the same time. Examples: dena (Number=Sing|Number[abs]=Sing), dituena (Number=Sing|Number[abs]=Plur|Number[erg]=Sing), dugunak (Number=Plur|Number[abs]=Sing|Number[erg]=Plur), direnak (Number=Plur|Number[abs]=Plur). So we reserve the Number feature for nominal inflection, and the Number[abs] feature for agreement.

Note that we also define Person[abs] and Polite[abs], although there is no direct conflict for these features. But it is better to have these features aligned with Person[erg], Polite[erg], Person[dat] and Polite[dat].

Sing: singular

Examples: [eu] dakarkiogu Number[abs]=Sing|Number[dat]=Sing

Plur: plural

Examples: [eu] dakarkiogu Number[erg]=Plur

edit AbsErgDatNumber

AbsErgDatPerson: person agreement with absolutive/ergative/dative argument

Person[abs], Person[erg], Person[dat]

Finite verbs in many Indo-European languages agree in person and number with their subject. In Basque (a polypersonal language), certain verbs overtly mark agreement with up to three arguments: one in the absolutive case, one in ergative and one in dative. Thus in dakarkiogu “we bring it to him/her”, akar is the stem (ekarri = “bring”), d stands for “it” (absolutive argument is the direct object of transitive verbs), ki stands for the dative case, o stands for “he” and gu stands for “we” (ergative argument is the subject of transitive verbs).

One may want to use just Person instead of Person[abs]. However, there are two issues with that (at least in Basque). First, the absolutive argument is not always the subject. For transitive verbs, it is the object, so the parallelism with nominative-accusative languages would be weak anyway. Second, we cannot avoid Number[abs] (both Number and Number[abs] can occur at one word) and thus we keep Person[abs] to demonstrate that it is the same layer of agreement for both the features.

1: first person

Examples: [eu] dakarkiogu Person[erg]=1

2: second person

Examples: [eu] dakarkiozu Person[erg]=2

3: third person

Examples: [eu] dakarkiogu Person[abs]=3|Person[dat]=3

edit AbsErgDatPerson

AbsErgDatPolite: politeness agreement with absolutive/ergative/dative argument

Polite[abs], Polite[erg], Polite[dat]

Finite verbs in many Indo-European languages agree in person and number with their subject; for the second person this also affects the politeness register. In Basque (a polypersonal language), certain verbs overtly mark agreement with up to three arguments: one in the absolutive case, one in ergative and one in dative. Thus in dakarkiogu “we bring it to him/her”, akar is the stem (ekarri = “bring”), d stands for “it” (absolutive argument is the direct object of transitive verbs), ki stands for the dative case, o stands for “he” and gu stands for “we” (ergative argument is the subject of transitive verbs).

One may want to use just Polite instead of Polite[abs]. However, there are two issues with that (at least in Basque). First, the absolutive argument is not always the subject. For transitive verbs, it is the object, so the parallelism with nominative-accusative languages would be weak anyway. Second, we cannot avoid Number[abs] (both Number and Number[abs] can occur at one word) and thus we keep Polite[abs] to demonstrate that it is the same layer of agreement for both the features.

Inf: informal

Examples: [eu] ezan, ezak Polite[erg]=Inf

Pol: polite, formal

Examples: [eu] ezazu Polite[erg]=Pol (politeness-neutral form is ezazue)

edit AbsErgDatPolite

AdpType: adposition type

Prep: preposition

Examples: in, on, to, from

Post: postposition

Examples: German “entlang” in “der Strasse entlang” (along the street)

Circ: circumposition

Examples: German “von … an” in “von dieser Stelle an” (from this place on)

Voc: vocalized preposition

In Slavic languages, some prepositions are non-syllabic and their form has to be changed in some contexts to facilitate pronunciation.

Czech examples: ke, ku, se, ve, ze

(Non-vocalized equivalents are: k, k, s, v, z)

Same phenomenon exists in Slovak, Russian and probably elsewhere.

edit AdpType

AdvType: adverb type

Semantic subclasses of adverbs. They are annotated in some tagsets (e.g. Bulgarian, Czech, Hindi, Japanese) and would probably apply to many other languages if their tagsets cared to cover them. Note that the “prontype” feature also applies to some adverbs and is orthogonal to “AdvType”.

Man: adverb of manner (“how”)

Loc: adverb of location (“where, where to, where from”)

Tim: adverb of time (“when, since when, till when”)

Deg: adverb of quantity or degree (“how much”)

Note that there is a fuzzy borderline between adverbs of degree and indefinite numerals (as they are called in some grammars). This has not yet been solved in Interset.

Cau: adverb of cause (“why”)

Mod: adverb of modal nature

The Czech examples below are similar to modal verbs: they take infinitives as arguments and add the meaning of possibility, necessity or recommendedness. I suspect that the Bulgarian example (transliteration of French “à propos”) is used differently but its native tagset also calles it “modal”.

Examples: [bg] апропо, [cs] možno, nutno, radno, třeba

Sta: adverb of state

Note that while the English translations of the Czech examples below might hint that they are adjectives, morphologically and syntactically they are adverbs (and some of them ambiguous with nouns).

Examples: [cs] plno (full), zima (cold), chyba (wrong), škoda (pity), volno (available), nanic (no good)

Finally, Interset also has two values of “AdvType” that somewhat deviate from the rest. I do not know what exactly should be done with them but I had to distinguish them somehow.

Ex: existential “there” in English

What part of speech is this “there” in the universal parts of speech?

Adadj: ad-adjective in Finnish

Derived from adjectives, used only to modify other adjectives (http://archives.conlang.info/pei/juenchen/phaelbhaduen.html).

edit AdvType

Animacy: animacy

Values: Anim Hum Inan Nhum

Similarly to Gender (and to the African noun classes), animacy is usually a lexical feature of nouns and inflectional feature of other parts of speech (pronouns, adjectives, determiners, numerals, verbs) that mark agreement with nouns. It is independent of gender, therefore it is encoded separately in some tagsets (e.g. all the Multext-East tagsets). On the other hand, in Czech the (almost) only grammatical implications occur within the masculine gender, which is why the PDT tagset does not have animateness as separate feature and instead defines four genders: masculine animate, masculine inanimate, feminine and neuter. We follow the two-feature approach used in Multext-East (many languages) because it is safer.

Polish is special in that it also distinguishes grammatically human vs. non-human animates. It can be demonstrated by inflection of words with adjectival inflection, for example, the word który “which” (boldface forms differ from the middle row):

gender sg-nom sg-gen sg-dat sg-acc sg-ins sg-loc pl-nom pl-gen pl-dat pl-acc pl-ins pl-loc
animate human który którego któremu którego którym którym którzy których którym których którymi których
animate non-human który którego któremu którego którym którym które których którym które którymi których
in-animate który którego któremu który którym którym które których którym które którymi których

More generally: Some languages distinguish animate vs. inanimate (e.g. Czech masculines), some languages distinguish human vs. non-human (e.g. Yuwan, a Ryukyuan language), and others distinguish three values, human vs. non-human animate vs. inanimate (e.g. Polish masculines).

Anim: animate

Human beings, animals, fictional characters, names of professions etc. are all animate. Even nouns that are normally inanimate can be inflected as animate if they are personified. For instance, consider a children’s story about cars where cars live and talk as people; then the cars may become and be inflected as animates.

Inan: inanimate

Nouns that are not animate are inanimate.

Hum: human

A subset of animates that only includes human beings (and personified characters) but not animals.

Nhum: non-human

In languages that only distinguish human from non-human, this value includes inanimates. In languages that distinguish human animates, non-human animates and inanimates, this value is used only for non-human animates, while Inan is used for inanimates.

edit Animacy

Aspect: aspect

Values: Hab Imp Iter Perf Prog Prosp

Aspect is typically a feature of verbs. It may also occur with other parts of speech (nouns, adjectives, adverbs), depending on whether borderline word forms such as gerunds and participles are classified as verbs or as the other category.

Aspect is a feature that specifies duration of the action in time, whether the action has been completed etc. In some languages (e.g. English), some tenses are actually combinations of tense and aspect. In other languages (e.g. Czech), aspect and tense are separate, although not completely independent of each other.

In Czech and other Slavic languages, aspect is a lexical feature. Pairs of imperfective and perfective verbs exist and are often morphologically related but the space is highly irregular and the verbs are considered to belong to separate lemmas.

Since we proceed bottom-up, the current standard covers only a few aspect values found in corpora. See Wikipedia (http://en.wikipedia.org/wiki/Grammatical_aspect) for a long list of other possible aspects.

Imp: imperfect aspect

The action took / takes / will take some time span and there is no information whether and when it was / will be completed.

Examples

Perf: perfect aspect

The action has been / will have been completed. Since there is emphasis on one point on the time scale (the point of completion), this aspect does not work well with the present tense. For example, Czech morphology can create present forms of perfective verbs but these actually have a future meaning.

Examples

Prosp: prospective aspect

In general, prospective aspect can be described as relative future: the action is/was/will be expected to take place at a moment that follows the reference point; the reference point itself can be in past, present or future. In the English sentence When I got home yesterday, John called and said he would arrive soon, the last clause (he would arrive soon) is in prospective aspect. Nevertheless, English does not have overt affixal morphemes dedicated to the prospective aspect, and we do not need the label in English. But other languages do; the -ko suffix in Basque is an example.

Note that this value was called Pro in UD v1 and it has been renamed Prosp in UD v2.

Examples

Prog: progressive aspect

English progressive tenses (I am eating, I have been doing …) have this aspect. They are constructed analytically (auxiliary + present participle) but the -ing participle is so bound to progressive meaning that it seems a good idea to annotate it with this feature (we have to distinguish it from the past participle somehow; we may use both the “Tense” and the “Aspect” features).

In languages other than English, the progressive meaning may be expressed by morphemes bound to the main verb, which makes this value even more justified. Example is Turkish with its two distinct progressive morphemes, -yor and -mekte.

Examples

Hab: habitual aspect

English simple present has this aspect.

Iter: iterative / frequentative aspect

Denotes repeated action. Attested e.g. in Hungarian. Iteratives also exist in Czech with this name and meaning but they can be formed only from imperfective verbs and they are usually not classified as a separate aspect; they are just Aspect=Imp.

Note: This value is new in UD v2 but a similar value has been used in UD v1 as language-specific for Hungarian, though it was called frequentative there (Freq).

Examples

edit Aspect

Case: case

Values: Core: Abs Acc Erg Nom
Non-core: Abe Ben Cau Cmp Cns Com Dat Dis Equ Gen Ins Par Tem Tra Voc
Local: Abl Add Ade All Del Ela Ess Ill Ine Lat Loc Per Sub Sup Ter

Case is usually an inflectional feature of nouns and, depending on language, other parts of speech (pronouns, adjectives, determiners, numerals, verbs) that mark agreement with nouns. In some tagsets it is also valency feature of adpositions (saying that the adposition requires its argument to be in that case). Annotating preposition valency case in UD treebanks would be superfluous because the same case feature can be found at the nominal to which the preposition belongs.

Case helps specify the role of the noun phrase in the sentence, especially in free-word-order languages. For example, the nominative and accusative cases often distinguish subject and object of the verb, while in fixed-word-order languages these functions would be distinguished merely by the positions of the nouns in the sentence.

Here on the level of morphosyntactic features we are dealing with case expressed morphologically, i.e. by bound morphemes (affixes). Note that on a higher level case can be understood more broadly as the role, and it can be also expressed by adding an adposition to the noun. What is expressed by affixes in one language can be expressed using adpositions in another language. Cf. the u-dep/case dependency label.

Examples

The descriptions of the individual case values below include semantic hints about the prototypical meaning of the case. Bear in mind that quite often a case will be used for a meaning that is totally unrelated to the meaning mentioned here. Valency of verbs, adpositions and other words will determine that the noun phrase must be in a particular grammatical case to fill a particular valency slot (semantic role). It is much the same as trying to explain the meaning of prepositions: most people would agree that the central meaning of English in is location in space or time but there are phrases where the meaning is less locational: In God we trust. Say it in English.

Note that Indian corpora based on the so-called Paninian model use a related feature called vibhakti. It is a merger of the Case feature described here and of various postpositions. Values of the feature are language-dependent because they are copies of the relevant morphemes (either bound morphemes or postpositions). Vibhakti can be mapped on the Case values described here if we know 1. which source values are bound morphemes (postpositions are separate nodes for us) and 2. what is their meaning. For instance, the genitive case (Gen) in Bengali is marked using the suffix -ra (-র), i.e. vib=era. In Hindi, the suffix has been split off the noun and it is now written as a separate word – the postposition kā/kī/ke (का/की/के). Even if the postpositional phrase can be understood as a genitive noun phrase, the noun is not in genitive. Instead, the postposition requires that it takes one of three case forms that are marked directly on the noun: the oblique case (Acc).

Nom: nominative / direct

The base form of the noun, typically used as citation form (lemma). In many languages this is the word form used for subjects of clauses. If the language has only two cases, which are called “direct” and “oblique”, the direct case will be marked Nom.

Acc: accusative / oblique

Perhaps the second most widely spread morphological case. In many languages this is the word form used for direct objects of verbs. If the language has only two cases, which are called “direct” and “oblique”, the oblique case will be marked Acc.

Abs: absolutive

Some languages (e.g. Basque) do not use nominative-accusative to distinguish subjects and objects. Instead, they use the contrast of absolutive-ergative.

The absolutive case marks subject of intransitive verb and direct object of transitive verb.

Erg: ergative

Some languages (e.g. Basque) do not use nominative-accusative to distinguish subjects and objects. Instead, they use the contrast of absolutive-ergative.

The ergative case marks subject of transitive verb.

Dat: dative

In many languages this is the word form used for indirect objects of verbs.

Examples

Gen: genitive

Prototypical meaning of genitive is that the noun phrase somehow belongs to its governor; it would often be translated by the English preposition of. English has the “saxon genitive” formed by the suffix ‘s; but we will normally not need the feature in English because the suffix gets separated from the noun during tokenization.

Note that despite considerable semantic overlap, the genitive case is not the same as the feature of possessivity (Poss). Possessivity is a lexical feature, i.e. it applies to lemma and its whole paradigm. Genitive is a feature of just a subset of word forms of the lemma. Semantics of possessivity is much more clearly defined while the genitive (as many other cases) may be required in situations that have nothing to do with possessing. For example, [cs] bez prezidentovy dcery “without the president’s daughter” is a prepositional phrase containing the preposition bez “without”, the possessive adjective prezidentovy “president’s” and the noun dcery “daughter”. The possessive adjective is derived from the noun prezident but it is really an adjective (with separate lemma and paradigm), not just a form of the noun. In addition, both the adjective and the noun are in their genitive forms (the nominative would be prezidentova dcera). There is nothing possessive about this particular occurrence of the genitive. It is there because the preposition bez always requires its argument to be in genitive.

Examples

Note that in Basque, Gen should be used for possessive genitive (as opposed to locative genitive): diktadorearen erregimena “dictator’s regime”; diktadore “dictator”.

Voc: vocative

The vocative case is a special form of noun used to address someone. Thus it predominantly appears with animate nouns (see the feature of Animacy). Nevertheless this is not a grammatical restriction and inanimate things can be addressed as well.

Examples

Loc: locative

The locative case often expresses location in space or time, which gave it its name. As elsewhere, non-locational meanings also exist and they are not rare. Uralic languages have a complex set of fine-grained locational and directional cases (see below) instead of the locative. Even in languages that have locative, some location roles may be expressed using other cases (e.g. because those cases are required by a preposition).

In Slavic languages this is the only case that is used exclusively in combination with prepositions (but such a restriction may not hold in other languages that have locative).

Examples

Ins: instrumental / instructive

The role from which the name of the instrumental case is derived is that the noun is used as instrument to do something (as in [cs] psát perem “to write using a pen”). Many other meanings are possible, e.g. in Czech the instrumental is required by the preposition s “with” and thus it includes the meaning expressed in other languages by the comitative case.

In Czech the instrumental is also used for the agent-object in passive constructions (cf. the English preposition by).

Examples

A semantically similar case called instructive is used rarely in Finnish to express “with (the aid of)”. It can be applied to infinitives that behave much like nouns in Finnish. We propose one label for both instrumental and instructive (instrumental is not defined in Finnish).

Examples

Par: partitive

In Finnish the partitive case expresses indefinite identity and unfinished actions without result.

Examples

Examples comparing partitive with accusative: ammuin karhun “I shot a bear.Acc” (and I know that it is dead); ammuin karhua “I shot at a bear.Par” (but I may have missed).

Using accusative instead of partitive may also substitute the missing future tense: luen kirjan “I will read the book.Acc”; luen kirjaa “I am reading the book.Par”.

Dis: distributive

The distributive case conveys that something happened to every member of a set, one in a time. Or it may express frequency.

Examples

Ess: essive / prolative

The essive case expresses a temporary state, often it corresponds to English “as a …” A similar case in Basque is called prolative and it should be tagged Ess too.

Examples

Tra: translative / factive

The translative case expresses a change of state (“it becomes X”, “it changes to X”). Also used for the phrase “in language X”. In the Szeged Treebank, this case is called factive.

Examples

Com: comitative / associative

The comitative (also called associative) case corresponds to English “together with …”

Examples

Abe: abessive

The abessive case corresponds to the English preposition without.

Examples

Ine: inessive

The inessive case expresses location inside of something.

Examples

Ill: illative

The illative case expresses direction into something.

Examples

Ela: elative

The elative case expresses direction out of something.

Examples

Add: additive

Distinguished by some scholars in Estonian, not recognized by traditional grammar, exists in the Multext-East Estonian tagset and in the Eesti keele puudepank. It has the meaning of illative, and some grammars will thus consider the additive just an alternative form of illative. Forms of this case exist only in singular and not for all nouns.

Examples

Ade: adessive

The adessive case expresses location at or on something. The corresponding directional cases are allative (towards something) and ablative (from something).

Examples

Note that adessive is used to express location on the surface of something in Finnish and Estonian, but does not carry this meaning in Hungarian.

All: allative

The allative case expresses direction to something (destination is adessive, i.e. at or on that something).

Examples

Abl: ablative

Prototypical meaning: direction from some point.

Examples

Sup: superessive

Used, chiefly in Hungarian, to indicate location on top of something or on the surface of something.

Examples

Sub: sublative

The sublative case is used in Finno-Ugric languages to express the destination of movement, originally to the surface of something (e.g. “to climb a tree”), and, by extension, in other figurative meanings as well (e.g. “to university”).

Examples

Del: delative

Used, chiefly in Hungarian, to express the movement from the surface of something (like “moved off the table”). Other meanings are possible as well, e.g. “about something”.

Examples

Lat: lative / directional allative

The lative case denotes movement towards/to/into/onto something. Similar case in Basque is called directional allative (Spanish adlativo direccional). However, lative is typically thought of as a union of allative, illative and sublative, while in Basque it is derived from allative, which also exists independently.

Examples

Per: perlative

The perlative case denotes movement along something. It is used in Warlpiri (Andrews 2007, p.162). Note that Unimorph mentions the English preposition “along” in connection with what they call prolative/translative; but we have different definitions of those two cases.

Examples

Tem: temporal

The temporal case is used to indicate time.

Examples

Ter: terminative / terminal allative

The terminative case specifies where something ends in space or time. Similar case in Basque is called terminal allative (Spanish adlativo terminal).

Examples

Cau: causative / motivative / purposive

Noun in this case is the cause of something. In Hungarian it also seems to be used frequently with currency (“to buy something for the money”) and it also can mean the goal of something.

Examples

Ben: benefactive / destinative

The benefactive case corresponds to the English preposition for.

Examples

Cns: considerative

The considerative case denotes something that is given in exchange for something else. It is used in Warlpiri (Andrews 2007, p.164).

Examples

Cmp: comparative

The comparative case means “than X”. It marks the standard of comparison and it differs from the comparative Degree, which marks the property being compared. It occurs in Dravidian and Northeast-Caucasian languages.

Equ: equative

The equative case means “X-like”, “similar to X”, “same as X”. It marks the standard of comparison and it differs from the equative Degree, which marks the property being compared. It occurs in Turkish.

Examples

References

edit Case

Clusivity: clusivity

Values: Ex In

Clusivity is a feature of first-person plural personal pronouns.

In: inclusive

Includes the listener, i.e. we = I + you (+ optionally they).

Examples

Ex: exclusive

Excludes the listener, i.e. we = I + they.

Examples

edit Clusivity

ConjType: conjunction type

We already distinguished the two main types, coordinating and subordinating conjunctions, at the level of POS tags. However, there are other (sub?)types that are not yet accounted for.

Comp: comparing conjunction

Examples: [de] wie (as), als (than)

Oper: mathematical operator

Note that operators can be expressed either using symbols or using words.

Examples: [cs] krát (times), plus, minus

edit ConjType

Definite: definiteness or state

Values: Com Cons Def Ind Spec

Definiteness is typically a feature of nouns, adjectives and articles. Its value distinguishes whether we are talking about something known and concrete, or something general or unknown. It can be marked on definite and indefinite articles, or directly on nouns, adjectives etc. In Arabic, definiteness is also called the “state”.

Ind: indefinite

In languages where Spec is distinguished the value Ind is interpreted as non-specific indefinite, i.e. “any (one) stick”.

Examples

Spec: specific indefinite

Specific indefinite, e.g. “a certain stick”. Occurs e.g. in Lakota. In languages where it is used the value Ind is interpreted as non-specific indefinite, i.e. “any (one) stick”.

Def: definite

Examples

Cons: construct state / reduced definiteness

Used in construct state in Arabic. If two nouns are in genitive relation, the first one (the “nomen regens”) has “reduced definiteness,” the second is the genitive and can be either definite or indefinite. Reduced form has neither the definite morpheme (article), nor the indefinite morpheme (nunation).

Note that in UD v1 this value was called Red. It has been renamed Cons in UD v2.

Examples

Com: complex

Used in improper annexation in Arabic. The genitive construction described above normally consists of two nouns (first reduced, second genitive). That is called proper annexation or iḍāfa. If the first member is an adjective or adjectivally used participle and the second member is a definite noun, the construction is called improper annexation or false iḍāfa. The result is a compound adjective that is usually used as an attributive adjunct and thus must agree in definiteness with the noun it modifies. Its first part (the adjective or participle) may get again the definite article. Although it may look the same as the form for the definite state, it is assigned a special value of complex state to reflect the different origin. See also Hajič et al. page 3.

Examples:

edit Definite

Degree: degree of comparison

Values: Abs Cmp Equ Pos Sup

Degree of comparison is typically an inflectional feature of some adjectives and adverbs.

Pos: positive, first degree

This is the base form that merely states a quality of something, without comparing it to qualities of others. Note that although this degree is traditionally called “positive”, negative properties can be compared, too.

Examples

Equ: equative

The quality of one object is compared to the same quality of another object, and the result is that they are identical or similar (“as X as”). Note that it marks the adjective and it is distinct from the equative Case, which marks the standard of comparison.

Examples

Cmp: comparative, second degree

The quality of one object is compared to the same quality of another object.

Examples

Sup: superlative, third degree

The quality of one object is compared to the same quality of all other objects within a set.

Examples

Abs: absolute superlative

Some languages can express morphologically that the studied quality of the given object is so strong that there is hardly any other object exceeding it. The quality is not actually compared to any particular set of objects.

Examples

edit Degree

Echo: is this an echo word or a reduplicative?

Is this a reduplicative or echo word? Such words occur in Hindi and other Indian languages. In Hyderabad Dependency Treebank they get their own part-of-speech tags RDP and ECH, respectively. We do not want to treat them as separate parts of speech because they could be assigned a POS independent of their RDP or ECH status (same as the word that they echo). Perhaps we should merge this also with the “hyph” feature to something called “compound”?

Rdp: reduplicative

The word is a copy of a previous word. In Hindi, this would add the meaning of distribution (“one rupee each”), separation (“sit separately”), variety, diversity or just emphasis.

Examples: [hi] “कभी - कभी” = “kabhī - kabhī” = “sometimes”, “कभी” = “kabhī” = “sometimes”; “एक एक” = “eka eka” = “one each”, “एक” = “eka” = “one”

Ech: echo

The word rhymes with a previous word but it is not identical to it and typically it does not have any meaning of its own. In Hindi it generalizes the meaning of the previous word and eventually translates as “or something”, “etc.” etc.

Examples: [hi] “चाय वाय” = “čāya vāya” = “tea or something” (as in “Have some tea or something.”)

For more details see Rupert Snell and Simon Weightman: Teach Yourself Hindi, Section 16.4 and 16.5, pages 210 – 211.

edit Echo

ErgDatGender: gender agreement with ergative/dative argument

Gender[erg], Gender[dat]

Finite verbs in many Indo-European languages agree in person and number with their subject. In Basque (a polypersonal language), certain verbs overtly mark agreement with up to three arguments: one in the absolutive case, one in ergative and one in dative. Thus in dakarkiogu “we bring it to him/her”, akar is the stem (ekarri = “bring”), d stands for “it” (absolutive argument is the direct object of transitive verbs), ki stands for the dative case, o stands for “he” and gu stands for “we” (ergative argument is the subject of transitive verbs).

In the informal register, there are also separate forms for masculine and feminine arguments, although gender is otherwise not distinguished in Basque.

Masc: masculine gender

Examples: [eu] ukan ezak “have it” Gender[erg]=Masc|Number[abs]=Sing|Number[erg]=Sing|Person[abs]=3|Person[erg]=2|Polite[erg]=Inf (imperative addressing a man)

Fem: feminine gender

Examples: [eu] ukan ezan “have it” Gender[erg]=Fem|Number[abs]=Sing|Number[erg]=Sing|Person[abs]=3|Person[erg]=2|Polite[erg]=Inf (imperative addressing a woman)

edit ErgDatGender

Evident: evidentiality

Values: Fh Nfh

Evidentiality is the morphological marking of a speaker’s source of information (Aikhenvald, 2004). It is sometimes viewed as a category of mood and modality.

Many different values are attested in the world’s languages. At present we only cover the firsthand vs. non-firsthand distinction, needed in Turkish. It distinguishes there the normal past tense (firsthand, also definite past tense, seen past tense) from the so-called miş-past (non-firsthand, renarrative, indefinite, heard past tense).

Aikhenvald also distinguishes reported evidentiality, occurring in Estonian and Latvian, among others. We currently use the quotative Mood for this.

Note: Evident is a new universal feature in UD version 2. It was used as a language-specific feature (under the name Evidentiality) in UD v1 for Turkish.

Fh: firsthand

Examples

Nfh: non-firsthand

Examples

References

edit Evident

Foreign: is this a foreign word?

Values: Yes

Boolean feature. Is this a foreign word? Not a loan word and not a foreign name but a genuinely foreign word appearing inside native text, e.g. inside direct speech, titles of books etc. This feature would apply either to the u-pos/X part of speech (unanalyzable token), or to other parts of speech if we know and are willing to annotate the class to which the word belongs in its original language.

Note: This feature is new in UD version 2. It was used as a language-specific addition in several treebanks in version 1 but it was not considered boolean and three values were foreseen. Since the additional values were used extremely rarely, they are not part of the universal definition of this feature in UD v2.

Yes: it is foreign

Example: [en] He said I could “dra åt helvete!

edit Foreign

Gender: gender

Values: Com Fem Masc Neut

Gender is usually a lexical feature of nouns and inflectional feature of other parts of speech (pronouns, adjectives, determiners, numerals, verbs) that mark agreement with nouns. In English gender affects only the choice of the personal pronoun (he / she / it) and the feature is usually not encoded in English tagsets.

See also the related feature of Animacy.

African languages have an analogous feature of noun classes: there might be separate grammatical categories for flat objects, long thin objects etc. African noun classes are not covered in the current guidelines because none of the languages covered by UD so far has such classes. They might be added in future.

Masc: masculine gender

Nouns denoting male persons are masculine. Other nouns may be also grammatically masculine, without any relation to sex.

Examples

Fem: feminine gender

Nouns denoting female persons are feminine. Other nouns may be also grammatically feminine, without any relation to sex.

Examples

Neut: neuter gender

Some languages have only the masculine/feminine distinction while others also have this third gender for nouns that are neither masculine nor feminine (grammatically).

Examples

Com: common gender

Some languages do not distinguish masculine/feminine most of the time but they do distinguish neuter vs. non-neuter (Swedish neutrum / utrum). The non-neuter is called common gender.

Note that it could also be expressed as a combined value Gender=Fem,Masc. Nevertheless we keep Com also as a separate value. Combined feature values should only be used in exceptional, undecided cases, not for something that occurs systematically in the grammar. Language-specific extensions to these guidelines should determine whether the Com value is appropriate for a particular language.

Note further that the Com value is not intended for cases where we just cannot derive the gender from the word itself (without seeing the context), while the language actually distinguishes Masc and Fem. For example, in Spanish, nouns distinguish two genders, masculine and feminine, and every noun can be classified as either Masc or Fem. Adjectives are supposed to agree with nouns in gender (and number), which they typically achieve by alternating -o / -a. But then there are adjectives such as grande or feliz that have only one form for both genders. So we cannot tell whether they are masculine or feminine unless we see the context. Yet they are either masculine or feminine (feminine in una ciudad grande, masculine in un puerto grande). Therefore in Spanish we should not tag grande with Gender=Com. Instead, we should either drop the gender feature entirely (suggesting that this word does not inflect for gender) or tag individual instances of grande as either masculine or feminine, depending on context.

edit Gender

Hyph: hyphenated compound or part of it

Boolean feature. Is this part of a hyphenated compound? Depending on tokenization, the compound may be one token or be split to several tokens; then the tokens need tags.

Yes: it is part of hyphenated compound

Examples: “anglo-“ in “anglo-saxon”; [cs] “česko-slovenský” (Czecho-Slovak; the second part is a normal adjective, including adjectival morphological paradigm, but the first part is special.)

edit Hyph

Mood: mood

Values: Adm Cnd Des Imp Ind Jus Nec Opt Pot Prp Qot Sub

Mood is a feature that expresses modality and subclassifies finite verb forms.

Ind: indicative

The indicative can be considered the default mood. A verb in indicative merely states that something happens, has happened or will happen, without adding any attitude of the speaker.

Examples

Imp: imperative

The speaker uses imperative to order or ask the addressee to do the action of the verb.

Examples

Cnd: conditional

The conditional mood is used to express actions that would have taken place under some circumstances but they actually did not / do not happen. Grammars of some languages may classify conditional as tense (rather than mood) but e.g. in Czech it combines with two different tenses (past and present).

Examples

Pot: potential

The action of the verb is possible but not certain. This mood corresponds to the modal verbs can, might, be able to. Used e.g. in Finnish. See also the optative.

Examples

Sub: subjunctive / conjunctive

The subjunctive mood is used under certain circumstances in subordinate clauses, typically for actions that are subjective or otherwise uncertain. In German, it may be also used to convey the conditional meaning.

Examples

Jus: jussive / injunctive

The jussive mood expresses the desire that the action happens; it is thus close to both imperative and optative. Unlike in desiderative, it is the speaker, not the subject who wishes that it happens. Used e.g. in Arabic. We also map the Sanskrit injunctive to Mood=Jus.

Examples

Prp: purposive

Means “in order to”, occurs in Amazonian languages.

Qot: quotative

The quotative mood is used e.g. in Estonian to denote direct speech.

Opt: optative

Expresses exclamations like “May you have a long life!” or “If only I were rich!” In Turkish it also expresses suggestions. In Sanskrit it may express possibility (cf. the potential mood in other languages).

Examples

Des: desiderative

The desiderative mood corresponds to the modal verb “want to”: “He wants to come.” Used e.g. in Turkish.

Nec: necessitative

The necessitative mood expresses necessity and corresponds to the modal verbs “must, should, have to”: “He must come.”

Examples

Adm: admirative

Expresses surprise, irony or doubt. Occurs in Albanian, other Balkan languages, and in Caddo (Native American from Oklahoma).

edit Mood

NameType: type of named entity

Classification of named entities (token-based, no nesting of entities etc.) The feature applies mainly to the cs-pos/PROPN tag; in multi-word foreign names, adjectives may also have this feature (they preserve the ADJ tag but at the same time they would not exist in Czech otherwise than in the named entity).

Geo: geographical name

Names of cities, countries, rivers, mountains etc.

Examples

Prs: name of person

This value is used if it is not known whether it is a given or a family name, but it is known that it is a personal name.

Giv: given name of person

Given name (not family name). This is usually the first name in European and American names. In Chinese names, the last two syllables (of three) are usually the given name.

Sur: surname / family name of person

Family name (surname). This is usually the last name in European and American names. In Chinese names, the first syllable (of three) is usually the surname.

Nat: nationality

Name denoting a member of a particular nation, or inhabitant of a particular territory.

Examples

Com: company, organization

Pro: product

Oth: other

Names of stadiums, guerilla bases, events etc.

edit NameType

NounType: noun type

We already split common and proper nouns at the level of POS tags but some tagsets mark other distinctions.

Class: classifier

Chinese classifiers between cardinal numbers and nouns. Note that this is the only value of NounType. Interset also has the values “com” and “prop” but in Universal Treebanks, we decided to distinguish these two already at the level of POS tags.

edit NounType

NumForm: numeral form

Feature of cardinal and ordinal numbers. Is the number expressed by digits or as a word? This feature appears in 10+ tagsets that I studied. Note that it is a bit Euro-centric because it distinguishes (in some tagsets) (Euro)Arabic digits and Roman numerals, but what about digits in various other scripts? In texts in many Indian scripts and in the Arabic script both native digits and Euro-Arabic digits can appear (e.g. 2014 vs. २०१४ in Devanagari).

Word: number expressed as word

Examples: one, two, three

Digit: number expressed using digits

Examples: 1, 2, 3

Roman: roman numeral

Examples: I, II, III

edit NumForm

NumType: numeral type

Values: Card Dist Frac Mult Ord Range Sets

Some languages (especially Slavic) have a complex system of numerals. For example, in the school grammar of Czech, the main part of speech is “numeral”, it includes almost everything where counting is involved and there are various subtypes. It also includes interrogative, relative, indefinite and demonstrative words referring to numbers (words like kolik / how many, tolik / so many, několik / some, a few), so at the same time we may have a non-empty value of PronType. (In English, these words are called quantifiers and they are considered a subgroup of determiners.)

From the syntactic point of view, some numtypes behave like adjectives and some behave like adverbs. We tag them u-pos/ADJ and u-pos/ADV respectively. Thus the NumType feature applies to several different parts of speech:

Card: cardinal number or corresponding interrogative / relative / indefinite / demonstrative word

Note that in some Indo-European languages there is a fuzzy borderline between numerals and nouns for thousand, million and billion.

Examples

Ord: ordinal number or corresponding interrogative / relative / indefinite / demonstrative word

This is a subtype of adjective or (in some languages) of adverb.

Examples

Mult: multiplicative numeral or corresponding interrogative / relative / indefinite / demonstrative word

This is subtype of adjective or adverb.

Examples

Frac: fraction

This is a subtype of cardinal numbers, occasionally distinguished in corpora. It may denote a fraction or just the denominator of the fraction. In various languages these words may behave morphologically and syntactically as nouns or ordinal numerals.

Examples

Sets: number of sets of things; collective numeral

Morphologically distinct class of numerals used to count sets of things, or nouns that are pluralia tantum. Some authors call this type collective numeral.

Examples

Dist: distributive numeral

Used to express that the same quantity is distributed to each member in a set of targets.

Examples

Range: range of values

This could be considered a subtype of cardinal numbers, occasionally distinguished in corpora.

Examples

edit NumType

NumValue: numeric value

Low-value (<5) cardinal numbers in Slavic languages behave morphologically and syntactically differently from the rest, therefore some tagsets distinguish them (so far seen in Czech, Polish, and also Arabic, although it is not Slavic).

In Czech, number “one” agrees with the counted noun in gender, number and case. Number “two” agrees in gender and case and numbers “three” and “four” agree in case. These numerals behave similarly to adjectives. Numbers “five”, “six” etc. behave differently. If the case of the counted phrase is genitive, dative, locative or instrumental, the numeral agrees in case with the noun. However, if the case of the whole phrase is nominative, accusative or vocative, then the numeral dictates that the noun is in genitive. This behavior is similar to nouns modified by other nouns in genitive. (Note that this is why in the Czech PDT some numeral nodes are annotated as governing nouns instead of modifying them.)

1: numeric value 1

2: numeric value 2

3: numeric value 3 or 4

edit NumValue

Number: number

Values: Coll Count Dual Grpa Grpl Inv Pauc Plur Ptan Sing Tri

Number is usually an inflectional feature of nouns and, depending on language, other parts of speech (pronouns, adjectives, determiners, numerals, verbs) that mark agreement with nouns.

Sing: singular number

A singular noun denotes one person, animal or thing.

Examples

Plur: plural number

A plural noun denotes several persons, animals or things.

Examples

Dual: dual number

A dual noun denotes two persons, animals or things.

Examples

Tri: trial number

A trial pronoun denotes three persons, animals or things. It occurs in pronouns of several Austronesian languages.

Pauc: paucal number

A paucal noun denotes “a few” persons, animals or things.

Grpa: greater paucal number

A greater paucal noun denotes “more than several but not many” persons, animals or things. It occurs in Sursurunga, an Austronesian language.

Grpl: greater plural number

A greater plural noun denotes “many, all possible” persons, animals or things. Precise semantics varies across languages.

Inv: inverse number

Inverse number means non-default for that particular noun. (Some nouns are by default assumed to be singular, some plural.) Occurs e.g. in Kiowa.

Count count plural

Attested in Bulgarian and Macedonian. It is known variously as “counting form”, “count plural” or “quantitative plural” (Sussex and Cubberley 2006, p. 324). It is a special plural form of nouns if they occur after numerals. (The form originates in the Proto-Slavic dual but it should not be marked Number=Dual because 1. the dual vanished from Bulgarian and 2. the form is no longer semantically tied to the number two.)

Examples

Ptan: plurale tantum

Some nouns appear only in the plural form even though they denote one thing (semantic singular); some tagsets mark this distinction. Grammatically they behave like plurals, so Plur is obviously the back-off value here; however, if the language also marks gender, the non-existence of singular form sometimes means that the gender is unknown. In Czech, special type of numerals is used when counting nouns that are plurale tantum (NumType = Sets).

Examples

Coll: collective / mass / singulare tantum

Collective or mass or singulare tantum is a special case of singular. It applies to words that use grammatical singular to describe sets of objects, i.e. semantic plural. Although in theory they might be able to form plural, in practice it would be rarely semantically plausible. Sometimes, the plural form exists and means “several sorts of” or “several packages of”.

Examples

References

edit Number

PartType: particle type

Types of particles found in various tagsets. I am merely presenting here what I have in Interset now. We will have to make it match our new definition of particles.

Mod: modal particle

Examples: [bg] май (possibly), нека (let), [cs] ať, kéž, nechť (let)

Emp: particle of emphasis

Examples: [bg] даже (even)

Res: particle of response

Examples: yes, no

Inf: infinitive marker

Did we say that these are subordinating conjunctions?

If so, do we want to have this feature value in “conjtype”?

Examples: [en] to, [de] zu, [da] at, [sv] att

Vbp: separated verb prefix in German

They are analogous to verbal particles in other Germanic languages, which again overlap with adpositions and adverbs. Do we want to tag them as adpositions/adverbs and add this feature? Examples: [de] vor (in “stellen Sie sich vor”)

Besides these, various languages have also question particles (they cause the sentence to be question, i.e. thye are a sort of pronounced question marks) and negative particles (English “not”, German “nicht” etc.; some people would say that these are adverbs). I have been abusing “prontype” values “int” and “neg” to capture these two types in Interset but I am not particularly happy with that, as prontype otherwise applies to different class of words. So if we keep the “PartType” feature, we may want to also add the “int” and “neg” values here.

edit PartType

Person: person

Values: 0 1 2 3 4

Person is typically feature of personal and possessive pronouns / determiners, and of verbs. On verbs it is in fact an agreement feature that marks the person of the verb’s subject (some languages, e.g. Basque, can also mark person of objects). Person marked on verbs makes it unnecessary to always add a personal pronoun as subject and thus subjects are sometimes dropped (pro-drop languages).

0: zero person

Zero person is for impersonal statements, appears in Finnish as well as in Santa Ana Pueblo Keres. (The construction is distinctive in Finnish but it does not use unique morphology that would necessarily require a feature. However, it is morphologically distinct in Keres (Davis 1964:75).

1: first person

In singular, the first person refers just to the speaker / author. In plural, it must include the speaker and one or more additional persons. Some languages (e.g. Taiwanese) distinguish inclusive and exclusive 1st person plural pronouns: the former include the addressee of the utterance (i.e. I + you), the latter exclude them (i.e. I + they).

Examples

2: second person

In singular, the second person refers to the addressee of the utterance / text. In plural, it may mean several addressees and optionally some third persons too.

Examples

3: third person

The third person refers to one or more persons that are neither speakers nor addressees.

Examples

4: fourth person

The fourth person can be understood as a third person argument morphologically distinguished from another third person argument, e.g. in Navajo.

References

edit Person

Polarity: polarity

Values: Neg Pos

Polarity is typically a feature of verbs, adjectives, sometimes also adverbs and nouns in languages that negate using bound morphemes. In languages that negate using a function word, Polarity is used to mark that function word, unless it is a pro-form already marked with PronType=Neg (see below).

Positive polarity (affirmativeness) is rarely, if at all, encoded using overt morphology. The feature value Polarity=Pos is usually used to signal that a lemma has negative forms but this particular form is not negative. Using the feature in such cases is somewhat optional for words that can be negated but rarely are.

For instance, all Czech verbs and adjectives can be negated using the prefix ne-. In theory, all nouns can be negated too, with the meaning “anything except the entities denotable by the original noun”. However, negated nouns are rare and it is not necessary to annotate every positive noun with Polarity=Pos. Language-specific documentation should define under which circumstances the positive polarity is annotated.

In English, verbs are negated using the particle not and adjectives are also negated using prefixes, although the process is less productive than in Czech (wise – unwise, probable – improbable).

Note that Polarity=Neg is not the same thing as PronType=Neg. For pronouns and other pronominal parts of speech there is no such binary opposition as for verbs and adjectives. (There is no such thing as “affirmative pronoun”.)

The Polarity feature can be also used to distinguish response interjections yes and no.

Pos: positive, affirmative

Examples

Neg: negative

Examples

edit Polarity

Polite: politeness

Values: Elev Form Humb Infm

Various languages have various means to express politeness or respect; some of the means are morphological. Three to four dimensions of politeness are distinguished in linguistic literature. The Polite feature currently covers (and mixes) two of them; a more elaborate system of feature values may be devised in future versions of UD if needed. The two axes covered are:

Changing pronouns and/or person and/or number of the verb forms when respectable persons are addressed in Indo-European languages belongs to the speaker-referent axis because the honorific pronouns are used to refer to the addressee.

In Czech, formal second person has the same form for singular and plural, and is identical to informal second person plural. This involves both the pronoun and the finite verb but not a participle, which has no special formal form (that is, formal singular is identical to informal singular, not to informal plural).

In German, Spanish or Hindi, both number and person are changed (informal third person is used as formal second person) and in addition, special pronouns are used that only occur in the formal register ([de] Sie; [es] usted, ustedes; [hi] आप āpa).

In Japanese, verbs and other words have polite and informal forms but the polite forms are not referring to the addressee (they are not in second person). They are just used because of who the addressee is, even if the topic does not involve the addressee at all. This kind of polite language is called teineigo (丁寧語) and belongs to the speaker-addressee axis. Nevertheless, we currently use the same values for both axes, i.e. Polite=Form can be used for teineigo too. This approach may be refined in future.

Infm: informal register

Usage varies but if the language distinguishes levels of politeness, then the informal register is usually meant for communication with family members and close friends.

Examples:

Form: formal register

Usage varies but if the language distinguishes levels of politeness, then the polite register is usually meant for communication with strangers and people of higher social status than the one of the speaker.

Examples:

Elev: referent elevating

This register belongs to the speaker-referent axis and can be seen as a subtype of the formal register there. As an example, Japanese sonkeigo (尊敬語) is a set of honorific forms that elevate the status of the referent.

Humb: speaker humbling

This register belongs to the speaker-referent axis and can be seen as a subtype of the formal register there. As an example, Japanese kenjōgo (謙譲語) is a set of honorific forms that lower the speaker’s status, thereby raising the referent’s status by comparison.

References

edit Polite

Poss: possessive

Values: Yes

Boolean feature of pronouns, determiners or adjectives. It tells whether the word is possessive.

While many tagsets would have “possessive” as one of the various pronoun types, this feature is intentionally separate from PronType, as it is orthogonal to pronominal types. Several of the pronominal types can be optionally possessive, and adjectives can too.

Yes: it is possessive

Note that there is no No value. If the word is not possessive, the Poss feature will just not be mentioned in the FEAT column. (Which means that empty value has the No meaning.)

Examples

edit Poss

PossGender: possessor’s gender

Possessive adjectives and pronouns may have two different genders: that of the possessed object (gender agreement with modified noun) and that of the possessor (lexical feature, inherent gender). The PossGender feature captures the possessor’s gender. For simplicity, the set of possible values is identical to Gender, although only a subset has been observed in corpora so far.

In the Czech examples below, the masculine PossGender implies using one of the suffixes -ův, -ova, -ovo, and the feminine PossGender implies using one of -in, -ina, -ino.

Masc: masculine possessor

Examples: [cs] otcův syn (father’s son; PossGender=Masc|Gender=Masc); otcova dcera (father’s daughter; PossGender=Masc|Gender=Fem); otcovo dítě (father’s child; PossGender=Masc|Gender=Neut).

Fem: feminine possessor

Examples: [cs] matčin syn (mother’s son; PossGender=Fem|Gender=Masc); matčina dcera (mother’s daughter; PossGender=Fem|Gender=Fem); matčino dítě (mother’s child; PossGender=Fem|Gender=Neut).

edit PossGender

PossNumber: possessor’s number

Possessives may have two different numbers: that of the possessed object (number agreement with modified noun) and that of the possessor. The PossNumber feature captures the possessor’s number. For simplicity, the set of possible values is identical to Number, although only a subset has been observed in corpora so far.

Sing: singular possessor

Examples: [en] my, his, her, its; [cs] můj pes (my dog; PossNumber=Sing|Number=Sing); psi (my dogs; PossNumber=Sing|Number=Plur).

Plur: plural possessor

Examples: [en] our, their; [cs] náš pes (our dog; PossNumber=Plur|Number=Sing); naši psi (our dogs; PossNumber=Plur|Number=Plur).

edit PossNumber

PossPerson: possessor’s person

PossPerson is possessor’s person, marked e.g. on Hungarian nouns. These noun forms would be translated to English as possessive pronoun + noun.

Note that there is currently a sort of inconsitency in Interset: since this feature was introduced, it would be logical to use it also for possessive pronouns in other languages. Yet the possessor’s person of these pronouns is traditionally captured in the “Person” feature. Also note that using PossPerson for possessive pronouns might introduce inconsistency at the other end because in some languages, possessive pronouns are actually identical to personal pronouns in the genitive case.

1: first person possessor

Examples: [hu] kutya = dog; kutyám = my dog; kutyánk = our dog.

2: second person possessor

Examples: [hu] kutya = dog; kutyád = your.Sing dog; kutyátok = your.Plur dog.

3: third person possessor

Examples: [hu] kutya = dog; kutyája = his/her/its dog; kutyájuk = their dog.

edit PossPerson

PossedNumber: possessed object’s number

PossedNumber

PossedNumber is the possessee’s (possessed, owned noun phrase’s) number. In Hungarian, possession can be marked on the possessor or on the possessed. It is possible, though rare, that a noun has three distinct number features: its own grammatical number, number of its possessor and number of its possession. Examples from the Multext-East Hungarian lexicon:

Words marked for plural possessions are very rare, though. Note that in the following example from Multext-East, Columbus is marked for plural possession, but not for his own owner.

Sing: singular possession

Plur: plural possession

edit PossedNumber

Prefix: Word functions as a prefix in a compund construction

Boolean feature. Is this a prefix word in a compound, that usually cannot stand on its own?

These are words corresponding to prefixes such inter- (inter disciplinary), post- (post traumatic), un- (un avoidable), di- (di transitive) and so on in English, but which are relized as distinct tokens (without the hyphen) in different languages.

Yes: it is a prefix of a compound

edit Prefix

PrepCase: case form sensitive to prepositions

Personal pronouns in some languages have different forms depending on whether they are objects of prepositions or not. For instance, Czech on (he) without prepositions has the forms jemu/DAT, jeho/ACC, jím/INS, while with a preposition it is němu/DAT, něho/ACC, ním/INS. Similarly, Portuguese pronouns in prepositional oblique case take forms different from oblique pronouns serving as direct objects of verbs: eu/NOM (I), me/ACC (give me that), mim/PREP-ACC (come to me).

Default empty value means that the word form is neutral w.r.t. prepositions.

Npr: non-prepositional case

This word form must not be used after a preposition.

Examples: [cs] jemu = him (dative).

Pre: prepositional case

This word form must be used after a preposition.

Examples: [cs] k mu = to him (dative).

edit PrepCase

PronType: pronominal type

Values: Art Dem Emp Exc Ind Int Neg Prs Rcp Rel Tot

This feature typically applies to pronouns, pronominal adjectives (determiners), pronominal numerals (quantifiers) and pronominal adverbs.

Prs: personal or possessive personal pronoun or determiner

See also the Poss feature that distinguishes normal personal pronouns from possessives. Note that Prs also includes reflexive personal/possessive pronouns (e.g. [cs] se / svůj; see the Reflex feature).

Examples

Rcp: reciprocal pronoun

Examples

Art: article

Article is a special case of determiner that bears the feature of definiteness (in other languages, the feature may be marked directly on nouns).

Examples

Int: interrogative pronoun, determiner, numeral or adverb

Note that possessive interrogative determiners (whose) can be distinguished by the Poss feature.

Examples:

Rel: relative pronoun, determiner, numeral or adverb

Note that in many languages this class heavily overlaps with interrogatives, yet there are pronouns that are only relative, and in some languages (Bulgarian, Hindi) the two classes are distinct.

Examples:

Exc: exclamative determiner

Exclamative pro-adjectives (determiners) express the speaker’s surprise towards the modified noun, e.g. what in “What a surprise!” In many languages, exclamative determiners are recruited from the set of interrogative determiners. Therefore, not all tagsets distinguish them.

Examples:

Dem: demonstrative pronoun, determiner, numeral or adverb

These are often parallel to interrogatives. Some tagsets might also distinguish a separate feature of distance (here / there; [es] aquí / ahí / allí).

Examples

Emp: emphatic determiner

Emphatic pro-adjectives (determiners) emphasize the nominal they depend on. There are similarities with reflexive and demonstrative pronouns / determiners.

Examples

Tot: total (collective) pronoun, determiner or adverb

Examples

Neg: negative pronoun, determiner or adverb

Negative pronominal words are distinguished from negating particles and from words that inflect for polarity (verbs, adjectives etc.) Those words do not use PronType=Neg, they use Polarity=Neg instead. See the Polarity feature for further details.

Examples:

Ind: indefinite pronoun, determiner, numeral or adverb

Note that some tagsets might further subclassify this category to distinguish “some” from “any” etc. Such distinctions are not part of universal features but may be added in language-specific extensions.

Examples

edit PronType

PunctSide: which side of paired punctuation is this?

Distinguishes between initial and final form of pairwise punctuation (brackets, quotation marks, question and exclamation in Spanish). Note that “initial” and “final” are better terms than “left” and “right”. The latter would be confusing in languages writing from right to left, like Arabic.

Ini: initial (left bracket in English texts)

Fin: final (right bracket in English texts)

edit PunctSide

PunctType: punctuation type

Many tagsets have just one tag for punctuation. Others (including the Penn Treebank and the Swedish Mamba tagset) classify punctuation in more detail.

Peri: period at the end of sentence; in Penn tagset, includes question and exclamation

Qest: question mark

Excl: exclamation mark

Quot: quoation marks (various sorts in various languages)

Brck: bracket

Comm: comma

Colo: colon; in Penn tagset, “:” is in fact tag for generic other punctuation

Semi: semicolon

Dash: dash, hyphen

Symb: symbol

edit PunctType

Reflex: reflexive

Values: Yes

Boolean feature, typically of pronouns or determiners. It tells whether the word is reflexive, i.e. refers to the subject of its clause.

While many tagsets would have “reflexive” as one of the various pronoun types, this feature is intentionally separate from PronType, as it is orthogonal to pronominal types.

Note that while some languages also have reflexive verbs, these are in fact fused verbs with reflexive pronouns, as in Spanish despertarse or Russian проснуться (both meaning “to wake up”). Thus in these cases the fused token will be split to two syntactic words, one of them being a reflexive pronoun.

Yes: it is reflexive

Note that there is no No value. If the word is not reflexive, the Reflex feature will just not be mentioned in the FEAT column. (Which means that empty value has the No meaning.)

Examples

edit Reflex

Style: style or sublanguage to which this word form belongs

This may be a lexical feature (some words-lemmas are archaic, some are colloquial) or a morphological feature (inflectional patterns may systematically change between dialects or styles). It could be used in many languages but only a few choose to actually annotate it. Seen in Bulgarian, Czech, Danish, Finnish and Hungarian.

Arch: archaic, obsolete

Rare: rare

Form: formal, literary

Poet: poetic

Norm: normal, neutral

Coll: colloquial

Vrnc: vernacular

Slng: slang

Expr: expressive, emotional

Derg: derogative

Vulg: vulgar

edit Style

Subcat: subcategorization

Lexical feature of verbs. Some tagsets distinguish intransitive and transitive verbs. In many languages however, subcategorization of verbs is much more complex than this.

Intr: intransitive verb

A verb that does not take arguments other than the subject.

Examples: [en] to go

Tran: transitive verb

A verb that takes a direct (accusative) object as argument (in addition to the subject). These verbs can be passivized, then the direct object becomes the subject.

Examples: [en] to do something, to be done by somebody

edit Subcat

Tense: tense

Values: Fut Imp Past Pqp Pres

Tense is typically a feature of verbs. It may also occur with other parts of speech (nouns, adjectives, adverbs), depending on whether borderline word forms such as participles are classified as verbs or as the other category.

Tense is a feature that specifies the time when the action took / takes / will take place, in relation to a reference point. The reference is often the moment of producing the sentence, but it can be also another event in the context. In some languages (e.g. English), some tenses are actually combinations of tense and aspect. In other languages (e.g. Czech), aspect and tense are separate, although not completely independent of each other.

Note that we are defining features that apply to a single word. If a tense is constructed periphrastically (two or more words, e.g. auxiliary verb indicative + participle of the main verb) and none of the participating words are specific to this tense, then the features will probably not directly reveal the tense. For instance, [en] I had been there is past perfect (pluperfect) tense, formed periphrastically by the simple past tense of the auxiliary to have and the past participle of the main verb to be. The auxiliary will be tagged VerbForm=Fin|Mood=Ind|Tense=Past and the participle will have VerbForm=Part|Tense=Past; none of the two will have Tense=Pqp. On the other hand, Portuguese can form the pluperfect morphologically as just one word, such as estivera, which will thus be tagged VerbForm=Fin|Mood=Ind|Tense=Pqp.

Past: past tense / preterite / aorist

The past tense denotes actions that happened before a reference point. In the prototypical case, the reference point is the moment of producing the sentence and the past event happened before the speaker speaks about it. However, Tense=Past is also used to distinguish past participles from other kinds of participles, and past converbs from other kinds of converbs; in these cases, the reference point may itself be in past or future, when compared to the moment of speaking. For instance, the Czech converb spatřivše “having seen” in the sentence spatřivše vojáky, velmi se ulekli “having seen the soldiers, they got very scared” describes an event that is anterior to the event of getting scared. It also happens to be anterior to the moment of speaking, but that fact is not encoded in the converb itself, it is rather a consequence of “getting scared” being in the past tense.

Among finite forms, the simple past in English is an example of Tense=Past. In German, this is the Präteritum. In Turkish, this is the non-narrative past. In Bulgarian, this is aorist, the aspect-neutral past tense that can be used freely with both imperfective and perfective verbs (see also imperfect).

Examples

Pres: present tense

The present tense denotes actions that are in progress (or states that are valid) in a reference point; it may also describe events that usually happen. In the prototypical case, the reference point is the moment of producing the sentence; however, Tense=Pres is also used to distinguish present participles from other kinds of participles, and present converbs from other kinds of converbs. In these cases, the reference point may be in past or future when compared to the moment of speaking. For instance, the English present participle may be used to form a past progressive tense: he was watching TV when I arrived.

Examples

Fut: future tense

The future tense denotes actions that will happen after a reference point; in the prototypical case, the reference point is the moment of producing the sentence.

Examples

Imp: imperfect

Used in e.g. Bulgarian and Croatian, imperfect is a special case of the past tense. Note that, unfortunately, imperfect tense is not always the same as past tense + imperfective aspect. For instance, in Bulgarian, there is lexical aspect, inherent in verb meaning, and grammatical aspect, which does not necessarily always match the lexical one. In main clauses, imperfective verbs can have imperfect tense and perfective verbs have perfect tense. However, both rules can be violated in embedded clauses.

Examples

Pqp: pluperfect

The pluperfect denotes action that happened before another action in past. This value does not apply to English where the pluperfect (past perfect) is constructed analytically. It applies e.g. to Portuguese.

Examples

edit Tense

Typo: is this a misspelled word?

Values: Yes

Indicates bad spelling, grammatical error etc. Does not say how the correct form looks like.

Yes: it is typo

Examples

edit Typo

VerbForm: form of verb or deverbative

Values: Conv Fin Gdv Ger Inf Part Sup Vnoun

Even though the name of the feature seems to suggest that it is used exclusively with verbs, it is not the case. Some verb forms in some languages actually form a gray zone between verbs and other parts of speech (nouns, adjectives and adverbs). For instance, participles may be either classified as verbs or as adjectives, depending on language and context. In both cases VerbForm=Part may be used to separate them from other verb forms or other types of adjectives.

Fin: finite verb

Rule of thumb: if it has non-empty Mood, it is finite. But beware that some tagsets conflate verb forms and moods into one feature.

Examples

Inf: infinitive

Infinitive is the citation form of verbs in many languages. Unlike in English, it often has morphological form that is distinct from the finite forms. Infinitives may be used together with auxiliaries to form periphrastic tenses (e.g. future tense [cs] budu sedět v letadle “I will sit in a plane”), they appear as arguments of modal verbs etc. In some languages they behave similarly to nouns and are used as such (similar to the gerund in English).

Examples

Sup: supine

Supine is a rare verb form. It survives in some Slavic languages (Slovenian) and is used instead of infinitive as the argument of motion verbs (old [cs] jdu spat lit. I-go sleep).

A form called “supine” also exists in Swedish where it is a special form of the participle, used to form the composite past form of a verb. It is used after the auxiliary verb ha (to have) but not after vara (to be):

Part: participle, verbal adjective

Participle is a non-finite verb form that shares properties of verbs and adjectives. Its usage varies across languages. It may be used to form various periphrastic verb forms such as complex tenses and passives; it may be also used purely adjectively.

Other features may help to distinguish past/present participles (English), active/passive participles (Czech), imperfect/perfect participles (Hindi) etc.

Examples

Conv: converb, transgressive, adverbial participle, verbal adverb

The converb, also called adverbial participle or transgressive, is a non-finite verb form that shares properties of verbs and adverbs. It appears e.g. in Slavic and Indo-Aryan languages.

Note that this value was called Trans in UD v1 and it has been renamed Conv in UD v2.

Examples

Gdv: gerundive

Used in Latin and Ancient Greek. Not to confuse with gerund.

Ger: gerund

Gerund is a non-finite verb form that shares properties of verbs and nouns. In English it shares the morphological form with present participle, which may mean that the tagset will not distinguish it from the participle.

Using VerbForm=Ger is discouraged and alternatives should be considered first because the term gerund is rather confusing: in Spanish (and other Romance languages) it denotes the present participle and should be thus labeled Tense=Pres|VerbForm=Part; some Slavists use it to denote converbs (adverbial participles), which should be labeled VerbForm=Conv; and UD version 1 recommended (inspired by English) to use it for verbal nouns, which in UD v2 should use VerbForm=Vnoun.

However, the feature is still available in UDv2 and can be used if the alternatives do not seem acceptable. The feature may be removed in future versions but comprehensive investigation has to be done first.

Examples

Vnoun: verbal noun, masdar

Verbal nouns other than infinitives. Also called masdars by some authors, e.g. Haspelmath, 1995.

Examples

References

edit VerbForm

VerbType: verb type

We already split auxiliary and non-auxiliary verbs at the level of POS tags but some tagsets mark other distinctions.

Aux: auxiliary verb

Verb used to create periphrastic verb forms (tenses, passives etc.) In many languages there will be ambiguity between auxiliary and other usages, thus the same verb should get different tags or feature values depending on context.

Cop: copula verb

Verb used to make nominal predicates from adjectives, nouns or participles. Some languages omit the copula or use other means to create nominal predicates. In languages that have copula, it is often the equivalent of the verbs “to be” or “to become”.

Examples: It is purple. He just became father.

Mod: modal verb

A group of verbs traditionally distinguished in grammars of some languages. They take infinitive of another verb as argument (with or without infinitive-marking conjunction, in languages that have it) and add various modes of possibility, necessity etc. to the meaning of the infinitive. There are other verbs that take infinitives as arguments but they are not considered modal (e.g. phasal verbs such as “to begin to do something”). The set of modal verbs for a language is closed and can be enumerated.

Note that some languages (e.g. Turkish) use special forms of the main verb instead of combining it with a modal verb.

German examples: dürfen (may), können (can), mögen (want/like to), müssen (must), sollen (shall), wollen (want to), wissen (know to)

Czech examples: muset (must), mít (shall, have to), moci (can), smět (may, be allowed to), umět (know to), chtít (want to)

Light: light (support) verb

Light or support verb is used in verbo-nominal constructions where the main part of the meaning is contributed by a noun complement. An English example would be to take a nap, where take is the light verb. It is often the case that the light verb can also function as a normal verb in the language (cf. to take two dollars). If the light verb constructions are used frequently in a language (e.g. Hindi or Japanese) or if there is a dedicated light verb that cannot be used as normal verb, it makes sense to mark light verbs with a dedicated feature value.

Japanese example: suru

edit VerbType

Voice: voice

Values: Act Antip Cau Dir Inv Mid Pass Rcp

Voice is typically a feature of verbs. It may also occur with other parts of speech (nouns, adjectives, adverbs), depending on whether borderline word forms such as gerunds and participles are classified as verbs or as the other category.

For Indo-European speakers, voice means mainly the active-passive distinction. In other languages, other shades of verb meaning are categorized as voice.

Act: active voice

The subject of the verb is the doer of the action (agent), the object is affected by the action (patient).

Examples

Mid: middle voice

Between active and passive, needed e.g. in Ancient Greek or Sanskrit.

Pass: passive voice

The subject of the verb is affected by the action (patient). The doer (agent) is either unexpressed or it appears as an object of the verb.

Examples

Antip: antipassive voice

In ergative-absolutive languages, an ergative subject is demoted to an absolutive subject.

Dir: direct voice

Used in direct-inverse voice systems, e.g. in North American languages. Direct means that the argument that is higher in salience hierarchy is the subject. Example hierarchy: human 1st person – 2nd – 3rd – non-human animate – inanimate.

Inv: inverse voice

Used in direct-inverse voice systems, e.g. in North American languages. Inverse voice marking means that the argument lower in the hierarchy functions as subject.

Rcp: reciprocal voice

Examples

Cau: causative voice

Documentation of the METU Sabanci treebank classifies causative as voice (page 26). Note that this is a feature of verbs. There are languages that have also the causative case of nouns.

Examples

edit Voice