home edit page issue tracker

Morphology: General Principles

The UD scheme allows the specification of a complete morpho-syntactic representation that can be applied cross-linguistically. This effectively means that grammatical notions may be indicated via word forms (morphologically) or via dependency relations (syntactically). The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:

Lemmas are typically determined by language-specific dictionaries and lexica. In contrast, the part-of-speech tags and grammatical properties are taken from two universal inventories defined below.

Unlike in various language-specific tagsets, the universal tags and features do not include means to mark fusion words (a word that is result of merging two other words, which are syntactically independent and belong to different parts of speech): Czech dělals (dělal + jsi … main verb + auxiliary); proň (pro + něj … preposition + pronoun); German zum (zu + dem … preposition + article); Spanish dámelo (da + me + lo … verb + clitics) etc. The only truly general approach to fusion words is to apply a language-specific processing step that will split tokens into syntactic words where necessary. Every syntactic word will then get its own part-of-speech tag and features. See also Tokenization and Format.

Lemmas

The LEMMA field should contain the canonical or base form of the word, such as the form typically found in dictionaries.

If the lemma is not available, an underscore (“_”) can be used to indicate its absence.

The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format).

Part-of-Speech Tags

The list of universal POS tags is a fixed list containing 17 tags.
It is possible that some tags will not be used in some languages. However, the list cannot be extended to cover language-specific extensions. Instead, more fine-grained classification of words can be achieved via the use of features (see below).

Also, note that the CoNLL-U format allows an additional POSTAG, taken from a language-specific (or corpus-specific) tagset. Such language-specific POSTAGs have their own data column and are not mixed with the universal POS tags.

The universal POS tags consist of uppercase English letters [A-Z] only. Just one tag per word is expected, and it should not be empty. (Use the X tag instead of underscore if no other tag is appropriate.)

Using a word vs. mentioning it

The universal POS tags focus more on what the word is than on which construction it is used in (the latter is specified by the dependency relation labels). In particular, the POS tags do not distinguish actual usage of a word from just mentioning it. Thus in both the following examples, yes will be tagged as interjection:

Similarly, in both the following examples, precede will be tagged as verb:

See also

The guidelines for the following cases are documented on the referenced pages for specific POS tags:

Features

Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. Every feature has the form Name=Value and every word can have any number of features, separated by the vertical bar, as in Gender=Masc|Number=Sing.

We provide an inventory of features that are attested in multiple corpora and it is thus desirable that they are encoded in a uniform way. The list is certainly not exhaustive and later versions of the standard may include new features or values found in new languages, corpora or tagsets.

Users can extend this set of universal features and add language-specific features when necessary. Such features should be described in the language-specific documentation and follow the general principles outlined here. Universal and language-specific features of a word are listed together in the FEATS column.

Lexical Features

All of these can be considered attributes of lexemes or lemmas (rather than individual word forms) and they represent a fine-grained sub-classification of words.

Inflectional Features

These are mostly features of word forms rather than lemmas. There are exceptions: for instance, gender of nouns is usually a lexical feature (all word forms of one lemma have the same gender). However, other parts of speech (adjectives, pronouns, verbs) may inflect for gender because of agreement with nouns.

Layered Features

In some languages, some features are marked more than once on the same word. We say that there are several layers of the feature. The exact meaning of individual layers is language-dependent.

For example, possessive adjectives, determiners and pronouns may have two different values of u-feat/Gender and two of u-feat/Number. One of the values is determined by agreement with the modified (possessed) noun. This is parallel to other (non-possessive) adjectives and determiners that agree in gender and number with the nouns they modify. The other value is determined lexically because it is a property of the possessor.

For detailed examples of layered features, see Layered Features.

If a feature is (can be) layered in a language, the name of the feature must indicate the layer. An additional identifier in square brackets is used to distinguish layers, e.g. Gender[psor] for the possessor’s gender. We recommend that the layer identifiers consist of lowercase English letters [a-z] and/or digits [0-9]. The layers, their meaning and their identifiers must be defined in a language-specific extension to this documentation. For each layered feature, one layer may be defined as default and the corresponding features then appear without identifier, e.g. Gender=Masc|Gender[psor]=Fem.