home edit page issue tracker

This page pertains to UD version 2.

Morphology: General Principles

The UD scheme allows the specification of a complete morpho-syntactic representation that can be applied cross-linguistically. This effectively means that grammatical notions may be indicated via word forms (morphologically) or via dependency relations (syntactically). The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:

Lemmas are typically determined by language-specific dictionaries and lexica. In contrast, the part-of-speech tags and grammatical properties are taken from two universal inventories defined below.

Unlike in various language-specific tagsets, the universal tags and features do not include means to mark fusion words (a word that is result of merging two other words, which are syntactically independent and belong to different parts of speech): Czech dělals (dělal + jsi … main verb + auxiliary); proň (pro + něj … preposition + pronoun); German zum (zu + dem … preposition + article); Spanish dámelo (da + me + lo … verb + clitics) etc. The only truly general approach to fusion words is to apply a language-specific processing step that will split tokens into syntactic words where necessary. Every syntactic word will then get its own part-of-speech tag and features. See also Tokenization and Format.


The LEMMA field should contain the canonical or base form of the word, such as the form typically found in dictionaries. If the lemma is not available, an underscore (“_”) can be used to indicate its absence.

At present, treebanks have considerable leeway in interpreting what “canonical or base form” means. In general, a canonical form should collapse inflectional and minor orthographic/spelling variation (such as casing, accents/diacritics, and typos). In the lemma field, some treebanks may choose to aggressively normalize spelling variation that may reflect dialect or authorial style.

In addition to normalizing spelling in lemmas, treebanks are encouraged to adopt the optional morphological feature Typo=Yes for clear accidental misspellings of a word (e.g. ltake for take or too for to). Typos of words in closed-class categories can be found in a corpus by inspecting word frequencies in each category. Treebank maintainers should take care not to use Typo=Yes for words that may reflect actual linguistic variation, e.g., dialect, style, or nonnative grammar.

Abbreviated/shortened forms can be mapped to their full spelling as the lemma in conjunction with the feature Abbr=Yes, provided that the full spelling is a single word. Abbreviations that would expand to multiple words should be retained in the lemma.

On occasion, a typo or abbreviation will apply to an inflected word (e.g. hadd for had), and thus the lemma should both normalize the spelling and remove the inflection. Treebanks may wish to use the MISC field to store the normalized but not lemmatized form.

(There is currently no UD-wide policy for lemmas of apparently erroneous extra words, missing words, or incorrectly segmented words.)

The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format).

Some corpora use numerical specifiers to distinguish homonymous lemmas, different word senses etc. (e.g. [en] can-1 vs. can-2). In UD, such specifiers must not appear in the LEMMA field because they are not part of the canonical surface form. If unique lemma identifiers are available, they can be preserved in the MISC column in the optional LId attribute (LId=can-1).

Part-of-Speech Tags

The list of universal POS tags is a fixed list containing 17 tags.
It is possible that some tags will not be used in some languages. However, the list cannot be extended to cover language-specific extensions. Instead, more fine-grained classification of words can be achieved via the use of features (see below).

Also, note that the CoNLL-U format allows an additional POSTAG, taken from a language-specific (or corpus-specific) tagset. Such language-specific POSTAGs have their own data column and are not mixed with the universal POS tags.

The universal POS tags consist of uppercase English letters [A-Z] only. Just one tag per word is expected, and it should not be empty. (Use the X tag instead of underscore if no other tag is appropriate.)

Using a word vs. mentioning it

The universal POS tags should capture regular, prevailing syntactic behavior, as well as morphological characteristics when available, and should not reflect sentence-specific exceptional behavior. In particular, the POS tags do not distinguish actual usage of a word from just mentioning it. Thus in both the following examples, yes will be tagged as interjection:

Similarly, in both the following examples, precede will be tagged as verb:

Pronominal words

Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).

See also

The guidelines for the following special cases are documented on the referenced pages for specific POS tags:


Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. Every feature has the form Name=Value and every word can have any number of features, separated by the vertical bar, as in Gender=Masc|Number=Sing.

We provide an inventory of features that are attested in multiple corpora and it is thus desirable that they are encoded in a uniform way. The list is certainly not exhaustive and later versions of the standard may include new features or values found in new languages, corpora or tagsets.

Users can extend this set of universal features and add language-specific features when necessary. Such features should be described in the language-specific documentation and follow the general principles outlined here. Universal and language-specific features of a word are listed together in the FEATS column.

Lexical Features

All of these can be considered attributes of lexemes or lemmas (rather than individual word forms) and they represent a fine-grained sub-classification of words.

Inflectional Features

These are mostly features of word forms rather than lemmas. There are exceptions: for instance, gender of nouns is usually a lexical feature (all word forms of one lemma have the same gender). However, other parts of speech (adjectives, pronouns, verbs) may inflect for gender because of agreement with nouns.

Layered Features

In some languages, some features are marked more than once on the same word. We say that there are several layers of the feature. The exact meaning of individual layers is language-dependent.

For example, possessive adjectives, determiners and pronouns may have two different values of u-feat/Gender and two of u-feat/Number. One of the values is determined by agreement with the modified (possessed) noun. This is parallel to other (non-possessive) adjectives and determiners that agree in gender and number with the nouns they modify. The other value is determined lexically because it is a property of the possessor.

For detailed examples of layered features, see Layered Features.

If a feature is (can be) layered in a language, the name of the feature must indicate the layer. An additional identifier in square brackets is used to distinguish layers, e.g. Gender[psor] for the possessor’s gender. We recommend that the layer identifiers consist of lowercase English letters [a-z] and/or digits [0-9]. The layers, their meaning and their identifiers must be defined in a language-specific extension to this documentation. For each layered feature, one layer may be defined as default and the corresponding features then appear without identifier, e.g. Gender=Masc|Gender[psor]=Fem.