home edit page issue tracker

This page pertains to UD version 2.

Morphology: General Principles

UD specifies a complete morpho-syntactic representation that can be applied cross-linguistically. This effectively means that grammatical notions may be indicated via word forms (morphologically) or via dependency relations (syntactically). The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:

Lemmas are typically determined by language-specific dictionaries and lexica. In contrast, the part-of-speech tags and grammatical properties are taken from two universal inventories defined below.

Unlike in various language-specific tagsets, the universal tags and features do not include means to mark fused words (a word that is result of merging two other words, which are syntactically independent and belong to different parts of speech): Czech dělals (dělal + jsi … main verb + auxiliary); proň (pro + něj … preposition + pronoun); German zum (zu + dem … preposition + article); Spanish dámelo (da + me + lo … verb + clitics) etc. The only truly general approach to fused words in UD is to exploit the distinction between tokens and (syntactic) words, and to apply a language-specific processing step that splits tokens into syntactic words where necessary. Every syntactic word then gets its own part-of-speech tag and features. See also Tokenization and Format.

Lemmas

The LEMMA field should contain the canonical or base form of the word, which is the form typically found in dictionaries. If a language is agglutinative, this is typically the form with no inflectional affixes; in fusional languages, the lemma is usually the result of a language-particular convention. If the lemma is not available, an underscore (“_”) can be used to indicate its absence.

At present, treebanks have considerable leeway in interpreting what “canonical or base form” means. Except perhaps in rare cases of suppletion, one form should be the chosen as the lemma of a verb, noun, determiner, or pronoun paradigm. The lemma of adjectives and adverbs should be the positive form (in languages with comparative and superlative forms). The lemma does not remove derivational morphology, so the lemma of [en] organizations is organization not organize (nor organ). In general, a canonical form should collapse inflectional and minor orthographic/spelling variation (such as casing, accents/diacritics, and typos). In the lemma field, some treebanks may choose to aggressively normalize spelling variation that may reflect dialect or authorial style.

Abbreviated/shortened forms can be mapped to their full spelling as the lemma in conjunction with the feature Abbr=Yes, provided that the full spelling is a single word. Abbreviations that would expand to multiple words should be retained in the lemma.

The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format).

Some corpora use numerical specifiers to distinguish homonymous lemmas, different word senses etc. (e.g. [en] can-1 vs. can-2). In UD, such specifiers must not appear in the LEMMA field because they are not part of the canonical surface form. If unique lemma identifiers are available, they can be preserved in the MISC column in the optional LId attribute (LId=can-1).

Lemmas of misspelled words

Details at: Typos and Other Errors

In addition to normalizing spelling in lemmas, treebanks are encouraged to adopt the optional morphological feature Typo=Yes for clear accidental misspellings of a word (e.g. ltake for take or too for to). Typos of words in closed-class categories can be found in a corpus by inspecting word frequencies in each category. Treebank maintainers should take care not to use Typo=Yes for words that may reflect actual linguistic variation, e.g., dialect, style, or nonnative grammar.

On occasion, a typo or abbreviation will apply to an inflected word (e.g. hadd for had), and thus the lemma should both normalize the spelling and remove the inflection.

For incorrectly split words, the first segment should be treated as morphologically representing the entire word, so it should have the lemma for the entire word, as described at Typos and Other Errors.

Part-of-Speech Tags

The list of universal POS tags is a fixed list containing 17 tags. It is possible that some tags will not be used in some languages. However, the list cannot be extended to cover language-specific extensions. Instead, more fine-grained classification of words can be achieved via the use of features (see below).

Also, note that the CoNLL-U format allows an additional XPOS, taken from a language-specific (or corpus-specific) tagset. Such language-specific XPOSes have their own data column and are not mixed with the universal POS tags.

The universal POS tags consist of uppercase English letters [A-Z] only. Just one tag per word is expected, and it should not be empty. (Use the X tag instead of underscore if no other tag is appropriate.)

Tagging principles

A word’s category should be primarily determined by prototypical (expected) syntactic behavior, as typically recorded in a dictionary, rather than by the context of a particular sentence. Syntax still plays an important role, especially in cross-linguistic mapping of same-named categories. However, prototypical (expected) syntactic behavior is of more importance than function performed in exceptional contexts.

Morphological behavior may be a good indicator in some languages. If, for example, a language uses distinct inflection patterns for nouns and adjectives, then morphology can be used to distinguish these two categories. Exceptions cannot be excluded but they should be really exceptional and well grounded; when in doubt, use the category determined by morphology (if available).

Ambiguous words (belonging to two or more categories) do exist. Sometimes by pure coincidence ([en] the can vs. can = to be able to). Sometimes the two words are related but differ morphologically ([en] the book(s) vs. to book, booked, booking).

Perhaps the most difficult part are ambiguous function words that do not inflect (i.e. morphology does not help us), yet they perform two or more significantly different syntactic functions, which we normally associate with different parts of speech. The two functions may not be equally frequent but each of them is more frequent than what could be labeled as a mere exception (i.e., the wait for his ‘yes’ example below is exceptional). Disambiguating such pairs clearly depends on the context of the given sentence where the word is used. So how do we know that the difference is “significant enough”? One clue is that the word, when translated to another language, gets two different translations with different POS tags (e.g., the English no as response interjection, vs. negative determiner). Another clue comes from contrasting the UD relations used for the two functions. For example, distinguishing PRON from SCONJ ([en] that, [es] que, [ru] что / čto) is important because pronouns, unlike conjunctions, may become core arguments and fill valency slots of verbs. Distinguishing SCONJ from ADP, or CCONJ from ADV seems less crucial and we can keep just one POS tag for each such word, based on prototypical usage.

Using a word vs. mentioning it

The universal POS tags should capture regular, prevailing syntactic behavior, as well as morphological characteristics when available, and should not reflect sentence-specific exceptional behavior. In particular, the POS tags do not distinguish actual usage of a word from just mentioning it. Thus in both the following examples, yes will be tagged as interjection:

Similarly, in both the following examples, precede will be tagged as verb:

Pronominal words

Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).

See also

The guidelines for the following special cases are documented on the referenced pages for specific POS tags:

Features

Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. Every feature has the form Name=Value and every word can have any number of features, separated by the vertical bar, as in Gender=Masc|Number=Sing.

Analogically to part-of-speech tags, features describe the word form but not necessarily its exact function in the given sentence. Most of the features are for locating the form in a slot of a morphological paradigm, and are canonical labels for the slot. Thus, for example, the Voice feature is used in Czech to distinguish passive participles (prodán “sold” Voice=Pass), which are morphologically distinct from active participles (prodal “sold” Voice=Act); however, the feature is not used with participles in German, where the same form is used in active and passive constructions alike (cf. Er hat es verkauft “He has sold it” vs. Es wurde verkauft “It was sold”). On the other hand, some word forms are homonymous and context must be used to identify the paradigm slot to which they belong. For example, the morphological paradigm of Czech nouns distinguishes nominative and accusative (among other Case values), as in matka “mother” Case=Nom vs. matku Case=Acc. Nevertheless, due to case syncretism, some other lexemes have the same form in these two paradigm slots, e.g. píseň “song” is either Case=Nom or Case=Acc and it has to be disambiguated by context.

We provide an inventory of features that are attested in multiple corpora and it is thus desirable that they are encoded in a uniform way. The list is certainly not exhaustive and later versions of the standard may include new features or values found in new languages, corpora or tagsets.

Users can extend this set of universal features and add language-specific features when necessary. Such features should be described in the language-specific documentation and follow the general principles outlined here. Universal and language-specific features of a word are listed together in the FEATS column.

Lexical Features

All of these can be considered attributes of lexemes or lemmas (rather than individual word forms) and they represent a fine-grained sub-classification of words.

Inflectional Features

These are mostly features of word forms rather than lemmas. There are exceptions: for instance, gender of nouns is usually a lexical feature (all word forms of one lemma have the same gender). However, other parts of speech (adjectives, pronouns, verbs) may inflect for gender because of agreement with nouns.

Layered Features

In some languages, some features are marked more than once on the same word. We say that there are several layers of the feature. The exact meaning of individual layers is language-dependent.

For example, possessive adjectives, determiners and pronouns may have two different values of u-feat/Gender and two of u-feat/Number. One of the values is determined by agreement with the modified (possessed) noun. This is parallel to other (non-possessive) adjectives and determiners that agree in gender and number with the nouns they modify. The other value is determined lexically because it is a property of the possessor.

For detailed examples of layered features, see Layered Features.

If a feature is (can be) layered in a language, the name of the feature must indicate the layer. An additional identifier in square brackets is used to distinguish layers, e.g. Gender[psor] for the possessor’s gender. We recommend that the layer identifiers consist of lowercase English letters [a-z] and/or digits [0-9]. The layers, their meaning and their identifiers must be defined in a language-specific extension to this documentation. For each layered feature, one layer may be defined as default and the corresponding features then appear without identifier, e.g. Gender=Masc|Gender[psor]=Fem.