Morphology: General Principles
The UD scheme allows the specification of a complete morpho-syntactic representation that can be applied cross-linguistically. This effectively means that grammatical notions may be indicated via word forms (morphologically) or via dependency relations (syntactically). The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:
- A lemma representing the semantic content of the word.
- A part-of-speech tag representing the abstract lexical category associated with the word.
- A set of features representing lexical and grammatical properties that are associated with the particular word form.
Lemmas are typically determined by language-specific dictionaries and lexica. In contrast, the part-of-speech tags and grammatical properties are taken from two universal inventories defined below.
Unlike in various language-specific tagsets, the universal tags and features do not include means to mark fusion words (a word that is result of merging two other words, which are syntactically independent and belong to different parts of speech): Czech dělals (dělal + jsi … main verb + auxiliary); proň (pro + něj … preposition + pronoun); German zum (zu + dem … preposition + article); Spanish dámelo (da + me + lo … verb + clitics) etc. The only truly general approach to fusion words is to apply a language-specific processing step that will split tokens into syntactic words where necessary. Every syntactic word will then get its own part-of-speech tag and features. See also Tokenization and Format.
Lemmas
The LEMMA
field should contain the canonical or base form of the word, such as the form typically found in dictionaries.
If the lemma is not available, an underscore (“_
”) can be used to indicate its absence.
The LEMMA
field should not be used to encode features or other similar properties of the word (use FEATS
and MISC
instead; see format).
Part-of-Speech Tags
The list of universal POS tags is
a fixed list containing 17 tags.
It is possible that some tags will not be used in some
languages. However, the list cannot be extended to cover
language-specific extensions. Instead, more fine-grained
classification of words can be achieved via the use of
features (see below).
Also, note that the CoNLL-U format allows an additional POSTAG, taken from a language-specific (or corpus-specific) tagset. Such language-specific POSTAGs have their own data column and are not mixed with the universal POS tags.
The universal POS tags consist of uppercase English letters [A-Z]
only.
Just one tag per word is expected, and it should not be empty. (Use the X
tag
instead of underscore if no other tag is appropriate.)
Using a word vs. mentioning it
The universal POS tags focus more on what the word is than on which construction it is used in (the latter is specified by the dependency relation labels). In particular, the POS tags do not distinguish actual usage of a word from just mentioning it. Thus in both the following examples, yes will be tagged as interjection:
- Yes, I think so.
- I am waiting for his ‘yes’ on the matter.
Similarly, in both the following examples, precede will be tagged as verb:
- Such discussion must precede every decision.
- He pronounced ‘precede’ in a funny way.
See also
The guidelines for the following cases are documented on the referenced pages for specific POS tags:
- Abbreviations and acronyms: described under SYM
Features
Features are additional pieces of information about the word, its part of speech
and morphosyntactic properties. Every feature has the form Name=Value
and
every word can have any number of features, separated by the vertical bar, as in
Gender=Masc|Number=Sing
.
We provide an inventory of features that are attested in multiple corpora and it is thus desirable that they are encoded in a uniform way. The list is certainly not exhaustive and later versions of the standard may include new features or values found in new languages, corpora or tagsets.
Users can extend this set of universal features and add language-specific features when necessary. Such features should be described in the language-specific documentation and follow the general principles outlined here. Universal and language-specific features of a word are listed together in the FEATS column.
- There are two types of identifiers:
- feature names = features
- feature values = values
- All identifiers (both features and values) consist of English letters or,
occasionally, digits 0-9. The first letter is always uppercase.
The other letters are generally lowercase, except for positions where new
internal words are marked for better readability (e.g.
NumType
). This makes features distinct from the universal POS tags (all uppercase) and from the universal dependency relations (all lowercase). - A feature of a word should always be fully specified in the data, i.e. both
the feature name and the value should be identified:
PronType=Prs
. Note that the values are not guaranteed to be unique across features, e.g.Sup
could denote the superessive case, superlative degree of comparison or supine (a verb form). - Not mentioning a feature in the data implies the empty value, which means that the feature is either irrelevant for this part of speech, or its value cannot be determined for this word form due to language-specific reasons.
- It is possible to declare that a feature has two or more values for a given word:
Case=Acc,Dat
. The interpretation is that the word may have one of these values but we cannot decide between them. Such multivalues should be used sparingly. They should not be used if the value list would cover the whole value space, or the subspace valid for the given language. That would mean that we cannot tell anything about this feature for the given word, and then it is preferable to just leave the feature out. - Canonical ordering: features of one word (appearing on the same line) are
always ordered alphabetically; if a feature has multiple values, these are
ordered alphabetically, too. This rule facilitates cases when it is necessary
to compare feature sets of two words.
Alphabetical sorting means that uppercase letters are considered identical to their lowercase counterparts.
So for example,
Number
precedesNumType
. - Description of individual features usually hints what parts of speech the feature is likely to appear with. This information is intended to help understand the typical usage of the feature; however, it is not a strict rule! Applicability of features to parts of speech is very language-dependent and it should never be assumed that the feature cannot appear together with a particular POS tag.
Lexical Features
All of these can be considered attributes of lexemes or lemmas (rather than individual word forms) and they represent a fine-grained sub-classification of words.
Inflectional Features
These are mostly features of word forms rather than lemmas. There are exceptions: for instance, gender of nouns is usually a lexical feature (all word forms of one lemma have the same gender). However, other parts of speech (adjectives, pronouns, verbs) may inflect for gender because of agreement with nouns.
Layered Features
In some languages, some features are marked more than once on the same word. We say that there are several layers of the feature. The exact meaning of individual layers is language-dependent.
For example, possessive adjectives, determiners and pronouns may have two different values of u-feat/Gender and two of u-feat/Number. One of the values is determined by agreement with the modified (possessed) noun. This is parallel to other (non-possessive) adjectives and determiners that agree in gender and number with the nouns they modify. The other value is determined lexically because it is a property of the possessor.
For detailed examples of layered features, see Layered Features.
If a feature is (can be) layered in a language, the name of the feature must
indicate the layer. An additional identifier in square brackets is used to
distinguish layers, e.g. Gender[psor]
for the possessor’s gender.
We recommend that the layer identifiers consist of lowercase English letters
[a-z]
and/or digits [0-9]
.
The layers, their meaning and their
identifiers must be defined in a language-specific extension to this
documentation. For each layered feature, one layer may be defined as default
and the corresponding features then appear without identifier,
e.g. Gender=Masc|Gender[psor]=Fem
.