Morphology: General Principles
UD specifies a complete morpho-syntactic representation that can be applied cross-linguistically. This effectively means that grammatical notions may be indicated via word forms (morphologically) or via dependency relations (syntactically). The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:
- A lemma representing the semantic content of the word.
- A part-of-speech tag representing the abstract lexical category associated with the word.
- A set of features representing lexical and grammatical properties that are associated with the particular word form.
Lemmas are typically determined by language-specific dictionaries and lexica. In contrast, the part-of-speech tags and grammatical properties are taken from two universal inventories defined below.
Unlike in various language-specific tagsets, the universal tags and features do not include means to mark fused words (a word that is result of merging two other words, which are syntactically independent and belong to different parts of speech): Czech dělals (dělal + jsi … main verb + auxiliary); proň (pro + něj … preposition + pronoun); German zum (zu + dem … preposition + article); Spanish dámelo (da + me + lo … verb + clitics) etc. The only truly general approach to fused words in UD is to exploit the distinction between tokens and (syntactic) words, and to apply a language-specific processing step that splits tokens into syntactic words where necessary. Every syntactic word then gets its own part-of-speech tag and features. See also Tokenization and Format.
Lemmas
The LEMMA
field should contain the canonical or base form of the word, which is the form typically found in dictionaries.
If a language is agglutinative, this is typically the form with no inflectional affixes; in fusional languages,
the lemma is usually the result of a language-particular convention.
If the lemma is not available, an underscore (“_
”) can be used to indicate its absence.
At present, treebanks have considerable leeway in interpreting what “canonical or base form” means. Except perhaps in rare cases of suppletion, one form should be the chosen as the lemma of a verb, noun, determiner, or pronoun paradigm. The lemma of adjectives and adverbs should be the positive form (in languages with comparative and superlative forms). The lemma does not remove derivational morphology, so the lemma of [en] organizations is organization not organize (nor organ). In general, a canonical form should collapse inflectional and minor orthographic/spelling variation (such as casing, accents/diacritics, and typos). In the lemma field, some treebanks may choose to aggressively normalize spelling variation that may reflect dialect or authorial style.
Abbreviated/shortened forms can be mapped to their full spelling as the lemma
in conjunction with the feature Abbr=Yes
, provided that the full spelling
is a single word. Abbreviations that would expand to multiple words should be retained in the lemma.
The LEMMA
field should not be used to encode features or other similar properties of the word (use FEATS
and MISC
instead; see format).
Some corpora use numerical specifiers to distinguish homonymous lemmas, different word senses etc. (e.g. [en] can-1 vs. can-2).
In UD, such specifiers must not appear in the LEMMA
field because they are not part of the canonical surface form.
If unique lemma identifiers are available, they can be preserved in the MISC
column in the optional LId
attribute
(LId=can-1
).
Lemmas of misspelled words
Details at: Typos and Other Errors
In addition to normalizing spelling in lemmas, treebanks are encouraged to adopt the optional morphological feature
Typo=Yes
for clear accidental misspellings of a word (e.g. ltake for take or too for to).
Typos of words in closed-class categories can be found in a corpus by inspecting word frequencies in each category.
Treebank maintainers should take care not to use Typo=Yes
for words that may reflect actual linguistic variation,
e.g., dialect, style, or nonnative grammar.
On occasion, a typo or abbreviation will apply to an inflected word (e.g. hadd for had), and thus the lemma should both normalize the spelling and remove the inflection.
For incorrectly split words, the first segment should be treated as morphologically representing the entire word, so it should have the lemma for the entire word, as described at Typos and Other Errors.
Part-of-Speech Tags
The list of universal POS tags is a fixed list containing 17 tags. It is possible that some tags will not be used in some languages. However, the list cannot be extended to cover language-specific extensions. Instead, more fine-grained classification of words can be achieved via the use of features (see below).
Also, note that the CoNLL-U format allows an additional XPOS, taken from a language-specific (or corpus-specific) tagset. Such language-specific XPOSes have their own data column and are not mixed with the universal POS tags.
The universal POS tags consist of uppercase English letters [A-Z]
only.
Just one tag per word is expected, and it should not be empty. (Use the X
tag
instead of underscore if no other tag is appropriate.)
Tagging principles
A word’s category should be primarily determined by prototypical (expected) syntactic behavior, as typically recorded in a dictionary, rather than by the context of a particular sentence. Syntax still plays an important role, especially in cross-linguistic mapping of same-named categories. However, prototypical (expected) syntactic behavior is of more importance than function performed in exceptional contexts.
Morphological behavior may be a good indicator in some languages. If, for example, a language uses distinct inflection patterns for nouns and adjectives, then morphology can be used to distinguish these two categories. Exceptions cannot be excluded but they should be really exceptional and well grounded; when in doubt, use the category determined by morphology (if available).
Ambiguous words (belonging to two or more categories) do exist. Sometimes by pure coincidence ([en] the can vs. can = to be able to). Sometimes the two words are related but differ morphologically ([en] the book(s) vs. to book, booked, booking).
Perhaps the most difficult part are ambiguous function words that do not inflect (i.e. morphology
does not help us), yet they perform two or more significantly different syntactic functions,
which we normally associate with different parts of speech. The two functions may not be equally
frequent but each of them is more frequent than what could be labeled as a mere exception (i.e.,
the wait for his ‘yes’ example below is exceptional). Disambiguating such pairs clearly depends
on the context of the given sentence where the word is used.
So how do we know that the difference is “significant enough”? One clue is that the word, when
translated to another language, gets two different translations with different POS tags (e.g.,
the English no as response interjection, vs. negative determiner). Another clue comes from
contrasting the UD relations used for the two functions. For example, distinguishing PRON
from SCONJ
([en] that, [es] que, [ru] что / čto) is important because pronouns, unlike
conjunctions, may become core arguments and fill valency slots of verbs. Distinguishing SCONJ
from ADP
, or CCONJ
from ADV
seems less crucial and we can keep just one POS tag
for each such word, based on prototypical usage.
Using a word vs. mentioning it
The universal POS tags should capture regular, prevailing syntactic behavior, as well as morphological characteristics when available, and should not reflect sentence-specific exceptional behavior. In particular, the POS tags do not distinguish actual usage of a word from just mentioning it. Thus in both the following examples, yes will be tagged as interjection:
- Yes, I think so.
- I am waiting for his ‘yes’ on the matter.
Similarly, in both the following examples, precede will be tagged as verb:
- Such discussion must precede every decision.
- He pronounced ‘precede’ in a funny way.
Pronominal words
Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).
- In most cases it is straightforward to determine whether a word is pronominal (see also the PronType feature) but the borderline between indefinite determiners and adjectives is slightly fuzzy. Related languages should synchronize the lists of words they treat as pronominal. The rest of these guidelines demarcate borders within the pronominal group.
- Pronominal adverbs are tagged
ADV
. Their pronominality is encoded using thePronType
feature. Their typical syntactic function is to modify verbs. - Articles (the, a, an) are always tagged
DET
; theirPronType
isArt
. - Pronominal numerals (quantifiers) are tagged
DET
; besidesPronType
, they also use the NumType feature. - Words that behave similar to adjectives are
DET
. (We understand theDET
class as pro-adjectives, which is a slightly broader sense than what is usually regarded as determiners in English. In particular, it is possible that one nominal is modified by more than one determiner.) Similar behavior means:- They are more likely to be used attributively (modifying a noun phrase) than substantively (replacing a noun phrase). They may occur alone, though. If they do, it is either because of ellipsis, or because the hypothetical modified noun is something unspecified and general, as in All [visitors] must pay.
- Their inflection is similar to that of adjectives, and distinct from nouns. They agree with the nouns they modify. Especially the ability to inflect for gender is typical for adjectives and determiners. (Gender of nouns is determined lexically and determiners may be required by the grammar to agree with their nouns in gender; therefore they need to inflect for gender.)
- Non-possessive personal, reflexive or reciprocal pronouns are always tagged
PRON
. - Possessives vary across languages. In some languages the above tests put them in the
DET
category. In others, they are more like a normal personal pronoun in a specific case (often the genitive), or a personal pronoun with an adposition; they are taggedPRON
. - When the above rules do not help, the category should be based on what the traditional grammar of the language says.
- Ideally, language-specific documentation should list pronominal words and their category. These are all closed classes so it should not be difficult.
See also
The guidelines for the following special cases are documented on the referenced pages for specific POS tags:
- Abbreviations and acronyms: described under SYM
Features
Features are additional pieces of information about the word, its part of speech
and morphosyntactic properties. Every feature has the form Name=Value
and
every word can have any number of features, separated by the vertical bar, as in
Gender=Masc|Number=Sing
.
Analogically to part-of-speech tags, features describe the word form but not
necessarily its exact function in the given sentence. Most of the features are
for locating the form in a slot of a morphological paradigm, and are canonical
labels for the slot. Thus, for example, the Voice feature is used in Czech
to distinguish passive participles (prodán “sold” Voice=Pass
), which are
morphologically distinct from active participles (prodal “sold” Voice=Act
);
however, the feature is not used with participles in German, where the same
form is used in active and passive constructions alike (cf. Er hat es
verkauft “He has sold it” vs. Es wurde verkauft “It was sold”).
On the other hand, some word forms are homonymous and context must be used to
identify the paradigm slot to which they belong. For example, the morphological
paradigm of Czech nouns distinguishes nominative and accusative (among other
Case values), as in matka “mother” Case=Nom
vs. matku Case=Acc
.
Nevertheless, due to case syncretism, some other lexemes have the same form in these
two paradigm slots, e.g. píseň “song” is either Case=Nom
or Case=Acc
and
it has to be disambiguated by context.
We provide an inventory of features that are attested in multiple corpora and it is thus desirable that they are encoded in a uniform way. The list is certainly not exhaustive and later versions of the standard may include new features or values found in new languages, corpora or tagsets.
Users can extend this set of universal features and add language-specific features when necessary. Such features should be described in the language-specific documentation and follow the general principles outlined here. Universal and language-specific features of a word are listed together in the FEATS column.
- There are two types of identifiers:
- feature names = features
- feature values = values
- All identifiers (both features and values) consist of English letters or,
occasionally, digits 0-9. The first letter is always uppercase.
The other letters are generally lowercase, except for positions where new
internal words are marked for better readability (e.g.
NumType
). This makes features distinct from the universal POS tags (all uppercase) and from the universal dependency relations (all lowercase). - A feature of a word should always be fully specified in the data, i.e. both
the feature name and the value should be identified:
PronType=Prs
. Note that the values are not guaranteed to be unique across features, e.g.Sup
could denote the superessive case, superlative degree of comparison or supine (a verb form). - Not mentioning a feature in the data implies the empty value, which means that the feature is either irrelevant for this part of speech, or its value cannot be determined for this word form due to language-specific reasons.
- It is possible to declare that a feature has two or more values for a given word:
Case=Acc,Dat
. The interpretation is that the word may have one of these values but we cannot decide between them. Such multivalues should be used sparingly. They should not be used if the value list would cover the whole value space, or the subspace valid for the given language. That would mean that we cannot tell anything about this feature for the given word, and then it is preferable to just leave the feature out. - Canonical ordering: features of one word (appearing on the same line) are
always ordered alphabetically; if a feature has multiple values, these are
ordered alphabetically, too. This rule facilitates cases when it is necessary
to compare feature sets of two words.
Alphabetical sorting means that uppercase letters are considered identical to their lowercase counterparts.
So for example,
Number
precedesNumType
. - Description of individual features usually hints what parts of speech the feature is likely to appear with. This information is intended to help understand the typical usage of the feature; however, it is not a strict rule! Applicability of features to parts of speech is very language-dependent and it should never be assumed that the feature cannot appear together with a particular POS tag.
Lexical Features
All of these can be considered attributes of lexemes or lemmas (rather than individual word forms) and they represent a fine-grained sub-classification of words.
Inflectional Features
These are mostly features of word forms rather than lemmas. There are exceptions: for instance, gender of nouns is usually a lexical feature (all word forms of one lemma have the same gender). However, other parts of speech (adjectives, pronouns, verbs) may inflect for gender because of agreement with nouns.
Layered Features
In some languages, some features are marked more than once on the same word. We say that there are several layers of the feature. The exact meaning of individual layers is language-dependent.
For example, possessive adjectives, determiners and pronouns may have two different values of u-feat/Gender and two of u-feat/Number. One of the values is determined by agreement with the modified (possessed) noun. This is parallel to other (non-possessive) adjectives and determiners that agree in gender and number with the nouns they modify. The other value is determined lexically because it is a property of the possessor.
For detailed examples of layered features, see Layered Features.
If a feature is (can be) layered in a language, the name of the feature must
indicate the layer. An additional identifier in square brackets is used to
distinguish layers, e.g. Gender[psor]
for the possessor’s gender.
We recommend that the layer identifiers consist of lowercase English letters
[a-z]
and/or digits [0-9]
.
The layers, their meaning and their
identifiers must be defined in a language-specific extension to this
documentation. For each layered feature, one layer may be defined as default
and the corresponding features then appear without identifier,
e.g. Gender=Masc|Gender[psor]=Fem
.