Part-of-speech tags in UD v2
For v2, we propose
- Extend the use of u-pos/AUX to copula verbs and nonverbal TAMVE particles.
- Minimize usage of u-pos/PART – small language-specific list of words, case must be made for each.
- Loosen borderline between u-pos/PRON and u-pos/DET: set of recommendations instead of purely functional rule.
- Provide more general recommendations about setting language-specific borderlines between word categories.
For related discussion see
- Morphology overview page has a section on POS tags
- Current guidelines for POS tags
- Form vs. function in POS tags
- Issue 159 (
DET) and the Uppsala report on determiners
- Issue 178 (question particles)
- Issue 237 (
- Issue 275 (copula
- List of open issues with the POS label
Rename CONJ to CCONJ
We propose to rename the
CONJ tag to
CCONJ (coordinating conjunction)
and thus make it analogous to
SCONJ (subordinating conjunction).
That will hopefully reduce the confusion that we observed in v1:
people have assumed that the
CONJ tag would correspond to the
while in fact it usually went with the relation
cc and practically never with
We considered removing
PROPN as a separate POS tag because
1. the motivation to have it is weak in some languages (no grammatical implications except for capital letters);
2. the possibility to say that a language does not have
PROPN does not really solve anything – these languages indeed have the category and typically will want to use it if it exists, to increase cross-linguistic parallelism;
3. disambiguation from
NOUN is hard in many languages;
4. it would be better to have phrase-level named entity annotation, which would include non-noun words (adjectives) and nesting.
We decided to keep the tag in v2 because 1. the category has been traditionally distinguished in a significant number of languages and tagsets, and people do not want to lose it there; 2. the reasons above do not seem strong enough to remove it; 3. we are not going to have named entity annotation in UD v2.
Keep PART but minimize its usage
We considered removing
PART because it is a very small category, which is defined mostly negatively (it is particle if it is not anything else). We decided to keep it because
1. we would have to take the current particles one-by-one in each language and decide where to put them;
2. in some instances it would still be hard to find a suitable category;
3. which may actually lead to creation of several new and even smaller categories.
However, the lists of particles in individual languages should be reviewed anyways. Language-specific documentation must list all particles in the language, and ideally also explain why they are particles.
Note that the current guidelines say that [en] not, [de] nicht etc. are negative particles (but negative determiners like [en] no, or negative auxiliary verbs like [cs] není are not particles). This is the only positive part of the current
PART definition at the language-universal level, because such words were traditionally tagged as adverbs in some languages, and they could be adverbs in the other languages too, had we decided to remove
Extend the use of AUX
The u-pos/AUX category is currently used for auxiliary verbs, that is verbs used with the syntactic relation u-dep/aux (regardless of whether these verbs can be used as main verbs in other contexts). We propose to extend its use in two ways. First, it can be used for nonverbal particles used to express TAMVE categories, which in v2 will also be analyzed using the u-dep/aux relation. Secondly, it will be extended to copula verbs, which perform a grammaticalized function in nominal clauses.
Borderlines between categories
Parts of speech have traditionally been defined using a mix of morphological, syntactic and semantic criteria. Given that UD is concerned with grammatical annotation, we in general want to give less priority to (purely) semantic criteria. Morphological criteria are very useful but are not applicable to all languages, or even to all categories within morphologically rich languages. It follows that, in many cases, we have to rely primarily on syntactic criteria. However, we believe that the part-of-speech classification is most useful if it captures regular, prevailing syntactic behavior and does not reflect sentence-specific exceptional behavior. Therefore, we should avoid completely functional guidelines for part-of-speech tags that make the tag completely predictable from the syntactic function. In addition to the obvious argument that such guidelines make (some) tags uninformative, they also make it harder to find interesting cross-lingual differences. For example, language X allows words of category A to have syntactic function b, but language Y doesn’t.
We admit that there are ambiguous words that cannot be tagged without considering the context. However, we prefer distinctions based on prevailing syntactic behavior and (when relevant) morphological properties. We should avoid the extreme where (almost) all ambiguities between two categories are resolved solely by syntactic structure, as was the case with pronouns vs. determiners in v1. Here the current definition of
DET should be loosened. It will be possible in many languages to enumerate words in both classes, and preferably the lists should be based on prevailing syntactic function and morphological properties, rather than the actual context in each sentence.
There are many pairs of categories with unclear border zones, although not all of them in all languages:
NOUNcan be confused with
ADJand some forms of
PROPNcan be confused with
ADJcan be confused with
NUM(ordinals) and some forms of
PRONcan be confused with
ADV(pronominal adverbs) and possibly
DETcan be confused with
ADV(pronominal adverbs) and possibly
NUMcan be confused with
NOUN(high-value cardinals, e.g. “million”),
DET(pronominal quantifiers) and
ADV(adverbs of degree/quantity, multiplicative numerals like [cs] sedmkrát “seven times” etc.)
ADVcan be confused with
NUM(adverbial numerals and adverbs of degree/quantity) and some forms of
VERB(transgressives/converbs); furthermore, some are on the border to
VERBcan be confused with
AUX(but we are proposing to remove
ADPcan be confused with
ADV, some secondary prepositions also with
NOUNand other categories
SCONJcan be confused with
CCONJcan be confused with
PARTshould in theory not be confused with anything because then it should not be
PART; nevertheless, at least some traditional particles are now
CCONJ, and the particle not would be traditionally
ADVin some languages
INTJcan sometimes be confused with
ADVif they are used like exclamations; but arguably these could then not be tagged
PUNCTcan be confused with
SYMcan be confused with
Xas a tag for foreign word can be confused with
PROPNwhen the word itself is not a proper noun in the foreign language, but is a part of a longer book/movie title
Some pairs are less problematic than others because some categories are functionally more compatible than others.
It is acceptable if
DET are pre-categorized and distinguished mostly by word list or morphology,
because they can be seen as two subcategories of a broader category of nominals; if a word is classified as
but it occurs in place of a noun phrase, it can be explained by ellipsis and we do not have to switch the tag
PRON in such contexts. However, some categories are not compatible and if a word occurs in both, it has to
be taken as two separate lemmas; consequently it has to be disambiguated according to sentence context.
A good example is [en] that and [es] que which can be both a relative pronoun and a subordinating conjunction
(complementizer). We cannot say that all occurrences are either
SCONJ because a pronoun can
act as a core argument of a predicate, while complementizer cannot. So we have to distinguish the two functions,
although historically the complementizer may actually come from a grammaticalized pronoun.
Furthermore, we should encourage authors of language-specific documentation to document all borderline cases, but this is not really a change of guidelines.
Proposed principles for UD v2
This text could be added to the morphology overview page, section on POS tags:
- A word’s category should be primarily determined by prototypical (expected) syntactic behavior, as typically recorded in a dictionary, rather than by the context of a particular sentence.
- Morphological behavior may be a good indicator in some languages. If, for example, a language uses distinct inflection patterns for nouns and adjectives, then morphology can be used to distinguish these two categories. Exceptions cannot be excluded but they should be really exceptional and well grounded; when in doubt, use the category determined by morphology (if available).
- Ambiguous words (belonging to two or more categories) do exist. Sometimes by pure coincidence ([en] the can vs. can = to be able to). Sometimes the two words are related but differ morphologically ([en] the book(s) vs. to book, booked, booking).
- Perhaps the most difficult part are ambiguous function words that do not inflect (i.e. morphology does not help us),
yet they perform two or more significantly different syntactic functions, which we normally associate with different
parts of speech. The two functions may not be equally frequent but each of them is more frequent than what could be labeled
as a mere exception (i.e. the wait for his ‘yes’ example is exceptional).
Disambiguating such pairs clearly depends on the context of the given sentence where the word is used.
So how do we know that the difference is “significant enough”?
One clue is that the word, when translated to another language, gets two different translations with different POS tags
(e.g. the English no as response interjection, vs. negative determiner).
Another clue comes from contrasting the UD relations used for the two functions.
For example, distinguishing
SCONJ([en] that, [es] que, [ru] что / čto) is important because pronouns, unlike conjunctions, may become core arguments and fill valency slots of verbs. Distinguishing
ADVseems less crucial and we can probably keep just one POS tag for each such word, based on prototypical usage.
Revised guidelines for pronominal words
This text could be added to the morphology overview page, section on POS tags, and it should also be reflected in the documentation of individual POS tags:
Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).
- In most cases it is straightforward to determine whether a word is pronominal (see also the PronType feature) but the borderline between indefinite determiners and adjectives is slightly fuzzy. Related languages should synchronize the lists of words they treat as pronominal. The rest of these guidelines demarcate borders within the pronominal group.
- Pronominal adverbs are tagged
ADV. Their pronominality is encoded using the
PronTypefeature. Their typical syntactic function is to modify verbs.
- Articles (the, a, an) are always tagged
- Pronominal numerals (quantifiers) are tagged
PronType, they also use the NumType feature.
- Words that behave similar to adjectives are
DET. (We understand the
DETclass as pro-adjectives, which is a slightly broader sense than what is usually regarded as determiners in English. In particular, it is possible that one nominal is modified by more than one determiner.) Similar behavior means:
- They are more likely to be used attributively (modifying a noun phrase) than substantively (replacing a noun phrase). They may occur alone, though. If they do, it is either because of ellipsis, or because the hypothetical modified noun is something unspecified and general, as in All [visitors] must pay.
- Their inflection is similar to that of adjectives, and distinct from nouns. They agree with the nouns they modify. Especially the ability to inflect for gender is typical for adjectives and determiners. (Gender of nouns is determined lexically and determiners may be required by the grammar to agree with their nouns in gender; therefore they need to inflect for gender.)
- Non-possessive personal, reflexive or reciprocal pronouns are always tagged
- Possessives vary across languages. In some languages the above tests put them in the
DETcategory. In others, they are more like a normal personal pronoun in a specific case (often the genitive), or a personal pronoun with an adposition; they are tagged
- When the above rules do not help, the category should be based on what the traditional grammar of the language says.
- Ideally, language-specific documentation should list pronominal words and their category. These are all closed classes so it should not be difficult.