home v2/v2 edit page issue tracker

This page pertains to UD version 2.

Part-of-speech tags in UD v2

Summary

For v2, we propose

For related discussion see

Rename CONJ to CCONJ

We propose to rename the CONJ tag to CCONJ (coordinating conjunction) and thus make it analogous to SCONJ (subordinating conjunction). That will hopefully reduce the confusion that we observed in v1: people have assumed that the CONJ tag would correspond to the conj relation, while in fact it usually went with the relation cc and practically never with conj.

Keep PROPN

We considered removing PROPN as a separate POS tag because 1. the motivation to have it is weak in some languages (no grammatical implications except for capital letters); 2. the possibility to say that a language does not have PROPN does not really solve anything – these languages indeed have the category and typically will want to use it if it exists, to increase cross-linguistic parallelism; 3. disambiguation from NOUN is hard in many languages; 4. it would be better to have phrase-level named entity annotation, which would include non-noun words (adjectives) and nesting.

We decided to keep the tag in v2 because 1. the category has been traditionally distinguished in a significant number of languages and tagsets, and people do not want to lose it there; 2. the reasons above do not seem strong enough to remove it; 3. we are not going to have named entity annotation in UD v2.

Keep PART but minimize its usage

We considered removing PART because it is a very small category, which is defined mostly negatively (it is particle if it is not anything else). We decided to keep it because 1. we would have to take the current particles one-by-one in each language and decide where to put them; 2. in some instances it would still be hard to find a suitable category; 3. which may actually lead to creation of several new and even smaller categories.

However, the lists of particles in individual languages should be reviewed anyways. Language-specific documentation must list all particles in the language, and ideally also explain why they are particles.

Note that the current guidelines say that [en] not, [de] nicht etc. are negative particles (but negative determiners like [en] no, or negative auxiliary verbs like [cs] není are not particles). This is the only positive part of the current PART definition at the language-universal level, because such words were traditionally tagged as adverbs in some languages, and they could be adverbs in the other languages too, had we decided to remove PART.

Extend the use of AUX

The u-pos/AUX category is currently used for auxiliary verbs, that is verbs used with the syntactic relation u-dep/aux (regardless of whether these verbs can be used as main verbs in other contexts). We propose to extend its use in two ways. First, it can be used for nonverbal particles used to express TAMVE categories, which in v2 will also be analyzed using the u-dep/aux relation. Secondly, it will be extended to copula verbs, which perform a grammaticalized function in nominal clauses.

Borderlines between categories

Parts of speech have traditionally been defined using a mix of morphological, syntactic and semantic criteria. Given that UD is concerned with grammatical annotation, we in general want to give less priority to (purely) semantic criteria. Morphological criteria are very useful but are not applicable to all languages, or even to all categories within morphologically rich languages. It follows that, in many cases, we have to rely primarily on syntactic criteria. However, we believe that the part-of-speech classification is most useful if it captures regular, prevailing syntactic behavior and does not reflect sentence-specific exceptional behavior. Therefore, we should avoid completely functional guidelines for part-of-speech tags that make the tag completely predictable from the syntactic function. In addition to the obvious argument that such guidelines make (some) tags uninformative, they also make it harder to find interesting cross-lingual differences. For example, language X allows words of category A to have syntactic function b, but language Y doesn’t.

We admit that there are ambiguous words that cannot be tagged without considering the context. However, we prefer distinctions based on prevailing syntactic behavior and (when relevant) morphological properties. We should avoid the extreme where (almost) all ambiguities between two categories are resolved solely by syntactic structure, as was the case with pronouns vs. determiners in v1. Here the current definition of PRON and DET should be loosened. It will be possible in many languages to enumerate words in both classes, and preferably the lists should be based on prevailing syntactic function and morphological properties, rather than the actual context in each sentence.

There are many pairs of categories with unclear border zones, although not all of them in all languages:

Some pairs are less problematic than others because some categories are functionally more compatible than others. It is acceptable if PRON and DET are pre-categorized and distinguished mostly by word list or morphology, because they can be seen as two subcategories of a broader category of nominals; if a word is classified as DET but it occurs in place of a noun phrase, it can be explained by ellipsis and we do not have to switch the tag to PRON in such contexts. However, some categories are not compatible and if a word occurs in both, it has to be taken as two separate lemmas; consequently it has to be disambiguated according to sentence context. A good example is [en] that and [es] que which can be both a relative pronoun and a subordinating conjunction (complementizer). We cannot say that all occurrences are either PRON or SCONJ because a pronoun can act as a core argument of a predicate, while complementizer cannot. So we have to distinguish the two functions, although historically the complementizer may actually come from a grammaticalized pronoun.

Furthermore, we should encourage authors of language-specific documentation to document all borderline cases, but this is not really a change of guidelines.

Proposed principles for UD v2

This text could be added to the morphology overview page, section on POS tags:

Revised guidelines for pronominal words

This text could be added to the morphology overview page, section on POS tags, and it should also be reflected in the documentation of individual POS tags:

Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).