Part-of-speech tags in UD v2
Summary
For v2, we propose
- Rename
CONJ
to u-pos/CCONJ. - Extend the use of u-pos/AUX to copula verbs and nonverbal TAMVE particles.
- Minimize usage of u-pos/PART – small language-specific list of words, case must be made for each.
- Loosen borderline between u-pos/PRON and u-pos/DET: set of recommendations instead of purely functional rule.
- Provide more general recommendations about setting language-specific borderlines between word categories.
For related discussion see
- Morphology overview page has a section on POS tags
- Current guidelines for POS tags
- Form vs. function in POS tags
- Issue 159 (
PRON
vs.DET
) and the Uppsala report on determiners - Issue 178 (question particles)
- Issue 237 (
ADV
vs.CONJ
) - Issue 275 (copula
VERB
vs.AUX
) - List of open issues with the POS label
Rename CONJ to CCONJ
We propose to rename the CONJ
tag to CCONJ
(coordinating conjunction)
and thus make it analogous to SCONJ
(subordinating conjunction).
That will hopefully reduce the confusion that we observed in v1:
people have assumed that the CONJ
tag would correspond to the conj
relation,
while in fact it usually went with the relation cc
and practically never with conj
.
Keep PROPN
We considered removing PROPN
as a separate POS tag because
- the motivation to have it is weak in some languages (no grammatical implications except for capital letters);
- the possibility to say that a language does not have
PROPN
does not really solve anything – these languages indeed have the category and typically will want to use it if it exists, to increase cross-linguistic parallelism; - disambiguation from
NOUN
is hard in many languages; - it would be better to have phrase-level named entity annotation, which would include non-noun words (adjectives) and nesting.
We decided to keep the tag in v2 because
- the category has been traditionally distinguished in a significant number of languages and tagsets, and people do not want to lose it there;
- the reasons above do not seem strong enough to remove it;
- we are not going to have named entity annotation in UD v2.
Keep PART but minimize its usage
We considered removing PART
because it is a very small category, which is defined mostly negatively (it is particle if it is not anything else). We decided to keep it because
- we would have to take the current particles one-by-one in each language and decide where to put them;
- in some instances it would still be hard to find a suitable category;
- which may actually lead to creation of several new and even smaller categories.
However, the lists of particles in individual languages should be reviewed anyways. Language-specific documentation must list all particles in the language, and ideally also explain why they are particles.
Note that the current guidelines say that [en] not, [de] nicht etc. are negative particles (but negative determiners like [en] no, or negative auxiliary verbs like [cs] není are not particles). This is the only positive part of the current PART
definition at the language-universal level, because such words were traditionally tagged as adverbs in some languages, and they could be adverbs in the other languages too, had we decided to remove PART
.
Extend the use of AUX
The u-pos/AUX category is currently used for auxiliary verbs, that is verbs used with the syntactic relation u-dep/aux (regardless of whether these verbs can be used as main verbs in other contexts). We propose to extend its use in two ways. First, it can be used for nonverbal particles used to express TAMVE categories, which in v2 will also be analyzed using the u-dep/aux relation. Secondly, it will be extended to copula (u-dep/cop) verbs, which perform a grammaticalized function in nominal clauses.
Borderlines between categories
Parts of speech have traditionally been defined using a mix of morphological, syntactic and semantic criteria. Given that UD is concerned with grammatical annotation, we in general want to give less priority to (purely) semantic criteria. Morphological criteria are very useful but are not applicable to all languages, or even to all categories within morphologically rich languages. It follows that, in many cases, we have to rely primarily on syntactic criteria. However, we believe that the part-of-speech classification is most useful if it captures regular, prevailing syntactic behavior and does not reflect sentence-specific exceptional behavior. Therefore, we should avoid completely functional guidelines for part-of-speech tags that make the tag completely predictable from the syntactic function. In addition to the obvious argument that such guidelines make (some) tags uninformative, they also make it harder to find interesting cross-lingual differences. For example, language X allows words of category A to have syntactic function b, but language Y doesn’t.
We admit that there are ambiguous words that cannot be tagged without considering the context. However, we prefer distinctions based on prevailing syntactic behavior and (when relevant) morphological properties. We should avoid the extreme where (almost) all ambiguities between two categories are resolved solely by syntactic structure, as was the case with pronouns vs. determiners in v1. Here the current definition of PRON
and DET
should be loosened. It will be possible in many languages to enumerate words in both classes, and preferably the lists should be based on prevailing syntactic function and morphological properties, rather than the actual context in each sentence.
There are many pairs of categories with unclear border zones, although not all of them in all languages:
NOUN
can be confused withPROPN
,PRON
,ADJ
and some forms ofVERB
(gerunds, infinitives)PROPN
can be confused withNOUN
andX
(foreign names)ADJ
can be confused withDET
,NUM
(ordinals) and some forms ofVERB
(participles)PRON
can be confused withDET
,NUM
(pronominal quantifiers),ADV
(pronominal adverbs) and possiblyNOUN
DET
can be confused withPRON
,NUM
(pronominal quantifiers),ADV
(pronominal adverbs) and possiblyADJ
NUM
can be confused withADJ
(ordinals),NOUN
(high-value cardinals, e.g. “million”),PRON
,DET
(pronominal quantifiers) andADV
(adverbs of degree/quantity, multiplicative numerals like [cs] sedmkrát “seven times” etc.)ADV
can be confused withADJ
,PRON
,DET
(pronominal adverbs),NUM
(adverbial numerals and adverbs of degree/quantity) and some forms ofVERB
(transgressives/converbs); furthermore, some are on the border toADP
,SCONJ
andCCONJ
VERB
can be confused withAUX
(but we are proposing to removeAUX
),NOUN
(gerunds),ADJ
(participles),ADV
(transgressives/converbs)ADP
can be confused withSCONJ
,ADV
, some secondary prepositions also withNOUN
and other categoriesSCONJ
can be confused withADP
,ADV
andCCONJ
CCONJ
can be confused withADV
andSCONJ
PART
should in theory not be confused with anything because then it should not bePART
; nevertheless, at least some traditional particles are nowADP
,ADV
,SCONJ
orCCONJ
, and the particle not would be traditionallyADV
in some languagesINTJ
can sometimes be confused withNOUN
,ADJ
,VERB
orADV
if they are used like exclamations; but arguably these could then not be taggedINTJ
PUNCT
can be confused withSYM
SYM
can be confused withPUNCT
X
as a tag for foreign word can be confused withPROPN
when the word itself is not a proper noun in the foreign language, but is a part of a longer book/movie title
Some pairs are less problematic than others because some categories are functionally more compatible than others.
It is acceptable if PRON
and DET
are pre-categorized and distinguished mostly by word list or morphology,
because they can be seen as two subcategories of a broader category of nominals; if a word is classified as DET
but it occurs in place of a noun phrase, it can be explained by ellipsis and we do not have to switch the tag
to PRON
in such contexts. However, some categories are not compatible and if a word occurs in both, it has to
be taken as two separate lemmas; consequently it has to be disambiguated according to sentence context.
A good example is [en] that and [es] que which can be both a relative pronoun and a subordinating conjunction
(complementizer). We cannot say that all occurrences are either PRON
or SCONJ
because a pronoun can
act as a core argument of a predicate, while complementizer cannot. So we have to distinguish the two functions,
although historically the complementizer may actually come from a grammaticalized pronoun.
Furthermore, we should encourage authors of language-specific documentation to document all borderline cases, but this is not really a change of guidelines.
Proposed principles for UD v2
This text could be added to the morphology overview page, section on POS tags:
- A word’s category should be primarily determined by prototypical (expected) syntactic behavior, as typically recorded in a dictionary, rather than by the context of a particular sentence.
- Morphological behavior may be a good indicator in some languages. If, for example, a language uses distinct inflection patterns for nouns and adjectives, then morphology can be used to distinguish these two categories. Exceptions cannot be excluded but they should be really exceptional and well grounded; when in doubt, use the category determined by morphology (if available).
- Ambiguous words (belonging to two or more categories) do exist. Sometimes by pure coincidence ([en] the can vs. can = to be able to). Sometimes the two words are related but differ morphologically ([en] the book(s) vs. to book, booked, booking).
- Perhaps the most difficult part are ambiguous function words that do not inflect (i.e. morphology does not help us),
yet they perform two or more significantly different syntactic functions, which we normally associate with different
parts of speech. The two functions may not be equally frequent but each of them is more frequent than what could be labeled
as a mere exception (i.e. the wait for his ‘yes’ example is exceptional).
Disambiguating such pairs clearly depends on the context of the given sentence where the word is used.
So how do we know that the difference is “significant enough”?
One clue is that the word, when translated to another language, gets two different translations with different POS tags
(e.g. the English no as response interjection, vs. negative determiner).
Another clue comes from contrasting the UD relations used for the two functions.
For example, distinguishing
PRON
fromSCONJ
([en] that, [es] que, [ru] что / čto) is important because pronouns, unlike conjunctions, may become core arguments and fill valency slots of verbs. DistinguishingSCONJ
fromADP
, orCCONJ
fromADV
seems less crucial and we can probably keep just one POS tag for each such word, based on prototypical usage.
Revised guidelines for pronominal words
This text could be added to the morphology overview page, section on POS tags, and it should also be reflected in the documentation of individual POS tags:
Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).
- In most cases it is straightforward to determine whether a word is pronominal (see also the PronType feature) but the borderline between indefinite determiners and adjectives is slightly fuzzy. Related languages should synchronize the lists of words they treat as pronominal. The rest of these guidelines demarcate borders within the pronominal group.
- Pronominal adverbs are tagged
ADV
. Their pronominality is encoded using thePronType
feature. Their typical syntactic function is to modify verbs. - Articles (the, a, an) are always tagged
DET
; theirPronType
isArt
. - Pronominal numerals (quantifiers) are tagged
DET
; besidesPronType
, they also use the NumType feature. - Words that behave similar to adjectives are
DET
. (We understand theDET
class as pro-adjectives, which is a slightly broader sense than what is usually regarded as determiners in English. In particular, it is possible that one nominal is modified by more than one determiner.) Similar behavior means:- They are more likely to be used attributively (modifying a noun phrase) than substantively (replacing a noun phrase). They may occur alone, though. If they do, it is either because of ellipsis, or because the hypothetical modified noun is something unspecified and general, as in All [visitors] must pay.
- Their inflection is similar to that of adjectives, and distinct from nouns. They agree with the nouns they modify. Especially the ability to inflect for gender is typical for adjectives and determiners. (Gender of nouns is determined lexically and determiners may be required by the grammar to agree with their nouns in gender; therefore they need to inflect for gender.)
- Non-possessive personal, reflexive or reciprocal pronouns are always tagged
PRON
. - Possessives vary across languages. In some languages the above tests put them in the
DET
category. In others, they are more like a normal personal pronoun in a specific case (often the genitive), or a personal pronoun with an adposition; they are taggedPRON
. - When the above rules do not help, the category should be based on what the traditional grammar of the language says.
- Ideally, language-specific documentation should list pronominal words and their category. These are all closed classes so it should not be difficult.