Form vs. Function in UD v2: POS tags and Syntax
(Dan’s position)
The UD guidelines consist of three main parts: part-of-speech tags, morphological features and dependency relations. While the relations clearly encode function of the word in the sentence, the tags and features may relate to both the function and the form of the word. Parts of speech are traditionally defined using a mixture of morphological, syntactic and semantic criteria.
As we want to have a cross-linguistically consistent set of guidelines in UD, we cannot always rely on dictionaries and grammatical traditions followed by scholars of individual languages: the approaches will often conflict. Sometimes we have to (re-)define word classes, to make sure that “the same thing is not annotated in different ways”. On doing so, we have to think about the extent to which word classes should by influenced by function of the words in particular contexts.
Main objective: Ideally we want to minimize the number of deviations from traditional grammars. (We want people to love UD, don’t we?) But we cannot eliminate the deviations because traditional grammars are not mutually compatible (at least cross-linguistically; but the difficult words may have different analyses even within one language).
Of form and function, the latter is more portable across languages (most similarities in form are only recognizable in closely related languages). However, once we have a rough idea about a category C in language X, which functionally overlaps with same-named categories in other languages, we can include form in defining the precise borders of the category (either by morphological processes it undergoes, or directly by enumerating word forms that belong to C). Now the main question that we address here is: What do we do if the same word form can perform two quite different functions? Or: what if the word behaves morphologically (form) like category C, but syntactically (function) like category D?
Extreme 1: Function rules! It is relatively easy to automatically re-tag words based on the tree structure and dependency relation labels. Drawbacks: The tree only helps when converting an existing treebank. This approach may be a disaster for taggers because tagging will incorporate an unusually large portion of syntax. High level of ambiguity of words. Redundancy in annotation (many tags and features follow from dependency relations in a straightforward way – do we then need the tags at all?) Quite likely it will make the life harder for human annotators, too. Sometimes it may not be easy to reconstruct the function because it would involve reconstructing elided material. The impact on the main objective above would have to be evaluated; I bet that if the approach is too much function-oriented, there will be more deviations from the traditional grammar, because the tradition will (probably) try to limit ambiguity in word categories when possible. (We have actually seen this with the PRON
/DET
distinction, which is defined purely functionally in UD v1. There have been complaints that the result significantly deviates from the traditional categories in many languages, including English.)
Extreme 2: Form rules! If we take this literally, there are no homonyms, no ambiguity. Probably good news for taggers but less so for users. If a word has two completely different functions (and maybe its translation to other languages differs according to function), then we have a genuine ambiguity. Without it, we could not even distinguish content and function words—a distinction that is central in Universal Dependencies.
The result? It should be now apparent that we need both. How much of each we need, that is yet to be determined. Maybe there is no good, quantifiable way of defining this borderline. But we should at least intuitively try to make the ratio similar in all decisions we make. Hence it might be helpful to compile a list of word-classification issues where we had to balance form with function.
List of form-function issues
- If a word is mentioned rather than used, we keep its original part of speech. Thus if [en] yes is normally interjection (
INTJ
), and it appears in a sentence like “We are waiting for his ‘yes’ on the matter”, it is still taggedINTJ
and notNOUN
. This rule is part of UD v1 (see here) and it clearly prefers form to function. - Verbal particles in Germanic languages, e.g. up in [en] come up, or on in come on. Despite traditionally being called particles in this usage, these words are originally adverbs or prepositions, and that’s how we tag them. This rule is part of UD v1 (see here) and it clearly prefers form to function.
- Multi-word expressions. Whenever we use the mwe relation we construct a multi-node unit that is frozen and (together) performs a grammatical function (often a preposition, conjunction or adverb). The individual words keep their part of speech that they would have outside the MWE, and there is no POS tag for the MWE as a whole. Hence in POS tagging we again prefer form to function, and the function of the MWE may look “incompatible” with the word that serves as the head of the MWE.
- Adverbs (
ADV
) vs. coordinating conjunctions (CONJ
). See also issue 237. No consensus yet. But one could argue that some adverbs have been grammaticalized to perform a connective function. As an analogy to multi-word expressions, we would continue tagging themADV
, while their syntactic relation would becc
(reflecting the function). That would mean preferring form to function in tagging. - [en] no: in No. We have no bananas. the first occurrence is interjection, while the second is a negative determiner. The syntactic functions are very different and the word is translated differently to other languages (in [cs], the first no would be ne, while the second would be žádné). On the other hand, the Czech word ne can function as either English no (
INTJ
) or not (PART
). In both cases we project the different functions down to the POS tag, hence the function is preferred to the form. (Note that similar ambiguities exist also between function and content words, such as English can, which is eitherAUX
orNOUN
.) - Pronouns (
PRON
) vs. determiners (DET
). See also issue 159 and the Uppsala discussion group on determiners. UD v1 follows a definition that can be traced back to the EAGLES project and is purely functional. In That car is expensive, that isDET
. In That is expensive, that isPRON
. As mentioned above, complaints have been made that this approach violates traditional (or Penn Treebank like?) categories even in English. Moreover, the rule is not bullet-proof. One could argue that a sentence is elliptical and that there actually is a modified noun in the underlying meaning, although it is not visible on the surface. We may want to revise this rule for UD v2. (My proposal: pre-classify all pro-forms in dictionary as eitherPRON
orDET
, based on their possible or frequent usage, and when applicable, on their morphological behavior. Then if aDET
appears without the modified nominal, we ascribe it to ellipsis. If aPRON
modifies another nominal, we interpret it as a variant of relation between two nouns—quite frequent in many languages. Then in both the above examples, that would be tagged asDET
. Note however that this particular word can be also used asSCONJ
, as in I know that the car is expensive. Here I would recommend that we keep it separate from theDET
(/PRON
) homonym, because the two syntactic functions are too different and incompatible.) - Nouns (
NOUN
) vs. adjectives (ADJ
). Somewhat parallel to thePRON
-DET
clash, but different in that most tagsets/languages have both categories. However, sometimes a word that is historically adjective lexicalizes as a noun. Its inflection (if applicable in the language) is still adjectival but it is no longer used as an adjectival modifier to nouns (albeit it can modify nouns in the same manner as other nouns do, e.g. as a genitive post-modifier). The absence of modified noun can no longer be explained by ellipsis. For example, [cs] hajný “gamekeeper” was originally an adjective derived from the noun háj “forest”. Its inflection paradigm is still adjectival but it is never used as an adjective. That is, you cannot say something like *hajný muž “gamekeeping man”. The word is listed asNOUN
in dictionary and it is taggedNOUN
in the UD corpus. Hence we prefer function to form in this case. There is no universal rule how to demarcate the border though. Individual words have to be considered separately in every language. Some words may not be far enough on their way from adjectives to nouns. Note that the other direction is easier. If a noun pre-modifies another noun, as is a common case in English (“fish market”), the modifier is still aNOUN
. If it acquires adjectival morphology ([cs] rybí trh “fish market”, from ryba “fish”), it becomes immediately a derivedADJ
. - Main verbs (
VERB
) vs. auxiliaries (AUX
). In the questions How do you do? Have you had a breakfast?, the first occurrence of to do and to have, respectively, is regarded auxiliary, while in the second it functions as a normal verb. With the verb to be, we distinguish auxiliary usage in periphrastic verb forms (I am eating; he was hired) and existential usage (I think, therefore I am; the castle is on the hill). The former isAUX
, the latter isVERB
. We also have copula verb as in he is smart; this usage was taggedVERB
according to the UD guidelines v1 but it should probably beAUX
in v2 (see issue 275). So even if there are purely auxiliary verbs (English can, could, must, shall will never be taggedVERB
), some verbs may work as both and their actual tag is based on the function. The tag can be usually deduced from the syntactic structure, unless the main verb has been elided and the auxiliary promoted to its position—then it would still be taggedAUX
but it would have a non-auxiliary relation to its parent. - Question particles is a class of words that lacks unified treatment across languages. In some languages they overlap with other parts of speech, in other languages they seem to be distinct. See issue 178 for a survey of question particles in UD 1.2 treebanks.
Proposed principles for UD v2
- A word’s category should be primarily determined by dictionary rather than by context of a particular sentence. Syntax still plays an important role, especially in cross-linguistic mapping of same-named categories. However, prototypical (expected) syntactic behavior is of more importance than function performed in exceptional contexts.
- Morphological behavior may be a good indicator in some languages. If, for example, a language uses distinct inflection patterns for nouns and adjectives, then morphology can be used to distinguish these two categories. Exceptions cannot be excluded (see the [cs] hajný example above) but they should be really exceptional and well grounded; when in doubt, use the category determined by morphology.
- Ambiguous words (belonging to two or more categories) do exist. Sometimes by pure coincidence ([en] the can vs. can = to be able to). Sometimes the two words are related but differ morphologically ([en] the book(s) vs. to book, booked, booking).
- Perhaps the most difficult part are ambiguous function words that do not inflect (i.e. morphology does not help us), yet they perform two or more significantly different syntactic functions, which we normally associate with different parts of speech. The two functions may not be equally frequent but each of them is more frequent than what could be labeled as a mere exception (i.e. the wait for his ‘yes’ example is exceptional). Disambiguating such pairs clearly depends on the context of the given sentence where the word is used. We should minimize this sort of ambiguity (because we want to decide as much as possible with dictionary). But I don’t think we can avoid it. So how do we know that the difference is “significant enough”? One clue is that the word, when translated to another language, gets two different translations with different POS tags (e.g. the English no example above). Another clue comes from contrasting the UD relations used for the two functions. For example, distinguishing
PRON
fromSCONJ
([en] that, [es] que, [ru] что / čto) is important because pronouns, unlike conjunctions, may become core arguments and fill valency slots of verbs. DistinguishingSCONJ
andADP
(or the corresponding relationsmark
andcase
) seems less crucial and we can probably keep just one POS tag for each such word, based on prototypical usage. Distinguishing betweenPRON
andDET
lies on the importance scale somewhere inbetween; but the nature of both these categories is nominal and we can probably live with pre-categorizing most of these words in dictionary.
Revised guidelines for pronominal words
Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much).
- In most cases it is straightforward to determine whether a word is pronominal (see also the PronType feature) but the borderline between indefinite determiners and adjectives is slightly fuzzy. Related languages should synchronize the lists of words they treat as pronominal. The rest of these guidelines demarcate borders within the pronominal group.
- Pronominal adverbs are tagged
ADV
. Their pronominality is encoded using thePronType
feature. Their typical syntactic function is to modify verbs. - Articles (the, a, an) are always tagged
DET
; theirPronType
isArt
. - Pronominal numerals (quantifiers) are tagged
DET
; besidesPronType
, they also use the NumType feature. - Words that behave similar to adjectives are
DET
. (We understand theDET
class as pro-adjectives, which is a slightly broader sense than what is usually regarded as determiners in English. In particular, it is possible that one nominal is modified by more than one determiner.) Similar behavior means:- They are more likely to be used attributively (modifying a noun phrase) than substantively (replacing a noun phrase). They may occur alone, though. If they do, it is either because of ellipsis, or because the hypothetical modified noun is something unspecified and general, as in All [visitors] must pay.
- Their inflection is similar to that of adjectives, and distinct from nouns. They agree with the nouns they modify. Especially the ability to inflect for gender is typical for adjectives and determiners. (Gender of nouns is determined lexically and determiners may be required by the grammar to agree with their nouns in gender; therefore they need to inflect for gender.)
- Non-possessive personal, reflexive or reciprocal pronouns are always tagged
PRON
. - Possessives vary across languages. In some languages the above tests put them in the
DET
category. In others, they are more like a normal personal pronoun in a specific case (often the genitive), or a personal pronoun with an adposition; they are taggedPRON
. - When in doubt, the category should be based on what the traditional grammar of the language says.
- Ideally, language-specific documentation should list pronominal words and their category. These are all closed classes so it should not be difficult.
Morphological features
We have discussed the problem of assigning words to POS categories and the extent to which we want it to depend on actual context. Similar problems arise with morphological features. It is not uncommon that a word form is ambiguous and may have several different values of a feature, depending on context.
- For example, the Czech noun růže “rose” has only three singular forms: růže (nominative, genitive or vocative), růži (dative, accusative or locative) and růží (instrumental). We have to distinguish seven case values because other nouns have other forms. The noun žena “woman” has six singular forms, with only dative and locative sharing a form; but these two cases often have different forms in plural. It is customary to assign just one value of Case that is correct in the given context. If we were only allowed to use a dictionary, we would have to tag růže with a set of values instead (
Case=Gen,Nom,Voc
) but we do not do that. Note that this is different from saying that a word category is only permitted in a given syntactic function. If we say thatdet
always impliesDET
, we create redundancy; but saying that we must look at context to disambiguateCase
assignment does not mean that theCase
value deterministically follows from the syntactic function. There is a clear benefit from theCase
being disambiguated. - However, other situations may be less clear. Many Spanish adjectives inflect for Gender and must agree in gender with the noun they modify. Hence we have un niño pequeño “a small baby boy” (masculine) but una ciudad pequeña “a small town” (feminine). It is natural to tag the adjective with the gender corresponding to the actual form (pequeño vs. pequeña). Then there is a large group of adjectives that do not inflect and cannot show the agreement, e.g. grande: un palacio grande “a big palace” and una ciudad grande “a big town”. In most cases their gender could be determined from syntactic context, namely from the head noun. Nevertheless, requesting it to be part of the annotation means that we create a redundancy (the gender directly follows from syntax and the noun gender, while there is no trace of it in the form of the adjective itself). So it seems more natural to say that there is a subset of adjectives which do not inflect for gender, and their
Gender
feature is always empty (i.e. it does not appear at all among the features of the adjective).
In an analogy to the principles for part-of-speech assignment, we could say that both the dictionary (with morphological analyzer) and the actual context may be needed to disambiguate a feature value. It is OK if in some instances the context is not needed. Creating ambiguities that can be resolved solely by context is discouraged, although there is no universally applicable rule how to separate the Czech and the Spanish examples above. (How large must the subset be so that we make it a separate sub-category, treated differently from the gender-aware adjectives?) Each language should have its own exact guidelines to determine what feature values are distinguished where. And as always, similar languages should synchronize with each other and make their respective guidelines as similar as possible.