Features in UD v2
We propose for v2:
- Rename
Negativeto u-feat/Polarity and rename individual values ofAspect,VerbFormandDefinite. - Remove
Tense=Nar(but keep the other values of u-feat/Tense). - For a number of existing features, add new values that may be or have already been needed in some languages.
- Add four new features to the universal set of features: evidentiality, politeness, abbreviation and foreign.
- A number of other potential changes is suggested for general discussion but not really proposed yet.
The proposals in this chapter are based on
- experience with applying the v1 features (see also the issue tracker)
- survey of language-specific features and values used in current UD treebanks (especially those that are also documented)
- draft proposal from the UniMorph project, which has similar goals as UD features (Sylak-Glassman 2016; see below for a comparison of UniMorph and UD features)
Another reference that could contribute to the universal features is the GOLD ontology; see also the general project page. In particular, items in the ontology under Morphosemantic property and Morphosyntactic property roughly correspond to our features.
Renaming existing features and/or values
See the issue 219 for related discussion.
- u-feat/Aspect: values
Pro(prospective, used in Basque) andProg(progressive, used in Basque, Turkish and Chinese) are highly confusing. I propose to change the prospective value toAspect=Prosp, following the label used in UniMorph. Negative: current proposal – rename the feature to u-feat/Polarity (and keep valuesPosandNeg). Another option would be to keep the name but use only one valueYesbecause positive polarity is rarely marked morphologically.Negative=Poslooks weird and it probably just marks words that can take the negative morpheme but do not have it in the given form. Nevertheless,Negative=Posis currently used in 13 treebanks so we probably want to keep it but rename the feature toPolarity.- u-feat/VerbForm: rename
Trans(transgressive) toConv(converb). Transgressive is a term that comes from Slavic languages and is alien to e.g. Turkish or Hindi where functionally similar forms exist. It turns out that even within Slavic linguistics, the term transgressive is not widely used (the Slavic languages naturally have their native terms; the translation transgressive, of Latin-German-English etymology, is almost unknown outside Czech, Slovak and Sorbian). English literature on Slavic languages sometimes uses the term gerund but it is absolutely confusing and unsuitable because it is similar neither in form nor in function to the form we markVerbForm=Gerin English and Spanish (and BTW these are also quite different from each other, but at least Spanish has the term gerundio as its own, not only as English translation). More neutral terms are adverbial participle or converb (Haspelmath, 1995), so I propose to relabel these formsVerbForm=Conv. - u-feat/Definite: rename
Red(reduced) toCons(construct state); see the issue 135 for related discussion. - TO DISCUSS: What is
NumType=Persin Irish? (Defined but not used.)
Adding/removing values to/from existing features
- u-feat/Animacy
- Add
Animacy=Hum(human). We currently have three values of animacy,Anim(animate),Inan(inanimate) andNhum(non-human). The last one is used so far only in Polish. As a side effect, it restricts the meaning ofAnimto human (also called personal) nouns. While it is not unusual that feature meaning slightly shifts across languages (e.g. plural means “more than one” in some languages and “more than two” in others), it will be more intuitive to divide the Polish nouns intoHum,NhumandInan. More importantly, there are languages (e.g. Yuwan, a Ryukyuan language) that only distinguish human vs. non-human, and the latter includes inanimates. On the other hand, languages like Czech will keep their animate vs. inanimate two-way distinction. The precise meaning will thus remain language-specific, but more appropriate labels will now be available.
- Add
- u-feat/Case
- Add
Case=Equ(equative, means “X-like”, “similar to X”, “same as X”). It is already used in UD Turkish and it is also proposed in UniMorph. - Add
Case=Cmp(comparative, means “than X”). It occurs in Dravidian and Northeast-Caucasian languages; it is proposed in UniMorph. - TO DISCUSS: Chinese “cases” Advb, Comp, Rel.
- Add
- u-feat/Degree
- Add
Degree=Equ(equative, means “as X as”; note that it marks the adjective and it is distinct from the equative case, which marks the standard of comparison). One of the examples in UniMorph is Estonian pikkune (pikkus+ne) “as tall as”. UD Estonian contains 6 occurrences of pikkune but it does not define equative as a language-specific feature; it uses simplyDegree=Poshere.
- Add
- u-feat/Definite
- Add
Definite=Spec(specific indefinite, e.g. “a certain stick”). Occurs e.g. in Lakota, proposed in UniMorph. In languages where it is used the valueIndis interpreted as non-specific indefinite, i.e. “any (one) stick”. - TO DISCUSS:
Definite=2in Hungarian. Description: definiteness-like agreement of verbs with a second person object in Hungarian. Hungarian verbs have to be conjugated in harmony with the definiteness of the object, making a difference between a definite object (nézem a filmet “I am watching the film”), an indefinite object (nézek egy filmet “I am watching a film”) and a second person object (nézlek téged “I am watching you”). SoDefinite=2is actually not about definiteness proper, maybe it should bePerson[obj]. Perhaps we should leave this value specific to Hungarian.
- Add
- u-feat/Number
- Add
Number=Countfrom Bulgarian (and Macedonian). It is known variously as “counting form”, “count plural” or “quantitative plural” (Sussex and Cubberley 2006, p. 324). It is a special plural form of nouns if they occur after numerals: tri stola “three chairs” vs. stolove “chairs”. (The form originates in the Proto-Slavic dual but it should not be markedNumber=Dualbecause 1. the dual vanished from Bulgarian and 2. the form is no longer semantically tied to the number two.) - Add
Number=Tri(trial). Occurs in pronouns of several Austronesian languages; proposed in UniMorph. - Add
Number=Pauc(paucal, means “a few”). Proposed in UniMorph. - Add
Number=Grpa(greater paucal, means “more than several but not many”). Occurs in Sursurunga, an Austronesian language. Proposed in UniMorph. - Add
Number=Grpl(greater plural, means “many, all possible”; precise semantics varies across languages). Proposed in UniMorph. - Add
Number=Inv(inverse number, i.e. non-default for that particular noun). Occurs e.g. in Kiowa. Proposed in UniMorph.
- Add
- u-feat/VerbForm
- Add
VerbForm=Gdv(gerundive, not gerund) in Latin and Ancient Greek. - Add
VerbForm=Vnounfor verbal nouns other than infinitives (also called masdars by some authors, e.g. Haspelmath, 1995). In UD v1 we were advising to useVerbForm=Gerfor them, using the English gerunds as model. However, the term gerund is rather confusing: in Spanish (and other Romance languages?) it denotes the present participle and should be thus labeledTense=Pres|VerbForm=Part; some Slavists use it to denote converbs (adverbial participles), which we now propose to labelVerbForm=Conv(previouslyVerbForm=Trans). - Using
VerbForm=Geris discouraged and alternatives should be considered first. However, the feature is still available in UDv2 and can be used if the alternatives do not seem acceptable. The feature may be removed in future versions but comprehensive investigation has to be done first.- Observations from UD 1.4:
VerbForm=Geroccurs in most of the Romance languages (Catalan, Spanish, Galician, Italian, Portuguese, Romanian) and Latin. I assume that in all these languages the form and function is similar to the Spanish gerundio, hence it should be replaced byVerbForm=Part|Tense=Pres(but I am unsure whether it can be extended to Latin). It occurs in one Slavic language (Polish) and it should be replaced byVerbForm=Vnounthere. The same feature could be used in all Slavic languages but verbal nouns are currently not distinguished there. It occurs in two Germanic languages (English, Danish) and I do not know whether it can be relabeled as verbal noun there. Finally, it also occurs in Irish, Sanskrit, Tamil, Kazakh and Turkish (I am not able to judge what should happen there ifVerbForm=Geris not available).
- Observations from UD 1.4:
- TO DISCUSS:
VerbForm=PartFut|PartPast|PartPresin Hungarian. These should probably be two features,VerbForm=PartandTense=Fut, as in other UD languages. - TO DISCUSS:
VerbForm=Copin Irish. There is no documentation so we will need some input from Teresa if we want to do anything with it. - TO DISCUSS:
VerbForm=Stemcurrently only one occurrence in Swedish. Verb stems also occur regularly in Hindi but they have the function of adverbial participles (converbs, transgressives) there, soVerbForm=Convshould be used for them.
- Add
- u-feat/Mood
- Add
Mood=Prp(purposive, means “in order to”). Occurs in Amazonian languages; proposed in UniMorph. - TO DISCUSS: Add
Mood=Int(intentive, indicates that the speaker strongly intents for the action of the verb to be realized). Occurs in Tonkawa; proposed in UniMorph. - TO DISCUSS:
Mood=Intin Irish (what does it mean? Interrogative?) - TO DISCUSS:
Mood=Interin Chinese (what does it mean?) - Add
Mood=Adm(admirative; expresses surprise, irony or doubt). Occurs in Albanian, other Balkan languages, and in Caddo (Native American from Oklahoma). Proposed in UniMorph. - TO DISCUSS: Add
Mood=Per(permissive; means “may, is permitted”). - TO DISCUSS: Add
Mood=Ded(deductive, inferential, speculative; means “[I believe that it] ought to, must be”). - TO DISCUSS: Add
Mood=Sim(simulative, means “as if”). - Do not add
Mood=Abil, which is currently used in UD Turkish. The Turkish data should useMood=Pot(potential) instead. - Do not add sequences of mood markings, which are currently used in Turkish and may be needed in other agglutinating languages. Leave them language-specific. [tr] AbilCnd, AbilDes, AbilGen, AbilGenNec, AbilImp, AbilNec, AbilPrs (all of these should start with Pot instead of Abil), GenNec.
- TO DISCUSS:
Mood=Prs(persuasive) in Turkish. Reportedly similar in meaning to imperative, but tries to persuade the addressee rather than issuing a direct command. But it could be also analyzed as a politeness distinction (comment by John Sylak-Glassman), perhapsPolite=Elev(see below). Hence we should be careful and at least discuss this more with Çağrı before we possibly add the value.
- Add
- u-feat/Tense
- Do not add
Tense=Aor(aorist), despite its current usage in Ancient Greek and Turkish. It is a confusing term with different meanings in grammars of different languages. In Slavic languages we use normalTense=Pastto denote aorist. In Turkish it is the unmarked non-past form. - Do not add sequences of tense markings, which are currently used in Turkish and may be needed in other agglutinating languages. Leave them language-specific. [tr] AorPast, FutPast.
- Remove
Tense=Nar. It has not been used anywhere yet. In Turkish, for which it was intended, the renarrative past is encoded asEvidentiality=Nfh|Tense=Past. And we are proposing to adopt evidentiality as a new universal feature.
- Do not add
- u-feat/Aspect
- Add
Aspect=Iter(iterative). It is already used in Hungarian UD, although it is called frequentative there (Aspect=Freq). It is called iterative in UniMorph and I also think the term iterative is more common cross-linguistically, although I have not checked Hungarian grammar. (Note: Iteratives also exist in Czech with this name and meaning but they can be formed only from imperfective verbs and they are usually not classified as a separate aspect; they are justAspect=Imp.) Hungarian example: üt “hit”, ütöget “hit several times”. - Add
Aspect=Hab(habitual). Proposed in UniMorph and in Turkish documentation (although not used in current Turkish data). It is the most accurate description for the simple present in English, and it said to be useful for a variety of other languages. - TO DISCUSS: Add
Aspect=Rapid? Used in UD Turkish, suffix -iver, Kornfilt (1995, p.361) calls this rapid or sudden aspect. John: In another grammar, Göksel and Kerslake (2005, p.79), the -iver morpheme seems to have a considerably wider range of meaning than simplyrapid' orsudden.’ In fact, it seemed to me to imply probability that some event would occur. Native Turkish speakers should deffinitely weigh in on this. - TO DISCUSS: Add
Aspect=Dur? Proposed in Turkish documentation but in the data it appears only as the first part of morpheme sequences DurPerf, DurPerfProg and DurProg. - Do not add
Aspect=Res(resultative) from Old Church Slavonic. It is used there for forms that are arguablyAspect=Perf. And it should not be used to mark a particular form because in Slavic languages aspect is primarily a lexical feature (perfective vs. imperfective lemmas). - Do not add sequences of aspect markings, which are currently used in Turkish and may be needed in other agglutinating languages. Leave them language-specific. [tr] DurPerf, DurPerfProg, DurProg, ProgRapid.
- Add
- u-feat/Voice
- Add
Voice=Mid(middle voice), currently used in fo, grc, grc_proiel, sa. - Add
Voice=Antip(antipassive): in ergative-absolutive languages, an ergative subject is demoted to an absolutive subject. Proposed in UniMorph. - Add
Voice=Dir(direct). Used in direct-inverse voice systems, e.g. in North American languages. Proposed in UniMorph. Direct means that the argument that is higher in salience hierarchy is the subject. Example hierarchy: human 1st person – 2nd – 3rd – non-human animate – inanimate. - Add
Voice=Inv(inverse). Used in direct-inverse voice systems, e.g. in North American languages. Proposed in UniMorph. Inverse voice marking means that the argument lower in the hierarchy functions as subject. - TO DISCUSS:
Voice=Auto(ga). - Do not add sequences of voice markings, which are currently used in Turkish and may be needed in other agglutinating languages. Leave them language-specific. [tr] CauPass.
- Add
- u-feat/PronType
- Add
PronType=Emp(emphatic) from Romanian. There are similarities with reflexive and demonstrative pronouns / determiners. Example: himself as in “He himself did it.” Czech sám, Romanian însuși. - Add
PronType=Exc(exclamative) from Italian (but it can be defined in other languages, too). It expresses the speaker’s surprise towards the modified noun, e.g. what in “What a surprise!” In many languages, exclamative determiners are recruited from the set of interrogative determiners. Therefore, not all tagsets distinguish them. - NOTE: The Italian data in UD v1 contain three additional values of Italian:
Clit,PredetandOrd. We do not propose to adopt these values as universal in UD v2. We propose to change the Italian guidelines so that these values are no longer needed. See Issue 353 for details.
- Add
- u-feat/NumType
- Remove
NumType=Gen, it is poorely defined as a garbage can, some of the words should actually be cardinals, some are better classified asNumType=Mult.
- Remove
- u-feat/Person
- Add
Person=0. Proposed in UniMorph. Zero person is for impersonal statements, appears in Finnish as well as in Santa Ana Pueblo Keres. (The construction is distinctive in Finnish but it does not use unique morphology that would necessarily require a feature; the current UD Finnish also lives without it.) However, it is morphologically distinct in Keres (Davis 1964:75). - Add
Person=4. Proposed in UniMorph. John: 4th person could be distinguished by additional, independently-needed features, such as obviation status (e.g. proximate [prx] or obviative [obv], which are not included in person features currently), so while the feature4is convenient and part of paradigmatic contrasts (e.g. in Navajo), it may not be strictly necessary. The features proximate (prx) and obviative (obv) should be included somehow if direct and inverse voice are allowed, since languages that mark these voice categories tend to also mark 3rd person arguments as proximate or obviative (esp. when all arguments in the clause are 3rd person).
- Add
Adding new features
Evident(evidentiality) is currently used only in Turkish but it seems like this is an important feature in non-Indo-European languages. We could take the values from UniMorph. At present we only needEvident=Nfh(non-first hand).Polite(politeness) is currently used in 9 treebanks: ca, da, de, es, es_ancora, eu, hi, sa, ta. UniMorph distinguishes four axes along which politeness may be scaled (see below), one of them covering another feature currently used in a few UD treebanks,Style. I propose to add, for the time being, the featurePolitewith the UniMorph-like values from the speaker-referent axis, i.e.Infm(informal),Form(formal),Elev(elevated status of referent; interpreted as a subtype ofForm),Humb(humbled status of speaker; subtype ofForm). That will let us cover the Indo-European tu/vous pronouns, as well as part of Japanese honorifics.- TO DISCUSS:
The Turkish treebank has
Register, which resembles politeness/style in having valuesFormandInf, contrasting e.g. the 3rd person verb forms etmekte (Form) and ediyor (Inf) “he is doing”. However, it is possible that the -mekte forms are restricted to more formal settings and do not directly reflect the relation between the speaker and the addressee. If this is true, then they are not covered by what we currently propose to cover underPolite.
- TO DISCUSS:
The Turkish treebank has
- Abbr=Yes (abbreviation) is not language-specific and is currently used in 12 treebanks: ar, cs, cs_cac, cs_cltt, da, et, fi, fi_ftb, fo, la_ittb, pl, ro.
- Foreign is not language-specific and is currently used in 13 treebanks: ar, cs, cs_cac, da, de, es, et, fi, fo, hi, nl, sl, sl_sst.
The values should be discussed though.
The currently used values are
Foreign,FscriptandTscriptbut most treebanks use only the first one. Alternatively we could make it just a binary feature,Foreign=Yes, which it was originally. - TO DISCUSS:
NumForm=Digit|Roman|Word. The values can arguably be easily deduced from the word form; nevertheless, it is now used in 12 treebanks: ar, ca, cs, cs_cac, es_ancora, et, la_ittb, nl, pt, ro, sl, ta. Inconsistency in Estonian:NumForm=Letterinstead ofWord, which is used elsewhere. - TO DISCUSS:
PartType(particle type, not participle type). Given how diverse the u-pos/PART category is, it would make sense to define its subcategories. Currently used in ga, da, nl, ro, with the following values:Inf(infinitive marker; used in ga, da, nl, ro),Vbp(used for separable verb prefixes in [nl] but they should be taggedADP, notPART, and would not get this feature),Ad,Cmpl,Comp,Cop,Deg,Num,Pat,Vb,Voc(meaning unknown, no documentation; used in Irish). On the other hand, the function of the particle can sometimes (often?) be expressed using other features that already exist. For instance, the particles marking infinitives could have the featureVerbForm=Inf. Similarly, negative particles like [en] not could havePolarity=Neg. - TO DISCUSS:
Interrog(interrogativity). In some sense it is parallel to (but separate from) polarity (negativity). It may mark independent question particles, which exist in some languages (but note that these could also be covered byPartType=Intif we approve the feature), as well as interrogative forms of verbs. These currently appear in Irish and are taggedMood=Intthere; however, the interrogative morphemes may probably be combined with other mood categories (examples? Turkish?), which supports interrogativity as a separate feature. It would not apply to interrogative pronouns, determiners and adverbs, which are already marked byPronType=Int—much likePolarity=Negis not used wherePronType=Negis. Similarly to polarity, the feature is proposed in UniMorph. It has two values there,DeclandInt, but the former is not expected to be used frequently (similar toPosinPolarity) because declarativeness is usually not marked.
Comparison with UniMorph
General differences:
- We work “bottom-up”. We wait for a feature to appear in a language (or source treebank), then we think about where to put it in the schema. They work “top-down”. They surveyed literature on language typology and collected all features that could possibly occur in any natural language.
- Their schema “is responsible for capturing only the meanings of overt inflectional morphemes, which considerably limits the semantic space
that must be formally described by the UniMorph Schema features.”
In contrast, we also include some features that are not inflectional but they provide a more fine-grained partitioning of the part-of-speech space, e.g.
PronType. - They build upon the Leipzig Glossing Rules and their labels can be applied, if needed, to words, morphemes or phrases. We focus on individual words and don’t mark some complex forms that can be expressed only periphrastically. (But it is actually possible that UniMorph ignores periphrastic forms, too. They often stress that something is/is not distinguished by overt affixal morphology. So maybe there is no difference in this point.)
- We need fully qualified feature+value pair to get a unique string, e.g.
Degree=Supis something else thanVerbForm=SuporCase=Sup. They distinguish “dimensions” (our features) but their values are globally unique even without dimension name. They also have templatic features (combined of several atoms), and they often rely on feature (value) combinations. We have combined values too (e.g.Gender=Masc,Neut) but for us it expresses disjunction, used when we cannot select just one of the values. They have disjunction too, but they also mark conjunction of features, or elaboration, e.g.IN+ABL.
UniMorph dimensions (draft v2)
Aktionsart, values:STAT(stative),DYN(dynamic),TEL(telic),ATEL(atelic),PCT(punctual),DUR(durative),ACH(achievement),ACCMP(accomplishment),SEMEL(semelfactive),ACTY(activity). Aktionsart is a feature that we don’t have in UD but it is closely related to our Aspect. Aspect in Slavic languages is treated as a lexical feature, change of aspect is considered a derivation. Perfective verbs in Slavic languages correspond to telic verbs in UniMorph, imperfective verbs correspond to atelic verbs and statives. However, aktionsart could be defined for other languages including English, while aspect is not marked in UD English.Animacy, values:ANIM(animate),INAN(inanimate),HUM(human),NHUM(non-human). In UD we don’t have human but we do have the other three. We are now proposing to addAnimacy=Humto UD, see above.Argument Markingfor head-marking languages. UniMorph uses templatic featuresARG+Case+Person+Number, e.g.ARGNO1Smeans that the nominative argument of the current verb is 1st person singular. Available cases are nominative, accusative, absolutive, ergative, dative, benefactive. We mostly only need to annotate agreement of the verb with its subject, i.e. the nominative argument, and we use thePersonandNumberfeatures of the verb for this. So far only Basque needs more, as the verbs may agree there with up to three arguments (absolutive, ergative and dative). We use the layered features, i.e.Person[abs],Person[erg],Person[dat],Number[abs]etc.Aspect, values:IPFV(imperfective),PFV(perfective),PRF(perfect),PROG(progressive),PROSP(prospective),ITER(iterative),HAB(habitual). Their aspect + aktionsart is not compatible with our aspect, although we have a few values in common (perfect/ive, imperfective, progressive, prospective). We also mix aspect with tense by allowing the valueTense=Imp.Case- Core case: can be defined in terms of three “meta-arguments,” S (subject), A (agent), and P (patient). Values:
NOM(nominative;Case=Nom),ACC(accusative;Case=Acc),ERG(ergative;Case=Erg),ABS(absolutive;Case=Abs),NOMS(nominative, subject only). We have all these values, except that we do not distinguishNOMSfromNOM. - Non-core, non-local case:
DAT(dative;Case=Dat),BEN(benefactive;Case=Ben),PRP(purposive;Case=Cau),GEN(genitive;Case=Gen),REL(relative),PRT(partitive;Case=Par),INS(instrumental;Case=Ins),COM(comitative;Case=Com),VOC(vocative;Case=Voc),COMPV(comparative),EQTV(equative),PRIV(privative;Case=Abe),PROPR(proprietive),AVR(aversive),FRML(essive formal;Case=Ess),TRANS(translative;Case=Tra),BYWAY(essive modal). We currently lack values of 6 cases in this category, although equative seems to already occur in our Turkish data (if it is whatCase=Equrefers to). Our causative (Case=Cau) might be (or overlap with) UniMorph’s purposive. Our abessive (Case=Abe) is their privative; we use the term from Uralic languages, they from Australia. Our essive/prolative (Case=Ess, used in [hu, et, fi, eu]) is their essive formal. Their essive modal (BYWAY) comes from Hungarian and “marks the notion of ‘by way of’ a location;” I suspect that we subsume it within instrumental. Their relative (REL) “marks possessor and A role”, hence it looks like a merger of genitive and accusative. - Local / place:
INTER(“among”),AT(“at”),POST(“behind”),IN(“in”),CIRC(“near”),ANTE(“near, in front of”),APUD(“next to”),ON(“on”),ONHR(“on” horizontal),ONVR(“on” vertical),SUB(“under”). - Local / distance:
REM(distal),PROX(proximate). - Local / motion:
ESS(essive),ALL(allative),ABL(ablative). - Local / aspect:
APPRX(approximative),TERM(terminative),PROL(prolative/translative),VERS(versative). - UD has
Case=Loc, which is used in a number of Indo-European languages (especially Slavic) but also in Basque, Turkish and others. It has mostly a locative meaning, where the placement and direction are not precisely specified. It can also have a non-locative meaning. The closest counterpart of the locative case in UniMorph is plainESS. - The UniMorph draft accounts for compositionality of locative morphemes in some languages: “from [the place] between us” could be encoded as we +
INTER+ABL. In contrast, we use established terms for some of the combinations but definitely cannot encode all possible combinations in all languages. Our local cases are: inessive (Case=Ine;IN+ESS); illative (Case=Ill;IN+ALL); elative (Case=Ela;IN+ABL); adessive (Case=Ade;ON/AT+ESS); allative (Case=All;ON/AT+ALL); ablative (Case=Abl;ON/AT+ABL); superessive (Case=Sup;ON/ONVR+ESS); sublative (Case=Sub;ON+ALL); delative (Case=Del;ON/ONVR+ABL); lative (Case=Lat;ALL, i.e. it says it’s motion towards something, without distinguishing on/at/in/under); terminative (Case=Ter;ALL+TERM, i.e. it specifies motion up to some point, also called terminal allative). - They do not have additive
Case=Addbecause they encode atomic meaning and additive is equal in meaning to illative. It is questionable whether we want to keep it in UD but I would keep it because it is actively used in UD Estonian, so there apparently is some demand. - They currently do not have
temporal
Case=Tem(hu) and distributiveCase=Dis(hu). - We should add
Case=Equto the universal features. We already use it in Turkish. Similarly, we should addCase=Cmpfor comparative (“than X”), occurring in Dravidian and Northeast-Caucasian languages. - We do not have
Case=Prpfor proprietive (“having X”), a positive counterpart of abessive, occurring in Australian languages. (But we have comitativeCase=Comwhich can also be viewed as a positive counterpart of abessive.) There appears to be debate about whether proprietive is in fact inflectional or derivational. Blake (2001:156) cites only Kalkatungu (an extinct Pama-Nyungan language from Queensland) as a specific example. Moreover, Heine and Kuteva (2002:88) identify a historical grammaticalization pathway for comitative case to the sense of “having”. - We do not have
Case=Avrfor aversive (“fearing X”). Blake (2001:156) notes that this case is “common in Australian languages” but only provides Kalkatungu as an example. - As for the local cases, there are too many possible combinations and we should probably wait until the need for one of them arises.
- Core case: can be defined in terms of three “meta-arguments,” S (subject), A (agent), and P (patient). Values:
Comparison, values:CMPR(comparative),SPRL(superlative),AB(absolute for superlatives),RL(relative for superlatives),EQT(equative). We haveDegree=Cmpfor comparative,Degree=Supfor relative superlative (SPRL+RL) andDegree=Absfor absolute superlative (SPRL+AB). We also haveDegree=Pos(positive) that denotes the basic degree, i.e. adjective that is not compared. This comes from traditional grammars of various languages, although it would be possible to tag such adjectives by omitting theDegreefeature. We could not distinguish adjectives that cannot take the comparative/superlative morphemes, but in fact we avoid that distinction with most features. On the other hand, we do not have the equative degree (note that it marks the adjective and it is distinct from the equative case, which marks the standard of comparison). One of the examples in UniMorph is Estonian pikkune (pikkus+ne) “as tall as”. UD Estonian contains 6 occurrences of pikkune but it does not define equative as a language-specific feature; it uses simplyDegree=Poshere. We may want to define the valueDegree=Equand see if anyone uses it.Definiteness, values:DEF(definite),INDF(indefinite),SPEC(specific),NSPEC(non-specific). The last two are elaboration of indefinite. Specific indefinite: a certain stick; non-specific indefinite: any stick but still only one, not many. We haveDefinite=DefandDefinite=Ind. If we include the distinction of specificity, it will be just one value cutting off the currentInd, maybeDefinite=Spec. On the other hand, we have two peculiar values coming from the Prague Arabic Dependency Treebank.Definite=Red(reduced) is used for noun that is modified by another noun in genitive and has neither definite nor indefinite morpheme. This is also called the construct state and appears in other Semitic languages, e.g. Hebrew.Definite=Com(complex) is used in improper annexation / false iḍāfa (related to the construct state but more complex) in Arabic.Deixissubclassifies demonstrative pronouns, which in some languages are also used instead of 3rd person pronouns. We have onlyPronType=Dem(orPrs) but we do not distinguish the other dimensions at present. At least distance would make sense for the languages we already have in UD, but the original tagsets did not care about it.Distance, values:PROX(proximate),MED(medial),REMT(remote).Reference Point, values:REF1(speaker),REF2(addressee),NOREF(distal, i.e. neither speaker nor addressee),PHOR(phoric, i.e. either anaphoric or cataphoric; previously mentioned or to be disambiguated). This feature sometimes overlaps with distance and sometimes is explicitly separated.Visibility, values:VIS(visible),NVIS(not visible).Verticality, values:ABV(above the level plane of the speaker),EVEN(at the same level),BEL(below the level of the speaker).
Evidentiality, values:FH(firsthand),DRCT(direct),SEN(sensory),VISU(visual),NVSEN(non-visual sensory),AUD(auditory),NFH(non-firsthand),QUOT(quotative),RPRT(reported),HRSY(hearsay),INFER(inferred),ASSUM(assumed). UD v1 does not have this feature, although we haveMood=Qot(et, lv) andTense=Nar(re-narrative past tense; the value was intended for Turkish but in the end it was not used there, and language-specificEvidentiality=Nfhwas introduced instead. We propose to makeEvidentialitya universal feature in UD v2. Maybe we can just adopt the values used in UniMorph. We will have to check how it interacts with the quotative mood in Estonian and Latvian.Finiteness, values:FIN(finite),NFIN(nonfinite). We haveVerbForm=Finfor finite, any other verbform is nonfinite.Gender and Noun Class, values:MASC(masculine),FEM(feminine),NEUT(neuter),BANTU1-23(noun classes in Bantu languages),NAKH1-8(noun classes in Nakh-Daghestanian languages). We haveGender=Masc,Fem,Neutfor the three genders; in addition, we haveGender=Comfor the common gender in Scandinavian languages, which only distinguish neutrum (Neut) and utrum (Com). At present we do not cover any Bantu or Nakh-Daghestanian language.Information Structure, values:TOP(topic),FOC(focus). We do not have a feature for information structure but there are not many languages where it is marked via overt affixal morphology.Interrogativity, values:DECL(declarative),INT(interrogative). Used for verbs. We do not have a feature for this.PronType=Intcould be possibly abused to encode verbs with interrogative morpheme, but it would be much better not to mix the feature with pronominal types; we also don’t mixNegative=NegwithPronType=Neg.- Language-specific features
LGSPEC1,LGSPEC2etc. UniMorph uses them to distinguish alternating forms whose selection is not tied to meaning. For example, genitive of German Buch is either Buchs or Buches. One form will beLGSPEC1and the otherLGSPEC2. We do not have anything similar in UD. Mood, values:IND(indicative,Mood=Ind),SBJV(subjunctive,Mood=Sub),REAL(realis),IRR(irrealis),AUPRP(Australian purposive),AUNPRP(Australian non-purposive),IMP(imperative-jussive,Mood=Imp,Mood=Jus),COND(conditional,Moos=Cnd),PURP(general purposive, “in order to”),INTEN(intentive),POT(potential,Mood=Pot),LKLY(likely),ADM(admirative),OBLIG(obligative,Mood=Nec),DEB(debitive,Mood=Nec),PERM(permissive),DED(deductive),SIM(simulative, “as if”),OPT(optative-desiderative,Mood=Opt,Mood=Des).- We do not have realis and irrealis but I wonder whether we actually have to distinguish them from indicative and subjunctive, respectively.
- There are no Australian languages in UD at present, and we do not encode the Australian purposive vs. non-purposive. The discussion in UniMorph suggests that the non-purposive vs. purposive opposition can be viewed as on par with the realis vs. irrealis and indicative vs. subjunctive oppositions.
- Imperatives are direct commands for the addressee while hortatives and jussives include more suggestive forms,
such as “let them/us X”. They just conflate all three to
IMP(imperative-jussive). We have a separate label for jussive but we only use it in Arabic. Nevertheless, Arabic also has the imperative, so it makes sense to distinguish the two values. - UniMorph conflates optative and desiderative into one mood,
OPT, while we have bothMood=OptandMood=Des. UD Turkish uses both values but the Turkish documentation says that desiderative is morphologically identical to conditional, so it is questionable whether we want to keepDesthere. In addition, optative is used in Finnish, Gothic, Ancient Greek and Sanskrit. - John’s comment on conditional: ambiguous between designating the protasis / condition (“if it rained”), and the apodosis / result from that condition being fulfilled (“, the game would be cancelled”). Languages differ in whether the conditional applies to the protasis, the apodosis or both. Spanish uses the conditional for the protasis, but subjunctive imperfect for the apodosis.
- UniMorph distinguishes debitive from obligative while we have only one corresponding value, necessitative.
- Other moods that are used in UD Turkish:
Abil,AbilCnd,AbilDes,AbilGen,AbilGenNec,AbilImp,AbilNec,AbilPrs,Gen,GenNec,Prs. TheAbilshould probably be replaced by the universal featureMood=Pot(potential).Prsmeans persuasive, reportedly related to imperative but different in that we try to persuade the addressee to do something instead of just commanding. - We currently lack values for purposive, intentive, admirative, permissive, deductive and simulative, and we may want to add them.
Number, values:SG(singular,Number=Sing),PL(plural,Number=Plur),DU(dual,Number=Dual),TRI(trial),PAUC(paucal),GPAUC(greater paucal),GRPL(greater plural),INVN(inverse).- We may want to add values
Number=Tri,Pauc,Grpa,Grpl,Inv. - We have two additional values,
Ptan(plurale tantum) andColl(singulare tantum), which may be viewed as elaboration ofSingandPlur, respectively, and which encode an inherent property of nouns rather than inflection. Agreeing adjectives and verbs never take this value, they useSingandPlurinstead.
- We may want to add values
Part of Speech, values:N(noun),PROPN(proper name),ADJ(adjective),PRO(pronoun),CLF(classifier),ART(article),DET(determiner),V(verb),ADV(adverb),AUX(auxiliary),V.PTCP(participle, verbal adjective),V.MSDR(masdar, verbal noun),V.CVB(converb, verbal adverb),ADP(adposition),COMP(complementizer),CONJ(conjunction),NUM(numeral),PART(particle),INTJ(interjection)- We do not have a separate tag for classifiers. I believe that we subsume them under nouns in UD Chinese.
- We subsume articles within determiners but we distinguish them by
PronType=Art. - We do not treat participles, verbal nouns and verbal adverbs as separate parts of speech.
We distinguish them by the
VerbFormfeature and we allow language-specific guidelines to put them underVERB, or underADJ/NOUN/ADV. Our v1 guidelines propose to call verbal nouns gerunds (VerbForm=Ger). This is motivated by the English gerunds but it should probably be revised because the term has quite different meaning in different languages, which causes confusion. On the other hand, I do not quite like the term masdar used in UniMorph, which is taken from Arabic but not widely understood elsewhere. PerhapsVerbForm=Vnounwould be enough? (Note that infinitives can also be used like nouns in some languages; these would keepVerbForm=Inf.) In contrast, I find the term converb (Haspelmath, 1995) quite appropriate and language-neutral. We currently useVerbForm=Trans(transgressive) but this term turned out to be known only in a few Slavic languages. So we may rename it toVerbForm=Conv(or maybeVerbForm=Vadv, verbal adverb, adverbial participle). - They have a tag for auxiliary verbs while we are now proposing to get rid of it.
- Unlike us, they conflate coordinating and subordinating conjunctions in one tag
CONJ. However, they have a separate tag for complementizers while we include them inSCONJ. - Like us, they distinguish
NUM,PARTandINTJ. They actually refer to us when explaining why they include these categories.
Person, values:0,1,2,3,4,INCL(inclusive we),EXCL(exclusive we),PRX(proximate),OBV(obviative). We have only the classical 1st–3rd persons. Zero person is for impersonal statements, appears in Finnish as well as in Santa Ana Pueblo Keres. The fourth person is used in some languages to describe an otherwise third-person referent that is differentiated from other third-person referents. Clusivity could be encoded as a separate language-specific feature, which would be in line with UniMorph, which combines1+INCLor1+EXCL.Polarity, values:NEG(negative),POS(positive, affirmative). We have been calling the featureNegative(ness)but we propose to rename it toPolarityin UD v2.Politeness: ve have it only as a language-specific feature, used rarely.- Speaker-Referent Axis (whether or not the referent happens to also be the addressee).
INFM(informal; the tu 2nd person singular pronoun in Indo-European languages.FORM(formal; the vous 2nd person singular pronoun in Indo-European languages. Sublevels of the formal level:FORM+ELEV(referent elevating; sonkeigo forms in Japanese).FORM+HUMB(speaker humbling; kenjougo forms in Japanese). - Speaker-Addressee Axis (not referring to the addressee).
Japanese teineigo is an example of an addressee honorific system.
POL(polite),MPOL(medium polite). - Speaker-Bystander Axis.
AVOID(avoidance style, taboo language = used in the presence of anyone to whom the avoidance relationship applies, e.g. mother-in-law),LOW(low status = language used in the presence of only those having a low status),HIGH(high status = in the presence of the secondary chief(tess)),STELV(elevated status = in the presence of the primary chieftess),STSUPR(supreme status = in the presence of the primary chief). The neutral level is unspecified. - Speaker-Setting Axis; referred to as register in sociolinguistics.
LIT(literary,Style=Form),FOREG(formal register,Style=Form),COL(colloquial,Style=Coll). We have corresponding features in the section of language-specific extensions but they are currently used only in a few treebanks (cs, da, fi).
- Speaker-Referent Axis (whether or not the referent happens to also be the addressee).
Possessionis a templatic feature that may incorporate features of the possessor such as person and number. We encode the same situation using the boolean featurePoss=Yes, and separate features forPerson,Numberetc. If it is necessary to distinguish them from same-named inflectional features of the possessive word, we use layered features on the[psor](“possessor”) layer:Person[psor],Number[psor]etc. UniMorph defines the following combinations:PSS1S(possession by 1st person singular),PSS2S,PSS2SM(2nd person singular masculine),PSS2SF,PSS2SINFM(informal),PSS2SFORM,PSS3S,PSS3SM,PSS3SF,PSS1D,PSS1DI(dual inclusive),PSS1DE(exclusive),PSS2D,PSS2DM,PSS2DF,PSS3D,PSS3DM,PSS3DF,PSS1P,PSS1PI,PSS1PE,PSS2P,PSS2PM,PSS2PF,PSS3P,PSS3PM,PSS3PF. In addition, they define simplePSSD(possessive but without marking features of the possessor), and alsoALNfor alienable andNALNfor inalienable possession. Alienable means that the ownership can change (“my house”) while inalienable means that it cannot change (“my back”).Switch Reference, values:SS(same subject),DS(different subject),SSADV,DSADV. When there are two verbs in a row, switch-reference is morphological marking of whether they have or do not have the same subject. We do not have this feature in UD.Tense, values:PRS(present,Tense=Pres),PST(past,Tense=Past),FUT(future,Tense=Fut),IMMED(immediate),HOD(hodiernal, i.e. today),1DAY(within one day),RCT(recent),RMT(remote).- They envisage combining their features, e.g.
FUT+HODorPST+RCT. - We currently only have present, past and future without the more specific values like recent and remote.
- Moreover, we cover two present-aspect combinations that may have separate morphological forms and sometimes cannot be represented by
Tense+Aspectbecause there is also the lexical aspect (as in Bulgarian). We would have to redesign our scheme and add aktionsart, or use two layered aspects on one word to solve this. The combinations areTense=Imp(imperfect tense) andTense=Pqp(pluperfect).
- They envisage combining their features, e.g.
Valency, values:IMPRS(impersonal),INTR(intransitive),TR(transitive),DITR(ditransitive),REFL(reflexive),RECP(reciprocal),CAUS(causative),APPL(applicative). At present we do not have a valency feature in UD. We only have a suggestion for a language-specific feature, with only two values,Subcat=IntrandSubcat=Tran, which are currently used only in UD Dutch. We do account for causativity and reciprocality in theVoicefeature. We also have a boolean featureReflex=Yesbut most of the time we use it to mark reflexive pronouns. TheValencyfeature in UniMorph captures number of arguments (arity) of the verb: e.g. the causative morpheme adds one participant (the person who is forced to do the thing). TheVoicefeature in UD is more about switching roles of the participants. Obviously, the two features must interact with each other.Voice, values:ACT(active,Voice=Act),MID(middle,Voice=Mid),PASS(passive,Voice=Pass),ANTIP(antipassive),DIR(direct),INV(inverse),AGFOC(agent focus),PFOC(patient focus),LFOC(location focus),BFOC(beneficiary focus),ACFOC(accompanier focus),IFOC(instrument focus),CFOC(conveyed focus).
Stuff to check
- Does Hungarian have
Case=Abs?
Inventory of features that will stay language-specific
AdpTypedistinguishes prepositions from postpositions, but also a few other types. It is used in a number of treebanks, but the usage is not consistent and I have some doubts whether it is useful. Many languages have a strong preference towards either pre- or postpositions.AdvTypepotentially useful in many languages but currently almost unused.Cliticspecific to FinnishConjTypeused only in CzechConnegativeFinnish and EstonianDerivationspecific to FinnishDialectused only in IrishEcho=Rdp(reduplicative), used in Hindi and TurkishFormspecific to IrishHebBinyanspecific to Hebrew (but I believe it could be converted to aspect and voice)HebExistentialspecific to HebrewHebSourcedebugging featureHyph=Yesused only in a few treebanksInfFormspecific to FinnishNameTypeused only in CzechNounTypespecific to IrishNumValueused in Czech and ArabicPartFormspecific to FinnishPositionused only in RomanianPrefix=Yesspecific to HebrewPrepCaseused in Catalan, Czech, Polish, Portuguese, SpanishPrepFormspecific to Irish (maybe it could be renamed toAdpType, which is used elsewhere)PunctSideused in ca, es_ancora, fi_ftb, nlPunctTypeused in ca, es_ancora, nl, ta; not consistentStyleused in Czech, Danish and FinnishSubcatused only in DutchTypo=Yescould be useful in all treebanks but we first need a general guideline for handling typos. Should the form in FORM be original, or fixed? And should we have a MISC attribute with the fixed or original form? See also issue 330.Variant;Variant=Brevin Russian denotes the short-form adjective (холоден, as opposed to холодный), which are distinguished by definiteness in South Slavic, and byVariant=Short(vs.Long) in West Slavic. Maybe this opposition would deserve a Slavic-specific feature,AdjForm=Short|Long.VerbType=Aux|Mod|Cop|Main; currently used in Hebrew, Dutch and Latin; it has to be seen how much such a feature will be demanded if we remove theAUXtag.Xtra=Junkused in Hebrew
All layered features
We may want to standardize some of the layers but they seem to be de-facto standardized anyway.
Gender[dat], [erg], [psor]Number[abs], [dat], [erg], [psed], [psor]Person[abs], [dat], [erg], [psor]Polite[abs], [dat], [erg]
References
- Blake, Barry J. 2001. Case. Cambridge: Cambridge University Press. 2nd edition.
- Davis, Irvine. 1964. The language of Santa Ana Pueblo (anthropological papers, no. 69). Smithsonian Institution Bureau of American Ethnology, Bulletin 191: Anthropological Papers, Numbers 68-74, Washington, DC: United States Government Printing Office, 53–190.
- Göksel, Asli and Kerslake, Celia. 2005. Turkish: A Comprehensive Grammar. New York: Routledge.
- Haspelmath, Martin. 1995. The converb as a cross-linguistically valid category. Converbs in Cross-Linguistic Perspective: Structure and Meaning of Adverbial Verb Forms – Adverbial Participles, Gerunds –, edited by Martin Haspelmath and Ekkehard König, Berlin: Mouton de Gruyter, Empirical Approaches to Language Typology, 1–56.
- Heine, Bernd and Kuteva, Tania. 2002. World Lexicon of Grammaticalization. Cambridge: Cambridge University Press.
- Kornfilt, Jaklin. Turkish. London and New York: Routledge, 1997.
- Sussex, Roland and Cubberley, Paul. 2006. The Slavic Languages. Cambridge University Press.
- Sylak-Glassman, John. 2016. The Composition and Use of the Universal Morphological Feature Schema (UniMorph Schema) Working Draft v2, May 25 2016.