home edit page issue tracker

Additional language-specific values for universal features

The following features are included in the universal set, but some values are missing there. It is likely that these values will be included in future versions of the universal set.

Aspect

u-feat/Aspect = Freq: frequentative aspect in Hungarian: üt “hit” – ütöget “hit several times”

Definite

u-feat/Definite = 2: definiteness-like agreement of verbs with a second person object in Hungarian. Hungarian verbs have to be conjugated in harmony with the definiteness of the object, making a difference between a definite object (nézem a filmet “I am watching the film”), an indefinite object (nézek egy filmet “I am watching a film”) and a second person object (nézlek téged “I am watching you”).

Degree

u-feat/Degree = Dim: diminutive (used for nouns e.g. in Dutch: stoeltje, huisje, nippertje)

PronType

u-feat/PronType = Exc: exclamative pronoun or determiner. It expresses the speaker’s surprise towards the modified noun, e.g. what in What a surprise! In many languages, exclamative determiners are recruited from the set of interrogative determiners. Therefore, not all tagsets distinguish them. For instance, they are distinguished in Spanish (es::conll2009), Catalan (ca::conll2009), Italian (it::isdt) and Persian (fa::conll).
PronType = Clit: used for clitic pronouns in Italian, e.g. Si è rotto “It has broken”

Tense

u-feat/Tense = Aor: aorist (as opposed to Past) in Ancient Greek. Note that other languages may have a past tense that they traditionally call aorist but they mark it using the normal Past value because the other past tenses have their own special values. Bulgarian is an example: Bulgarian aorist is labeled Past and imperfect past is labeled Imp.

VerbForm

u-feat/VerbForm = Gdv: gerundive (as opposed to the gerund) in Latin.

Voice

u-feat/Voice = Mid: middle voice in Ancient Greek. (The mediopassive voice can be expressed as Voice=Mid,Pass.)
Voice = Int: intensive voice/aspect (the PIEL binyan) in Hebrew.

Language-specific features

In addition to the universal set of features, it is desirable to recognize word features that are particular to one language or a small group of related languages. We also include here features that are not language-specific but rather treebank-specific. They encode something that could occur in many languages but only a few treebanks choose to tag it (for example, whether a word is an abbreviation).

These features are not part of the core universal set but if they appear in more than one language, they should be encoded in all the languages identically.

For the universal features, there may be additional language-specific values that are not (yet) defined at the universal level.

Features that have brackets in their name (such as Gender[psor]) are layered features. It means that a feature applies more than once to a word, in layers. The layer is indicated in the brackets. Layered features are clones of existing non-layered universal or language-specific features. They have their own language-specific documentation that describes what is the meaning of the layer, how the list of values is modified for the layer (if at all), and provides layer-specific examples.

The universal features are mostly derived from the Interset Project (Zeman, 2008). Interset contains additional features that have not yet been adopted as universal features. However, they may be used, if necessary, as part of the “language-specific extensions” to the universal features.

There are automatically generated approximate conversion tables from existing tagsets of various languages to the universal part-of-speech tags and universal + language-specific features.