Additional language-specific values for universal features
The following features are included in the universal set, but some values are missing there. It is likely that these values will be included in future versions of the universal set.
- Definite
= 2
: definiteness-like agreement of verbs with a second person object in Hungarian. Hungarian verbs have to be conjugated in harmony with the definiteness of the object, making a difference between a definite object (nézem a filmet “I am watching the film”), an indefinite object (nézek egy filmet “I am watching a film”) and a second person object (nézlek téged “I am watching you”).
- PronType
= Exc
: exclamative pronoun or determiner. It expresses the speaker’s surprise towards the modified noun, e.g. what in What a surprise! In many languages, exclamative determiners are recruited from the set of interrogative determiners. Therefore, not all tagsets distinguish them. For instance, they are distinguished in Spanish (es::conll2009), Catalan (ca::conll2009), Italian (it::isdt) and Persian (fa::conll). PronType = Clit
: used for clitic pronouns in Italian, e.g. Si è rotto “It has broken”
- Tense
= Aor
: aorist (as opposed toPast
) in Ancient Greek. Note that other languages may have a past tense that they traditionally call aorist but they mark it using the normalPast
value because the other past tenses have their own special values. Bulgarian is an example: Bulgarian aorist is labeledPast
and imperfect past is labeledImp
- Voice
= Mid
: middle voice in Ancient Greek. (The mediopassive voice can be expressed asVoice=Mid,Pass
.) Voice = Int
: intensive voice/aspect (the PIEL binyan) in Hebrew.
Language-specific features
In addition to the universal set of features, it is desirable to recognize word features that are particular to one language or a small group of related languages. We also include here features that are not language-specific but rather treebank-specific. They encode something that could occur in many languages but only a few treebanks choose to tag it (for example, whether a word is an abbreviation).
These features are not part of the core universal set but if they appear in more than one language, they should be encoded in all the languages identically.
For the universal features, there may be additional language-specific values that are not (yet) defined at the universal level.
Features that have brackets in their name (such as Gender[psor]
) are layered features.
It means that a feature applies more than once to a word, in layers. The layer is indicated in the brackets.
Layered features are clones of existing non-layered universal or language-specific features.
They have their own language-specific documentation that describes what is the meaning of the layer,
how the list of values is modified for the layer (if at all),
and provides layer-specific examples.
The universal features are mostly derived from the Interset Project (Zeman, 2008). Interset contains additional features that have not yet been adopted as universal features. However, they may be used, if necessary, as part of the “language-specific extensions” to the universal features.
There are automatically generated approximate conversion tables from existing tagsets of various languages to the universal part-of-speech tags and universal + language-specific features.
- Arabic values: Yes
- Czech values: Yes
- Estonian values: Yes
- Finnish values: Yes
- Latin values: Yes
- Slovenian values: Yes
- Interset: Yes
- Ancient Greek values: Prep
- Arabic values: Prep
- Czech values: Prep, Voc, Comprep
- Dutch values: Prep, Post, Circ
- Estonian values: Prep, Post
- Latin values: Prep
- Portuguese values: Prep
- Tamil values: Post
- Interset: Prep, Post, Circ, Voc
- Interset: Man, Loc, Tim, Deg, Cau, Mod, Sta, Ex, Adadj
- Czech values: Oper
- Interset: Comp, Oper
- Finnish values: Yes
- Finnish values: Minen, Sti, Inen, Lainen, Ja, Ton, Vs, Ttain, Ttaa
- Interset: Rdp, Ech
- Arabic values: Foreign
- Czech values: Foreign, Fscript, Tscript
- Dutch values: Foreign
- Finnish values: Foreign, Fscript
- Slovenian values: Yes
- Spanish values: Foreign, Fscript
- Interset: Foreign, Fscript, Tscript
- Basque values: Masc, Fem
- Interset: Masc, Fem
- Basque values: Masc, Fem
- Interset: Masc, Fem
- Czech values: Masc, Fem
- Slovenian values: Masc, Fem, Neut
- Interset: Masc, Fem
- Czech values: Yes
- Portuguese values: Yes
- Interset: Yes
- Finnish values: 1, 2, 3
- Czech values: Geo, Prs, Giv, Sur, Nat, Com, Pro, Oth
- Estonian values: Nat
- Interset: Geo, Prs, Giv, Sur, Nat, Com, Pro, Oth
- Interset: Com, Prop, Class
- Basque values: Sing, Plur
- Interset: Sing, Plur
- Basque values: Sing, Plur
- Interset: Sing, Plur
- Basque values: Sing, Plur
- Interset: Sing, Plur
- Interset: Sing, Plur
- Czech values: Sing, Plur
- Finnish values: Sing, Plur
- Portuguese values: Sing, Plur
- Slovenian values: Sing, Dual, Plur
- Interset: Sing, Plur
- Arabic values: Digit, Word
- Czech values: Digit, Roman, Word
- Estonian values: Digit, Word
- Latin values: Digit
- Slovenian values: Digit, Roman, Word
- Tamil values: Digit
- Interset: Digit, Roman, Word
- Arabic values: 1, 2, 3
- Czech values: 1, 2, 3
- Interset: 1, 2, 3
- Finnish values: Pres, Past, Agt, Neg
- Dutch values: Inf, Vbp
- Interset: Mod, Emp, Res, Inf, Vbp
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
- Finnish values: 1, 2, 3
- Interset: 1, 2, 3
- Basque values: Inf, Pol
- Spanish values: Pol
- Tamil values: Pol
- Interset: Inf, Pol
- Basque values: Inf, Pol
- Interset: Inf, Pol
- Basque values: Inf, Pol
- Interset: Inf, Pol
- Basque values: Inf, Pol
- Interset: Inf, Pol
- Interset: Yes
- Czech values: Npr, Pre
- Portuguese values: Pre
- Spanish values: Npr, Pre
- Interset: Npr, Pre
- Dutch values: Ini, Fin
- Interset: Ini, Fin
- Dutch values: Brck, Colo, Comm, Excl, Peri, Qest, Quot, Semi
- Estonian values: Comm, Excl, Peri, Qest
- Tamil values: Comm, Peri
- Interset: Peri, Qest, Excl, Quot, Brck, Comm, Colo, Semi, Dash
- Czech values: Arch, Rare, Poet, Norm, Coll, Vrnc, Slng, Expr, Derg, Vulg
- Finnish values: Arch, Coll
- Interset: Arch, Rare, Poet, Norm, Coll, Vrnc, Slng, Expr, Derg, Vulg
- Dutch values: Intr, Tran
- Interset: Intr, Tran
- Finnish values: Yes
- Portuguese values: Yes
- Interset: Yes
- Czech values: Short
- Dutch values: Short
- Slovenian values: Bound, Short
- Dutch values: Aux, Cop, Mod
- Estonian values: Mod
- Italian values: Aux, Cop, Mod
- Latin values: Mod
- Interset: Aux, Cop, Mod, Light