Additional language-specific values for universal features
The following features are included in the universal set, but some values are missing there. It is likely that these values will be included in future versions of the universal set.
Aspect
- u-feat/Aspect
= Freq
: frequentative aspect in Hungarian: üt “hit” – ütöget “hit several times”
Definite
- u-feat/Definite
= 2
: definiteness-like agreement of verbs with a second person object in Hungarian. Hungarian verbs have to be conjugated in harmony with the definiteness of the object, making a difference between a definite object (nézem a filmet “I am watching the film”), an indefinite object (nézek egy filmet “I am watching a film”) and a second person object (nézlek téged “I am watching you”).
Degree
- u-feat/Degree
= Dim
: diminutive (used for nouns e.g. in Dutch: stoeltje, huisje, nippertje)
PronType
- u-feat/PronType
= Exc
: exclamative pronoun or determiner. It expresses the speaker’s surprise towards the modified noun, e.g. what in What a surprise! In many languages, exclamative determiners are recruited from the set of interrogative determiners. Therefore, not all tagsets distinguish them. For instance, they are distinguished in Spanish (es::conll2009), Catalan (ca::conll2009), Italian (it::isdt) and Persian (fa::conll). PronType = Clit
: used for clitic pronouns in Italian, e.g. Si è rotto “It has broken”
Tense
- u-feat/Tense
= Aor
: aorist (as opposed toPast
) in Ancient Greek. Note that other languages may have a past tense that they traditionally call aorist but they mark it using the normalPast
value because the other past tenses have their own special values. Bulgarian is an example: Bulgarian aorist is labeledPast
and imperfect past is labeledImp
.
VerbForm
- u-feat/VerbForm
= Gdv
: gerundive (as opposed to the gerund) in Latin.
Voice
- u-feat/Voice
= Mid
: middle voice in Ancient Greek. (The mediopassive voice can be expressed asVoice=Mid,Pass
.) Voice = Int
: intensive voice/aspect (the PIEL binyan) in Hebrew.
Language-specific features
In addition to the universal set of features, it is desirable to recognize word features that are particular to one language or a small group of related languages. We also include here features that are not language-specific but rather treebank-specific. They encode something that could occur in many languages but only a few treebanks choose to tag it (for example, whether a word is an abbreviation).
These features are not part of the core universal set but if they appear in more than one language, they should be encoded in all the languages identically.
For the universal features, there may be additional language-specific values that are not (yet) defined at the universal level.
Features that have brackets in their name (such as Gender[psor]
) are layered features.
It means that a feature applies more than once to a word, in layers. The layer is indicated in the brackets.
Layered features are clones of existing non-layered universal or language-specific features.
They have their own language-specific documentation that describes what is the meaning of the layer,
how the list of values is modified for the layer (if at all),
and provides layer-specific examples.
The universal features are mostly derived from the Interset Project (Zeman, 2008). Interset contains additional features that have not yet been adopted as universal features. However, they may be used, if necessary, as part of the “language-specific extensions” to the universal features.
There are automatically generated approximate conversion tables from existing tagsets of various languages to the universal part-of-speech tags and universal + language-specific features.
Abbr
- Arabic values: Yes
- Czech values: Yes
- Estonian values: Yes
- Finnish values: Yes
- Latin values: Yes
- Slovenian values: Yes
- Interset: Yes
AdpType
- Ancient Greek values: Prep
- Arabic values: Prep
- Czech values: Prep, Voc, Comprep
- Dutch values: Prep, Post, Circ
- Estonian values: Prep, Post
- Latin values: Prep
- Portuguese values: Prep
- Tamil values: Post
- Interset: Prep, Post, Circ, Voc
AdvType
- Interset: Man, Loc, Tim, Deg, Cau, Mod, Sta, Ex, Adadj
ConjType
- Czech values: Oper
- Interset: Comp, Oper
Connegative
- Finnish values: Yes
Derivation
- Finnish values: Minen, Sti, Inen, Lainen, Ja, Ton, Vs, Ttain, Ttaa
Echo
- Interset: Rdp, Ech
Foreign
- Arabic values: Foreign
- Czech values: Foreign, Fscript, Tscript
- Dutch values: Foreign
- Finnish values: Foreign, Fscript
- Slovenian values: Yes
- Spanish values: Foreign, Fscript
- Interset: Foreign, Fscript, Tscript
Gender[dat]
- Basque values: Masc, Fem
- Interset: Masc, Fem
Gender[erg]
- Basque values: Masc, Fem
- Interset: Masc, Fem
Gender[psor]
- Czech values: Masc, Fem
- Slovenian values: Masc, Fem, Neut
- Interset: Masc, Fem
Hyph
- Czech values: Yes
- Portuguese values: Yes
- Interset: Yes
InfForm
- Finnish values: 1, 2, 3
NameType
- Czech values: Geo, Prs, Giv, Sur, Nat, Com, Pro, Oth
- Estonian values: Nat
- Interset: Geo, Prs, Giv, Sur, Nat, Com, Pro, Oth
NounType
- Interset: Com, Prop, Class
Number[abs]
- Basque values: Sing, Plur
- Interset: Sing, Plur
Number[dat]
- Basque values: Sing, Plur
- Interset: Sing, Plur
Number[erg]
- Basque values: Sing, Plur
- Interset: Sing, Plur
Number[psee]
- Interset: Sing, Plur
Number[psor]
- Czech values: Sing, Plur
- Finnish values: Sing, Plur
- Portuguese values: Sing, Plur
- Slovenian values: Sing, Dual, Plur
- Interset: Sing, Plur
NumForm
- Arabic values: Digit, Word
- Czech values: Digit, Roman, Word
- Estonian values: Digit, Word
- Latin values: Digit
- Slovenian values: Digit, Roman, Word
- Tamil values: Digit
- Interset: Digit, Roman, Word
NumValue
- Arabic values: 1, 2, 3
- Czech values: 1, 2, 3
- Interset: 1, 2, 3
PartForm
- Finnish values: Pres, Past, Agt, Neg
PartType
- Dutch values: Inf, Vbp
- Interset: Mod, Emp, Res, Inf, Vbp
Person[abs]
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
Person[dat]
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
Person[erg]
- Basque values: 1, 2, 3
- Interset: 1, 2, 3
Person[psor]
- Finnish values: 1, 2, 3
- Interset: 1, 2, 3
Polite
- Basque values: Inf, Pol
- Spanish values: Pol
- Tamil values: Pol
- Interset: Inf, Pol
Polite[abs]
- Basque values: Inf, Pol
- Interset: Inf, Pol
Polite[erg]
- Basque values: Inf, Pol
- Interset: Inf, Pol
Polite[dat]
- Basque values: Inf, Pol
- Interset: Inf, Pol
Prefix
- Interset: Yes
PrepCase
- Czech values: Npr, Pre
- Portuguese values: Pre
- Spanish values: Npr, Pre
- Interset: Npr, Pre
PunctSide
- Dutch values: Ini, Fin
- Interset: Ini, Fin
PunctType
- Dutch values: Brck, Colo, Comm, Excl, Peri, Qest, Quot, Semi
- Estonian values: Comm, Excl, Peri, Qest
- Tamil values: Comm, Peri
- Interset: Peri, Qest, Excl, Quot, Brck, Comm, Colo, Semi, Dash
Style
- Czech values: Arch, Rare, Poet, Norm, Coll, Vrnc, Slng, Expr, Derg, Vulg
- Finnish values: Arch, Coll
- Interset: Arch, Rare, Poet, Norm, Coll, Vrnc, Slng, Expr, Derg, Vulg
Subcat
- Dutch values: Intr, Tran
- Interset: Intr, Tran
Typo
- Finnish values: Yes
- Portuguese values: Yes
- Interset: Yes
Variant
- Czech values: Short
- Dutch values: Short
- Slovenian values: Bound, Short
VerbType
- Dutch values: Aux, Cop, Mod
- Estonian values: Mod
- Italian values: Aux, Cop, Mod
- Latin values: Mod
- Interset: Aux, Cop, Mod, Light