home edit page issue tracker

Additional language-specific values for universal features

The following features are included in the universal set, but some values are missing there. It is likely that these values will be included in future versions of the universal set.

Aspect

Definite

Degree

PronType

Tense

VerbForm

Voice

Language-specific features

In addition to the universal set of features, it is desirable to recognize word features that are particular to one language or a small group of related languages. We also include here features that are not language-specific but rather treebank-specific. They encode something that could occur in many languages but only a few treebanks choose to tag it (for example, whether a word is an abbreviation).

These features are not part of the core universal set but if they appear in more than one language, they should be encoded in all the languages identically.

For the universal features, there may be additional language-specific values that are not (yet) defined at the universal level.

Features that have brackets in their name (such as Gender[psor]) are layered features. It means that a feature applies more than once to a word, in layers. The layer is indicated in the brackets. Layered features are clones of existing non-layered universal or language-specific features. They have their own language-specific documentation that describes what is the meaning of the layer, how the list of values is modified for the layer (if at all), and provides layer-specific examples.

The universal features are mostly derived from the Interset Project (Zeman, 2008). Interset contains additional features that have not yet been adopted as universal features. However, they may be used, if necessary, as part of the “language-specific extensions” to the universal features.

There are automatically generated approximate conversion tables from existing tagsets of various languages to the universal part-of-speech tags and universal + language-specific features.

Abbr

AdpType

AdvType

ConjType

Connegative

Derivation

Echo

Foreign

Gender[dat]

Gender[erg]

Gender[psor]

Hyph

InfForm

NameType

NounType

Number[abs]

Number[dat]

Number[erg]

Number[psee]

Number[psor]

NumForm

NumValue

PartForm

PartType

Person[abs]

Person[dat]

Person[erg]

Person[psor]

Polite

Polite[abs]

Polite[erg]

Polite[dat]

Prefix

PrepCase

PunctSide

PunctType

Style

Subcat

Typo

Variant

VerbType