home edit page issue tracker

This page pertains to UD version 2.

Tokenization

The tokenization of Slovenian UD treebanks follows the same principles as the original ssj500k corpus and the rule-based Obeliks tokenizer. Namely:

Information on whether a token is not followed by a space (e.g. d.o.o. vs. d. o. o.) is indicated with SpaceAfter=No feature in the MISC column.

Note that the current version of the Slovenian UD Treebank does not yet comply with the universal guidelines recommendation for splitting of fused words, such as combinations of prepositions and pronouns, e.g. name “on me”, _zanj “for him”, _vase “in/to oneself”. Instead, these tokens are currently marked as pronouns with the feature Variant=Bound.