This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.
home issue tracker

Tokenization

Uralic languages are based on latin or cyrillic alphabets, and use most commonly a white-space-based tokenization with usual punctuation conventions. Multiword tokens are generally avoided, some considerations are described in the Multiword tokens section. As a general guideline, current Uralic annotations avoid tokens with spaces in them and avoid using multiple tokens per word.

Multiword tokens

There are some cases in current Uralic tokenisations that can raise questions: compounds with multiple content words in them and contraction-like stuff for example as well as some enclitic particles.

Some corner-cases with compounding

Sometimes compound forms in Uralic languages are like contractions, and can be encoded as multiple tokens per word. For example Finnish conjunctions form such compound with negation verb as in Finnish FTB UD:

Also possible to use just compound token like in Finnish UD:

Enclitic particles

Enclitic particles are common in Uralic languages. They are not separate lexical units but could be treated as such in future revisions. Currently it is advised to encode enclitic particles as a part of token as usual, as in Finnish UD:

References

BESbswyBESbswyBESbswyBESbswy