Tokenization
Uralic languages are based on latin or cyrillic alphabets, and use most commonly a white-space-based tokenization with usual punctuation conventions. Multiword tokens are generally avoided, some considerations are described in the Multiword tokens section. As a general guideline, current Uralic annotations avoid tokens with spaces in them and avoid using multiple tokens per word.
Multiword tokens
There are some cases in current Uralic tokenisations that can raise questions: compounds with multiple content words in them and contraction-like stuff for example as well as some enclitic particles.
Some corner-cases with compounding
Sometimes compound forms in Uralic languages are like contractions, and can be encoded as multiple tokens per word. For example Finnish conjunctions form such compound with negation verb as in Finnish FTB UD:
Also possible to use just compound token like in Finnish UD:
Enclitic particles
Enclitic particles are common in Uralic languages. They are not separate lexical units but could be treated as such in future revisions. Currently it is advised to encode enclitic particles as a part of token as usual, as in Finnish UD: