home edit page issue tracker

This page pertains to UD version 2.

Tokenization

The tokenization in the Swedish UD treebanks mostly follows the principles of the Stockholm-Umeå Corpus, Version 2.0 (SUC, 2006), which is the de facto standard for Swedish tokenization and part-of-speech tagging. This is a straightforward segmentation based on whitespace and punctuation, but the following special cases deserve to be mentioned:

Swedish UD treebanks do not contain multiword tokens.

References

The Stockholm Umeå Corpus. Version 2.0. 2006. Stockholm University: Department of Linguistics.