The tokenization in the Swedish UD treebanks mostly follows the principles of the Stockholm-Umeå Corpus, Version 2.0 (SUC, 2006), which is the de facto standard for Swedish tokenization and part-of-speech tagging. This is a straightforward segmentation based on whitespace and punctuation, but the following special cases deserve to be mentioned:
- Numerical expressions (including dates) are treated as single words as long as they do not contain spaces, for example, “1.1.1970”, “11:00”.
- Abbreviations are treated as single words even when they contain spaces, for example, “t ex” (a variant of “t.ex.”, meaning “for example”).
Swedish UD treebanks do not contain multiword tokens.
The Stockholm Umeå Corpus. Version 2.0. 2006. Stockholm University: Department of Linguistics.