home edit page issue tracker

This page still pertains to UD version 1.

Tokenization

The low-level tokenization of the Belarusian UD treebank generally adopts the RNC standard.

Some special cases worth mentioning: * Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens. * Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }). * Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }). * Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }). * Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-гадовы “79 year old”, 500-годдзе “500th anniversary”) are treated as single tokens. * Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { з-за } “because of”, { зялёна-шэрых } “green-gray”, { Санкт-Пецярбург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”. * Abbreviations are treated as single tokens, whitespaces split the abbreviations. * Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”. * Abbreviations can not contain a period inside, i.e. the patterns like і т.д. “and so on”, да т.п. “and so forth” are splitted into three tokens: { i , т. , д. }, { да , т. , п. }. * Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}

The Belarusian UD treebank does not contain multiword tokens.

Indefinite pronouns and adverbs

Verb forms, analytical grammatical forms, negation

Character set

-,;:!?.’’”“”()/&#%°+0123456789aAábdDeěfFghHiIjkKlLmn№oOpPrRsStTuvVwWXyаАбБвВгГдДеЕёЁжЖзЗіІйкКлЛмМнНоОпПрРсСтТуУўфФхХцЦчЧшШыьэЭюЮяЯ