home edit page issue tracker

This page still pertains to UD version 1.

Tokenization

The low-level tokenization of the Russian UD treebanks generally adopt the RNC standard.

Some special cases worth mentioning: * Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens. * Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }). * Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }). * Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }). * Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-летний “79 year old”, 500-летие “500th anniversary”) are treated as single tokens. * Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { из-за } “because of”, { зелено-серых } “green-gray”, { Санкт-Петербург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”. * The discoursive particles -то and are tokenised separately, e.g. Вася-то { Вася , - , то }. Exception: indefinite pronouns and adverbs, see below. * Abbreviations are treated as single tokens, whitespaces split the abbreviations. * Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”. * Abbreviations can not contain a period inside, i.e. the patterns like и т.д. “and so on”, и т.п. “and so forth” are splitted into three tokens: { и , т. , д. }, { и , т. , п. }. * Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}

The Russian UD treebanks does not contain multiword tokens. (UD_Russian-Syntagrus treebank v.1.3 and v.1.4 contained multitokens following the Syntagrus standard).

Pronouns and adverbs

Verb forms, analytical grammatical forms, negation