This page pertains to UD version 2.

UD for Russian

Tokenization and Word Segmentation

Russian UD treebanks do not contain multiword tokens.



All corpora use the full range of UPOS tags. The XPOS column uses a version of the Penn Treebank tagset in GSD and Taiga treebanks, see https://github.com/olesar/ruUD/blob/master/conversion/RussianUD_XPOSlist.md.


Morphological features are included in all corpora. In GSD and Taiga, they are tagged manually, in Syntagrus, they are converted from the features manually tagged in the source treebank. In PUD, they are added automatically and then manually checked.

The following feature subtypes are used in Russian:

The following universal features are not used in Russian: Clusivity, Definite, Evident, NounClass, Polite.


There are four Russian UD treebanks: