UD for Belarusian
Tokenization and Word Segmentation
The low-level tokenization generally adopts the RNC standard.
- In general, tokens are delimited by whitespace. The regexp
[А-zА-яЁёУўі\-’'‘]+
usually corresponds to one token. - An apostrophe ⟨’⟩ (occasionally ⟨’⟩ and ⟨‘⟩ in the source texts) is part of the token when used to separate the non-palatalized consonant and the iotated vowel: ⟨п’я п’е п’і п’ё п’ю⟩ /pja pjɛ pi pjɔ pju/.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
- Each punctuation mark is treated as a single token, e.g. the following sequence: )”, - becomes four tokens, ) , ”, ,, and -“. Exceptions are conventional multi-character punctuation marks such as – , … , ?! and emojis and smileys: :) , ^_^, etc.
- Conventional non-cyrillic multi-character terms are tokenized as single tokens: °С, км2.
Some special cases worth mentioning:
- Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens.
- Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }).
- Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }).
- Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }).
- Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-гадовы “79 year old”, 500-годдзе “500th anniversary”) are treated as single tokens.
- Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { з-за } “because of”, { зялёна-шэрых } “green-gray”, { Санкт-Пецярбург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”.
- Abbreviations are treated as single tokens, whitespaces split the abbreviations.
- Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”.
- Abbreviations can not contain a period inside, i.e. the patterns like і т.д. “and so on”, да т.п. “and so forth” are splitted into three tokens: { i , т. , д. }, { да , т. , п. }.
- Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}
The Belarusian UD treebank does not contain multiword tokens.
Instruction: Describe the general rules for delimiting words (for example, based on whitespace and punctuation) and exceptions to these rules. Specify whether words with spaces and/or multiword tokens occur. Include links to further language-specific documentation if available.
Morphology
Tags
- Belarusian uses 16 universal POS categories; the current data does not contain any occurrences of the INTJ category.
- The only auxiliary verb is the copula быць “to be”. The conditional mood marker б “would” etymologically related to быць is also marked AUX.
- The pronoun (PRON) vs. determiner (DET) distinction is based on word lists because the traditional grammar does not define determiners. The DET category includes possessive (including reflexive and relative possessive), demonstrative, interrogative/relative, indefinite, negative, and universal (total) determiners that inflect for gender. The relative pronoun які is tagged either PRON) or (DET depending its syntactic role.
Features
- There are five main verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
- (De)verbal nouns are not considered part of the verb paradigm and tagged NOUN.
- Verb Aspect has two possible values:
Perf
andImp
. - Tense has three possible values:
Past
,Pres
, andFut
. - Voice has three possible values:
Act
andMid
on the non-participle forms, andAct
andPass
on the participle forms. - Mood has three possible values:
Ind
,Imp
, andCnd
; the latter is tagged in AUX or SCONJ. - Number has two possible values:
Sing
andPlur
. - Gender has three possible values:
Masc
,Fem
, andNeut
. Gender is a lexical feature for NOUN, PROPN, and PRON and an inflectional agreement features for ADJ, DET, a few NUM, participle forms (VERB and AUX). - Animacy have two values:
Anim
andInan
. Animacy is a lexical feature for NOUN, PROPN, and PRON and a grammatical feature for ADJ, DET, a few NUM, participle forms (VERB and AUX). Only accusative forms (except for feminine singular) distinguish Animacy grammatically. - Case has six possible values:
Nom
,Gen
,Dat
,Acc
,Ins
,Loc
. - Person has three possible values:
1
,2
,3
and applies to the non-past finite verb forms (VERB and AUX) as an inflectional feature and to pronouns (PRON) as a lexical feature. - Degree has three possible values:
Pos
,Cmp
,Sup
and applies to ADJ and ADV.Cmp
also aplies to the ‘second’ comparative with the prefix po-. - Polarity has one value,
Neg
, and applies primarily to PART. The negative particle used with the verb is considered a separate token. However, Polarity applies also to a number of lexemes in which negation is incorporated within the verb, predicative, or adverb ((VERB, ADV: няма “there is no”, нягледзячы “regardless”). - PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM).
Other Lexical Features
- Abbr applies to NOUN and PROPN.
- Foreign applies to X.
- Poss applies to possessive determiners (DET).
- Reflex applies to pronouns (PRON).
Language-Specific Features
- Variant distinguishes short and long forms of adjectives and participles, a Slavic-wide phenomenon. It has one value:
Short
, long forms are not labeled.
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case or in the genitive case (under negation or quantifier).
- The objects are divided to core objects, labeled obj or iobj, and oblique objects, labeled obl.
- In passive clauses, the nominal subject is labeled with nsubj:pass; the semantic subject in the instrumental case is labeled obl:agent.
Non-verbal Clauses
- Nominal clauses either use the copula (быць) “to be” or present simple juxtaposition without copula.
Relation Subtypes
- The following relation subtypes are used:
- acl:relcl for relative adnominal clauses;
- aux:pass for auxiliaries in passive clauses;
- nsubj:pass for nominal subjects of passive verbs;
- obl:agent for agents of passive verbs;
- nummod:gov for cardinal numbers that are attached as children of the counted noun but govern its case;
- nummod:entity for cardinal numbers that are attached as children of the counted noun but govern its case;
- advmod:discourse for adverbs or particles that that modify noun phrases and emphasize them;
- flat:foreign for flat relations in foreign multi-word expressions and named entities;
- flat:name for flat relations in names.
Instruction: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any.
Treebanks
There is one Belarusian UD treebanks:
Instruction: Treebank-specific pages are generated automatically from the README file in the treebank repository and
from the data in the latest release. Link to the respective *-index.html
page in the treebanks
folder, using the language code
and the treebank code in the file name.