home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

Tokenization

The low-level tokenization of the UD Armenian Treebanks (both Eastern and Western Armenian) generally adopts the Հայերէնի ծառադարան - ArmTDP standard:

In general, tokens are delimited by whitespace.
Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization. Some special cases worth mentioning:
- An abbreviation marked by a period, as in թ. “year”, becomes two tokens, թ and . .
- A compound containing a hyphen becomes three tokens (two words and the hyphen), as in առասպելա-բանաստեղծական “fabulo-poetic”, or five tokens (three words and two hyphens), as in ռուսա-իրանա-սուրիական “Russian-Iranian-Syrian”. In these cases, the first token is a special form of adjective that never occurs independently.
- Compounds without a hyphen are not split, thus պաղեստինեւիսրայէլեան “Israeli-Palestinian” is one token but բիւզանդական-արեւելեան “Byzantine-Eastern” would be three tokens.
- Another common case of splitting-on-hyphen are reduplicative or echo words as in մէկիկ-մէկիկ “one by one”, միս-մինակ “all alone”.
- Inflectional bound morphemes and hypens after phrases or sentences used as names in quotation marks or after abbreviations marked by a period, as in «Ցեղին սիրտը»էն “from ‘The Heart of the Tribe’” or 2020 թ.-ին “in year 2020” are split and are considered as separate tokens: { « , Ցեղին , սիրտը , » , էն } and { 2020 , թ , . , - , ին } . The word before the hypen is the head and the bound morpheme is linked with the deprel dep. Tokenizing and segmenting this way seems easier for parsing.
- The words that contain “infixed” punctuation (question, exclamation, emphasis and Armenian abbreviation marks), as in ինչպէ՞ս “How?”, are considered multi-word tokens and become two tokens, ինչպէս and ՞ . EXCEPTION is the apostrophe, as in Ժանն տ՚Արք “Joan of Arc” or կ՚ընէ, which is split and belongs to the preceding word, { Ժան , տ՚ , Արք } , { կ՚ , ընէ }.
- Generally, every punctuation character constitutes a token of its own. Thus »,— will become three tokens.
- EXCEPTIONs are conventional multi-character punctuation marks: … , …. , and emojis and smileys: :) , ^_^ , ։Ճ etc. Conventional non-armenian multi-character punctuation marks and terms are tokenized as single tokens: ?! .
Special symbols before and after numerical expressions, as in $250 , 4,81% , +32°С , are tokenised separately (so, the tokens are { $ , 250 } , { 4,81 , % } , { + , 32 , °С }).
Email addresses, URLs, and tweet-style names are treated as single tokens: muster@muster.am , https://github.com , @gov_am .

Some special cases worth mentioning:

Numerical expressions are treated as single words as long as they do not contain spaces or hyphen, for example, 355,089.40 . Decimal numbers (with Armenian decimal comma or English decimal point) are also kept as one token, e.g. 2.1 , 2,1 .
EXCEPTION: Time expressions and dates like 19:45 or 20.05.2000 , 20/05/2000> are splitted into separate tokens (in this case, three { 19 , : , 45 } and five { 20 , . , 05 , . , 2000 } , { 20 , / , 05 , / , 2000 }).
Numerical expressions with or without hyphen and Armenian endings as well as adjectives and other non-numerals which contain digits (e.g. 2րդ “2nd” , 44օրեայ “44-day” , 1700ամյա “1700-year-old” , 5նոց “in 5tհs” , ՆԱՏՕ-ական “belonging-to-NATO , ՏՈՒ-154Մ “TU-154M”) are treated as single tokens as long as they do not contain inflectional endings (e.g. 96ի “of 96.Dat” , 1956ին “in 1956.Dat” , 196-ամեակի “196th anniversary.Dat” ) which are splitted into separate tokens (in this case, two or three { 96 , ի } , { 1956 , ին } , { 196 , - , ամեակի }).

Multi-word tokens

See above, the “infixed” punctuation.

Pronouns and adverbs

Indefinite pronouns and adverbs like ինչ-որ, ինչ-ինչ “something, somewhat”, etc. are splitted as compounds containing a hyphen and become three tokens (two words and the hyphen).

Verb forms, analytical grammatical forms, negation

the forms of indicative mood, complex tenses, analytical causative, complex comparatives, etc. are splitted according to the orthographic principle: { կ՚ , ըսուի } “is said”, { պայքարած , են } “have fought”, { պէտք , է , սկսած , ըլլար } “it might have been started”, { շինել , տուեր , էին } “they had smth. built/made”, { աւելի , յաճախ } “more often”.
մի՛ and ոչ used as negation markers with verbs, adjectives, adverbs, pronouns and other words are tokenized according to the orthographic rules: { մի , ՛ , ընէք } “don’t do!”, { ոչ , հինցած } “not outdated”, { ոչ , հիմա } “not now” , { ոչ , մէկ , տեղ } “nowhere, (lit.: in no place)”.

Sentence splitting

Each sentence contains only one root. Splitting is usually performed after an end-of-sentence full stop or after a dot, ellipsis or colon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.