Tokenization
The low-level tokenization of the UD Armenian Treebanks (both Eastern and Western Armenian) generally adopts the Հայերէնի ծառադարան - ArmTDP standard:
- In general, tokens are delimited by whitespace.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
Some special cases worth mentioning:
- An abbreviation marked by a period, as in թ. “year”, becomes two tokens, թ and . .
- A compound containing a hyphen becomes three tokens (two words and the hyphen), as in առասպելա-բանաստեղծական “fabulo-poetic”, or five tokens (three words and two hyphens), as in ռուսա-իրանա-սուրիական “Russian-Iranian-Syrian”. In these cases, the first token is a special form of adjective that never occurs independently.
- Compounds without a hyphen are not split, thus պաղեստինեւիսրայէլեան “Israeli-Palestinian” is one token but բիւզանդական-արեւելեան “Byzantine-Eastern” would be three tokens.
- Another common case of splitting-on-hyphen are reduplicative or echo words as in մէկիկ-մէկիկ “one by one”, միս-մինակ “all alone”.
- Inflectional bound morphemes and hypens after phrases or sentences used as names in quotation marks or after abbreviations marked by a period, as in «Ցեղին սիրտը»էն “from ‘The Heart of the Tribe’” or 2020 թ.-ին “in year 2020” are split and are considered as separate tokens: { « , Ցեղին , սիրտը , » , էն } and { 2020 , թ , . , - , ին } . The word before the hypen is the head and the bound morpheme is linked with the deprel
dep
. Tokenizing and segmenting this way seems easier for parsing. - The words that contain “infixed” punctuation (question, exclamation, emphasis and Armenian abbreviation marks), as in ինչպէ՞ս “How?”, are considered multi-word tokens and become two tokens, ինչպէս and ՞ . EXCEPTION is the apostrophe, as in Ժանն տ՚Արք “Joan of Arc” or կ՚ընէ, which is split and belongs to the preceding word, { Ժան , տ՚ , Արք } , { կ՚ , ընէ }.
- Generally, every punctuation character constitutes a token of its own. Thus »,— will become three tokens.
- EXCEPTIONs are conventional multi-character punctuation marks: … , …. , and emojis and smileys: :) , ^_^ , ։Ճ etc. Conventional non-armenian multi-character punctuation marks and terms are tokenized as single tokens: ?! .
- Special symbols before and after numerical expressions, as in $250 , 4,81% , +32°С , are tokenised separately (so, the tokens are { $ , 250 } , { 4,81 , % } , { + , 32 , °С }).
- Email addresses, URLs, and tweet-style names are treated as single tokens: muster@muster.am , https://github.com , @gov_am .
Some special cases worth mentioning:
- Numerical expressions are treated as single words as long as they do not contain spaces or hyphen, for example, 355,089.40 . Decimal numbers (with Armenian decimal comma or English decimal point) are also kept as one token, e.g. 2.1 , 2,1 .
- EXCEPTION: Time expressions and dates like 19:45 or 20.05.2000 , 20/05/2000> are splitted into separate tokens (in this case, three { 19 , : , 45 } and five { 20 , . , 05 , . , 2000 } , { 20 , / , 05 , / , 2000 }).
- Numerical expressions with or without hyphen and Armenian endings as well as adjectives and other non-numerals which contain digits (e.g. 2րդ “2nd” , 44օրեայ “44-day” , 1700ամյա “1700-year-old” , 5նոց “in 5tհs” , ՆԱՏՕ-ական “belonging-to-NATO , ՏՈՒ-154Մ “TU-154M”) are treated as single tokens as long as they do not contain inflectional endings (e.g. 96ի “of 96.Dat” , 1956ին “in 1956.Dat” , 196-ամեակի “196th anniversary.Dat” ) which are splitted into separate tokens (in this case, two or three { 96 , ի } , { 1956 , ին } , { 196 , - , ամեակի }).
Multi-word tokens
See above, the “infixed” punctuation.
Pronouns and adverbs
- Indefinite pronouns and adverbs like ինչ-որ, ինչ-ինչ “something, somewhat”, etc. are splitted as compounds containing a hyphen and become three tokens (two words and the hyphen).
Verb forms, analytical grammatical forms, negation
-
the forms of indicative mood, complex tenses, analytical causative, complex comparatives, etc. are splitted according to the orthographic principle: { կ՚ , ըսուի } “is said”, { պայքարած , են } “have fought”, { պէտք , է , սկսած , ըլլար } “it might have been started”, { շինել , տուեր , էին } “they had smth. built/made”, { աւելի , յաճախ } “more often”.
-
մի՛ and ոչ used as negation markers with verbs, adjectives, adverbs, pronouns and other words are tokenized according to the orthographic rules: { մի , ՛ , ընէք } “don’t do!”, { ոչ , հինցած } “not outdated”, { ոչ , հիմա } “not now” , { ոչ , մէկ , տեղ } “nowhere, (lit.: in no place)”.
Sentence splitting
Each sentence contains only one root. Splitting is usually performed after an end-of-sentence full stop or after a dot, ellipsis or colon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.