home edit page issue tracker

This page pertains to UD version 2.

Tokenization

The low-level tokenization of the UD Armenian Treebanks (both Eastern and Western Armenian) generally adopts the Հայերէնի ծառադարան - ArmTDP standard:

Some special cases worth mentioning:

Multi-word tokens

See above, the “infixed” punctuation.

Pronouns and adverbs

Verb forms, analytical grammatical forms, negation

Sentence splitting

Each sentence contains only one root. Splitting is usually performed after an end-of-sentence full stop or after a dot, ellipsis or colon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.