home edit page issue tracker

This page pertains to UD version 2.

Introduction

UD currently contains one treebank for Eastern Armenian:

UD Armenian-ArmTDP

Github repository

UD_Armenian-ArmTDP is based on the ՀայՇտեմ - ArmTDP-East dataset (version 1.0), a mix of random sentences sampled from different sources and representing different genres and domains, released in several formats (local on-line newspaper and journal articles, contemporary fiction), originally annotated by a team led by Marat M. Yavrumyan at the Yerevan State University. The annotation scheme was developed in according to the UD guidelines. All data were checked manually. The tokenization and POS-tagging process was carried out through alternating steps of automatic scripting and manual revision in the YerevaNN research lab (led by Hrant H. Khachatrian).

UD_Armenian comprises 564 sentences and 12213 tokens. Documentation is provided by Marat M. Yavrumyan and Anna S. Danielyan.

The first preliminary release was issued in April 2018, for the CoNLL-2018 shared task.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID Sentence segmentation and tokenization was automatically done using ՀայՆիշ-ArmTDP tokenizer. Additional changes (splitting and merging) were done manually during the annotation.
FORM  
LEMMA Manual selection from possibilities provided by morphological analysis using Eastern Armenian lexicons: two annotators and then an arbiter.
UPOSTAG Manual selection from possibilities provided by morphological analysis: two annotators and then an arbiter.
XPOSTAG _ (currently unused)
FEATS Generated automatically from UPOSTAG, and then hand-corrected.
HEAD Original UD annotation is manual, done by two independent annotators and then an arbiter.
DEPREL Original UD annotation is manual, done by two independent annotators and then an arbiter.
DEPS _ (currently unused)
MISC Information about token spacing.

Acknowledgments