This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.

home issue tracker

Introduction

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy. The treebank was part of HamleDT, a collection of treebanks converted to the Prague dependency style (since 2011). Later versions of HamleDT added a conversion to the Stanford dependencies (2014) and to Universal Dependencies (HamleDT 3.0, 2015). The first release of Universal Dependencies that includes this treebank is UD v1.2 in November 2015. It is essentially the HamleDT conversion but the data is not identical to HamleDT 3.0 because the conversion procedure has been further improved.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column	Status
ID	Sentence segmentation and tokenization (including cutting off certain suffixes that constitute independent syntactic words) was automatically done and then hand-corrected.
FORM	Identical to TamilTB form.
LEMMA	Gold (preprocessed and then manually corrected).
UPOSTAG	Converted automatically from XPOSTAG (via Interset).
XPOSTAG	Gold (preprocessed and then manually corrected).
FEATS	Converted automatically from XPOSTAG (via Interset).
HEAD	Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPREL	Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPS	— (currently unused)
MISC	Information about token spacing restored using heuristics. Mapping between multi-word tokens and syntactic words verified against the source text.

References

Loganathan Ramasamy, Zdeněk Žabokrtský. 2012. Prague Dependency Style Treebank for Tamil. In: Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), İstanbul, Turkey, ISBN 978-2-9517408-7-7, pp. 1888–1894.

Introduction

Source of annotations

Links

References