home edit page issue tracker

This page still pertains to UD version 1.

Introduction

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy. The treebank was part of HamleDT, a collection of treebanks converted to the Prague dependency style (since 2011). Later versions of HamleDT added a conversion to the Stanford dependencies (2014) and to Universal Dependencies (HamleDT 3.0, 2015). The first release of Universal Dependencies that includes this treebank is UD v1.2 in November 2015. It is essentially the HamleDT conversion but the data is not identical to HamleDT 3.0 because the conversion procedure has been further improved.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID Sentence segmentation and tokenization (including cutting off certain suffixes that constitute independent syntactic words) was automatically done and then hand-corrected.
FORM Identical to TamilTB form.
LEMMA Gold (preprocessed and then manually corrected).
UPOSTAG Converted automatically from XPOSTAG (via Interset).
XPOSTAG Gold (preprocessed and then manually corrected).
FEATS Converted automatically from XPOSTAG (via Interset).
HEAD Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPREL Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPS — (currently unused)
MISC Information about token spacing restored using heuristics. Mapping between multi-word tokens and syntactic words verified against the source text.

References