Introduction
The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy. The treebank was part of HamleDT, a collection of treebanks converted to the Prague dependency style (since 2011). Later versions of HamleDT added a conversion to the Stanford dependencies (2014) and to Universal Dependencies (HamleDT 3.0, 2015). The first release of Universal Dependencies that includes this treebank is UD v1.2 in November 2015. It is essentially the HamleDT conversion but the data is not identical to HamleDT 3.0 because the conversion procedure has been further improved.
Source of annotations
This table summarizes the origins and checking of the various columns of the CoNLL-U data.
Column | Status |
---|---|
ID | Sentence segmentation and tokenization (including cutting off certain suffixes that constitute independent syntactic words) was automatically done and then hand-corrected. |
FORM | Identical to TamilTB form. |
LEMMA | Gold (preprocessed and then manually corrected). |
UPOSTAG | Converted automatically from XPOSTAG (via Interset). |
XPOSTAG | Gold (preprocessed and then manually corrected). |
FEATS | Converted automatically from XPOSTAG (via Interset). |
HEAD | Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests. |
DEPREL | Original TamilTB annotation is manual (preprocessed by a rule-based parser and then manually corrected). Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests. |
DEPS | — (currently unused) |
MISC | Information about token spacing restored using heuristics. Mapping between multi-word tokens and syntactic words verified against the source text. |
Links
- TamilTB
- HamleDT
- Treex is the software used for conversion
- Interset was used to convert POS tags and features
References
- Loganathan Ramasamy, Zdeněk Žabokrtský. 2012. Prague Dependency Style Treebank for Tamil. In: Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), İstanbul, Turkey, ISBN 978-2-9517408-7-7, pp. 1888–1894.