home edit page issue tracker

This page pertains to UD version 2.

UD for Tamil

Preprocessing

Before UD annotation, all texts should be normalised and checked for ambiguous characters. The aim is not to correct the language, but to ensure consistent text representation.

Unicode Normalisation

Vowel modifier/issue Example Normalised form Possible non-normalised form
ொ / o கொ க + ொ க + ெ + ா
ோ / ō கோ க + ோ க + ே + ா
ௌ / au கௌ க + ௌ க + ெ + ௗ
Independent vowel au ஒள

Ambiguous Characters

Ambiguous character/Tamil numeral Correct character Example correction
௨லகம் → உலகம்
௭ன்று → என்று
௮வன் → அவன்
௧லம் → கலம்

General Rules

Tokenization and Word Segmentation

Morphology

Tags

Features

Syntax

Tamil uses 4 relation subtypes:

References


Treebanks

There are two Tamil UD treebanks at present: