UD for Tamil

Preprocessing
Before UD annotation, all texts should be normalised and checked for ambiguous characters. The aim is not to correct the language, but to ensure consistent text representation.
Unicode Normalisation
- Apply Unicode normalisation before tokenisation and annotation.
- Tamil vowel modifiers may be encoded inconsistently, especially the e-series, o-series, and au forms.
- Use one consistent Unicode form throughout the corpus, preferably NFC.
| Vowel modifier/issue | Example | Normalised form | Possible non-normalised form |
|---|---|---|---|
| ொ / o | கொ | க + ொ | க + ெ + ா |
| ோ / ō | கோ | க + ோ | க + ே + ா |
| ௌ / au | கௌ | க + ௌ | க + ெ + ௗ |
| Independent vowel au | ஔ | ஔ | ஒள |
Ambiguous Characters
- Replace visually similar or wrongly encoded characters with the intended Tamil character.
- This is especially important in OCR text and old digitally converted texts.
- All replacements should be documented.
| Ambiguous character/Tamil numeral | Correct character | Example correction |
|---|---|---|
| ௨ | உ | ௨லகம் → உலகம் |
| ௭ | எ | ௭ன்று → என்று |
| ௮ | அ | ௮வன் → அவன் |
| ௧ | க | ௧லம் → கலம் |
General Rules
- Do not change the linguistic content.
- Do not modernise spelling unless the project requires it.
Tokenization and Word Segmentation
- Following most tokenization patterns, words are delimited by whitespace or punctuation.
- Multiword tokens are relatively common in Tamil. For example, the coordinating clitic -உம் / -um is analyzed as a separate syntactic word.
Morphology
Tags
- Tamil uses 14 universal tags (SCONJ, INTJ, and SYM do not occur in the corpus at present).
- Auxiliary verbs (AUX) include:
- போ / po “go” for future tense, follows the infinitive of the main verb
- மாட்டேன் / māṭṭen “will not” (lemma மாட்டு māṭṭu) for negative future tense with human subject
- படு / paṭu “experience” for the passive voice
- வை / vai “put” for the causative voice
- இல் / il (இல்லை / illai) “not be” for negation
- உள் / uḷ “within”, இரு / iru “be”, வரு / varu “come”, கொள் / kòḷ “take”, செய் / cèy “do”, விடு / viṭu “let”, வா / vā “come”
- வேண்டு / veṇṭu “must”
- முடியும் / muṭiyum “can” (lemma முடி muṭi): modal auxiliary, follows the infinitive of the main verb
Features
- 7 cases are annotated as morphological features of nouns: nominative, genitive, dative, accusative, instrumental, comitative, locative. Tamil is an agglutinating language and other spatiotemporal and/or case-like morphemes may be analyzed as postpositions.
- Verbs occur as finite forms, participles, infinitives, and gerunds.
Syntax
- Tamil is a verb-final language; both SOV and OSV orders are possible.
- Core arguments are marked by the morphological cases nominative (subject) and accusative (object). Core arguments are bare noun phrases without postpositions.
- Subjects have the following characteristics:
- Case marking: Subjects occur in nominative case without adpositions.
- Passivization: Subjects are suppressed when verbs are passivized.
- Objects have the following characteristics:
- Case marking: Objects occur in accusative case without adpositions.
- Passivization: Objects become (non-expletive) subjects when verbs are passivized.
- Bare nominal arguments (i.e., verb-licensed dependents) in the dative case are not considered core arguments. They are attached as
obl:arg. - Prepositional arguments (i.e., verb-licensed dependents) are not considered core arguments. They are attached as
obl:arg.
Tamil uses 4 relation subtypes:
advmod:emphfor adverbials emphasizing noun phrasescompound:prtto attach verbal particles to verbsnsubj:passfor nominal subjects in passive clausesobl:argfor oblique arguments (to distinguish them from other oblique dependents, i.e., adjuncts)
References
- See also http://www.southasia.sas.upenn.edu/tamil/grammar/tamilgrammar12.html
- Tamil at the Language Gulper
Treebanks
There are two Tamil UD treebanks at present: