UD Thai TUD
Language: Thai (code: th)
Family: Tai-Kadai
This treebank has been part of Universal Dependencies since the UD v2.17 release.
The following people have contributed to making this treebank part of UD: Panyut Sriwirote, Wei Qi Leong, Charin Polpanumas, Santhawat Thanyawong, William Chandra Tjhi, Wirote Aroonmanakun, Attapol T. Rutherford, Ratanon Jiamsundutsadee, Punyanuch Maitreenukul.
Repository: UD_Thai-TUD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: wiki, news, fiction, nonfiction, academic, legal
Questions, comments? General annotation questions (either Thai-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [attapol • t (æt) chula • ac • th, punyanuch • maitree (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | not available |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | not available |
| Relations | annotated manually, natively in UD style |
Description
UD Thai TUD (Thai Universal Dependency Treebank) is a treebank of 3,627 syntactic trees from the Thai National Corpus and Wikipedia, annotated in Universal Dependencies, covering diverse text types and topics across various domains.
The UD Thai TUD treebank was created to provide a broad-coverage syntactic resource for the Thai language under the Universal Dependencies (UD) framework. Text was randomly sampled from two major sources: the Thai National Corpus and the November 2020 dump of Thai Wikipedia. To ensure diversity, 5,000 paragraphs were selected from various document types—news articles, Wikipedia entries, essays, advertisements, interviews, and stories—covering a wide range of topics such as politics, crime, entertainment, sports, history, religion, culture, and science. After annotation and rigorous quality control, 3,627 well-formed dependency trees were retained in the final dataset.
Annotation Process
All paragraphs were tokenized using the newmm tokenizer from the PyThaiNLP library, then annotated on the Datasaur platform. A team of 10 annotators with linguistics backgrounds was trained to: (1) correct tokenization errors, (2) assign Universal POS (UPOS) tags, (3) identify dependency arcs, and (4) label relations (DEPREL) without subtypes. The LEMMA field was excluded due to the lack of inflectional morphology in Thai.
Following pilot annotations and manual review, two annotators demonstrating the highest accuracy were selected to complete the remaining data. Agreement was evaluated on 20 double-annotated sentences (399 tokens), achieving Cohen’s Kappa scores of 0.92 (UPOS) and 0.84 (DEPREL), and UAS/LAS scores of 0.85 and 0.78. The annotated data was then converted into CoNLL-U format and split into individual trees based on dependency structure. Trees with incomplete labels, multiple roots, or structural errors were corrected, with additional quality assurance performed via manual inspection of 50 randomly selected trees.
The treebank consists of randomly shuffled sentences sampled from the Thai National Corpus (TNC) and the November 2020 dump of Thai Wikipedia, rather than complete documents. Each filename encodes the source document and the portion of the document from which the sentences were extracted:
- Wikipedia trees: filenames follow the format
wiki_<wgArticleID>. - TNC trees: filenames follow the format
[tnc/][Original TNC filename][Part of the document].
Train–Test Split
The final treebank was split into training, development, and test sets in an 8:1:1 ratio. It consists of syntactically complete trees rather than full documents. While Filenames are used to identify source paragraphs, typically reflecting their origin from the Thai National Corpus or Wikipedia. However, sentence IDs in the final treebank do not encode genre or domain metadata.
Tokenization and Word Segmentation
- This corpus contains 3627 sentences and 77215 tokens.
- This corpus contains 68893 tokens (89%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 70 types of words that contain both letters and punctuation. Examples: พ.ศ., ค.ศ., อ., ธ.ค., ม.ค., ต., ก.พ., น.ส., จ., ตร., น., พ.ย., ศก., ผบ., พ.ต.อ., สน., Lo-Society, ดร., ผช., ผญบ., ล., สภ., อบจ., ก.ล.ต., กม., พ.ร.บ., พล.ต.อ., ส.ท., โทร., 802.1x/RADIUS, A., CSMA/CA, S., http://www.unseencar.com/content5.php, l'Opéra, ก.น., ก.ย., ก.หน., กทม., จนท., จอง-อิล, ฉก., ด.ญ., ดี.ซี., ท., นบ., นศ., บช., บช.น., ผกก.
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB
- This corpus does not use the following tags: INTJ, X
- This corpus contains 62 word types tagged as particles (PART): ก็, ก็ดี, ก็แล้วกัน, ขนาด, ขี้, ครับ, คือ, ค่ะ, ช่าง, ซะ, ซะแล้ว, ซึ่ง, ดังเช่น, ด้วย, ตาม, ทั้ง, ทั้งสิ้น, ที, ที่, นะ, นัก, นั่น, นั่นเอง, นั้น, นี่, นี่เอง, นี่แหละ, นี้, น่า, บ้าง, ฟะ, ล่ะ, ล่ะกัน, หรอ, หรอก, หรือไม่, ห้าม, อย่าง, อย่างน้อย, อย่างไร, อะไร, อาทิ, อีก, เช่น, เช่นกัน, เดียว, เป็นต้น, เลย, เสีย, เอง, แต่, แม้กระทั่ง, แม้แต่, แล้ว, แหละ, โดย, ใคร, ใด, ให้, ไม่, ไหน, ไหม
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
- ExtPos
- ADJ
- ADJ: ต่อ, ทั่ว, เดียว, ขยัน, ถัด, ที่, เขียว, แวด
- ADP
- ADP: จน, ตั้ง, ท่าม, โดย, พร้อม, ภาย, หลัง, เช่น, ให้, ของ
- PART: อาทิ
- ADV
- ADV: ต่อ, ก็, แต่, ใน, ตลอด, ทรง, ทัน, มาก, เพียง, ค่อน
- PART: นะ, เลย, แต่
- AUX
- AUX: ยัง, ควร, จะ, อาจ, คง, ที่, อยู่, เพิ่ง
- CCONJ
- CCONJ: อย่างไร, ขณะ, รวม, นอก, แต่, ดัง, ทั้ง, ใน, พร้อม, อย่าง
- DET
- DET: ดัง, นั่น, เหล่า
- PRON
- PRON: ที่, ฝ่า
- SCONJ
- PART: ไม่
- SCONJ: เนื่อง, หลัง, ไม่, ถึง, เป็น, ตัวอย่าง, แม้, ขณะ, นอก, ภาย
- VERB: ยก
- ADJ
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).- nsubj
- VERB--NOUN (2452)
- VERB--NOUN-ADP(_) (49)
- VERB--PRON (1978)
- VERB--PRON-ADP(_) (1)
- obj
- VERB--NOUN (5932)
- VERB--NOUN-ADP(_) (29)
- VERB--PRON (549)
- VERB--PRON-ADP(_) (1)
- iobj
- VERB--NOUN (29)
- VERB--NOUN-ADP(_) (2)
- VERB--PRON (12)
Relations Overview
- This corpus does not use relation subtypes.
- The following 2 relation types are not used in this corpus at all: goeswith, reparandum