UD Indonesian CSUI
Language: Indonesian (code: id
Family: Austronesian, Malayo-Sumbawan
This treebank has been part of Universal Dependencies since the UD v2.7 release.
The following people have contributed to making this treebank part of UD: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi.
Repository: UD_Indonesian-CSUI
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14
License: CC BY-SA 4.0
Genre: nonfiction, news
Questions, comments? General annotation questions (either Indonesian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [ika • alfina (æt) cs • ui • ac • id]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
Lemmas | assigned by a program, with some manual corrections, but not a full manual verification |
UPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
XPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
Features | assigned by a program, with some manual corrections, but not a full manual verification |
Relations | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named Kethu that was also a conversion from a constituency treebank built by Dinakaramani et al. (2015). We named this treebank Indonesian-CSUI, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.
UD Indonesian-CSUI treebank was converted automatically from the Kethu treebank, an Indonesian constituency treebank in the Penn Treebank format. The Kethu treebank itself was converted from a consituency treebank built by Dinakaramani et al. (2015).
Other characteristics of the treebank:
- Genre: news in formal Indonesian (the majority is economic news)
- This treebank consists of 1030 sentences and 28K words. We divide CSUI treebank into testing and training dataset:
- Testing dataset consists of around 10K words
- Training dataset consists of around 18K words
- Average sentence length is around 27.4 words per-sentence, which is very high compare to the Indonesian-PUD treebank that has average sentence length of 19.4.
- The original constituency treebank was built with manual annotation by Arawinda Dinakaramani, Fam Rashel, Andry Luthfi, and Ruli Manurung at Faculty of Computer Science, Universitas Indonesia in 2015.
- The previous treebank was converted to the Penn Treebank format by Ika Alfina and Jessica Naraiswari Arwidarasti in 2019-2020. This PTB version was named Kethu.
- The Kethu treebank was converted automatically to this UD treebank by Alfina et al. (2020).
- The lemma (LEMMA) and morphological features (FEATS) were generated using Aksara and manually corrected.
- Ika Alfina, Indra Budi, and Heru Suhartanto. “Tree Rotations for Dependency Trees: Converting the Head-Directionality of Noun Phrases”. In Journal of Computer Science, 2020, Vol 16 No 11.
- M. Yudistira Hanifmuti and Ika Alfina. “Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines”. In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020.
Statistics of UD Indonesian CSUI
POS Tags
Clusivity – Definite – Degree – Foreign – Mood – Number – NumType – Person – Polarity – Polite – PronType – Reflex – Voice
acl – acl:relcl – advcl – advmod – advmod:emph – amod – appos – aux – case – case:adv – cc – cc:preconj – ccomp – clf – compound:a – conj – cop – csubj – dep – det – discourse – dislocated – fixed – flat – flat:foreign – flat:name – iobj – mark – nmod – nmod:lmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:tmod – orphan – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 1030 sentences, 27771 tokens and 28263 syntactic words.
- This corpus contains 3923 tokens (14%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 148 types of words that contain both letters and punctuation. Examples: rata-rata, APBN-P, masing-masing, Ltd., non-migas, 's, non-keuangan, AA-idn, II/2007, Ka'ban, Pte., langkah-langkah, negara-negara, No., RAPBN-P, bank-bank, baru-baru, idA-, syarat-syarat, C/D, Co., I/2007, II/2003, LLC., S., S/A, Tbk., anak-anak, benar-benar, berbeda-beda, berturut-turut, minus/idn, monyet-monyet, nama-nama, non-residence, obligasi-obligasi, peringkat-peringkat, perusahaan-perusahaan, prinsip-prinsip, rasio-rasio, semata-mata, sumber-sumber, terus-menerus, 03-Oct, 05-May, 10-Jan, 17-Mar, 23-Aug, 26-Sep, 34/PMK.011/2007
- This corpus contains 492 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 200 types of multi-word tokens. Examples: katanya, adanya, menurutnya, ujarnya, laporannya, pihaknya, lainnya, tambahnya, membaiknya, apakah, pernyataannya, sahamnya, jelasnya, masuknya, sisanya, tingginya, bersihnya, walaupun, Keuangannya, besarnya, meningkatnya, meskipun, naiknya, rencananya, ucapnya, Dikatakannya, antaranya, bunganya, jumlahnya, kalinya, nilainya, penjelasannya, persnya, turunnya, usahanya, Dijelaskannya, Disebutkannya, Ditambahkannya, Misalnya, akhirnya, artinya, aslinya, baiknya, banyaknya, bukanlah, halnya, informasinya, inilah, instrumennya, investasinya.
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 8 word types tagged as particles (PART): belum, bukan, jangan, kah, lah, pun, tak, tidak
- This corpus contains 18 lemmas tagged as pronouns (PRON): anda, apa, begitu, demikian, dia, diri, ia, ini, itu, kami, kita, mana, mereka, nya, saya, sesuatu, siapa, yang
- This corpus contains 22 lemmas tagged as determiners (DET): bagi, banyak, beberapa, berbagai, buah, ini, itu, masing-masing, nya, orang, para, sana, sebut, sedikit, segala, seluruh, semua, sendiri, setiap, si, suatu, yang
- Out of the above, 4 lemmas occurred sometimes as PRON and sometimes as DET: ini, itu, nya, yang
- This corpus contains 12 lemmas tagged as auxiliaries (AUX): adalah, akan, bisa, boleh, dapat, harus, ialah, mungkin, sedang, sudah, telah, tengah
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: dapat
- This corpus does not use the VerbForm feature.
Nominal Features
- Plur
- DET: beberapa, banyak, para, berbagai
- NOUN: langkah-langkah, negara-negara, bank-bank, syarat-syarat, anak-anak, monyet-monyet, nama-nama, obligasi-obligasi, peringkat-peringkat, perusahaan-perusahaan
- PRON: kita, mereka, kami
- Sing
- NOUN: persen, Rp, tahun, dolar, sebesar, saham, perusahaan, pemerintah, negara, pertumbuhan
- PRON: nya, dia, ia, saya, anda
- Def
- DET: nya, yang
- Ind
- DET: sebuah, seorang, suatu
Degree and Polarity
- Sup
- ADJ: terakhir, terbesar, tertinggi, terbaik, tertentu, terkaya, terdekat, terbanyak, terendah, terutama
- Neg
- PART: tidak, belum, bukan, tak, jangan
Verbal Features
- Ind
- VERB: kata, menjadi, mencapai, mengatakan, ada, meningkat, naik, dibandingkan, lalu, merupakan
- Act
- VERB: kata, menjadi, mencapai, mengatakan, ada, meningkat, naik, lalu, merupakan, turun
- Pass
- VERB: dibandingkan, dibanding, terjadi, dilakukan, diperkirakan, termasuk, terdiri, diharapkan, didorong, diterbitkan
Pronouns, Determiners, Quantifiers
- Art
- DET: nya, sebuah, seorang, yang, suatu
- Dem
- DET: ini, tersebut, itu, si, sana, sebagian
- PRON: itu, demikian, ini, mana, begitu
- Emp
- DET: sendiri
- Ind
- DET: beberapa, banyak, para, berbagai, sedikit
- PRON: sesuatu
- Int
- PRON: Apa
- Prs
- PRON: nya, dia, kita, ia, mereka, saya, kami, diri, anda
- Rel
- ADV: bagaimana
- PRON: yang, apa, siapa
- Tot
- DET: seluruh, semua, masing-masing, setiap, segala
- NUM: Ke-23
- Card
- NUM: 2007, triliun, miliar, 2006, juta, 2008, satu, dua, 30, 10
- Ord
- ADJ: pertama, kedua, ketiga, keenam, kedelapan, kelima, ke-10, ke-2, ke-4, ke-40
- Yes
- PRON: diri
- 1
- PRON: kita, saya, kami
- 2
- PRON: anda
- 3
- PRON: nya, dia, ia, mereka
- Form
- PRON: saya, anda
Other Features
- Clusivity
- Ex
- PRON: kami
- In
- PRON: kita
- Ex
- Foreign
- Yes
- X: rate, year, rating, mortgage, subprime, on, listed, net, netto, outlook
- Yes
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: adalah.
- This corpus uses 10 lemmas as auxiliaries (aux). Examples: akan, telah, bisa, dapat, sudah, harus, sedang, mungkin, tengah, boleh.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (687)
- VERB--PRON (503)
- obj
- VERB--NOUN (943)
- VERB--PRON (31)
- iobj
- VERB--NOUN (1)
Verbs with Reflexive Core Objects
- This corpus contains 5 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: beri diri, daftar diri, tahu diri, tarik diri, tempat diri
Relations Overview
- This corpus uses 13 relation subtypes: acl:relcl, advmod:emph, case:adv, cc:preconj, compound:a, flat:foreign, flat:name, nmod:lmod, nmod:poss, nmod:tmod, nsubj:pass, obl:agent, obl:tmod
- The following 1 main types are not used alone, they are always subtyped: compound
- The following 5 relation types are not used in this corpus at all: vocative, expl, list, goeswith, reparandum