UD for Indonesian 
Tokenization and Word Segmentation
-
In general, words are delimited by whitespace characters. Special treatments are given to multiword tokens and punctuations.
- Special treatments of multiword tokens:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- bacalah is split into baca “read” and lah
- diakah is split into dia “he/she” and kah
- apatah is split into apa “what” and tah
- walaupun is split into walau “although” and pun
- Multiword tokens that contain clitics of -ku “me/my”, -mu “you/your”, -nya “he/him/she/her/it” are split into two tokens, with exceptions for words ended with -nya.
- Words ended with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Word -nya as pronoun, as in mencintainya “love him/her/it”, this token is split into mencintai “love” and nya “him/her/it”.
- Word -nya as posessive pronoun, as in bukunya “his/her/its book”, this token is split into buku “book” and nya “his/her/its”.
- Word -nya as determiner in predicate nominalisation case, as in meningkatnya “the increase”, this token is split into meningkat “increase” and nya “the”.
- Words ended with -nya that functions as adverbs, adjectives or auxiliary are not split. For example:
- adverbs ended with -nya: khususnya “especially”, awalnya “initially”, akhirnya “finally”
- adjectives ended with -nya: sebelumnya “previous”, sesudahnya “next”, berikutnya “next”
- auxiliary ended with -nya: seharusnya/sebaiknya “shall/should”
- Words ended with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- Special treatments for punctuations. All punctuation symbols are separated from the words, except in two cases:
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
- Singular noun: mata-mata “spy”
- Plural noun: anak-anak “children”
- Verb: merobek-robek “shredding”
- Adjective: hiruk-pikuk “noisy”
- Adverb: terus-menerus “continuously”
- For abbreviations. All abbreviations such as Mr., M.Sc. Tn., are not split and remain one token.
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
Morphology
Tags
- We refer to KBBI (Kamus Besar Bahasa Indonesia/Indonesian Great Dictionary) as the reference dictionary. However, since this dictionary only defines 7 word classes: noun, verb, adjective, adverb, pronoun, particle and number, we need to make adjustments so that the tags conform to UD v2.
- Indonesian UD treebanks use all 17 universal POS categories.
- PART is used for:
- negation words: tidak/tak/bukan “no/not”, belum “not yet”, jangan “don’t + VERB”
- particles of -lah, -kah, -tah, pun, that have been discussed in the previous section.
- The auxiliary (AUX) vs. VERB distinction is based on examples for English treebank, since initially there is no AUX type in KBBI. We defined 14 Indonesian words as AUX as follows:
- adalah and ialah “be” serve as copulas.
- Tenses-related AUX:
- akan/bakal “will/would” for the future tense.
- sedang/tengah “be” for the present tense.
- telah/sudah “have/has/had” for the past tense.
- Modal-related AUX:
- harus/mesti/wajib as the equivalents of modal “must”.
- sebaiknya/seharusnya/perlu as the equivalents of modal ‘shall/should’.
- bisa/dapat/sanggup/mampu as the equivalents of modal “can/could”.
- boleh as the equivalent of modal “may”.
- mungkin as the equivalent of modal “might”.
- The pronoun (PRON) vs. determiner (DET) distinction is also based on examples for English treebank, since DET word class also is not defined in KBBI.
- The following word types are tagged as PRON:
- personal pronouns, such as saya/aku/ku “I”, kamu/mu/anda “you”, dia/ia/nya “he/she/it/him/her/its”, kami/kita “we/us/our”, mereka “they/them/their”
- interrogative pronouns, apa “what”, siapa “who” as in Apa yang kamu inginkan? “What do you want?”
- relative pronouns: apa “what”, siapa “who” as in Saya tahu siapa yang kamu maksud. “I know who you mean”
- indefinite pronouns: seseorang “seomeone/somebody”, sesuatu “something”
- total pronouns, such as semua “all” as in Semua kecuali bukumu “All except your books”.
- demonstrative pronouns: ini “this” as in Ini bukan salahmu. “This is not your fault”.
- The following word types are tagged as DET:
- demonstrative determiners: ini “this” as in Kota ini sangat indah “This city is beautiful”
- pronominal numerals: beberapa, berbagai, para “some/many”, semua “all” as in semua siswa “all students”
- The following word types are tagged as PRON:
- Indonesian has the following coordinating conjunction words (CCONJ):
- dan, serta, maupun as the equivalents of “and” in English
- atau “or”
- tapi, tetapi, namun, melainkan as the equivalents of “but” in English
Features
- We propose the use of 14 of 24 features defined in UD v2 that are relevant to Indonesian grammar:
- Abbr, with one possible value:
Yes
. This feature can be applied to all UPOS categories, except PUNCT and SYM. - Clusivity, applies to PRON with two possible values:
Ex
andIn
.Clusivity=Ex
for kami “we/our”Clusivity=In
for kita “we/our”
- Degree, applies to ADJ with one possible value:
Sup
.Degree=Sup
for superlative adjectives, such as terbaik “best”, tercantik “most beautiful”, etc.
- Foreign, with one possible value:
Yes
. This feature only applies to X. - Mood, with two possinle values:
Ind
, andImp
- Number, applies to DET, NOUN, and PRON, with two possible values:
Sing
, orPlur
. - NumType, applies to NUM and ADJ, with two possible values:
Card
orOrd
.NumType=Card
is used forNUM
.NumType=Ord
is used for ordinal numbers tagged asADJ
.
- Person, applies to PRON with three possible values:
1
,2
,3
. - Polarity, with one possible value:
Neg
, applies to PART and INTJ. - Poss, applies to PRON, with one possible value:
Yes
for PRON that serves as possessive pronouns. - PronType, applies to PRON, DET, and ADV. For Indonesian, 7 possible values can be applied:
PronType=Dem
, applies toADV
,DET
, andPRON
such as for itu “that” in Itu masalahmu. “That is your problem.”PronType=Emp
, applies toDET
such as for sendiri “self” in Kamu harus percaya pada dirimu sendiri “You have to believe in yourself”.PronType=Ind
, applies toADV
,DET
, andPRON
such as for seseorang “someone/somebody” or sesuatu “something”PronType=Int
, applies toPRON
andADV
.PronType=Int
forPRON
, such as for apa “what” and siapa “who” in interrogative sentencesPronType=Int
forADV
, such as for bagaimana “how” and kapan “when” in interrogative sentences
PronType=Prs
, applies toPRON
for all personal pronouns.PronType=Rel
, applies toPRON
andADV
.PronType=Rel
forPRON
, such as for apa “what”, siapa “who”, yang “that”.PronType=Rel
forADV
, such as for di mana “where”, bagaimana “how” and kapan/saat/ketika “when” in non-interrogative sentences
PronType=Tot
, applies toADV
,DET
, andPRON
.PronType=Tot
forPRON
, such as for semua “all” in Semua adalah milikmu. “All is yours.”PronType=Tot
forDET
, such as for semua “all” in Semua siswa terlihat senang. “All students look happy.”PronType=Tot
forADV
, such as for selalu “always” in Dia selalu terlambat. “She is always late.”
- Reflex, applies to PRON with one possible value: Yes.
- Typo, with one possible value,
Yes
. This feature can be applied to all UPOS categories except PUNCT and SYM. - Voice, applies to VERB with two possible values:
Act
andPass
. Voice alternation is treated as inflection and the active and passive counterparts have the same lemma.Voice=Act
for active verbs that have characteristic of using base word, prefixes me-, ber-- Active verbs without affix: duduk “sit”, pergi “go”
- Active verbs with prefix me-: memperbaiki “fix”, mengakui “admit”
- Active verbs with prefix ber-: belajar “study”, bekerja “work”
Voice=Pass
for passive verbs that have characteristic of using prefixes di-, ter- or circumfix ke-an.- Passive verbs with prefix di- : dipublikasikan “be published”, dilepaskan “be released”
- Passive verbs with prefix ter-: terbakar “on fire”, terjatuh “fell”, terkejut “shocked”
- Passive verbs with confix ke-an: ketinggalan “lag behind”, kecurian “be stolen”
- Abbr, with one possible value:
- We consider these 10 UD v2 features are not relevant to Indonesian grammar:
Gender
. Indonesian words have no gender.Animacy
. Similar with Gender, there is no requirements of agreements between words in Indonesian.NounClass
, with the same reason for Gender and AnimacyCase
, with the same reason for Gender, Animacy, and NounClassTense
. Indonesian verbs have the same form in all tenses.Aspect
, with the same reason for Tense.Definite
Evident
Polite
VerbForm
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- The default word order is SVO, so the subject (nsubj) normally precedes and the object follows the verb (with the exception of inverted sentences).
- A verb may serve as the subject and is labeled as clausal subject, either as csubj or csubj:pass.
- Transitive verbs will have a noun phrase as the object (obj).
- Passive verbs could be followed by agent (obl:agent), such as in Pesan yang dikirimkan presiden “Messages sent by president”, presiden “president” is the agent of predicate dikirimkan “be sent”.
- Verbs can have oblique arguments (obl). Special for temporal modifiers, we label it as obl:tmod.
Non-verbal Clauses
- The copula ialah or adalah (be) is optionally used in equational, attributional, locative, possessive and benefactory nonverbal clauses. The two forms are interchangeable but adalah is more common. For example: “This is my house.”, in Indonesian can be written as:
- Ini rumahku., without copula
- Ini adalah rumahku., with copula adalah
Relations Overview
- Among 37 universal dependency relations in UDv2, 33 deprels are respresented in the Indonesian-PUD and Indonesian-CSUI treebank
- The following universal delrel are not represented in both the Indonesian-PUD and Indonesian-CSUI:
dep
expl
list
reparandum
- We provide additional docummentation with examples in Indonesian for some of universal deprels:
- The following 14 relation subtypes could be used in Indonesian UD treebank:
- acl:relcl for relative clauses that modify a noun phrase.
- advmod:emph for particles (PART) -lah, -kah, -tah and , pun that emphasize other words.
- case:adv for adposition (ADP) that is not a nominal dependent.
- cc:preconj for word baik in clause baik A maupun B “both A and B”.
- compound:a for adjective compounds
- csubj:pass for clausal subjects of passive verbs.
- flat:foreign to label sequences of foreign words.
- flat:name to label sequences of names of PROPN-PROPN pairs.
- nmod:lmod for locative nouns.
- nmod:poss for possessive relationship.
- nmod:tmod for temporal modifier of a noun phrase.
- nsubj:pass for nominal subjects of passive verbs.
- obl:agent for agents of passive verbs.
- obl:tmod for temporal modifier for a VERB/ADJ.
Remark
This Indonesian documentation is applied only to the Indonesian-PUD and the Indonesian-CSUI treebank. The Indonesian-GSD does not conform to these guidelines.
Treebanks
There are 3 Indonesian UD treebanks: