UD for Indonesian 
Tokenization and Word Segmentation
-
In general, words are delimited by whitespace characters. Special treatments are given to multiword tokens and punctuations.
- Special treatments of multiword tokens:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- bacalah is split into baca “read” and lah
- diakah is split into dia “he/she” and kah
- apatah is split into apa “what” and tah
- walaupun is split into walau “although” and pun
- The particle kah marks yes-no questions and its position may emphasize the previous word as the focus of the question. The word apa “what”, when placed at the beginning of a sentence, also functions as a question particle, and it may be optionally strengthened by kah, resulting in apakah, as in Apa(kah) dia guru? “Is she a teacher?” However, apa is also used as an interrogative pronoun, even sentence-initially, as in Apa pendapatmu? “What do you think?” Finally, kah can be also added to interrogative words (apa “what”, siapa “who”, di mana “where”, kapan “when”, bagaimana “how”) in open questions; when -kah is added, the tone becomes more polite.
- Multiword tokens that contain clitics of -ku “me/my”, -mu “you/your”, -nya “he/him/she/her/it” are split into two tokens, with exceptions for certain words ending with -nya.
- Words ending with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Word -nya as pronoun, as in mencintainya “love him/her/it”, this token is split into mencintai “love” and nya “him/her/it”.
- Word -nya as posessive pronoun, as in bukunya “his/her/its book”, this token is split into buku “book” and nya “his/her/its”.
- Word -nya as determiner in predicate nominalisation case, as in meningkatnya “the increase”, this token is split into meningkat “increase” and nya “the”.
- Words ending with -nya that functions as adverbs, adjectives or auxiliary are not split. For example:
- adverbs ended with -nya: khususnya “especially”, awalnya “initially”, akhirnya “finally”
- adjectives ended with -nya: sebelumnya “previous”, sesudahnya “next”, berikutnya “next”
- auxiliary ended with -nya: seharusnya/sebaiknya “shall/should”
- Words ending with -nya where _-nya- itself serves as a pronoun or determiner are split into two tokens. For example:
- Multiword tokens that ended with particles -lah/-kah/-tah/-pun are split into two tokens. These particles are usually used to emphasize the word before them. Particles of -lah/-kah/-tah are clitics, while particle pun can be written as clitic or a single token. The examples of how to tokenize these clitic particles are as follows:
- Special treatments for punctuations. All punctuation symbols are separated from the words, except in two cases:
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
- Singular noun: mata-mata “spy”
- Plural noun: anak-anak “children”, from anak “child”
- Verb: merobek-robek “shredding”
- Adjective: hiruk-pikuk “noisy”
- Adverb: terus-menerus “continuously”
- For abbreviations. All abbreviations such as Mr., M.Sc. Tn., are not split and remain one token.
- Hyphen in reduplicated words. Indonesian has many reduplicated words as nouns (both singular and plural), verbs, adjectives, adverbs, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
Morphology
Tags
- We refer to KBBI (Kamus Besar Bahasa Indonesia/Indonesian Great Dictionary) as the reference dictionary. However, since this dictionary only defines 7 word classes: noun, verb, adjective, adverb, pronoun, particle and number, we need to make adjustments so that the tags conform to UD v2.
- Indonesian UD treebanks use all 17 universal POS categories.
- PART is used for:
- negation words: tidak/tak/bukan “no/not”, belum “not yet”, jangan “don’t + VERB”
- particles of -lah, -kah, -tah, pun, that have been discussed in the previous section.
- The auxiliary (AUX) vs. VERB distinction is based on examples for English treebank, since initially there is no AUX type in KBBI. We defined 14 Indonesian words as AUX as follows:
- adalah and ialah “be” serve as copulas.
- Tenses-related AUX:
- akan “will/would” for the future tense.
- sedang/tengah “be” for the present tense.
- telah/sudah “have/has/had” for the past tense.
- Modal-related AUX:
- harus/mesti/wajib as the equivalents of modal “must”.
- sebaiknya/seharusnya as the equivalents of modal ‘shall/should’.
- bisa/dapat/sanggup/mampu as the equivalents of modal “can/could”.
- boleh as the equivalent of modal “may”.
- mungkin as the equivalent of modal “might”.
- The pronoun (PRON) vs. determiner (DET) distinction is also based on examples for English treebank, since DET word class also is not defined in KBBI.
- The following word types are tagged as PRON:
- personal pronouns, such as saya/aku/ku “I”, kamu/mu/anda “you”, dia/ia/nya “he/she/it/him/her/its”, kami/kita “we/us/our”, mereka “they/them/their”
- interrogative pronouns, apa “what”, siapa “who” as in Apa yang kamu inginkan? “What do you want?”
- relative pronouns: apa “what”, siapa “who” as in Saya tahu siapa yang kamu maksud. “I know who you mean”
- indefinite pronouns: seseorang “seomeone/somebody”, sesuatu “something”
- total pronouns, such as semua “all” as in Semua kecuali bukumu “All except your books”.
- demonstrative pronouns: ini “this” as in Ini bukan salahmu. “This is not your fault”.
- The following word types are tagged as DET:
- demonstrative determiners: ini “this” as in Kota ini sangat indah “This city is beautiful”
- pronominal numerals: beberapa, berbagai, para “some/many”, semua “all” as in semua siswa “all students”
- The following word types are tagged as PRON:
- Indonesian has the following coordinating conjunction words (CCONJ):
- dan, serta, maupun as the equivalents of “and” in English
- atau “or”
- tapi, tetapi, namun, melainkan as the equivalents of “but” in English
- Clauses can be nominalized by attaching the clitic -nya to the predicate. In the annotation, the clitic is analyzed as a separate syntactic word, functioning as a determiner. However, the predicate keeps the VERB tag, so there may be a verb with a determiner attached to it.
- meningkat “to increase” is a verb
- meningkatnya “the increase” is a nominalized form; however, since -nya is also used with regular nouns and functions like a definite article, meningkatnya is treated as a multi-word token meningkat+nya, where nya is attached as a det to the verb meningkat
- Since meningkat stays tagged as a verb, it will attach to its parent as a clause rather than a nominal. So if it is a subject of another clause, it will be csubj rather than nsubj.
Features
- We propose the use of 15 of 24 features defined in UD v2 that are relevant to Indonesian grammar:
-
Abbr, with one possible value:
Yes. This feature can be applied to all UPOS categories, except PUNCT and SYM. - Clusivity, applies to PRON with two possible values:
ExandIn.Clusivity=Exfor kami “we/our”Clusivity=Infor kita “we/our”
- Degree, applies to ADJ with one possible value:
Sup.Degree=Supfor superlative adjectives, such as terbaik “the best”, tercantik “the most beautiful”.
-
Foreign, with one possible value:
Yes. This feature only applies to X. - Mood, applies to VERB, with two possible values:
Ind, andImpMood=Indfor verb in declarative sentences.Mood=Impfor verb in imperative sentences.
- Number, applies to DET, NOUN, and PRON, with two possible values:
Sing, orPlur.Number=Singis used for singular nouns, determiner, or pronouns.Number=Pluris used for plural nouns, determiner, or pronouns.
- NumType, applies to NUM and ADJ, with two possible values:
CardorOrd.NumType=Cardis used for cardinal numbers tagged asNUM.NumType=Ordis used for ordinal numbers tagged asADJ.
-
Person, applies to PRON with three possible values:
1,2,3. - Polarity, with one possible value:
Neg, applies to PART and INTJ. - Polite, applies to PRON with two possible values:
FormandInfm.Polite=Form, applies toPRON, such as for saya “I”, anda “you”, and beliau “him/her”.Polite=Infm, applies toPRON, such as for aku “I”, kamu “you” (singular), and kalian “you” (plural).
- PronType, applies to PRON, DET, and ADV. For Indonesian, eight possible values can be applied:
PronType=Art, applies toDET, such as for sebuah, seorang and -nyaPronType=Dem, applies toADV,DET, andPRONsuch as for itu “that” in Itu masalahmu. “That is your problem.”PronType=Emp, applies toDETsuch as for sendiri “self” in Kamu harus percaya pada dirimu sendiri “You have to believe in yourself”.PronType=Ind, applies toADV,DET, andPRONsuch as for seseorang “someone/somebody” or sesuatu “something”PronType=Int, applies toPRONandADV.PronType=IntforPRON, such as for apa “what” and siapa “who” in interrogative sentencesPronType=IntforADV, such as for bagaimana “how” and kapan “when” in interrogative sentences
PronType=Prs, applies toPRONfor all personal pronouns.PronType=Rel, applies toPRONandADV.PronType=RelforPRON, such as for apa “what”, siapa “who”, yang “that”.PronType=RelforADV, such as for di mana “where”, bagaimana “how” and kapan/saat/ketika “when” in non-interrogative sentences
PronType=Tot, applies toADV,DET, andPRON.PronType=TotforPRON, such as for semua “all” in Semua adalah milikmu. “All is yours.”PronType=TotforDET, such as for semua “all” in Semua siswa terlihat senang. “All students look happy.”PronType=TotforADV, such as for selalu “always” in Dia selalu terlambat. “She is always late.”
-
Reflex, applies to PRON with one possible value:
Yes. Only one word qualifies to this feature: diri “self”. -
Typo, with one possible value,
Yes. This feature can be applied to all UPOS categories except PUNCT and SYM. - Voice, applies to VERB with two possible values:
ActandPass. Voice alternation is treated as inflection and the active and passive counterparts have the same lemma.Voice=Actfor active verbs that have characteristic of using base word, prefixes me-, ber-- Active verbs without affix: duduk “sit”, pergi “go”
- Active verbs with prefix me-: memperbaiki “fix”, mengakui “admit”
- Active verbs with prefix ber-: belajar “study”, bekerja “work”
Voice=Passfor passive verbs that have characteristic of using prefixes di-, ter- or circumfix ke-an.- Passive verbs with prefix di- : dipublikasikan “be published”, dilepaskan “be released”
- Passive verbs with prefix ter-: terbakar “on fire”, terjatuh “fell”, terkejut “shocked”
- Passive verbs with confix ke-an: ketinggalan “lag behind”, kecurian “be stolen”
-
- We consider these 9 UD v2 features are not relevant to Indonesian grammar:
Gender. Indonesian words have no gender.Animacy. Similar with Gender, there is no requirements of agreements between words in Indonesian.NounClass, with the same reason for Gender and AnimacyCase, with the same reason for Gender, Animacy, and NounClassTense. Indonesian verbs have the same form in all tenses.Aspect, with the same reason for Tense.EvidentPossVerbForm
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- The default word order is SVO, so the subject (nsubj) normally precedes and the object follows the verb (with the exception of inverted sentences).
- A verb may serve as the subject and is labeled as clausal subject, either as csubj or csubj:pass.
- Transitive verbs will have a noun phrase as the object (obj).
- Passive verbs could be followed by agent (obl:agent), such as in Pesan yang dikirimkan presiden “Messages sent by president”, presiden “president” is the agent of predicate dikirimkan “be sent”.
- Verbs can have oblique arguments (obl). Special for temporal modifiers, we label it as obl:tmod.
Non-verbal Clauses
- The copula ialah or adalah (be) is optionally used in equational, attributional, locative, possessive and benefactory nonverbal clauses. The two forms are interchangeable but adalah is more common. For example: “This is my house.”, in Indonesian can be written as:
- Ini rumahku., without copula
- Ini adalah rumahku., with copula adalah
Relations Overview
- Among 37 universal dependency relations in UDv2:
- 31 deprels are represented in the Indonesian-CSUI (except:
compound,expl,goeswith,list,reparandum, andvocative) - 33 deprels are represented in the Indonesian-PUD (except:
dep,expl,list, andreparandum) - 34 deprels are represented in the Indonesian-GSD (except:
dislocated,expl, andreparandum)
- 31 deprels are represented in the Indonesian-CSUI (except:
- We provide additional docummentation with examples in Indonesian for some of universal deprels:
- The following 14 relation subtypes could be used in Indonesian UD treebank:
- acl:relcl for relative clauses that modify a noun phrase.
- advmod:emph for particles (PART) -lah, -kah, -tah and , pun that emphasize other words.
- case:adv for adposition (ADP) that is not a nominal dependent.
- cc:preconj for word baik in clause baik A maupun B “both A and B”.
- compound:a for adjective compounds
- csubj:pass for clausal subjects of passive verbs.
- flat:foreign to label sequences of foreign words.
- flat:name to label sequences of names of PROPN-PROPN pairs.
- nmod:lmod for locative nouns.
- nmod:poss for possessive relationship.
- nmod:tmod for temporal modifier of a noun phrase.
- nsubj:pass for nominal subjects of passive verbs.
- obl:agent for agents of passive verbs.
- obl:tmod for temporal modifier for a VERB/ADJ.
Treebanks
There are 3 Indonesian UD treebanks: