UD for Javanese
- Javanese has several language levels, such as Krama, Krama Inggil, and Ngoko. For certain words, we will add information about its level: Kr. for Krama, KI for Krama Inggil, and Ng. for Ngoko.
Tokenization and Word Segmentation
-
In general, words are delimited by whitespace characters. Special treatments are given to multiword tokens and punctuations.
- Special treatments of multiword tokens:
- Multiword tokens that contain clitics are split into two tokens. Examples of clitics in Javanese:
- Examples of enclitics:
- -ku (Ng.) “me/my”, as in bojoku “my wife”
- -mu (Ng.) “you/your”, as in omahmu “your house”
- -é (Ng.) “he/him/she/her/it”, as in gawéané “his/her work”
- -ipun (Kr.) “he/him/she/her/it”, as in ramanipun “his/her father”
- Examples of proclitics:
- tak-/dak- “I”, as in takbukak “I open”, dakopenane “I take care”
- kok-, _mbok- “you”, as in kokjupuk “you take”, mbokpangan “you eat”
- ma-, me- “to”, as in mangulon “to the west”, mengetan “to the east”
- Examples of enclitics:
- Multiword tokens that contain clitics are split into two tokens. Examples of clitics in Javanese:
- Special treatments for punctuations. All punctuation symbols are separated from the words, except in two cases:
- Hyphen in reduplicated words. Javanese has many reduplicated words as nouns, verbs, determiner, adverb, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
- Plural noun: bangsa-bangsa “nations”
- Verb: mlaku-mlaku (Ng.) “traveling”
- Determiner: pinten-pinten (Kr.), pira-pira (Ng.) “several”
- Adverb: akèh-akèhan (Ng.) “as much as possible”
- For abbreviations. All abbreviations such as Dr., Tn. “Mr.”, Ny. “Mrs.” are not split and remain one token.
- Hyphen in reduplicated words. Javanese has many reduplicated words as nouns, verbs, determiner, adverb, and so on. These reduplicated words are not split and remain one token. The examples of reduplicated words are:
Morphology
Tags
-
UD Javanese treebank uses all UPOS tags.
-
The auxiliary (AUX). We defined these Javanese words as AUX:
-
yaiku (Ng.) or inggih punika (Kr.) “be” serve as copulas. For inggih punika only inggih will be tagged as AUX, punika will be labeled as the child of inggih with deprel fixed.
- Tenses-related AUX:
- bakal (Ng.), bade (Kr.) “will/would” for the future tense.
- lagi (Ng.), saweg (Kr.) “be” for the present tense.
- wis (Ng.), sampun (Kr.) “have/has/had” for the simple/past perfect tense.
- Modal-related AUX:
- kudu (Ng.), mesti (Kr.) as the equivalents of modal “must”.
- sekuduné (Ng.), semestiné (Kr.) as the equivalents of modal ‘shall/should’.
- bisa (Ng.), saged (Kr.) as the equivalents of modal “can/could”.
-
- PART is used for:
- negation words: ora (Ng.) or boten (Kr.): “no/not”
- particles like ta, ya that are used to emphasize something
- Javanese has the following coordinating conjunction words (CCONJ):
- lan/karo (Ng.), kaliyan (Kr.): as the equivalents of “and” in English
- atawa (Ng.), utawi (Kr.): as the equivalents of “or” in English
- nanging (Ng.), namung (Kr.): as the equivalents of “but” in English
Features
- We propose the use of 13 of 24 features defined in UD v2 that are relevant to Javanese grammar:
-
Abbr, with one possible value:
Yes
. This feature can be applied to all UPOS categories, except PUNCT and SYM. -
Definite, applies to DET with two possible values:
Def
andInd
. -
Foreign, with one possible value:
Yes
. This feature only applies to X. - Mood, applies to VERB with two possible values:
Imp
andInd
.Mood=Imp
for imperative clausesMood=Ind
for declarative clauses
-
Number, applies to DET, NOUN, and PRON, with two possible values:
Sing
, orPlur
. - NumType, applies to NUM and ADJ, with two possible values:
Card
orOrd
.NumType=Card
is used forNUM
.NumType=Ord
is used for ordinal numbers tagged asADJ
.
-
Person, applies to PRON with three possible values:
1
,2
,3
. - Polarity, with one possible value:
Neg
, applies to PART and INTJ. - Polite with four possible values:
Infm
,Form
,Elev
,Humb
Polite=Infm
for words of Ngoko languagePolite=Form
for words of Krama languagePolite=Elev
for words of Krama InggilPolite=Humb
for words of Krama Andhap
- PronType with eight possible values:
Art
,Dem
,Emp
,Ind
,Int
,Prs
,Rel
, andTot
PronType=Art
is used for DET along with Definite feature.PronType=Dem
is used for DET or PRON, such as for words iki (Ng.), punika (Kr.) “this”PronType=Emp
is used for DET for word piyambak “self”PronType=Ind
is used for DET such as for akèh (Ng.) “many”PronType=Int
is used for ADV or PRON, such as sapa (Ng.), sinten (Kr.) “who”, apa (Ng.) “what”PronType=Prs
is used for PRON along with Person feature.PronType=Rel
is used for PRON such as kang, sing (Ng.),ingkang (Kr.) “which/that”, sapa (Ng.), sinten (Kr.) “who”, apa (Ng.) “what”PronType=Tot
is used for PRON or DET such as sedaya (Kr) “all”
-
Reflex with one possible value:
Yes
. For PRON, such as for word dhèknè “self”. -
Typo, with one possible value,
Yes
. This feature can be applied to all UPOS categories except PUNCT and SYM. - Voice, applies to VERB with two possible values:
Act
andPass
.Voice=Act
for active verbsVoice=Pass
for passive verbs
-
Syntax
- The default word order is SVO, so the subject (nsubj) normally precedes and the object follows the verb (with the exception of inverted sentences).
- A verb may serve as the subject and is labeled as clausal subject, either as csubj or csubj:pass.
- Transitive verbs will have a noun phrase as the object (obj).
- Passive verbs could be followed by agent (obl:agent)
- Verbs can have oblique arguments (obl). Special for temporal modifiers, we label it as obl:tmod.
Non-verbal Clauses
- The copula yaiku (Ng.) or inggih punika (Kr.) “be” is optionally used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
Relations Overview
- The following 12 relation subtypes could be used in UD Javanese treebank:
- acl:relcl for relative clauses that modify a noun phrase.
- advmod:emph for particles (PART) that emphasize other words.
- case:adv for case to ADJ/VERB that plays role as adverb.
- csubj:pass for clausal subjects of passive verbs.
- flat:foreign to label sequences of foreign words.
- flat:name to label sequences of names of PROPN-PROPN pairs.
- nmod:poss for possessive relationship.
- nmod:lmod for locative modifier of a noun phrase.
- nmod:tmod for temporal modifier of a noun phrase.
- nsubj:pass for nominal subjects of passive verbs.
- obl:agent for agents of passive verbs.
- obl:tmod for temporal modifier for a VERB/ADJ.
Treebanks
There is 1 Javanese UD treebank: