UD for Bhojpuri
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Bhojpuri uses all 17 universal POS categories, including particles (PART).
- Bhojpuri has the following auxiliary verbs (AUX):
- हऽ ha’, आ ā, स sa, बा bā, छी chi, भा bhā and ना nā are equivalents of “to be”. They are used as copulas and in periphrastic tenses.
- गइल gaila for the past tense.
- रह raha (“to stay”) for the progressive aspect (with the stem of the main verb and the auxiliary).
- कर kara (“to do”) for the habitual aspect (with the perfective participle of the main verb).
- जा jā (“to go”) for the passive (with the perfective participle of the main verb).
- Modal auxiliaries:
- सक saka (“be able, can”)
- पा pā (“to manage”)
- चाही cāhī (“needed, should, ought to”)
- हो ho (“to have to”)
- पड़ paṛa (“must”)
- Phase verbs:
- लग laga (“to start”)
- चुक cuka (“to finish”)
- The current data treats as auxiliaries also some verbs that regularly appear in compound verbs
as the semantically less salient element. Since compound verbs are not periphrastic tense/aspect/voice
forms, these verbs do not fit well in the UD definition of auxiliaries, and they should be given
a different analysis in the future releases. The following verbs are used as semantic auxiliaries
in compound verbs:
- जा jā (“to go”) (note that this verb can also be used as real auxiliary in passives)
- ले le (“to take”)
- दे de, मार mār (“to give”)
- डाल ḍāla (“to throw”)
- पड़ paṛa (“to fall”) (note that this verb can also be used as modal “must”)
- बैठ baiṭha (“to sit”)
- उठ uṭha (“to rise”)
- रख rakha (“to keep”)
- आ ā (“to come”)
Instruction: Specify any unused tags. Explain what words are tagged as PART. Describe how the AUX-VERB and DET-PRON distinctions are drawn, and specify whether there are (de)verbal forms tagged as ADJ, ADV or NOUN. Include links to language-specific tag definitions if any.
Features
*
Instruction: Describe inherent and inflectional features for major word classes (at least NOUN and VERB). Describe other noteworthy features. Include links to language-specific feature definitions if any.
Syntax
*
Instruction: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any.
Treebanks
There is only one Bhojpuri UD treebank at present: