UD for Hindi 
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Hindi uses all 17 universal POS categories, including particles (PART).
- Hindi has the following auxiliary verbs (AUX):
- है hai and था thā are present and past equivalents of “to be”. They are used as copulas and in periphrastic tenses.
- रह raha (“to stay”) for the progressive aspect (with the stem of the main verb and the auxiliary ह/था).
- कर kara (“to do”) for the habitual aspect (with the perfective participle of the main verb).
- जा jā (“to go”) for the passive (with the perfective participle of the main verb).
- Modal auxiliaries:
- सक saka (“be able, can”)
- पा pā (“to manage”)
- चाहिए cāhie (“needed, should, ought to”)
- हो ho (“to have to”)
- पड़ paṛa (“must”)
- There are other verbs that are not auxiliaries under the UD definition, although some authors
would call them auxiliaries outside the UD context. Some of them regularly appear in compound
verbs as the semantically less salient element, others are control and raising verbs. Some
examples follow:
- लग laga (“to start”)
- चुक cuka (“to finish”)
- जा jā (“to go”) (note that this verb can also be used as real auxiliary in passives)
- ले le (“to take”)
- दे de (“to give”)
- डाल ḍāla (“to throw”)
- पड़ paṛa (“to fall”) (note that this verb can also be used as modal “must”)
- बैठ baiṭha (“to sit”)
- उठ uṭha (“to rise”)
- रख rakha (“to keep”)
- आ ā (“to come”)
Treebanks
There are 2 Hindi UD treebanks: