home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD for Hindi

Tokenization and Word Segmentation

In general, words are delimited by whitespace characters. Description of exceptions follows.
According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).

Morphology

Tags

Hindi uses all 17 universal POS categories, including particles (PART).
Hindi has the following auxiliary verbs (AUX):
- है hai and था thā are present and past equivalents of “to be”. They are used as copulas and in periphrastic tenses.
- रह raha (“to stay”) for the progressive aspect (with the stem of the main verb and the auxiliary ह/था).
- कर kara (“to do”) for the habitual aspect (with the perfective participle of the main verb).
- जा jā (“to go”) for the passive (with the perfective participle of the main verb).
- Modal auxiliaries:
  - सक saka (“be able, can”)
  - पा pā (“to manage”)
  - चाहिए cāhie (“needed, should, ought to”)
  - हो ho (“to have to”)
  - पड़ paṛa (“must”)
There are other verbs that are not auxiliaries under the UD definition, although some authors would call them auxiliaries outside the UD context. Some of them regularly appear in compound verbs as the semantically less salient element, others are control and raising verbs. Some examples follow:
- लग laga (“to start”)
- चुक cuka (“to finish”)
- जा jā (“to go”) (note that this verb can also be used as real auxiliary in passives)
- ले le (“to take”)
- दे de (“to give”)
- डाल ḍāla (“to throw”)
- पड़ paṛa (“to fall”) (note that this verb can also be used as modal “must”)
- बैठ baiṭha (“to sit”)
- उठ uṭha (“to rise”)
- रख rakha (“to keep”)
- आ ā (“to come”)

Treebanks

There are 2 Hindi UD treebanks: