UD for Urdu
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Urdu uses all 17 universal POS categories, including particles (PART).
- Urdu has the following auxiliary verbs (AUX):
- ہے hai and تھا thā are present and past equivalents of “to be”. They are used as copulas and in periphrastic tenses.
- رہ rah (“to stay”) for the progressive aspect (with the stem of the main verb and the auxiliary ہے/تھا).
- کر kar (“to do”) for the habitual aspect (with the perfective participle of the main verb).
- جا jā (“to go”) for the passive (with the perfective participle of the main verb).
- Modal auxiliaries:
- سک sak (“be able, can”)
- پا pā (“to manage”)
- چاہیئے cāhie (“needed, should, ought to”)
- ہو ho (“to have to”)
- پڑ paṛ (“must”)
- There are other verbs that are not auxiliaries under the UD definition, although some authors
would call them auxiliaries outside the UD context. Some of them regularly appear in compound
verbs as the semantically less salient element, others are control and raising verbs. Some
examples follow:
- لگ lag (“to start”)
- چک cuk (“to finish”)
- جا jā (“to go”) (note that this verb can also be used as real auxiliary in passives)
- لے le (“to take”)
- دے de (“to give”)
- ڈال ḍāl (“to throw”)
- پڑ paṛ (“to fall”) (note that this verb can also be used as modal “must”)
- بیٹھ baiṭh (“to sit”)
- اٹھ uṭh (“to rise”)
- رکھ rakh (“to keep”)
- آ ā (“to come”)
Features
*
Instruction: Describe inherent and inflectional features for major word classes (at least NOUN and VERB). Describe other noteworthy features. Include links to language-specific feature definitions if any.
Syntax
*
Instruction: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any.
Treebanks
There is 1 Urdu UD treebank: