UD for Bengali
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Bengali uses all 17 universal POS categories, including particles (PART).
- The following words are classified as particles (PART):
- না nā, নাই nāi the negative particle (“not”).
- কি ki when it functions as a question particle (rather than the pronoun “what”) that marks a yes-no question.
- At present, only some modal verbs are treated as auxiliaries (AUX):
- পারা pārā (“be able, can, could”) combines with the imperfect participle (also called the infinitive) of the main verb
- হওয়া haoŷā (“be, become”) combines with the imperfect participle (also called the infinitive) of the main verb to create a necessitative meaning (“to have to”)
- চাওয়া cāoŷā (“want”) combines with the infinitive (also called the verbal noun) of the main verb
- There are verb-verb compounds, where the semantically more salient verb comes first in the form
of a participle, and the semantically less salient verb comes second and has a finite form.
There is a restricted set of verbs that can occur at the second position of verb-verb compounds.
Note that these are not considered auxiliary verbs in UD.
- যাওয়া yāoŷā (“to go”)
- নেয়া neŷā (“to take”)
- দেত্তয়া dettaŷā (“to give”)
- বসা basā (“to sit”)
- আসা āsā (“to come”)
- There are noun-verb compounds where the noun is the semantically more salient part and it comes first.
These are treated as light-verb constructions, i.e., the noun is attached to the verb as compound:lvc
rather than an object. The verb is not treated as auxiliary. Examples:
- শেষ করা śeṣa karā “to finish” (lit. “end do”)
- There are five main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
- Infinitive
Inf
, tagged VERB or AUX. This is the citation form (lemma) of the verb. When used in a nominal position (as a subject of another verb, with a case suffix, with possessive modifiers etc.), it is instead tagged NOUN and itsVerbForm
isVnoun
(verbal noun) rather thanInf
. - Finite verb
Fin
, tagged VERB or AUX. - Participle
Part
, tagged VERB or AUX. It does not express person. The two main types of participles are distinguished by Aspect. The third type, conditional participle, is distinguished by Mood.- Imperfect participle. Some sources (e.g., Wiktionary) may use the term infinitive for this form and verbal noun for what we call infinitive.
- Perfect participle.
- Conditional participle:
Mood=Cnd
.
- Infinitive
Nominal Features
- The two main values of the Number feature are
Sing
andPlur
. Nouns (NOUN and PROPN) can inflect forNumber
but the default singular form is usually used unless plural needs to be explicitly signaled. For personal pronouns (PRON), both singular and plural forms are used. TheNumber
of subject (unlike itsPerson
) is not cross-referenced by the finite verb. - Case has up to 4 possible values:
Nom
,Gen
,Acc
,Loc
. It occurs with the nominal words, i.e., NOUN, PROPN, PRON. The accusative form is also called objective and used in contexts where other languages would use a dative form. While the accusative form of personal pronouns is regularly used, for nouns (and especially inanimate nouns), the nominative form is often used also in object position.
Degree and Polarity
Verbal Features
- Two participle types are distinguished by Aspect, either imperfective (
Imp
) or perfective (Perf
).Aspect=Perf
is further used with finite verbs in the present perfect form. Similarly, present continuous (progressive) finite forms will getAspect=Prog
. - Finite verbs always have one of two values of Mood:
Ind
orImp
. The conditional moodCnd
is only used with the conditional participle. - Finite verbs distinguish the Person of the subject but not its
Number
. Infinitives and participles do not distinguishPerson
. - Verbs in the indicative mood always have one of three values of Tense:
Past
,Pres
orFut
. - There are two values of the Voice feature:
Act
andPass
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV).
- NumType is used with numerals (NUM), adjectives (ADJ), determiners (DET) and adverbs (ADV).
- Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - Personal pronouns also have one of two values of Number:
Sing
orPlur
. They are lemmatized to the singular form, i.e., grammatical number is treated as (irregular) inflection. Note that the number of subject is not cross-referenced by finite verbs. - Personal pronouns have up to three Case forms:
Nom
(the base form),Acc
(the objective form, also used in contexts where other languages use dative), andGen
(the genitive form).- There is no specific category of possessive pronouns. Instead, the genitive form of personal pronouns is used to encode possession.
Other Features
- There are the following language-specific features:
- PartType:
PartType=Int
(interrogative) marks the word কি ki when it functions as a question particle (rather than the pronoun “what”).PartType=Neg
(negative) marks the negative particles না nā, নাই nāi (“not”).
- PartType:
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- Objects may be bare noun phrases in nominative or accusative form.
- Nominals in the locative form, if modifying a verb, are considered oblique (obl).
- In passive clauses, the subject is labeled with nsubj:pass.
- If the demoted agent is present, it has the form of a bare instrumental phrase and its relation is labeled obl:agent.
Non-verbal Clauses
- In the present tense, simple juxtaposition of the subject and the nominal/adjectival predicate is used, without an overt copula.
Relations Overview
- The following relation subtypes are used in Bengali:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- obl:agent for agents of passive verbs
- acl:relcl for relative clauses
- nmod:poss for possessively used genitive modifiers
Treebanks
There are 2 Bengali UD treebanks: