home edit page issue tracker

This page pertains to UD version 2.

UD for Southern Kurdish

Tokenization and Word Segmentation

Whitespace and punctuation: Words are generally delimited by whitespace and punctuation marks following standard conventions. Multiword tokens: The treebank contains 48 multiword tokens that are split into multiple syntactic words. Common patterns include:

Possessive constructions like bawkî → bawk + î (noun + possessive pronoun) Prepositional contractions like mîveganî → mîvegan + î (noun + pronoun) Pronominal clitics systematically separated from their hosts

Clitics: Pronominal clitics are segmented as separate tokens and marked with full morphological features (Person, Number, Animacy). Morphological affixes for definiteness and case remain attached to their hosts. Segmentation policy:

Pronominal clitics: Always separated Prepositional clitics: Separated if grammatically distinct Postpositional elements: Remain host-attached

Morphology

Tags

PART: The PART tag is used for:

Discourse particles like ewse (that’s it), xodi (hey!) Emphatic particles like he (just, really) Distributive particles like erî (each, every)

AUX vs VERB:

AUX is used sparingly (4 instances) for auxiliary verbs from daştin (to be) and copular e (is/are) VERB covers all main verbs including light verbs in compound constructions

Light Verb Constructions: Highly productive pattern of semantically heavy nouns combined with light verbs (primarily dan “give”), annotated with compound:lvc Unused tags: All major POS tags are represented, with NUM (52 instances) and PUNCT (58 instances) being less frequent categories.

Features

NOUN features:

Number: Sing, Plur Definiteness marking when overt

VERB features:

Person: 1, 2, 3 Number: Sing, Plur Tense: Past, Pres, Fut Mood: Ind (indicative), Sub (subjunctive), Imp (imperative) Voice: Act (active), Pass (passive), Cau (causative) Aspect: Perf (perfective), Imp (imperfective) VerbForm: Fin (only finite forms in current dataset)

Pronoun features:

Person: 1, 2, 3 Number: Sing, Plur Animacy: Hum (human), Anim (animate), Inan (inanimate) Definiteness: Def, Ind, Spec

Other notable features:

ExtPos: Used for multi-word adpositions Systematic animacy marking on pronouns and some nouns

Syntax

Core arguments:

Subjects identified with nsubj relation (68 instances) Direct objects use obj relation (35 instances) Oblique arguments frequently marked with obl (98 instances), often involving adpositions

Light verb constructions:

Most frequent syntactic pattern (88 instances of compound:lvc) Structure: NOUN[compound:lvc] → VERB[root] Example: řikab dey = “to pedal” (lit. “give pedal”)

Possessive constructions:

Marked with nmod:poss relation (68 instances) Structure: NOUN ← PRON[nmod:poss] Possessive pronouns carry full pronominal features

Subtype relations used:

nmod:poss for possessive modification compound:lvc for light verb constructions advmod:emph for emphatic adverbs fixed for multi-word expressions

Other syntactic features:

Rich inventory of discourse particles marked with discourse relation Coordination frequently used (conj - 106 instances) Parataxis common for independent clauses (32 instances) SOV word order patterns evident in dependency structures

Coordination and subordination:

Extensive coordination without overt coordinators Subordination marked with mark relation (11 instances) Relative clauses introduced by ki

Treebanks

There is 1 (../treebanks/sdh-comparison.html) Southern Kurdish UD treebanks: