UD for Arabic
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We always tokenize them as separate tokens (words); that holds even for hyphenated compounds and for abbreviations.
- Enclitic pronouns, proclitic conjunctions and prepositions are cut off during tokenization and marked as multi-word tokens.
- Definite articles are treated as bound morphemes and they are not cut off during tokenization.
Morphology
Tags
- Arabic uses 16 universal POS categories; at present, subordinating and coordinating conjunctions are not distinguished (the CCONJ tag is used).
- Some Arabic auxiliaries inflect like verbs, some do not inflect at all.
The following auxiliaries are found in the data:
- The copula كَان (kān) and the negative copulas لَيس (lays) and لسنا (lasnā).
- Future tense / modal auxiliary سَوفَ (sawfa) or its clitic version سَ (sa) “will”
- Modal particle قَد (qad) “may”
- Modal particle رُبَّمَا (rubbamā) “maybe, perhaps”
- Modal particle عَلَّ (ʿalla) “perhaps”
- Aspectual auxiliary verb عَاد (ʿād) “return, no longer do”
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with values
Masc
orFem
.- Verbs inflect for
Gender
because they must agree with nouns.
- Verbs inflect for
- Number has 3 possible values:
Sing
,Dual
andPlur
. - Case has 3 possible values:
Nom
,Gen
,Acc
. It occurs with the nominal words, i.e., NOUN, PROPN, PRON, ADJ, DET, NUM.
Degree and Polarity
Verbal Features
- Aspect is inflectional, either imperfective (
Imp
) or perfective (Perf
). - Mood has 4 possible values:
Ind
,Imp
,Jus
,Sub
. - Voice has 2 possible values:
Act
,Pass
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
.
Other Features
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- For the purpose of UD the objects are divided to core objects, labeled obj or iobj,
and oblique objects, labeled obl:arg.
- Bare accusative and genitive objects are considered core.
- All prepositional objects are considered oblique.
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
Non-verbal Clauses
- Besides the auxiliary copulas mentioned above, a personal pronoun can also serve as a copula.
Relations Overview
- The following relation subtypes are used in Arabic:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- obl:arg for prepositional objects
- nmod:poss for possessive/genitive modifiers
- advmod:emph for adverbs or particles that modify noun phrases and emphasize or negate them
- flat:name for non-first personal names
- flat:foreign for non-first words in quoted foreign phrases
Treebanks
There are three Arabic UD treebanks: