UD for Zaar 
Tokenization and Word Segmentation
Since the dependencies presented in the Universal Dependencies framework are based on a lexical approach of syntax, the first step of the processing chain is to decide how to tokenize the language. The idea is, by breaking down the sentence into tokens, to extract the syntactic information related to words in the discourse chain.
- The Zaar treebank is an extension of an oral corpus (https://cortypo.huma-num.fr/index.html) interlinearized and glossed on a morphological basis.
- Tokenization had to take into account the fact that syntactic information in Zaar can be spread in different ways in words, affixes and clitics. It has been decided to keep as tokens only words (with and without affixes) and clitics while the syntactic information contained in affixes is annotated by morphological features of the affixed words. Clitics are PRON conveying syntactic functions such as complement and modifier.
- As we are dealing with oral data, we have chosen the illocutionary unit as the basic transcription unit. Punctuation tokens (e.g. <, >, //, etc.) organise the illocutionary unit into: pre-nucleus < nucleus > post-nucleus //
Morphology
This is an overview only. For more detailed discussion and examples, see the list of Zaar POS tags and Zaar features.
Tags
- The language specific tagset is the original annotation made from the extended version of the Leipzig Glossing Rules. (Available at here)
- The UD tagset is based on a conversion from the previous annotation to UPOS.
- Zaar uses 16 of the universal tags (with the exception of
SYM, which is not relevant for oral data) - As in other African languages (e.g. Hausa, Wolof), the verbal inflections in Zaar are gathered in a single
AUXthat precedes theVERB, and expresses various combinations ofTense(2 values),Aspect(4 values) andMood(4 values). In addition to the TAM auxiliaries, Zaar has 2 copulas.
The following auxiliaries are recognized in Zaar:- a for
future(tense) - á for
subjunctive(mood) - áː for
perfect(aspect) - nə for the
identifier/focus(copula) - tə̀ for
jussive(mood) - taynàː for
past(tense) - yáː for
imperfect(aspect) - yǎː for
conditional(mood) - yi for for the
locative/predicative(copula) - yí for
irrealis(mood) - yiː for
iterative(aspect) - yiká for
progressive(aspect)
- a for
- These auxiliaries can be combined to produce complex TAM values.
Features
- The Zaar treebank uses 34 universal features
- 10 language specific values have been added to the scheme:
- 7
PartTypesfor thePARTPOS (Adv= Adverbial ;Disc= Discourse ;Foc= Focus;Illoc= Illocution;Neg= Negation ;Pred= Predicative;Top= Topic) - 3
PastTypesfor thepasttense (Immediate,RecentandRemote)
- 7
Syntax
- The dependency analysis is a conversion of the manual annotation to SUD format. For more information, see SUD guidelines.
- Zaar is mostly a SVO language. The only exception is found in the progressive Aspect, where the direct object can precede the nominalized verb (a Vnoun).
- Zaar is a prodrop language with a strong proportion of dislocated subjects and complements. In addition to a possible independent lexical or pronominal subject (tagged
nsubj), theAUXcontains agreement features forPersonandNumber. - We have direct object with
obj, indirect object withiobj.
Treebanks
There is 1 Zaar UD treebank: