home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD for Zaar

Tokenization and Word Segmentation

Since the dependencies presented in the Universal Dependencies framework are based on a lexical approach of syntax, the first step of the processing chain is to decide how to tokenize the language. The idea is, by breaking down the sentence into tokens, to extract the syntactic information related to words in the discourse chain.

The Zaar treebank is an extension of an oral corpus (https://cortypo.huma-num.fr/index.html) interlinearized and glossed on a morphological basis.
Tokenization had to take into account the fact that syntactic information in Zaar can be spread in different ways in words, affixes and clitics. It has been decided to keep as tokens only words (with and without affixes) and clitics while the syntactic information contained in affixes is annotated by morphological features of the affixed words. Clitics are PRON conveying syntactic functions such as complement and modifier.
As we are dealing with oral data, we have chosen the illocutionary unit as the basic transcription unit. Punctuation tokens (e.g. <, >, //, etc.) organise the illocutionary unit into: pre-nucleus < nucleus > post-nucleus //

Morphology

This is an overview only. For more detailed discussion and examples, see the list of Zaar POS tags and Zaar features.

Features

The Zaar treebank uses 34 universal features
10 language specific values have been added to the scheme:
- 7 PartTypes for the PART POS (Adv = Adverbial ; Disc = Discourse ; Foc = Focus; Illoc = Illocution; Neg = Negation ; Pred= Predicative; Top= Topic)
- 3 PastTypes for the past tense (Immediate, Recent and Remote)

Syntax

The dependency analysis is a conversion of the manual annotation to SUD format. For more information, see SUD guidelines.
Zaar is mostly a SVO language. The only exception is found in the progressive Aspect, where the direct object can precede the nominalized verb (a Vnoun).
Zaar is a prodrop language with a strong proportion of dislocated subjects and complements. In addition to a possible independent lexical or pronominal subject (tagged nsubj), the AUX contains agreement features for Person and Number.
We have direct object with obj, indirect object with iobj.

Treebanks

There is 1 Zaar UD treebank: