UD for Turkish
This is a work-in-progress overview of the UD annotation for Turkish.
Unfortunately, different treebanks follow (slightly) different annotation guidelines, and, as of v2.4, multiple uncoordinated attempts of correction efforts were known. Currently, as of v2.14, there’s a group working on the unification of the Turkish treebanks, named the UD Turkic Group.
Tokenization and Word Segmentation
- In general, words are delimited by whitespaces or punctuation. Whitespaces or punctuation may appear in some abbreviations or numeric expressions.
- Clitics are treated as separate words. This includes the clitics that follow word-internal processes (e.g., vowel harmony) such as question clitic mI and adverbial DA.
- Due to the interaction of syntax and rich morphology, the following list affixes introduce new “syntactic words”. - Copular suffix attached to nouns or adverbs (if not null), including the conditional -(y)sA and converbial -(y)ken - -ki - -lI - -sIz - -lIk
For more details, see tokenization.
Morphology
Turkish has a rich inflectional and derivational morphology.
Some of the morphological phenomena are not satisfactorily annotated as of UD v2.
This includes some missing feature-value pairs,
e.g., ‘reflexive voice’ which is marked using language specific value Voice=Rfl
.
Another open issue is multiple values for certain UD morphological features.
For example, a gelemeselerdi “if they were not able to come_ expresses
two different modalities, requiring assigning both Pot
and Cnd
to the Mood feature.
Currently these multiple features are expressed
by concatenating the values together in alphabetic order,
resulting in feature-value pairs like Mood=CndPot
.
Besides Mood
, Voice
may also have multiple values.
Tags
This is an overview only. For more detailed discussion and examples, see the list of Turkish POS tags and Turkish features.
- The use of UD POS tags vary among different treebanks.
PART
is not used in any of the current treebanks. - Negation particle değil is marked either as
AUX
in some treebanks, andVERB
in others. - The question particle mI is tagged as
AUX
. - The copular suffix -(y)-, which is treated as a syntactic word, and its clitic counterpart i- is marked as AUX.
- Treatment of auxiliary/copula ol differs among different treebanks.
- There are four main (de)verbal forms, distinguished by the value of the VerbForm feature:
- Finite verb
Fin
. - Participle
Part
. - Converb
Conv
. - Verbal noun
Vnoun
(it includes the citation forms with -mak, sometimes called the infinitive).
- Finite verb
Syntax
This is an overview only. For more detailed discussion and examples, see the list of relations,
Relations Overview
- The following relation subtypes are used in Turkish:
Treebanks
As of UD 2.13, there are nine Turkish UD treebanks, with more treebanks in progress.