UD Kyrgyz TueCL
Language: Kyrgyz (code: ky)
Family: Turkic
This treebank has been part of Universal Dependencies since the UD v2.14 release.
The following people have contributed to making this treebank part of UD: Bermet Chontaeva, Çağrı Çöltekin.
Repository: UD_Kyrgyz-TueCL
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: grammar-examples
Questions, comments? General annotation questions (either Kyrgyz-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [bermet • chontaeva (æt) student • uni-tuebingen • de, cagri • coeltekin (æt) uni-tuebingen • de]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | annotated manually, natively in UD style |
| Relations | annotated manually, natively in UD style |
Description
This is a small treebank of grammatical examples for Kyrgyz. It is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages, designed to facilitate cross-linguistic research on these related languages.
The Kyrgyz-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages (Turkish - UD_Turkish-TueCL, Azerbaijani - UD_Azerbaijani-TueCL, Kyrgyz - UD_Kyrgyz-TueCL, and Uzbek - UD_Uzbek-TueCL), designed to facilitate cross-linguistic research on these related languages.
Total sentences: 173 Total tokens: 1250 Unique word forms (types): 464 Unique lemmas: 287
The Kyrgyz-TueCL treebank consists of 173 carefully selected sentences compiled from multiple sources, including the Cairo corpus (20 sentences), the UDTW23 corpus (20 sentences), and 97 additional examples illustrating specific grammatical constructions of interest. It serves as a source treebank for a parallel corpus spanning four Turkic languages from distinct branches of the family: Turkish and Azerbaijani (Oghuz), Kyrgyz (Kipchak), and Uzbek (Karluk).
The treebank includes various syntactic phenomena relevant to Turkic languages, such as pro-drop constructions, auxiliary chains, postverbal structures, and non-canonical word orders. Each sentence has been manually annotated following UD guidelines, with particular attention to morphosyntactic features that highlight both shared typological characteristics and language-specific traits. Glossing, transliteration, and translations of all sentences are provided in Azerbaijani, Turkish, Uzbek, and English as metadata to support comparative research.
Dependency relations, glossing, lemmatization, morphological features, POS tagging, tokenization, and transliteration were manually annotated.
This resource is significant as it represents the first fully aligned parallel UD treebanks for these Turkic languages, enabling systematic cross-linguistic comparisons previously hindered by the lack of parallel resources. The treebank supports research in comparative Turkic syntax, cross-lingual parsing, and language education.
Acknowledgments
This work was supported by COST Action CA21167 - Universality, diversity and idiosyncrasy in language technology (UniDive). We thank the Turkic UD working group for fruitful discussions of linguistic issues and annotation approaches. We extend special thanks to the Kyrgyz team — Jonathan North Washington, Aida Kasieva, Gulnura Dzumalieva, Aigul Tursunova, Meerim Ryspakova, and Aizat Kadyrbekova — for their consistent support, as well as their valuable weekly meetings and discussions that greatly contributed to this work.
References
Please, cite the following paper if you use Kyrgyz-TueCL UD treebank:
@inproceedings{akhundjanova-etal-2025-parallel,
title = "Parallel {U}niversal {D}ependencies Treebanks for {T}urkic Languages",
author = "Akhundjanova, Arofat and
Akkurt, Furkan and
Chontaeva, Bermet and
Eslami, Soudabeh and
Coltekin, Cagri",
editor = {Bouma, Gosse and
{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
booktitle = "Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)",
month = aug,
year = "2025",
address = "Ljubljana, Slovenia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.udw-1.14/",
pages = "129--136",
ISBN = "979-8-89176-292-3",
abstract = "We introduce the first fully aligned and manually annotated parallel Universal Dependencies (UD) treebanks for four Turkic languages: Azerbaijani, Kyrgyz, Turkish, and Uzbek. These resources currently consist of 148 strategically selected sentences that illustrate typologically significant morphosyntactic phenomena across these related yet distinct languages. These parallel treebanks enable systematic comparative studies of Turkic syntax and may be instrumental in cross-lingual NLP applications. All treebanks are available as part of UD v2.16."
}
Statistics of UD Kyrgyz TueCL
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB
Features
Aspect – Case – Definite – Degree – Evident – Mood – Number – Number[psor] – Person – Person[psor] – PronType – Tense – VerbForm – Voice
Relations
acl – acl:relcl – advcl – advmod – advmod:emph – amod – appos – aux – case – cc – ccomp – compound – compound:lvc – compound:svc – conj – cop – csubj – det – discourse – fixed – flat – mark – nmod – nmod:poss – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:cau – obl:tmod – orphan – parataxis – punct – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 173 sentences, 1214 tokens and 1250 syntactic words.
- This corpus contains 232 tokens (19%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
- This corpus contains 35 multi-word tokens. On average, one multi-word token consists of 2.03 syntactic words.
- There are 22 types of multi-word tokens. Examples: бердиби, Окугандарын, барбы, бекен, беле, жатабы, Дениздикинин, ашкананыкын, бересиңби, жаттыңызбы, жокпу, келдиби, кичинекейби, коёсуңбу, сеникинен, текчесиндегилер, туурабы, үйдөбү, үйдөгү, үйдөгүнүкү, өрөөнбү, өткөрүлөбү.
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB
- This corpus does not use the following tags: X
- This corpus contains 8 word types tagged as particles (PART): б, би, бу, бы, бү, пу, тургандыгы, ээ
- This corpus contains 13 lemmas tagged as pronouns (PRON): _, Питер, ал, алар, биз, бир, бул, ки, ким, мен, сен, эмне, өз
- This corpus contains 7 lemmas tagged as determiners (DET): бардык, бир, бул, көп, ошо, эч, өз
- Out of the above, 3 lemmas occurred sometimes as PRON and sometimes as DET: бир, бул, өз
- This corpus contains 12 lemmas tagged as auxiliaries (AUX): ал, бер, бол, жат, жок, кал, кет, кой, окшо, тур, э, экен
- Out of the above, 7 lemmas occurred sometimes as AUX and sometimes as VERB: ал, бер, бол, жат, кал, кет, окшо
- There are 4 (de)verbal forms:
- Conv
- VERB: Түнөп
- Fin
- AUX: болчу, жаткан
- VERB: ач, такылдатыптыр, өткөрүлө
- Inf
- VERB: жыйнап
- Part
- VERB: берген
Nominal Features
- Plur
- PRON: Алар, Сен
- Sing
- AUX-Fin: болчу, жаткан
- NOUN: Жамгыр, күйөөсүнө, унааны, Кыз, гүнү, досуна, кат, кү, терезени, үйдө
- PRON: ал, Сен, Менин
- PROPN: Сэм
- VERB: жуудурду, такылдатыптыр, өткөрүлө
- VERB-Fin: такылдатыптыр, өткөрүлө
- Abl
- NOUN: ресторандан
- Acc
- NOUN: унааны, терезени, эшикти, үйдү
- Dat
- NOUN: күйөөсүнө, досуна
- Gen
- NOUN: гүнү, үйдүн
- PRON: Менин
- Loc
- NOUN: үйдө
- Nom
- NOUN: Жамгыр, конок, тамак, Кыз, буюртма, кат, кү, ээси
- PRON: ал, Сен, Алар
- PROPN: Сэм
- Def
- NOUN: эшикти, үйдү, үйдүн
- Ind
- NOUN: буюртма
Degree and Polarity
- Cmp
- ADJ: кыйыныраак
Verbal Features
- Imp
- AUX-Fin: болчу
- Perf
- VERB-Conv: Түнөп
- VERB-Inf: жыйнап
- Imp
- VERB-Fin: ач
- Ind
- AUX-Fin: болчу, жаткан
- VERB: жуудурду, такылдатыптыр, өткөрүлө
- VERB-Fin: такылдатыптыр, өткөрүлө
- Past
- AUX-Fin: болчу, жаткан
- VERB: жуудурду, берген, такылдатыптыр
- VERB-Fin: такылдатыптыр
- VERB-Part: берген
- Pres
- VERB-Fin: ач, өткөрүлө
- Cau
- VERB: жуудурду, такылдатыптыр
- VERB-Fin: такылдатыптыр
- Pass
- VERB-Fin: өткөрүлө
- Nfh
- VERB-Fin: такылдатыптыр
Pronouns, Determiners, Quantifiers
- Prs
- PRON: ал, Сен, Алар, Менин
- 2
- PRON: Сен
- 3
- AUX-Fin: болчу, жаткан
- PRON: ал, Алар, Сен
- VERB: жуудурду, ач, такылдатыптыр, өткөрүлө
- VERB-Fin: ач, такылдатыптыр, өткөрүлө
- Plur,Sing
- NOUN: ээси
Other Features
- Person[psor]
- 3
- NOUN: ээси
- 3
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: э.
- This corpus uses 12 lemmas as auxiliaries (aux). Examples: жат, кал, ал, бол, э, экен, жок, кой, тур, бер, кет, окшо.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (28)
- VERB--NOUN-Nom (4)
- VERB--PRON (11)
- VERB--PRON-Nom (14)
- VERB-Fin--NOUN-Nom (1)
- VERB-Inf--NOUN-Nom (1)
- obj
- VERB--NOUN (56)
- VERB--NOUN-Acc (2)
- VERB--NOUN-Nom (1)
- VERB--PRON (7)
- VERB-Fin--NOUN-Acc (2)
- VERB-Inf--NOUN-Acc (1)
- VERB-Part--NOUN-Nom (1)
Relations Overview
- This corpus uses 9 relation subtypes: acl:relcl, advmod:emph, compound:lvc, compound:svc, nmod:poss, nsubj:outer, nsubj:pass, obl:cau, obl:tmod
- The following 8 relation types are not used in this corpus at all: iobj, expl, dislocated, clf, list, goeswith, reparandum, dep