home edit page issue tracker

This page pertains to UD version 2.

UD for Old Turkish

On this page, Old Turkish refers to the historical Turkic language identified by the ISO 639-3 code otk; Old Turkic refers to one of the scripts in which Old Turkish was written. The currently released UD treebank is based on Old Turkic script texts, but the language documentation is intentionally written so that it can also accommodate future Old Uyghur and Karakhanid material written in other scripts. Old Turkish should therefore not be annotated by simply projecting the habits of present-day Turkish or of another modern Turkic treebank onto an older corpus.

The key language-specific point is that the data are script-faithful and syntactically fine-grained: many elements that look like bound morphology from a modern blankspace-based perspective are separate syntactic tokens in the current UD analysis.

Tokenization and Word Segmentation

Orthography

Morphology

The current released treebank provides UPOS tags and basic dependency relations, but it does not provide lemmas, XPOS tags, or morphological features. For that reason, this page documents the current segmentation, UPOS, and syntactic decisions rather than a full Old Turkish feature inventory.

Tags

The current treebank uses 13 UPOS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PRON, PROPN, PUNCT, SCONJ, and VERB. The tags PART, INTJ, SYM, and X are currently unused in the released data.

Features

No morphological feature inventory is released for Old Turkish at present, and all FEATS values in the current treebank are empty. This is not a documentation omission. It reflects the current state of the treebank, where distinctions that might later become features are represented mainly through tokenization, UPOS tags, and dependencies.

When features are introduced, they should follow the universal UD feature inventory wherever possible. Language-specific features or values should be added only when they are necessary for Old Turkish and can be documented across the relevant corpora and scripts. Future feature work must also be coordinated with the tokenization policy above: categories currently represented as independent syntactic tokens should not automatically be duplicated as features on the lexical host.

Syntax

Old Turkish is predominantly head-final and suffixing. Embedded clauses normally precede the main clause, but rich morphology allows pragmatic or translated material to show non-canonical order. The released treebank uses only universal dependency relations and no language-specific relation subtypes.

The current treebank uses the following relations: advcl, advmod, amod, aux, case, cc, ccomp, compound, conj, cop, det, mark, nmod, nsubj, nummod, obj, obl, punct, and root.

Treebanks

For the initial annotation rationale and tooling, see Universal Dependencies for Old Turkish.

Diffs

Old Turkish-Clausal

The released treebank is intentionally small, synthetic, and conservative. Compared with broader plans described in early project documentation, the current release uses 13 UPOS tags rather than a larger projected tag inventory, has no lemmas, has no XPOS tags, has no morphological features, and uses no relation subtypes.