UD Nenets Tundra
Language: Nenets (code: yrk)
Family: Uralic
This treebank has been part of Universal Dependencies since the UD v2.16 release.
The following people have contributed to making this treebank part of UD: Morgane Bona, Bruno Guillaume, Sylvain Kahane, Aleksandra Miletić, Nikolett Mus, Daniel Zeman.
Repository: UD_Nenets-Tundra
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either Nenets-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [mus • nikolett (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | annotated manually, natively in UD style |
| Relations | annotated manually, natively in UD style |
Description
The Tundra Nenets UD treebank is converted from the Tundra Nenets mSUD treebank. The conversion from mSUD to UD is performed automatically followed by a comprehensive manual revision to ensure compliance with the UD annotation standards.
The treebank currently comprises 171 manually annotated sentences (approximately 11 minutes of recorded speech). The data were collected during a fieldwork session conducted in Moscow in 2017 with a native speaker of Tundra Nenets (Yamal dialect). The session elicited semi-spontaneous speech through visual stimulus–based tasks, including a modified version of the [HCRC Map Task] (https://groups.inf.ed.ac.uk/maptask/maptasknxt.html) and the the so-called Pear Story narrative task (Chafe, 1980).
The spoken data were transcribed by the native speaker using standard Tundra Nenets orthography rather than phonetic or IPA-based notation. Sentence segmentation builds on the original transcription and follows a combined prosodic and semantic approach, in which intonational boundaries are treated as sentence boundaries only when they correspond to semantically complete units; otherwise, adjacent material is merged. The recordings were manually time-aligned at both sentence and lexeme levels in Praat to support morphological and syntactic analysis. The transcription reflects normalized forms and does not capture phonetic variation or morphophonological processes (e.g. sandhi), although the audio data are available for reference.
Annotation focuses on spoken-language phenomena relevant to word-level analysis, especially word boundary identification. An inductive approach was adopted, identifying recurrent phenomena in the data and assigning dedicated tags, informed by spoken UD practices and adapted to the typological properties of Tundra Nenets. The dataset consists mainly of narrative monologues and lacks interactional features such as overlap. Spoken phenomena are divided into non-lexical items (e.g. noises, pauses, hesitation markers), assigned discourse relations, and lexical disruptions (e.g. unfinished words, false starts, repetitions), which affect syntactic structure; pauses are treated either as punctuation or discourse elements depending on function.
Lemmatization preserves dialectal variation due to the absence of a unified written standard. Only inflectional morphology is segmented from stems, while derivational morphology is retained; inflectional suffixes remain in their surface forms, and linking vowels are treated as part of the stem. POS tagging and morphological analysis were performed manually, with inflectional features segmented and glossed following established descriptive traditions and a tagset based on the Leipzig Glossing Rules, adapted to Tundra Nenets.
To account for its complex morphology and syntactic encoding, annotation is based on the morphologically enhanced Surface-Syntactic Universal Dependencies (mSUD) framework (Guillaume et al., 2024). Within the UniDive COST Action (CA21167), UD is extended with a layer for Information Structure roles (e.g. topic and focus). In Tundra Nenets, topicality is reflected in agreement: when a verb agrees with a topical object, the object is annotated as ISRole=Top and the corresponding verbal agreement as ISMarker[Top]=Agr.
The original Cyrillic transcription was transliterated into Latin script, taking into account language-specific properties. Conversion from mSUD to UD is performed in two stages, first from mSUD to SUD and then from SUD to UD, using iteratively applied Grew (Guillaume, 2021) rules.
This treebank is described in detail in Mus et al. (2025), where further information on annotation decisions and linguistic analysis can be found.
Acknowledgments
The development of this treebank was supported by two research projects: Autogramm: Induction of Descriptive Grammar from Annotated Corpora (ANR-21-CE38-0017), and ThEA: Theoretical and Experimental Approaches to Dialectal Variation and Contact-Induced Change – A Case Study of Tundra Nenets (NKFIH FK 129235). These projects contributed to both the data collection and the creation of the treebank. In addition, this work was supported by COST Action CA21167 —Universality, diversity and idiosyncrasy in language technology (UniDive).
References
Chafe, Wallace L. (ed.) 1980. The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Advances in Discourse Processes, vol. III. Ablex, Norwood, NJ, USA.
Guillaume, Bruno. 2021. Graph matching and graph rewriting: GREW tools for corpus exploration, maintenance and conversion. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 168–175, Online. Association for Computational Linguistics.
Guillaume, Bruno, Gerdes, Kim, Guiller, Kirian, Kahane, Sylvain and Li, Yixuan. 2024. Joint annotation of morphology and syntax in dependency treebanks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 9568–9577, Torino, Italia. ELRA and ICCL.
Mus, Nikolett, Guillaume, Bruno, Kahane, Sylvain and Zeman, Daniel. 2025. Creating a multi-layer Treebank for Tundra Nenets. In Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages, pp. 77-86. 2025.
Statistics of UD Nenets Tundra
POS Tags
ADJ – ADP – ADV – AUX – DET – INTJ – NOUN – NUM – PRON – PUNCT – VERB – X
Features
Number – Person – PronType – VerbForm
Relations
acl – advcl – advmod – amod – aux – case – ccomp – cop – csubj – dep – det – discourse – mark – nmod – nmod:poss – nsubj – nsubj:outer – nummod – obj – obl:mod – parataxis – punct – reparandum – root – vocative
Tokenization and Word Segmentation
- This corpus contains 170 sentences and 1272 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 22 types of words that contain both letters and punctuation. Examples: ӈацекэко?мэ?, ӈацекэр?мэ?,
тохо', ?хан?, ?таремʼ?, ?ёсь?, велосипед?мэ?, велосипедам', грушамда?мэ?, грушидам', иня_няӈыʼ, маˮламбада?мэ?, мюд?, нида?мэ?, тикандоʼ?мэ?, тохо', тяхана?мэ?, ха"маэм, хасава?ва?, Ӈоб?мэ?, ӈацекэкоʼ?мэ?, ӈопой?мэ?</li> </ul> Morphology
Tags
- This corpus uses 12 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, NUM, PRON, PUNCT, VERB, X
- This corpus does not use the following tags: PROPN, SCONJ, CCONJ, PART, SYM
- This corpus contains 9 lemmas tagged as pronouns (PRON): _, нер-, няби, нянда, та, тикы, тюку, харта, ӈамгэ
- This corpus contains 4 lemmas tagged as determiners (DET):
, няби, ханяӈыˮ, ӈаниʼ</li> </ul>
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: няби
- This corpus contains 3 lemmas tagged as auxiliaries (AUX): ни, тара, ӈa
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: ӈa
- There are 1 (de)verbal forms:
- Conv
- VERB: ӈохолё, мада, пыдабтамба
Nominal Features
- Plur
- PRON: ваˮ
- Sing
- ADP: нерни, нерниʼ, нянда
- AUX: тара, ӈэвы
- NOUN: марядʼ, махалэянда, нёнда, тарканда, сидеранда, харданда,
марядʼ, нёнд , сэвˮни, таркахаюта</li> - PRON: нерниʼ
- VERB: таня, миманиʼ, танявыˮ, танявэхэˮ, ядваниʼ, яӈговы
</ul> </li> </ul>Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
- Dem
- PRON: тикы, тикар, та, тика, тюку
- 1
- ADP: нерни, нерниʼ
- NOUN: сэвˮни
- PRON: ваˮ, нерниʼ
- VERB: миманиʼ, ядваниʼ
- 2
- NOUN: марядʼ,
марядʼ</li> </ul> </li> </ul> - 3
- ADP: нянда
- AUX: тара, ӈэвы
- NOUN: махалэянда, нёнда, тарканда, сидеранда, харданда, нёнд
, таркахаюта, хэвувнанда, ядувнанда</li> - VERB: таня, танявыˮ, танявэхэˮ, яӈговы
</ul> </li> </ul>Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: ӈa.
- This corpus uses 3 lemmas as auxiliaries (aux). Examples: ни, тара, ӈa.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).- nsubj
- VERB--NOUN (38)
- VERB--PRON (2)
- obj
- VERB--NOUN (74)
- VERB--PRON (1)
- VERB-Conv--NOUN (2)
Relations Overview
- This corpus uses 3 relation subtypes: nmod:poss, nsubj:outer, obl:mod
- The following 1 main types are not used alone, they are always subtyped: obl
- The following 14 relation types are not used in this corpus at all: iobj, xcomp, expl, dislocated, appos, clf, conj, cc, fixed, flat, compound, list, orphan, goeswith
- 3
- NOUN: марядʼ,