UD Nenets Tundra
Language: Nenets (code: yrk)
Family: Uralic
This treebank has been part of Universal Dependencies since the UD v2.16 release.
The following people have contributed to making this treebank part of UD: Bruno Guillaume, Sylvain Kahane, Nikolett Mus, Daniel Zeman.
Repository: UD_Nenets-Tundra
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either Nenets-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [mus • nikolett (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | annotated manually, natively in UD style |
| Relations | annotated manually, natively in UD style |
Description
The Tundra Nenets UD treebank is converted from the Tundra Nenets mSUD treebank. The conversion from mSUD to UD is performed automatically followed by a comprehensive manual revision to ensure compliance with the UD annotation standards.
The treebank currently consists of 93 manually annotated sentences (5.6758783 seconds of recorded speech). The data originates from a fieldwork session conducted in Moscow in 2017 with a native speaker of Tundra Nenets, representing the Yamal dialect. The session involved semi-spontaneous speech elicitation using visual stimulus-based tasks, based on a modified version of the HCRC Map Task
The morphological and syntactic annotation of the original mSUD treebank was created manually. The conversion from mSUD to UD was designed and implemented by Bruno Guillaume.
The transcription of the spoken data was carried out by the speaker and follows the standard orthographic conventions of Tundra Nenets, rather than a phonetic or IPA-based system.
To further support the analysis of prosodic and discourse-related phenomena, the recordings were aligned phonetically using Praat, and relevant features of spoken language were incorporated into the annotation.
The original transcription in Cyrillic script was transliterated into Latin script, taking into account certain linguistic particularities of Tundra Nenets.
Acknowledgments
The development of this treebank was supported by two research projects: Autogramm: Induction of Descriptive Grammar from Annotated Corpora (ANR-21-CE38-0017), and ThEA: Theoretical and Experimental Approaches to Dialectal Variation and Contact-Induced Change – A Case Study of Tundra Nenets (NKFIH FK 129235). These projects contributed to both the data collection and the creation of the treebank.
References
Statistics of UD Nenets Tundra
POS Tags
ADJ – ADP – ADV – AUX – DET – INTJ – NOUN – NUM – PRON – PUNCT – VERB – X
Features
Number – Person – PronType – VerbForm
Relations
acl – advcl – advmod – amod – aux – case – ccomp – cop – csubj – dep – det – discourse – mark – nmod – nmod:poss – nsubj – nsubj:outer – nummod – obj – obl:mod – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 170 sentences and 1272 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 22 types of words that contain both letters and punctuation. Examples: ӈацекэко?мэ?, ӈацекэр?мэ?,
тохо', ?хан?, ?таремʼ?, ?ёсь?, велосипед?мэ?, велосипедам', грушамда?мэ?, грушидам', иня_няӈыʼ, маˮламбада?мэ?, мюд?, нида?мэ?, тикандоʼ?мэ?, тохо', тяхана?мэ?, ха"маэм, хасава?ва?, Ӈоб?мэ?, ӈацекэкоʼ?мэ?, ӈопой?мэ?</li> </ul> Morphology
Tags
- This corpus uses 12 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, NUM, PRON, PUNCT, VERB, X
- This corpus does not use the following tags: PROPN, SCONJ, CCONJ, PART, SYM
- This corpus contains 8 lemmas tagged as pronouns (PRON): _, нер-, нянда, та, тикы, тюку, харта, ӈамгэ
- This corpus contains 4 lemmas tagged as determiners (DET):
, няби, ханяӈыˮ, ӈаниʼ</li> </ul>
- This corpus contains 3 lemmas tagged as auxiliaries (AUX): ни, тара, ӈa
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: ӈa
- There are 1 (de)verbal forms:
- Conv
- VERB: ӈохолё, мада, пыдабтамба
Nominal Features
- Plur
- PRON: ваˮ
- Sing
- ADP: нерни, нерниʼ, нянда
- AUX: тара, ӈэвы
- NOUN: марядʼ, махалэянда, нёнда, тарканда, сидеранда, харданда,
марядʼ, нёнд , сэвˮни, таркахаюта</li> - PRON: нерниʼ
- VERB: таня, миманиʼ, танявыˮ, танявэхэˮ, ядваниʼ, яӈговы
</ul> </li> </ul>Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
- Dem
- PRON: тикы, тикар, та, тика, тюку
- 1
- ADP: нерни, нерниʼ
- NOUN: сэвˮни
- PRON: ваˮ, нерниʼ
- VERB: миманиʼ, ядваниʼ
- 2
- NOUN: марядʼ,
марядʼ</li> </ul> </li> </ul> - 3
- ADP: нянда
- AUX: тара, ӈэвы
- NOUN: махалэянда, нёнда, тарканда, сидеранда, харданда, нёнд
, таркахаюта, хэвувнанда, ядувнанда</li> - VERB: таня, танявыˮ, танявэхэˮ, яӈговы
</ul> </li> </ul>Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: ӈa.
- This corpus uses 3 lemmas as auxiliaries (aux). Examples: ни, тара, ӈa.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).- nsubj
- VERB--NOUN (38)
- VERB--PRON (3)
- obj
- VERB--NOUN (70)
- VERB--PRON (1)
- VERB-Conv--NOUN (2)
Relations Overview
- This corpus uses 3 relation subtypes: nmod:poss, nsubj:outer, obl:mod
- The following 1 main types are not used alone, they are always subtyped: obl
- The following 13 relation types are not used in this corpus at all: iobj, expl, dislocated, appos, clf, conj, cc, fixed, flat, compound, list, orphan, goeswith
- 3
- NOUN: марядʼ,