home edit page issue tracker

This page pertains to UD version 2.

UD Slovenian SST

Language: Slovenian (code: sl)
Family: Indo-European, Slavic

This treebank has been part of Universal Dependencies since the UD v1.3 release.

The following people have contributed to making this treebank part of UD: Kaja Dobrovoljc, Joakim Nivre.

Repository: UD_Slovenian-SST
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14

License: CC BY-NC-SA 4.0

Genre: spoken

Questions, comments? General annotation questions (either Slovenian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [kaja • dobrovoljc (æt) ff • uni-lj • si]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style

Description

The Spoken Slovenian Treebank (SST) is a manually annotated collection of transcribed audio recordings featuring spontaneous speech in various everyday situations. It includes 344 unique speech events (documents) amounting to approximately 10 hours of speech, encompassing a total of 6,104 utterances and 76,341 tokens.

Spoken Slovenian Treebank (SST) is a manually grammatically annotated sample of the GOS reference corpus of spoken Slovenian. It contains transcribed audio recordings of monologic, dialogic and multi-pary spontaneous speech in different everyday situations, balanced so as to be representative of speaker demographics (sex, age, region, education), channels (TV, radio, telephone, personal contact) and communication settings (TV and radio shows, lectures, meetings, consultations, services, conversations between friends etc.).

The spelling, tokenization and segmentation principles follow the transcription guidelines of the reference corpus (Verdonik et al. 2013) with the syntactic trees spanning over individual utterances (semantically, syntactically and acoustically delimited units, roughly corresponding to written-like sentences). The annotation has been performed on top of normalized transcriptions, i.e. words with standardized spelling. To accommodate the structural and pragmatic particularities of spoken language data, such as self-repairs, fillers, discourse markers and parentheticals, we relied on the guidelines proposed by Dobrovoljc and Nivre (2016) and Dobrovoljc (2022).

As of UD release v2.14 in May 2024, the original version of the SST UD treebank (Dobrovoljc in Nivre 2016) has been partially revised and substantially extended with new data from GOS v2 (Verdonik et al. 2024), such as parliamentary debates, round tables and online events. The latest version of the SST treebank thus includes 6,104 utterances (76,341 tokens), produced by 676 speakers in 344 different speech events (48% public and 52% non-public tokens) amounting to approximately 10 hours of recordings.

The train-dev-test data split has been randomized on document-level. The CONLL-U files include links to original audio recordings, and information on the GOS speaker/event IDs, which can be used to retrieve additional metadata information from the original GOS corpus, such as the information on speaker demographics, speech event details or transcribed markers of prosody.

Acknowledgments

We wish to thank all the collaborators who have helped with dependency annotation (Nives Hüll, Karolina Zgaga, Luka Terčon, Matija Škofljanec), JOS-MTE lemmatization and morphological annotation (Jaka Čibej, Tina Munda, Matija Škofljanec), data sampling (Darinka Verdonik, Nikola Ljubešić, Peter Rupnik), automatic pre-annotation (Luka Krsnik), JOS-to-UD morphology conversion (Jaka Ćibej), and guidelines consulting (Joakim Nivre). This work was financially supported by the Slovenian Research and Innovation Agency (grant no. Z6-4617 - A Treebank-Driven Approach to the Study of Spoken Slovenian, Young Researcher Programme 2013) and IC1207 COST Action PARSEME.

References

@inproceedings{dobrovoljc-nivre-2016-universal,
title = "The {U}niversal {D}ependencies Treebank of Spoken {S}lovenian",
author = "Dobrovoljc, Kaja and Nivre, Joakim",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
year = "2016",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L16-1248",
pages = "1566--1573",
}

Other

Changelog

2024-04-11 v2.14
* Extended original dataset with 2,916 new sentences (46,853 tokens)
* Revised original dataset to implement guidelines changes (e.g. reparandum and discourse)
* Removed conj:extend label
* Added Gos2.1 document/sentence/token IDs for easier lookup
* Changed license to CC-BY-SA
* Updated readme


2023-04-12 v2.12
* Added metadata information on speaker ID and soundfile URL
* Renamed sentence IDs to comply with the GOS 2.0 nomenclature
* Corrected mistakes pertaining to Reflex and Polarity features
* Corrected inconsistent UPOS tags for non-lexical tokens (all PUNCT)
* Corrected some minor errors in manual annotation
* Removed old msd info from MISC and renamed 'word' to 'pronunciation'

2022-04-20 v2.10
* Manual relabelling of the few examples raising validation errors, mostly from goeswith to fixed

2019-10-30 v2.5
* Fixed legacy validation errors, i.e.
* Re-tagging the [gap]-like punctuation from X to PUNCT
* Re-attaching the [gap]-like punctuation causing non-projectivity
* Re-attaching leafs of unlike parents
* Fixing random mistakes in annotation

2015-01-30 v2.0
* Manual and automatic conversions from UDv1 to UDv2 guidelines
* Manual corrections of some mistakes in previous versions
* Resizing of train-dev-test (in accordance with CONLL ST 2017 requirements)
* Random utterance shuffling to ensure more representative genre distributions.

2015-03-15 v2.2
* Manual corrections of some mistakes in previous versions
* New (text-level) data randomization
* Resizing of train-test datasets (in accordance with CONLL ST 2018)

Acknowledgments

Statistics of UD Slovenian SST

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNSCONJVERBX

Features

AbbrAnimacyAspectCaseDefiniteDegreeForeignGenderGender[psor]MoodNumberNumber[psor]NumFormNumTypePersonPolarityPossPronTypeReflexTenseTypoVariantVerbForm

Relations

acladvcladvmodamodapposauxcasecccc:preconjccompconjcopcsubjdepdetdiscoursediscourse:fillerdislocatedexplfixedflatflat:foreignflat:namegoeswithiobjmarknmodnsubjnummodobjoblorphanparataxisparataxis:discourseparataxis:restartreparandumrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Verbs with Reflexive Core Objects

Relations Overview