home swl/swl edit page issue tracker

This page pertains to UD version 2.

Introduction

The Swedish Sign Language UD treebank is based on the The Swedish Sign Language Corpus (SSLC), consisting of video and accompanying annotation files (Mesch et al., 2012b, Mesch et al., 2015). The corpus consists of naturalistic, dyadic signing, the majority of the data coming from conversational type texts, and a smaller part coming from elicited narratives. In total, 300 texts have been recorded, distributed over 42 different signers (Mesch, 2012; Mesch et al., 2012a). Annotation is made with the ELAN software (Wittenburg et al., 2006), producing annotation cells on tiers time-aligned with the video files. The most recent (2016) update of the SSLC contains 48,690 sign tokens, spanning just over 6 hours of video data, distributed across 85 files and 42 signers. The annotation files contain six main tiers: four for the sign glosses (i.e., one for each of the hands of the two signers) and two for written translations into Swedish (i.e., one tier for each signer) (Mesch and Wallin, 2015). In addition, the SSLC contains part-of-speech (PoS) tags that are attached to the gloss annotations on the sign-gloss tier (e.g., “PRO1[PN]”). The tagging procedure was based on a semiautomatic method on an earlier version of the corpus (Östling et al., 2015), and subsequent expansions have been manually tagged. Recently, a minor part of the corpus has also been syntactically annotated. This process involves segmentation of the sign stream into clauses and annotation of predicates, arguments and optional modifiers within these clauses (Börstell et al., 2016). The Swedish Sign Language UD treebank is based on a semiautomatic mapping of the part-of-speech tags and manual conversion of the syntactic annotation to the corresponding UD categories and relations.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Manual annotation is performed in ELAN, the ud-swl project repository contains original annotation files and conversion scripts, along with preliminary annotation guidelines (only in Swedish for now).

Column Status
ID “Sentence” segmented according to dependency tree, “tokenization” from original SSLC annotation.
FORM From the original SSLC annotation.
LEMMA — (currently unused)
UPOSTAG Mapped from XPOSTAG.
XPOSTAG From Östling et al. (2015)
FEATS — (currently unused)
HEAD Manual annotation.
DEPREL Manual annotation.
DEPS — (currently unused)
MISC — (currently unused)

References