home edit page issue tracker

This page pertains to UD version 2.

UD Norwegian Nynorsk

Language: Norwegian (code: no)
Family: Indo-European, Germanic

This treebank has been part of Universal Dependencies since the UD v2.0 release.

The following people have contributed to making this treebank part of UD: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle.

Repository: UD_Norwegian-Nynorsk
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.5

License: CC BY-SA

Genre: news, blog, nonfiction

Questions, comments? General annotation questions (either Norwegian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [liljao (æt) ifi • uio • no]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS not available
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually in non-UD style, automatically converted to UD

Description

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

NDT was developed 2011-2014 at the National Library of Norway in collaboration with the Text Laboratory and the Department of Informatics at the University of Oslo. NDT contains around 300,000 tokens taken from a variety of genres. The treebank texts have been manually annotated for morphosyntactic information. The morphological annotation mainly follows the Oslo-Bergen Tagger. The syntactic annotation follows, to a large extent, the Norwegian Reference Grammar, as well as a dependency annotation scheme formulated at the outset of the annotation project and iteratively refined throughout the construction of the treebank. For more information, see the references below.

DATA SPLITS

In creating the data splits, care has been taken to preserve contiguous texts in the different splits and also to keep a fair balance of genres in each of the splits. Petter Hohle created the splits for the Norwegian UD treebank. The splits were created by concatenating the following files (available with the distribution of NDT):

Train:

blogg-nn001_0000 – blogg-nn003_0001 dot001_0000 – dot014_0007 firda-nn001_0000 – firda-nn004_0005 kk-nn001_0000 – kk-nn006_0002 mom001_0000 – mom003_0004 st-nn001_0000 – st-nn002_0000 vtb-nn001_0000 – vtb-nn006_0004

Dev:

blogg-nn003_0002 dot014_0008 – dot015_0002 firda-nn004_0006 – firda-nn005_0002 kk-nn006_0003 – kk-nn007_0002 mom003_0005 st-nn002_0001 vtb-nn006_0005 – vtb-nn007_0001

Test:

blogg-nn003_0003 dot015_0003 – dot016_0004 firda-nn005_0003 – firda-nn005_0006 kk-nn007_0003 – kk-nn007_0006 mom003_0006 st-nn002_0002 vtb-nn007_0002 – vtb-nn007_0004

BASIC STATISTICS

TOKENIZATION

White space always indicates a token boundary and punctuation constitute separate tokens, except:

The treebank does not contain multiword tokens.

MORPHOLOGY

The PoS-tags follow the universal tag set and does not add any language-specific PoS-tags. The morphological features follow the Oslo-Bergen Tagger scheme (Hagen et. al., 2000). PoS-tags and morphological features were converted automatically to the UD scheme.

SYNTAX

The syntactic annotation in the Norwegian UD treebank conforms to the UD guidelines, adding language-specific relations for relative clauses (acl:relcl) and verb particles (compound:prt). The annotation has been automatically converted to UD from the original dependency scheme described in Solberg et. al. (2014) and further described in the NDT guidelines (Kinn et. al.). The conversion has not been manually checked. There are a few known discrepancies from UD:

REFERENCES

Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad: “A Constraint-based Tagger for Norwegian”. 2000. Proceedings of the 17th Scandinavian Conference in Linguistics.

Kari Kinn, Per Erik Solberg and Pål Kristian Eriksen. “NDT Guidelines for Morphological Annotation”. National Library Tech Report.

Per Erik Solberg, Arne Skjærholt, Lilja Øvrelid, Kristin Hagen and Janne Bondi Johannessen. 2014.”The Norwegian Dependency Treebank”, Proceedings of LREC 2014, Reykjavik

Velldal, Erik; Øvrelid, Lilja & Hohle, Petter (2017). Joint UD Parsing of Norwegian Bokmål and Nynorsk , In Jörg Tiedemann (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. ISBN 978-91-7685-601-7. Article no. 001. s 1 - 10

Acknowledgments

NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo. Petter Hohle created the data splits and Fredrik Jørgensen aligned the treebank to the original texts. We thank the annotators of the original NDT: Pål Kristian Eriksen, Kari Kinn and Per Erik Solberg.

CHANGELOG

Statistics of UD Norwegian Nynorsk

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX

Features

AbbrAnimacyCaseDefiniteDegreeGenderMoodNumberNumTypePersonPolarityPossPronTypeTenseVerbForm

Relations

aclacl:cleftacl:relcladvcladvmodamodapposauxaux:passcaseccccompcompoundcompound:prtconjcopcsubjcsubj:passdetdiscourseexplflat:foreignflat:nameiobjmarknmodnsubjnsubj:passnummodobjoblorphanparataxispunctreparandumrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview