home edit page issue tracker

This page pertains to UD version 2.

UD Icelandic IcePaHC

Language: Icelandic (code: is)
Family: Indo-European, Germanic

This treebank has been part of Universal Dependencies since the UD v2.7 release.

The following people have contributed to making this treebank part of UD: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Hildur Jónsdóttir, Kristín Bjarnadóttir, Anton Karl Ingason, Kristján Rúnarsson, Steinþór Steingrímsson, Joel C. Wallenberg, Eiríkur Rögnvaldsson.

Repository: UD_Icelandic-IcePaHC
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14

License: CC BY-SA 4.0

Genre: fiction, bible, nonfiction, legal

Questions, comments? General annotation questions (either Icelandic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [thar (æt) hi • is, hinrik • hafst (æt) gmail • com, einar • freyr • sigurdsson (æt) arnastofnun • is]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features assigned by a program, not checked manually
Relations annotated manually in non-UD style, automatically converted to UD

Description

UD_Icelandic-IcePaHC is a conversion of the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.

The conversion was done using UDConverter.

The Icelandic Parsed Historical Corpus (IcePaHC) is a one-million-word, diachronic corpus which includes 61 texts from the 12th to 21st centuries. These texts were originally manually parsed according to the Penn Parsed Corpora of Historical English (PPCHE) annotation scheme. These parsed texts were later automatically converted to the Universal Dependencies scheme to create UD_Icelandic-IcePaHC.

Text categories

UD_Icelandic-IcePaHC contains the following main genres:

Further subclassification is reflected in the extended genre label. For example NAR-SAG means narrative-saga and REL-BIB means religious text-bible

Each sentence ID in UD-Icelandic-IcePaHC carries the following information:

1150.FIRSTGRAMMAR.SCI-LIN,1.1

Using the sentence IDs within UD_Icelandic-IcePaHC, specific genres or periods can be extracted or filtered from the treebank CoNLL-U files.

Data split

For further info on each text, see the IcePaHC documentation.

TRAIN:

TEST:

DEV:

Acknowledgments

This project was funded by The Strategic Research and Development Programme for Language Technology, grant no. 180020-5301. Thanks are due to Örvar Kárason, whose previous work was used as a basis for the conversion.

The Icelandic Parsed Historical Corpus (IcePaHC) is available at https://linguist.is/icelandic_treebank/Download and https://repository.clarin.is/repository/xmlui/handle/20.500.12537/62.

Morphological features were generated using ABLTagger, a PoS tagger for Icelandic, developed by Steinþór Steingrímsson, Örvar Kárason and Hrafn Loftsson and available here.

References

@inproceedings{arnardottir-etal-2020-universal,
title = "A {U}niversal {D}ependencies Conversion Pipeline for a {P}enn-format Constituency Treebank",
author = "Arnard{\'o}ttir, {\TH}{\'o}runn and
Hafsteinsson, Hinrik and
Sigur{\dh}sson, Einar Freyr and
Bjarnad{\'o}ttir, Krist{\'\i}n and
Ingason, Anton Karl and
J{\'o}nsd{\'o}ttir, Hildur and
Steingr{\'\i}msson, Stein{\th}{\'o}r",
booktitle = "Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.udw-1.3",
pages = "16--25",
abstract = "The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.",
}

@inproceedings{arnardottir-etal-2023-evaluating,
title = "Evaluating a {U}niversal {D}ependencies Conversion Pipeline for {I}celandic",
author = "Arnard{\'o}ttir, {\TH}{\'o}runn and
Hafsteinsson, Hinrik and
Jasonarson, Atli and
Ingason, Anton and
Steingr{\'\i}msson, Stein{\th}{\'o}r",
editor = {Alum{\"a}e, Tanel and
Fishel, Mark},
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.69",
pages = "698--704",
abstract = "We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.",
}

Statistics of UD Icelandic IcePaHC

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

CaseDefiniteDegreeForeignGenderMoodNumberNumTypePersonPronTypeTenseVerbFormVoice

Relations

aclacl:relcladvcladvmodamodapposauxcaseccccompcompound:prtconjcopcsubjdepdetdiscoursedislocatedexplfixedflatflat:foreignflat:nameiobjmarknmodnmod:possnsubjnummodobjoblparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview