home edit page issue tracker

This page pertains to UD version 2.

UD German HDT

Language: German (code: de)
Family: Indo-European, Germanic

This treebank has been part of Universal Dependencies since the UD v2.4 release.

The following people have contributed to making this treebank part of UD: Emanuel Borges Völker, Felix Hennig, Arne Köhn, Maximilan Wendt.

Repository: UD_German-HDT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2

License: HZSK-ACA (Text) / CC BY-SA-4.0 (Annotation)

Genre: news, nonfiction, web

Questions, comments? General annotation questions (either German-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [Arne Köhn <arne (æt) chark • eu>]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS assigned by a program, with some manual corrections, but not a full manual verification
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion

Description

UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

The Hamburg Dependency Treebank consists of 261,821 sentences (4.8M tokens). The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

For UD_German-HDT, 206,794 sentences (3.8M tokens) from the original HDT were converted with TrUDucer, a treebank conversion tool created by Felix Hennig and extended by Maximilian Wendt and Emanuel Borges Völker. The conversion has a very high accuracy of 97% (checked on a manually converted subset of the treebank). Annotation information not captured in the original annotation was resolved by using external data sources (Wiktionary) and manual input from annotators.

Acknowledgments

The following people worked on the conversion:

References

If you use this treebank, please cite the upcoming paper describing the conversion of the HDT to UD.

The TrUDucer paper describing the formalism behind the conversion:

Hennig, Felix, & Köhn, Arne (2017). Dependency tree transformation with tree transducers. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017) (pp. 58–66). Gothenburg, Sweden: Association for Computational Linguistics. url: http://www.aclweb.org/anthology/W17-0407

The paper describing the HDT:

Foth, K. A., Köhn, A., Beuck, N., & Menzel, W. (2014). Because Size Does Matter: The Hamburg Dependency Treebank. In Proceedings of the Language Resources and Evaluation Conference 2014 (pp. 2326–2333). Reykjavik, Iceland: European Language Resources Association (ELRA). url: http://nbn-resolving.de/urn:nbn:de:gbv:18-228-7-2013

The annotation guidelines of the original HDT:

Foth, K. A. (2006). Eine umfassende Constraint-Dependenz-Grammatik des Deutschen. url: http://nbn-resolving.de/urn:nbn:de:gbv:18-228-7-2048

Software

TrUDucer the software used to convert the HDT. Comes with a pipeline to replicate the conversion of the HDT.

jwcdg, the successor of the parser used for initial automatic annotation of the HDT. It contains the lexicon with the relevant morpho-syntactic features annotated.

DECCA, a tool to detect and correct errors in annotated corpora

Statistics of UD German HDT

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

CaseDegreeGenderMoodNumberPersonTense

Relations

acladvcladvmodamodapposauxaux:passcaseccccompcompound:prtconjcopcsubjcsubj:passdetdet:possexplexpl:pvflatflat:nameiobjmarknmodnmod:possnsubjnsubj:passnummodobjoblobl:argorphanparataxispunctrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Reflexive Verbs

Relations Overview