home edit page issue tracker

This page pertains to UD version 2.

UD Croatian SET

Language: Croatian (code: hr)
Family: Indo-European, Slavic

This treebank has been part of Universal Dependencies since the UD v1.1 release.

The following people have contributed to making this treebank part of UD: Željko Agić, Nikola Ljubešić, Daniel Zeman.

Repository: UD_Croatian-SET
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 4.0

Genre: news, web, wiki

Questions, comments? General annotation questions (either Croatian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [zeljko • agic (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style

Description

The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the hr500k corpus.

The sentences are partially parallel with the smaller Serbian UD treebank, which comes from the Serbian edition of SETimes. For the UD release 2.4, the Croatian and Serbian corpus were enriched with newdoc metadata and re-split so that the corresponding documents are in the same section (train/dev/test) in Croatian and Serbian.

Sentence ids have also been changed to reflect the domain / source the data comes from, and not the dependence to one of the sections (train/dev/test), as was the case in previous releases.

Training set.

Contains 6,844 sentences (151,226 tokens) from three sources:

  1. sentence ids set.hr*: pseudorandom 80% of documents of newspaper text from the Southeast European Times news website, obtained from the SETimes parallel corpus. This part of the treebank is built on top of the SETimes.HR dependency treebank of Croatian;
  2. sentence ids news.hr*: pseudorandom 80% of documents of Croatian news web sources.
  3. sentence ids web.hr* : pseudorandom 80% of sentences of Croatian web sources.

Development set.

Contains 954 sentences (21,952 tokens) from three sources:

  1. sentence ids set.hr*: pseudorandom 10% of documents of newspaper text from the Southeast European Times.
  2. sentence ids news.hr*: pseudorandom 10% of documents of Croatian news web sources.
  3. sentence ids web.hr* : pseudorandom 10% of sentences of Croatian web sources.

Test set.

Contains 1214 sentences (26,263 tokens) from four sources:

  1. sentence ids set.hr*: pseudorandom 10% of documents of newspaper text from the Southeast European Times (+ the previous test set from the same source).
  2. sentence ids wiki.hr*: old Wikipedia-based test set.
  3. sentence ids news.hr*: pseudorandom 10% of documents of Croatian news web sources.
  4. sentence ids web.hr* : pseudorandom 10% of sentences of Croatian web sources.

Details

Sentence and word segmentation was manually checked. The treebank does not include multiword tokens. No language-specific features and relations were used. The POS tags and features were converted from Multext East v6 (present in the XPOS column) and manually checked. The syntactic annotation was done manually.

Acknowledgments

When using the Croatian UD treebank, please cite the following paper:

See file LICENSE.txt for further licensing information.

Statistics of UD Croatian SET

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX

Features

AnimacyCaseDefiniteDegreeForeignGenderGender[psor]MoodNumberNumber[psor]NumTypePersonPolarityPossPronTypeReflexTenseVerbFormVoice

Relations

acladvcladvmodadvmod:emphamodapposauxcaseccccompcompoundconjcopcsubjdepdetdet:numgovdiscoursedislocatedexplfixedflatflat:foreigniobjlistmarknmodnsubjnummodnummod:govobjoblorphanparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Verbs with Reflexive Core Objects

Relations Overview