This page pertains to UD version 2.

UD Croatian SET

Language: Croatian (code: hr)
Family: Indo-European, Slavic

This treebank has been part of Universal Dependencies since the UD v1.1 release.

The following people have contributed to making this treebank part of UD: Željko Agić, Nikola Ljubešić, Daniel Zeman.

Repository: UD_Croatian-SET
License: CC BY-SA 4.0

Genre: news, web, wiki

Questions, comments? General annotation questions (either Croatian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [zeljko • agic (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style


The Croatian UD treebank is based on the SETimes-HR corpus.

The sentences are partially parallel with the smaller Serbian UD treebank, which comes from the Serbian edition of SETimes. For the CoNLL 2018 shared task in parsing (and for UD release 2.2), the Croatian corpus was re-split so that corresponding sentences are in the same section (train/dev/test) in Croatian and Serbian. The re-split had to be done on the Croatian side because the Serbian corpus is smaller and most of it correspond to what used to be training data in Croatian.

For the time being, sentence ids have not been changed although they contain references to train/dev/test. Therefore it is now possible that e.g. sentence id “train-s2852” occurs in the development data, not in training data. This may be changed in future releases.

Also note that the following description of data split and sources refers to the old data split. Thus, sentences 0001-3557 of the “training set” have ids “train-s1” to “train-s3557” but some of them are now in the dev file and some in the test file.

Training set.

Contains 7,689 sentences (169,283 tokens) from three sources:

  1. Sentences 0001-3557: Newspaper text from the Southeast European Times news website, obtained from the SETimes parallel corpus. This part of the treebank is built on top of the SETimes.HR dependency treebank of Croatian;
  2. Sentences 3558-5792: Text from various Croatian web sources.
  3. Sentences 5793-7689: Croatian news web sources.

Development set.

Contains 600 sentences (14,533 tokens) from two sources:

  1. 001-200: newspaper text from the Croatian SETimes,
  2. 201-600: Croatian news web sources.

Test set.

Contains 600 sentences (13,228 tokens) from three sources:

  1. sentences 001-100: newspaper text,
  2. sentences 101-200: Wikipedia,
  3. sentences 201-297: web sources, and
  4. sentences 298-600: Croatian news web sources.


Sentence and word segmentation was manually checked. The treebank does not include multiword tokens. No language-specific features and relations were used. The POS tags and features were converted from Multext East v4 and manually checked. The syntactic annotation was done manually.


When using the Croatian UD treebank, please cite the following paper:

See file LICENSE.txt for further licensing information.

Statistics of UD Croatian SET

