home edit page issue tracker

This page pertains to UD version 2.

UD Czech CAC

Language: Czech (code: cs)
Family: Indo-European, Slavic

This treebank has been part of Universal Dependencies since the UD v1.3 release.

The following people have contributed to making this treebank part of UD: Barbora Hladká, Daniel Zeman.

Repository: UD_Czech-CAC
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 4.0

Genre: news, nonfiction, legal, reviews, medical

Questions, comments? General annotation questions (either Czech-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [zeman (æt) ufal • mff • cuni • cz]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually in non-UD style, automatically converted to UD

Description

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

CAC consists both of written data and transcripts of spoken language. Only the written part is included in this treebank as no syntactic annotation is available for the spoken data. Out of 650,000 total CAC tokens, 493,306 appear in the treebank.

The first version of CAC was created by a team from the Institute of the Czech Language, Czechoslovak Academy of Sciences, led by Marie Těšitelová, in 1971-1985; its original name was “Korpus věcného stylu”. It was reshaped and made compatible with the Prague Dependency Treebank between 2007 (CAC 1.0) and 2008 (CAC 2.0); these corpora are distributed by the Linguistic Data Consortium. The corpus has now been converted to Universal Dependencies and made freely available under the CreativeCommons license (see LICENSE.txt).

See the following websites for more information on CAC 2.0:

CAC contains mostly unabridged articles taken from a wide range of media. These articles include newspapers, magazines and other sources covering administration, journalism and scientific fields. These three genres can be distinguished by the sentence id: in

Acknowledgments

We wish to thank all of the contributors to the original annotation effort, as well as the team responsible for the corpus’ revival in 2008.

References

Statistics of UD Czech CAC

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERB

Features

AbbrAdpTypeAnimacyAspectCaseConjTypeDegreeForeignGenderGender[psor]HyphMoodNameTypeNumberNumber[psor]NumFormNumTypePersonPolarityPossPrepCasePronTypeReflexStyleTenseVariantVerbFormVoice

Relations

aclacl:relcladvcladvmodadvmod:emphamodapposauxaux:passcaseccccompcompoundconjcopcsubjcsubj:passdepdetdet:numgovdet:nummoddiscourseexpl:passexpl:pvfixedflatflat:foreigniobjmarknmodnsubjnsubj:passnummodnummod:govobjoblobl:argorphanparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Reflexive Verbs

Reflexive Passive

Verbs with Reflexive Core Objects

Relations Overview