Tokenization
The low-level tokenization of the Czech UD treebank follows the tokenization of the Prague Dependency Treebank 3.0 (PDT):
- In general, tokens are delimited by whitespace.
- Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
Some special cases worth mentioning:
- An abbreviation marked by a period, as in atd. “etc.”, becomes two tokens, atd and .. The same holds for ordinal numbers (1.)
- A compound containing a hyphen becomes three tokens (two words and the hyphen), as in česko-slovenský “Czech-Slovak”, česko-německý “Czech-German” or německo-český “German-Czech”. In these cases, the first token is a special form of adjective that never occurs independently. Compounds without a hyphen are not split, thus středopravý “right-centrist” is one token but středo-pravý would be three tokens. Another common case of splitting-on-hyphen is the conjunction li “if”, attached to verbs as in bude-li “if will be”.
- Exception: Decimal numbers are normalized (the Czech decimal comma is converted to the English decimal point) and kept as one token, e.g. 2.1.
- Most of the time, every punctuation character constitutes a token of its own. Thus … will become three tokens.
Words and Tokens
In Czech there are fused words that correspond to multiple syntactic words. The original PDT data use special part-of-speech tags to identify fused words, nevertheless the fused token is not split in PDT and it corresponds to just one node in the dependency tree. (Note: An exception was the splitting of aby and kdyby in PDT 1.0 but it was abandoned in later versions.)
In contrast, the UD format requires that certain types of fused words be split. We say that there is a multi-word token consisting of several syntactic words, each having its own node in the tree (see also universal tokenization).
Preposition + Personal Pronoun on in the Accusative (něj)
- proň = pro něj = “for him”
- naň = na něj = “on him”
- oň = o něj = “about him”
- zaň = za něj = “behind/for him”
This category covers words that would be tagged by the PDT tag P0-------------
.
However, no such word occurs in the PDT 3.0 data.
Preposition + Interrogative/Relative Pronoun co in the Accusative
- nač = na co = “on what”
- oč = o co = “about what”
- zač = za co = “behind/for what”
This category covers words that would be tagged by the PDT tag PY-------------
.
No such word occurs in the PDT 3.0 data but there are a few occurrences in the CAC 2.0 data.
Note: There is another analogically fused word, proč “why”. In contrast to the above, proč has grammaticalized into an interrogative/relative adverb. It is more frequent than the three fusions listed above but it is not used to replace a prepositional object. We do not split it into pro co.
Participle, Pronoun or Subordinating Conjunction + the Auxiliary být in the 2nd Person Singular (jsi)
- udělals = udělal jsi = “you have done”
- tys = ty jsi = “you have”
- ses = jsi se (se jsi) = “you have … yourself”
- sis = jsi si (si jsi) = “you have … yourself”
- cos = co jsi = “what you have”
- tos = to jsi = “you have … that”
- žes = že jsi = “that you have”
Note: This rule does not include the words bys, abys and kdybys. They resemble the words above but bys is an independent form of the auxiliary verb být “to be”, and abys and kdybys are in fact fused words, but they were formed using bys, not jsi.
This category does not have its own tag in PDT.
The ses, sis pronouns are P7.*
pronouns with the second person.
The tys pronoun can be distinguished by having more verbal features in its tag (PP-S1--2P-AA---
) than ty (PP-S1--2-------
).
The žes conjunction is tagged J,-S---2-------
while že is tagged J,-------------
.
The participles can be distinguished by the value of person:
normal participle udělal does not inflect for person (VpYS---XR-AA---
)
while participle fused with jsi, i.e. udělals, is tagged as being in the second person (VpYS---2R-AA---
).
None of these occur in the PDT 3.0 data.
Subordinating Conjunction aby or kdyby
- abych = aby bych = “so that I would”
- abys = aby bys = “so that you would”
- aby = aby by = “so that he/she/it/they would”
- abychom = aby bychom = “so that we would”
- abyste = aby byste = “so that you would”
- kdybych = když bych = “if I were”
- kdybys = když bys = “if you were”
- kdyby = když by = “if he/she/it/they were”
- kdybychom = když bychom = “if we were”
- kdybyste = když byste = “if you were”
Note: It is not clear even to a native speaker what exactly the first word should be (aby, až, kdyby or když); in any case, it is a conjunction. However, it is clear that the second word is a conditional form of být.
Heuristic to transform the tree if only surface tokens are desired as nodes: attach the fused token (e.g. abychom) to the parent and with the label of the first part (aby). Tag it as subordinating conjunction and merge the features of both parts:
3-4 abychom _ _ _ _ _ _ _ _ 3 aby aby SCONJ J,------------- _ 7 mark _ _ 4 bychom být AUX Vc-P---1------- Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin 7 aux _ _
will be transformed to
3 abychom aby SCONJ J,-P---1------- Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin 6 mark _ _
Verb + Conjunction neboť
- dělámť = neboť dělám = “because I do”
- děláť = neboť dělá = “because he/she/it does”
- dělalť = neboť dělal = “because he did”
The word forms in this group can be considered archaic.
There is only one occurrence in the PDT 3.0 data of the word neníť “because it is not” (tagged Vt-S---3P-NA--2
).