POS tags
Open class words | Closed class words | Other |
---|---|---|
ADJ | ADP | PUNCT |
ADV | AUX | SYM |
INTJ | CONJ | X |
NOUN | DET | |
PROPN | NUM | |
VERB | PART | |
PRON | ||
SCONJ |
ADJ
: adjective
Definition
Adjectives are words that typically modify nouns and specify their properties or attributes. They may also function as predicates, as in
To auto je zelené. “The car is green.”
The ADJ
tag is intended for ordinary adjectives only. See DET
for determiners and NUM for cardinal numerals.
In accord with the UD approach, adjectival ordinal numerals (první, sedmý, stopadesátý) are tagged as adjectives, although the traditional grammar classifies them as numerals. They behave like adjectives both morphologically and syntactically, with the exception that they cannot be compared and negated.
Most Czech adjectives inflect for cs-feat/Gender (velký – velká – velké) “big”, cs-feat/Number (velký – velcí), cs-feat/Case (velký – velkého – velkému – velkém – velkým), cs-feat/Degree (velký – větší – největší), and Negation (velký – nevelký).
Examples
- velký “big”
- starý “old”
- zelený “green”
- nejneobhospodařovávatelnějšímu “to the most uncultivatable” (the longest Czech word)
- otcův, matčin “father’s, mother’s” (possessive adjectives)
- první, druhý, třetí “first, second, third”
- udělaný “done” (passive participial adjective, see below)
- scvrklý “shrivelled” (past participial adjective)
- dělající “doing” (present participial adjective, derived from present transgressive)
- udělavší “having done” (past participial adjective, derived from past transgressive)
Border cases
Passive participles lie on the border between verbs and adjectives.
Core participial forms (ending in consonant or short vowel) are tagged VERB
.
Long forms are participial adjectives and they are tagged ADJ
.
For example:
- Verb: nesen, nesena, neseno, neseni, neseny “carried”
- Adjective: nesený, nesená, nesené, nesení, nesené “carried”
Their meaning is almost identical but the usage slightly varies. Both groups can be used in nominal predication with copula. Only true participles (verbs) can be used to form the passive voice (but it may be sometimes difficult to distinguish from copula constructions, see AUX). On the other hand, the participial adjectives inflect for case and thus can modify nouns.
There is an analogy with some adjectives that preserved so called nominal (short) forms. And these adjectives are not derived from verbs. Example:
- Short (nominal) forms: stár, stára, stáro “old”
- Normal (pronominal) forms: starý, stará, staré “old”
Here both groups are ADJ
. The nominal forms are used in predication,
the standard forms both in predication and to modify nouns.
References
ADP
: adposition
Definition
Czech has only prepositions but no postpositions or circumpositions. They occur before a complement noun phrase (noun, pronoun) and they form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause.
Some prepositions take the form of fixed multiword expressions, e.g.
na rozdíl od “in contrast to”, v souvislosti s
“in connection with”. The
component words are then still tagged according to their basic use
(na is ADP
, rozdíl is NOUN, etc.) and their status as
multiword expressions are accounted for in the syntactic annotation.
Examples
- v “in, at”
- k “to”
- během “during”
References
ADV
: adverb
Definition
Adverbs are words that typically modify verbs for such categories as time, place, direction or manner. They may also modify adjectives and other adverbs, as in velmi významně “very significantly” or prokazatelně chybný “provably wrong”.
There is a closed subclass of pronominal adverbs that refer to
circumstances in context, rather than naming them directly; similarly
to pronouns, these can be categorized as interrogative, relative,
demonstrative etc. Pronominal adverbs also get the ADV
part-of-speech tag but they are differentiated by additional features.
In accord with the UD approach,
adverbial ordinal numerals (poprvé, posedmé, postopadesáté)
are tagged ADV
, although the traditional grammar classifies
them as numerals.
The same holds for multiplicative numerals
(jednou, sedmkrát, stopadesátkrát).
Note that Czech transgressives (also called adverbial participles)
are tagged VERB, not ADV
.
Examples
- velmi “very”
- dobře “well”
- přesně “exactly”
- zítra “tomorrow”
- nahoru, dolů “up, down”
- ordinal numeral adverbs: poprvé, podruhé, potřetí “for the first time, for the second time, for the third time”
- multiplicative numeral adverbs: jednou, dvakrát, třikrát “once, twice, three times”
- interrogative adverbs: kde, kam, kdy, jak, proč “where, where to, when, how, why”
- demonstrative adverbs: tady, tam, teď, tehdy, tak “here, there, now, then, so”
- indefinite adverbs: někde, někam, někdy, nějak “somewhere, to somewhere, sometime, somehow”
- total adverbs: všude, vždy “everywhere, always”
- negative adverbs: nikde, nikdy “nowhere, never”
References
AUX
: auxiliary verb
Definition
The only truly auxiliary verb in Czech is být “to be”, and its variant (with separate lemma) bývat “to usually be”. It accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb.
Examples
-
Future tense. Finite future form of být is combined with infinitive of the lexical verb. The auxiliary expresses person, number and tense: budu dělat “I will do”, budeš dělat “you will do”, budou dělat “they will do”. Note that a limited set of verbs can form future morphologically, without the auxiliary.
-
Past tense in the first and second person. Finite present form of být is combined with past participle of the lexical verb. The auxiliary expresses person and number, the participle expresses gender and number: dělal jsem “I did.
Masc
”, dělala jsem “I did.Fem
”, dělal jsi “you did.Masc
”, dělali jsme “we did.Masc
”. -
Conditional mood. Conditional form (historically aorist) of být is combined with past participle of the lexical verb. The auxiliary expresses person and number, the participle expresses gender and number: dělal bych “I would do.
Masc
”, dělala bych “I would do.Fem
”, dělal bys “you would do.Masc
”, dělali bychom “we would do.Masc
”. -
Passive voice. A form of být (in various tenses and moods or in the infinitive) is combined with passive participle of the lexical verb. The auxiliary expresses person, number, tense and mood, the participle expresses gender, number and voice: je udělán “he is done”, bude udělán “he will be done”, byl udělán “he was done”, byl by udělán “he would be done”, buď udělán “be done”, být udělán “to be done”.
Note that the verb být will not be tagged AUX
if it is used as
copula (Moje auto je nové. “My car is new.”)
or as a content verb (V Praze je nové divadlo. “There is a new theatre in Prague.”)
It will be tagged VERB in these cases.
It is also possible that an auxiliary být modifies a lexical být
(V Praze by bylo nové divadlo. “There would be a new theatre in Prague.”)
Note that the passive participle may be also used as nominal predicate with copula. Hence it may be difficult to distinguish a passive construction from a copula construction. The former focuses on the process while the latter emphasizes the result.
- Passive: Smlouva byla.
AUX
podepsána v Bílém domě. “The contract was signed in the White House.” - Copula: Smlouva byla.
VERB
podepsána červeným inkoustem. “The contract was signed in red ink.”
Modal verbs are not auxiliaries
Czech modal verbs are not considered auxiliary and they are tagged VERB
,
in accord with the annotation in the Prague Dependency Treebank.
Their behavior is only slightly different from other content verbs.
Constructions with mít and passive participle
There is a construction parallel to the perfect tenses of Germanic and Romance languages: mít “to have” + neuter singular passive participle, e.g. mít (něco) uděláno “to have (something) done”. They can also apply to intransitive verbs: mít vyhráno “to have won”. Sometimes the verb mít shares the subject (actor) with the participle, but in other contexts such relation is not guaranteed: mít (někde něco) napsáno “to have (something) written (somewhere)”. None of these constructions is considered a separate tense in the Czech grammar and the verb mít is not analyzed as auxiliary.
References
- Loos, Eugene E., et al. 2003. Glossary of linguistic terms: What is an auxiliary verb?
- Wikipedia
- Jarmila Panevová, Eva Benešová, Petr Sgall. 1971. Čas a modalita v češtině (Tense and modality in Czech). Acta Universitatis Carolinae, Philologica Monographia XXXIV, Universita Karlova Praha
CONJ
: coordinating conjunction
Definition
A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them.
For subordinating conjunctions, see SCONJ.
Examples
- a “and”
- nebo “or”
- ale “but”
References
- Loos, Eugene E., et al. 2003. Glossary of linguistic terms: What is a coordinating conjunction?
- Wikipedia
DET
: determiner
Definition
Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular number or quantity, etc.
An important point to note is that the traditional grammar of Czech does not
define determiners as a separate word class. Czech does not have articles.
Most determiners are traditionally called pronouns; that is, an UD-conformant
annotation of Czech must distinguish between substantive pronouns (UD tag PRON)
and attributive pronouns (UD tag DET
).
Also note that the DET
tag includes (pronominal) quantifiers (words
like mnoho, málo “many, few”), which the traditional grammar classifies
as a special subclass of numerals. However,
cardinal numerals in the narrow sense (jeden, pět, sto) are not
tagged DET
even though some authors would include them in
quantifiers. Cardinal numbers have their own tag NUM.
Conversion from the Prague Dependency Treebank
Since the PDT tagset (like all other Czech tagsets) does not distinguish
substantive and attributive pronouns, morphological tags alone are not enough
to find the correct universal POS tag.
Morphological rules could help, as the inflection patterns of some pronouns
bear similarities to adjectival inflection; nevertheless, there will be other
cases that cannot be solved this way.
We have to examine the dependency tree.
If a pronoun modifies a noun, it should be tagged DET
.
Otherwise it is PRON
.
As a result, all words that can be tagged DET
can also be tagged PRON
,
but some words can only be tagged PRON
.
(We cannot recognize cases where the pronoun is in fact attributive, but the
modified noun has been elided and is not represented in the tree.)
For instance, tohle “this” is either pronoun (Tohle jsem viděl včera. “I saw this yesterday.”) or determiner (Tohle auto jsem viděl včera. “I saw this car yesterday.”)
Examples
- possessive determiners: můj, tvůj, jeho, její, náš, váš, jejich “my, your, his, her, our, your, their”
- reflexive possessive determiner: svůj “one’s own”
- demonstrative determiners: tohle as in Tohle auto jsem viděl včera. “I saw this car yesterday.”
- interrogative determiners: který as in Které auto se ti líbí? “Which car do you like?”
- relative determiners: který as in Zajímá mě, které auto se ti líbí. “I wonder which car you like.”
- relative possessive determiner: jehož “whose”
- indefinite determiners: nějaký, některý
- total determiners: každý, všechen
- negative determiners: žádný as in Nemáme žádná auta. “We have no cars available.”
References
INTJ
: interjection
Definition
An interjection is a word that is used most often as an exclamation or part of an exclamation. It typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language.
As a special case of interjections, we recognize feedback particles such as ano, jo, ne, etc. Note that these words are considered particles in the PDT tagset and have to be retagged during the conversion process.
Examples
(Note that no direct translation of interjections is possible. The approximate translations below are for orientation purposes and they cannot serve to judge the part of speech from the English perspective.)
- ach “oh”
- pink
- inu “well”
- hle “look”
- proboha “for God’s sake”
Diffs
Prague Dependency Treebank
At present the UD-conversion of PDT keeps the PDT convention on tagging the response words (“yes, no”) as particles. Automatic conversion would not be straightforward because the negative particle ne is sometimes used as the response particle/interjection (English “no”) and sometimes as a free negative morpheme (English “not”). These two usages would have to be distinguished and only the first one converted to interjection.
References
NOUN
: noun
Definition
Nouns are a part of speech typically denoting a person, place, thing, animal or idea.
The NOUN
tag is intended for common nouns only. See PROPN for
proper nouns and PRON for pronouns.
Czech nouns have the lexical feature cs-feat/Gender. Furthermore, the nouns inflect for cs-feat/Number and cs-feat/Case.
A verbal noun can be derived productively from almost every verb
(e.g. dělat “to do” → dělání “doing”).
While in other languages a corresponding form may be called gerund and tagged VERB,
in Czech it is tagged NOUN
. It has always the neuter gender and the full
number-case inflectional paradigm.
Examples
- dívka “girl”
- kočka “cat”
- strom “tree”
- vzduch “air”
- krása “beauty”
- plavání “swimming”
References
NUM
: numeral
Definition
A numeral is a word, functioning most typically as a determiner, adjective or pronoun, that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction.
Note that cardinal numerals are covered by NUM
whether they are used
as determiners or not (as in Windows 7) and whether they
are expressed as words (čtyři), digits (4) or Roman numerals
(IV).
Czech grammar distinguishes several subclasses of pronominal numerals (quantifiers):
interrogative and relative (kolik “how many”);
demonstrative (tolik “this many”);
indefinite (několik, mnoho, málo “several, many, few”).
These words behave similarly to (most) cardinal numbers,
e.g. they require that the counted noun phrase be in genitive.
They are not similar to adjectives (unlike their English counterparts).
However, in accord with the UD standard, they should be tagged DET, not NUM
.
In addition, several types of (non-pronominal) numerals, such as ordinal numerals and multiplicative numerals, are tagged ADJ or ADV, based on their syntactic and morphological behavior.
Examples
- 0, 1, 2, 3, 4, 5, 2014, 1000000, 3.14159265359
- I, II, III, IV, V, MMXIV
- jeden, dva, tři, čtyři, pět, sedmdesát “one, two, three, four, five, seventy”
- polovina, třetina, čtvrtina, pětina “one-half, one third, quarter, one fifth”: denominators of fractions constitute a separate class of cardinal numerals.
- čtvero, patero “four, five” (These are special forms, so-called generic numerals. They are used rarely, in literary or archaic style.)
- jedny, dvoje, troje, čtvery, patery, sedmdesátery “one set of, two sets of, three sets of, four sets of, five sets of, seventy sets of”
Counterexamples
- první, druhý, třetí “first, second, third”: adjectival ordinal numerals. They are tagged ADJ, and the cs-feat/NumType feature reveals their semantic relation to numbers.
- poprvé, podruhé, potřetí “for the first time, for the second time, for the third time”: adverbial ordinal numerals. They are tagged ADV, and the cs-feat/NumType feature reveals their semantic relation to numbers.
- jednou, dvakrát, třikrát “once, twice, three times”: multiplicative numerals. They are tagged ADV, and the cs-feat/NumType feature reveals their semantic relation to numbers.
- dvojí, trojí, čtverý, paterý, sedmdesáterý “twofold, three kinds of, four kinds of, five kinds of, seventy kinds of”: generic numerals. They are tagged ADJ.
- dvojice, trojice, čtveřice “pair, triplet, foursome”: n-tuples (n-tice) are not considered numerals in the Czech grammar. They are tagged NOUN.
- jednička, dvojka, trojka, čtyřka, pětka “number one, number two, number three, number four, number five”: names of numbers, or of objects identified by the number (e.g. of a bus route). They are not considered numerals and they are tagged NOUN.
- tisíc, milión, miliarda, bilión “thousand, million, billion, trillion”: words for large quantities are ambiguous between cardinal numerals (tagged
NUM
) and nouns. If they inflect as nouns, they are tagged NOUN; but the borderline is fuzzy. For instance, in phrases like tisíce lidí demonstrovaly v ulicích (“thousands of people demonstrated in the streets”), tisíce is a noun. In numeric expressions, e.g. 110 tisíc dolarů (“110 thousand dollars”), it is a cardinal numeral.
References
PART
: particle
Definition
Particles are function words that must be associated with another word or phrase to impart meaning and that do not satisfy definitions of other universal parts of speech (e.g. adpositions, coordinating conjunctions, subordinating conjunctions or auxiliary verbs). Particles may encode grammatical categories such as negation, mood, tense etc. Czech particles are not inflected.
Note that response words such as ano, jo “yes”, ne “no”, etc. are considered particles in the PDT tagset but they should be retagged as interjections under the UD standard. Also note that ne can be used in two ways, one would be translated as English “no” and the other as “not”. Only the former should become interjection, while the latter will stay a particle.
Examples
- Sentence modality: ať, kéž, nechť (“Let’s do it!” “If only I could do it over.” “May you have an enjoyable stay!”)
- jen “just, only”
- až “only, as late as, even, up to” Use case: až po stovky tisíc let “up to hundreds of thousands of years”
- asi “about, roughly, maybe”
Diffs
Prague Dependency Treebank
-
li “if”: This is an encliticized morpheme that functions as subordinating conjunction but it always immediately follows the predicate of the subordinate clause. For example: Nebude-li pršet, nezmoknem. lit. Will-not-if rain, we-will-not-get-wet. “We will not get wet if it does not rain.” PDT tags the li morpheme as particle and it is currently kept so in the UD conversion but it might be changed to
SCONJ
in the future releases. -
At present the UD-conversion of PDT keeps the PDT convention on tagging the response words (“yes, no”) as particles. Automatic conversion would not be straightforward because the negative particle ne is sometimes used as the response particle/interjection (English “no”) and sometimes as a free negative morpheme (English “not”). These two usages would have to be distinguished and only the first one converted to interjection.
References
PRON
: pronoun
Definition
Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context.
Pronouns under this definition function like nouns. Note that
Czech grammar traditionally extends the term pronoun to words that
substitute for adjectives. Such words are not tagged PRON
under our universal scheme. They are tagged as determiners in
order to annotate the same thing same way across languages.
For instance, tohle “this” is traditionally called pronoun in
Czech grammar, regardless of context (the notion of determiners does
not exist in Czech grammar). To make the annotation parallel across
languages, it should be now tagged PRON
in Tohle jsem viděl
včera. “I saw this yesterday.” and DET
in
Tohle auto jsem viděl včera. “I saw this car yesterday.”
Examples
- personal pronouns: já, ty, on, ona, ono, my, vy, oni, ony “I, you, he, she, it, we, you, they, they”
- reflexive pronouns: sebe, se, sobě, si, sebou “oneself”
- demonstrative pronouns: tohle as in Tohle jsem viděl včera. “I saw this yesterday.”
- interrogative pronouns: kdo, co “who, what” as in Co si myslíš? “What do you think?”
- relative pronouns: kdo, co “who, what” as in Zajímalo by mě, co si myslíš. “I wonder what you think.”
- indefinite pronouns: někdo, něco “somebody, something”
- total pronouns: každý, všichni “everybody, all”
- negative pronouns: nikdo, nic “nobody, nothing”
References
PROPN
: proper noun
Definition
A proper noun is a noun that is the name of a specific individual, place, or object. Czech proper nouns are always written starting with an uppercase letter. Note that names of days of week (pondělí, úterý, středa, čtvrtek, pátek, sobota, neděle) and names of months (leden, únor, březen, duben, květen, červen, červenec, srpen, září, říjen, listopad, prosinec) are not written capitalized (unlike in English) and are not considered proper nouns.
Single-word named entities should be tagged PROPN
even if they originate
from a common noun (Zajíc, Huť) or an adjective (Veselý, Teplá).
Even if they were originally adjectives and inflect according to adjectival
paradigms, they behave syntactically as nouns. For instance, Teplá
(a river and city in western Bohemia) is originally feminine form of the
adjective teplý “warm” but as a geographical name, it is a noun.
It denotes a concrete location (rather than a property of somebody/something)
and its feminine gender is fixed (while adjectives have forms in all three
genders).
Note that names of languages (čeština, angličtina)
and adjectives derived from geographical names (český, anglický “Czech, English”)
are written in lowercase and are not tagged PROPN
.
Personal names are typically treated as a sequence of proper nouns
(one or more given names and one or more surnames).
If the name contains prepositions, conjunctions or articles (foreign names
and old Czech names), these are tagged as ADP
, CONJ
and DET
,
respectively.
Czech (and other Slavic) multi-word named entities have internal syntactic
structure, which is preserved in the annotation. The headword is always noun
and there may be other nouns involved. They will be tagged either PROPN
or
NOUN
and possible ambiguities must be resolved individually.
Modifying adjectives are never tagged PROPN
. Even if an adjective is the
first word of a multi-word name, and thus it starts with an uppercase letter,
it is still tagged ADJ
.
Similarly, function words in named entities retain their normal tags.
These rules are less strict for foreign named entities where the original
part of speech is hidden for a Czech speaker.
Examples
- Bečov.
PROPN
nad.ADP
Teplou.PROPN
is a city. Bečov is the head and the nad Teplou part refers to the river flowing through the city, to distinguish it from other Bečovs. - Červený.
ADJ
Újezd.PROPN
is a village. Újezd is the head and it is taggedPROPN
although it originates in the common noun újezd “district, riding”. There are many locations named Újezd and the noun is perceived as a proper noun in current Czech. Červený is an adjective meaning “red” and it is taggedADJ
. - Červená.
ADJ
řeka.NOUN
“Red River”. Even though the two words together are a name of a particular river, řeka is a common noun and is tagged as such. - Organizace.
NOUN
spojených.ADJ
národů.NOUN
“United Nations Organization” consists of three words, none of which is proper noun. However, the acronym OSN “UNO” is a single-token name and is taggedPROPN
.
Conversion from the Prague Dependency Treebank
The PDT set of morphological (part-of-speech) tags does not distinguish
common and proper nouns. However, lemmas in PDT contain additional features
that also encode types of named entities. When converting the PDT annotation
to UD, these lemma features are removed, the PROPN
tag is used and the feature
cs-feat/NameType is added to the universal features to preserve the type.
Only nouns are treated this way.
Foreign adjectives are not converted to PROPN
despite the fact
that they entered Czech as parts of foreign names and their lemmas contain
the name type feature.
The following table lists the name types together with the most frequent examples. See http://ufal.mff.cuni.cz/techrep/tr27.pdf, page 8, section 2.1 (Lemma structure) for more details.
_;Y | given name | Jan, Jiří, Václav, Petr, Josef | “Jan, Jiří, Václav, Petr, Josef” |
_;S | surname | Klaus, Havel, Němec, Jelcin, Svoboda | “Klaus, Havel, Němec, Yeltsin, Svoboda” |
_;E | member of a particular nation, inhabitant of a particular territory | Němec, Čech, Srb, Američan, Slovák | “German, Czech, Serbian, American, Slovak” |
_;G | geographical name | Praha, ČR, Evropa, Německo, Brno | “Prague, CR, Europe, Germany, Brno” |
_;K | company, organization, institution | ODS, OSN, Sparta, ODA, Slavia | “ODS, UN, Sparta, ODA, Slavia” |
_;R | product | LN, Mercedes, Tatra, PC, MF | “LN, Mercedes, Tatra, PC, MF” |
_;m | other proper name: names of mines, stadiums, guerilla bases etc. | US, PVP, Prix, Rapaport, Tour | “US, PVP, Prix, Rapaport, Tour” |
Diffs
Prague Dependency Treebank
Articles in foreign names (the, die, le) are tagged ADJ, not DET. Otherwise, the morphological analysis usually includes the original part of speech of foreign words.
References
PUNCT
: punctuation
Definition
Punctuation marks are non-alphabetical characters and character groups used to delimit linguistic units in printed text.
Punctuation is not taken to include logograms such as $, %, and §, which are instead tagged as SYM.
Examples
- Period: .
- Comma: ,
- Parentheses: ()
Diffs
Prague Dependency Treebank
The PDT texts are from the early 1990s and there are no e-mail addresses.
If they were there, the PDT tokenization rules would break them up on all dots and at signs.
The same holds for telephone numbers. For example,
tel.: (05) 4321 6014 is analyzed as eight tokens (NOUN PUNCT PUNCT PUNCT NUM PUNCT NUM NUM
).
References
SCONJ
: subordinating conjunction
Definition
A subordinating conjunction is a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate) clause.
For coordinating conjunctions, see CONJ.
Examples
- že “that”
- aby “so that”
- zda “if”
- jako “as”
- než “than”
References
- Loos, Eugene E., et al. 2003. Glossary of linguistic terms: What is a subordinating conjunction?
- Wikipedia
SYM
: symbol
Definition
A symbol is a word-like entity that differs from ordinary words by form, function, or both.
Many symbols are or contain special non-alphanumeric characters, similarly to punctuation. What makes them different from punctuation is that they can be substituted by normal words. This involves all currency symbols, e.g. $ 75 is identical to seventy-five dollars.
Mathematical operators form another group of symbols.
Another group of symbols is emoticons and emoji.
Strings that consists entirely of alphanumeric characters are not
symbols but they may be proper nouns: 130XE, DC10; others
may be tagged PROPN
(rather than SYM
) even if they contain special
characters: DC-10.
Similarly, abbreviations for single words are not symbols but are assigned the part of speech
of the full form. For example, Mr. (mister), kg (kilogram), km (kilometr), dr (doktor)
should be tagged nouns.
Acronyms for proper names such as OSN and NATO should be tagged as proper nouns.
Characters used as bullets in itemized lists (•, ‣) are not symbols, they are punctuation.
Examples
- $, %, §, ©
- +, −, ×, ÷, =, <, >
- :), ♥‿♥, 😝
- john.doe@universal.org, http://universaldependencies.org/, 1-800-COMPANY
Diffs
Prague Dependency Treebank
The PDT part-of-speech tagset does not distinguish symbols from punctuation, hence all non-alphanumeric characters in the converted data are currently tagged PUNCT.
The PDT texts are from the early 1990s and there are no e-mail addresses.
If they were there, the PDT tokenization rules would break them up on all dots and at signs.
The same holds for telephone numbers. For example,
tel.: (05) 4321 6014 is analyzed as eight tokens (NOUN PUNCT PUNCT PUNCT NUM PUNCT NUM NUM
).
VERB
: verb
Definition
A verb is a member of the syntactic class of words that typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause.
Note that the VERB
tag covers main verbs (content verbs),
modal verbs and
copulas but it does not cover auxiliary verbs, for which there is
the AUX tag. (Czech modal verbs are not considered auxiliary.)
See the description of AUX
for more information on the borderline
between VERB
and AUX
.
Czech verbs can take the following morphological forms:
- Infinitive (this is the citation form)
- Finite verb (indicative and imperative forms; conditional is constructed periphrastically)
- Past participle (used to construct past and conditional)
- Passive participle (used to construct passive voice; also used separately as an adjective)
- Transgressive (also called adverbial participle)
There are participial forms that are tagged as adjectives (ADJ) rather than verbs. See below for examples.
A verbal noun can be derived productively from almost every verb
(e.g. dělat “to do” → dělání “doing”).
While in other languages a corresponding form may be called gerund and tagged VERB
,
in Czech it is tagged NOUN. It has always the neuter cs-feat/Gender
and it inflects for cs-feat/Number and cs-feat/Case.
Examples
- nést “to carry”
- nesu, neseš, nese, neseme, nesete, nesou “I carry, you carry, he/she/it carries, we carry, you carry, they carry”
- nes, nesme, neste “carry” (imperative in different persons and numbers)
- nesl, nesla, neslo, nesli, nesly “carried” (past participle in different genders and numbers)
- nesen, nesena, neseno, neseni, neseny “carried” (passive participle in different genders and numbers)
- nesa, nesouc, nesouce “carrying” (present transgressive in different genders and numbers)
Border cases
There are passive participles as verb forms (VERB
)
and participial adjectives (ADJ
). For example:
- Verb: nesen, nesena, neseno, neseni, neseny “carried”
- Adjective: nesený, nesená, nesené, nesení, nesené “carried”
Their meaning is almost identical but the usage slightly varies. Both groups can be used in nominal predication with copula. Only true participles (verbs) can be used to form the passive voice (but it may be sometimes difficult to distinguish from copula constructions, see AUX). On the other hand, the participial adjectives inflect for case and thus can modify nouns.
There is an analogy with some adjectives that preserved so called nominal (short) forms. And these adjectives are not derived from verbs. Example:
- Short (nominal) forms: stár, stára, stáro “old”
- Normal (pronominal) forms: starý, stará, staré “old”
Here both groups are ADJ
. The nominal forms are used in predication,
the standard forms both in predication and to modify nouns.
References
X
: other
Definition
The tag X
is used for words that for some reason cannot be assigned
a real part-of-speech category.
A special usage of X
is for cases of code-switching where it is not
possible (or meaningful) to analyze the intervening language
grammatically (and where the dependency relation foreign is
typically used in the syntactic analysis).
This rarely applies to the PDT data where many foreign words are tagged with their original
part of speech.
Even if foreign words are tagged X
, this usage does not extend
to ordinary loan words which should be assigned a normal
part-of-speech. For example, in Skotové nosí kilt “Scots wear kilts”,
kilt is an ordinary NOUN.
Examples
- A on pak akorát xfgh pdl jklw “And then he just xfgh pdl jklw”