home cs/feat edit page issue tracker

Variant: alternative form of word

Sometimes there are multiple word forms for the same lemma and set of features. The Variant feature helps distinguish alternate forms.

In Czech there are two groups of words where double forms are regular and worth capturing: short forms of adjectives and short (clitic) forms of personal pronouns. This feature only marks the non-standard short forms, hence there is only one value, Short. For the long standard forms the Variant feature remains unspecified.

Short: short form of adjectives

The short form is called nominal form of adjective (jmenný tvar přídavného jména), as opposed to the long form, which is pronominal because it originated as a combination of a nominal form and a personal pronoun. But this is ancient history of the language. In modern Czech, only a subset of the nominal forms survive, and using them sometimes sounds slightly archaic. They are used as nominal predicates with copula, but they do not appear as premodifiers of nouns. The pronominal forms are considered standard, except for two frequent adjectives that do not have them: třeba, rád.

Examples

Short: short (clitic) form of personal pronouns

Some personal pronouns in dative and accusative Case have double forms. The normal (long) form is more independent in terms of positions it can take in word order. The short forms are clitics (http://cs.wikipedia.org/wiki/P%C5%99%C3%ADklonka). They are separate words (unlike in some other languages) but in the word order they usually stick to the second position.


Treebank Statistics (UD_Czech)

This feature is language-specific. It occurs with 1 different values: Short.

29070 tokens (2%) have a non-empty value of Variant. 159 types (0%) occur at least once with a non-empty value of Variant. 57 lemmas (0%) occur at least once with a non-empty value of Variant. The feature is used with 2 part-of-speech tags: cs-pos/PRON (27181; 2% instances), cs-pos/ADJ (1889; 0% instances).

PRON

27181 cs-pos/PRON tokens (37% of all PRON tokens) have a non-empty value of Variant.

The most frequent other feature values with which PRON and Variant co-occurred: PronType=Prs (27181; 100%), Gender=EMPTY (25948; 95%), Reflex=Yes (25163; 93%), Number=EMPTY (25163; 93%), Person=EMPTY (25163; 93%), Case=Acc (22246; 82%).

PRON tokens may have the following values of Variant:

ADJ

1889 cs-pos/ADJ tokens (1% of all ADJ tokens) have a non-empty value of Variant.

The most frequent other feature values with which ADJ and Variant co-occurred: Degree=EMPTY (1889; 100%), Case=EMPTY (1889; 100%), Negative=Pos (1844; 98%), Animacy=EMPTY (1529; 81%), Number=Sing (1333; 71%).

ADJ tokens may have the following values of Variant:

Variant seems to be lexical feature of ADJ. 100% lemmas (51) occur only with one value of Variant.