home edit page issue tracker

This page pertains to UD version 2.

Hyph: hyphenated compound or part of it

Boolean feature. Is this part of a hyphenated compound? Depending on tokenization, the compound may be one token or be split to several tokens; then the tokens need tags.

These are words corresponding to prefixes such inter- (inter disciplinary), post- (post traumatic), un- (un avoidable), di- (di transitive) and so on in English, but which are relized as distinct tokens (without the hyphen) in different languages.

Yes: it is part of hyphenated compound

Note that this depends on the tokenization conventions used in the language. For example, in Czech (see below), česko-slovenský is tokenized as three tokens: česko, the hyphen, and slovenský. While slovenský is a normal adjective in Czech, česko is derived from an adjectival stem but it is in a form that can never occur as a separate word. On the other hand, it can be combined with many other adjectives denoting affiliation with a country or region: česko-moravský, česko-německý, česko-americký etc. If tokenization left it as one token, it the whole word česko-slovenský would be simply an adjective and no Hyph=Yes would be used in the annotation.

Examples


Hyph in other languages: [cs] [hy] [u]