Hyph
: hyphenated compound or part of it
Boolean feature. Is this part of a hyphenated compound? Depending on tokenization, the compound may be one token or be split to several tokens; then the tokens need tags.
These are words corresponding to prefixes such inter-
(inter disciplinary), post-
(post traumatic), un-
(un avoidable), di-
(di transitive) and so on in English, but which are
relized as distinct tokens (without the hyphen) in different languages.
Yes
: it is part of hyphenated compound
Note that this depends on the tokenization conventions used in the language.
For example, in Czech (see below), česko-slovenský is tokenized as three
tokens: česko, the hyphen, and slovenský. While slovenský is a normal
adjective in Czech, česko is derived from an adjectival stem but it is in
a form that can never occur as a separate word. On the other hand, it can be
combined with many other adjectives denoting affiliation with a country or
region: česko-moravský, česko-německý, česko-americký etc. If tokenization
left it as one token, it the whole word česko-slovenský would be simply an
adjective and no Hyph=Yes
would be used in the annotation.
Examples
- [cs] česko-slovenský “Czecho-Slovak”
- [en] Anglo-Saxon
Hyph in other languages: [cs] [hy] [u]