UD for Upper Sorbian
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We always tokenize them as separate tokens (words); that holds even for hyphenated compounds such as syrisko-arabska “Syrian-Arabic” (three tokens) and for abbreviations such as atd. “etc.” (two tokens).
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
Morphology
Tags
- Upper Sorbian uses all 17 universal POS categories, including particles (PART). At present, three word types are tagged PART: hakle, jenož “only” and nic “not”.
- The pronoun (PRON) vs. determiner (DET) distinction is based on word lists because the traditional grammar does not define determiners. In general, words that inflect for gender, to be able to agree with a modified noun, are tagged DET, even if they act independently in a given sentence; that includes possessives. Pronominal quantifiers (which the traditional grammar includes in numerals) are DET as well.
- Upper Sorbian has just one auxiliary verb (AUX), być (“to be”).
The auxiliary verb is used in several types of constructions:
- The copula with non-verbal predicates.
- Periphrastic future tense (future form of być + infinitive of the main verb).
- Periphrastic perfect tenses (present/past form of być + l-participle of the main verb).
- Periphrastic conditional (conditional form of być + l-participle of the main verb).
- Periphrastic passive.
- In other words, być is the only lemma that occurs with the AUX tag.
It may still occur also as normal VERB if it is used in purely existential sentences
(i.e. such that don’t even indicate location because if they do, then być is treated as copula).
- Note that this may be changed in future. Existential sentences could be treated as elliptical versions of locational sentences;
then the verb would be the root, but it could still be tagged as
AUX
and theAUX
-VERB
distinction could be anchored in the lexicon.
- Note that this may be changed in future. Existential sentences could be treated as elliptical versions of locational sentences;
then the verb would be the root, but it could still be tagged as
- Verbs with modal meaning are not considered auxiliary in Upper Sorbian.
- There are five main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of three values:
Masc
,Fem
orNeut
. In some cases the masculine gender is further subclassified by the Animacy valuesAnim
,Nhum
andInan
. Feminine and neuter nominals do not distinguish animacy grammatically. - The three main values of the Number feature are
Sing
,Dual
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite, participles and converbs).- Selected nouns are plurale tantum (
Ptan
) or singulare tantum (Coll
). These two values are lexical and cannot be used with the agreeing adjectives, determiners or verbs. They also never occur with pronouns.
- Selected nouns are plurale tantum (
- Case has 7 possible values:
Nom
,Gen
,Dat
,Acc
,Voc
,Loc
,Ins
. It occurs with the nominal words, i.e., NOUN, PROPN, PRON, ADJ, DET, and marginally NUM. It can occur with participles but only with those tagged asADJ
. It never occurs with verbs.- The
Case
feature also occurs with prepositions (ADP). Here it is a lexical feature. Prepositions do not inflect for case but they subcategorize for the case of their noun phrase.
- The
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has two values,
Pos
andNeg
, and applies primarily to verbs (VERB, AUX), adjectives (ADJ) and adverbs (ADV) that can be negated using the bound morpheme nje-.- Negating nouns is usually limited to those derived from verbs.
- The
Polarity
feature is not used with pronouns and determiners, although there is a subset of negative pronouns and determiners. ThePronType=Neg
feature is used there instead.
Verbal Features
- Verbs have a lexical Aspect, either imperfective (
Imp
) or perfective (Perf
). A few verbs are biaspectual and they lack theAspect
feature. Some imperfective verbs could be further classified as iteratives but they are not marked as such (although UD providesAspect=Iter
).- The
Aspect
feature should be also used with the corresponding derived nouns and adjectives (participles), if they have theVerbForm
feature.
- The
- Finite verbs always have one of three values of Mood:
Ind
,Imp
orCnd
. The conditional mood is only used with conditional auxiliaries (bych, by, bychmoj, byštaj, byštej, bychmy, byšće, bychu). The l-participle of the main verb, that is needed to form a periphrastic conditional, is not marked with this feature. - Verbs in the indicative mood always have one of three values of Tense:
Past
,Pres
orFut
. Note thatTense=Pres
is also used with forms of perfective verbs, which are formally present, but semantically future. Hence both du domoj “I am going home” and přińdu domoj “I will come home” end up marked asTense=Pres
. On the other hand, a few imperfective verbs can form a genuine future form using prefixes, and they are markedTense=Fut
: póńdu domoj “I will go home”.- Imperative and conditional forms do not have the
Tense
feature (note that past and present conditionals are distinguished analytically). - The
Tense
feature is also used to mark present adjectival participles (dźěłacy “doing”). The l-participle (taggedVERB
orAUX
) hasTense=Past
because its primary function is to form the past (perfect) tense.
- Imperative and conditional forms do not have the
- There are two values of the Voice feature:
Act
andPass
. Only the passive participle hasVoice=Pass
. All other verb forms haveVoice=Act
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON), determiners (DET) and adverbs (ADV).
- NumType is used with numerals (NUM), adjectives (ADJ), determiners (DET) and adverbs (ADV).
- The Poss feature marks possessive personal determiners (e.g. mój “my”), possessive interrogative, indefinite or negative determiners (e.g. čeji “whose”), possessive relative determiners (e.g. čejiž “whose”) and possessive adjectives (e.g. nanowy “father’s”).
- The Reflex feature marks reflexive pronouns (so, sej) and determiners (swój).
In Upper Sorbian it is always used together with
PronType=Prs
. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - There are two layered features, Gender[psor] and Number[psor]. They appear with certain possessive adjectives and determiners and encode the lexical gender/number of the possessor. The extra layer is needed to distinguish these lexical features from the inflectional gender and number that mark agreement with the modified (possessed) noun.
Other Features
- Besides the layered features listed above, there are several other language-specific features:
- The following universal features are not used in Upper Sorbian: Definite, Evident, Polite.
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier. If this is the case, then the quantifier is attached using a special relation, either nummod:gov or det:numgov.
- An infinitive verb may serve as the subject and is labeled as clausal subject, csubj.
On the other hand, verbal nouns as subjects are just
nsubj
. - A finite subordinate clause may serve as the subject and is labeled
csubj
.
- Objects defined in the Upper Sorbian grammar may be bare noun phrases in accusative, dative, genitive or instrumental,
or prepositional phrases in accusative, dative, genitive, locative or instrumental.
For the purpose of UD the objects are divided to core objects, labeled obj or iobj,
and oblique objects, labeled obl:arg.
- Bare accusative, dative, genitive and instrumental objects are considered core.
- All prepositional objects are considered oblique.
- Accusative objects of some verbs alternate with finite clausal complements, which are labeled ccomp.
- If a verb subcategorizes for the infinitive (e.g. modal verbs or verbs of control), the infinitival complement is labeled xcomp.
- If a verb subcategorizes for two core objects, one of them accusative (or
ccomp
) and the other non-accusative, then the non-accusative object is labeled iobj. Core nominal objects in other situations are labeled just obj.
- Adjuncts (or, following the Upper Sorbian grammar, adverbial modifiers realized as noun phrases) are usually
prepositional phrases, but they can be bare noun phrases as well. They are labeled obl:
- Temporal modifiers realized as accusative noun phrases: přijědu přichodnu sobotu “I will come next Saturday.”
- Instrumental noun phrases expressing the way or means with which something was done. Example: rěčachu akkadšćinu “they spoke Akkadian.”
- All prepositional phrases that are not prepositional objects (i.e., their role and form is not defined lexically by the predicate) are adjuncts.
- Extra attention has to be paid to clitic forms of reflexive pronouns so (accusative) and sej (dative). They can function as:
- Core objects (obj or iobj).
- Reciprocal core objects (
obj
oriobj
). - Reflexive passive (expl:pass): skłonjować so “to be declined,” lit. “to decline itself.”
- Inherently reflexive verb, cannot exist without the reflexive clitic. In accord with the current UD guidelines, we label the relation
between the verb and the clitic as expl:pv, not
compound
. Example: hodźić so “to fit.”
- In passive clauses (both reflexive and periphrastic passive), the subject is labeled with nsubj:pass or csubj:pass, respectively.
Non-verbal Clauses
- The copula verb być (be) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses. Purely existential clauses (without indicating location) use być as well but it is treated as the head of the clause and tagged VERB.
Relations Overview
- The following relation subtypes are used in Upper Sorbian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- obl:agent for agents of passive verbs
- obl:arg for prepositional objects
- expl:pv for reflexive clitics of inherently reflexive verbs
- expl:pass for reflexive clitics in reflexive passives
- aux:pass for passive auxiliaries
- nummod:gov for cardinal numbers that are attached as children of the counted noun but govern its case
- det:numgov for pronominal quantifiers that are attached as children of the quantified noun but govern its case
- det:nummod for pronominal quantifiers in cases in which they do not govern the case of the quantified noun
- advmod:emph for adverbs or particles that modify noun phrases and emphasize or negate them
- flat:foreign for non-first words in quoted foreign phrases
- The following main types are not used alone and must be subtyped: expl
- The following relation types are not used in Upper Sorbian at all: clf, dislocated
Treebanks
There is one Upper Sorbian UD treebank: