UD for Latvian
Tokenization and Word Segmentation
In general, words are delimited by whitespace characters and punctuation is separated. Description of exceptions follows:
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
- Abbreviations without spaces are treated as single words and may contain punctuation (utt. “etc.”). In following cases we treat abbreviation as a single token even if whitespace is used between part of abbreviation and punctuation mark: u.t.jpr., u.c., u.tml., v.tml., u.t.t., N.B., P.S. and P.P.S.
- Double surnames such as Vīķe-Freiberga and words abbreviated with dashes such as e-pasts “e-mail”, k-dze “Ms.” are tokenized as a single token.
- In Latvian ordinal numerals are written with punctuation mark without whitespace like abbreviations (1.), so we tokenize ordinal numeral together with punctuation mark as one token.
- Multiple dots (… and .. ) are considered as one token. Multiple ?! are considered one token, ?!… is considered to be two tokens (?! and …).
Paragraph borders from the original text is indicated by comment line # newpar
in cases when paragraph borders aligns sentence borders and MISC
value NewPar=Yes
for the token following mid-sentence paragraph break. MISC
value SpaceAfter=No
is used to note tokens lacking any whitespace after.
Morphology
Tags
Latvian uses all 17 universal POS categories.
Particles
PART tag is used for following function words: acīmredzot, ak, ar, arī, arīdzan, da, diemžēl, diez, diezin, droši, gan, i, ij, ik, ir, it, itin, ja, jau, jā, jel, jo, kaut, kā, lai, laikam, mjā, ne, nea, nebūt, nez, nezin, nē, nu, nudien, nujā, nū, nūja, nūjā, pat, patiesi, patiešām, protams, proti, taču, tad, tak, tā, tāpat, tātad, tiešām, tik, tikai, tikpat, tipa, tomēr, turklāt, vai, varbūt, vēl, vien, vienīgi, vis.
Particles can be homonymous with other POS, most notably, conjunctions CCONJ and SCONJ, interjections INTJ, and adjectives ADJ, correct POS is assigned based on sentence context.
Pronouns and Determiners
Effectively distinguishing PRON and DET categories in Latvian is very hard as words used as DET
can also be used as PRON
, and, thus, traditional Latvian grammar does not define determiners as a distinct POS. Since version 2.15 pronoun (PRON
) vs. determiner (DET
) distinction is done by lemma (similarly as is done with PDT). In earlyer versions distinction was made based on tree structure.
Currently DET
are: abas, abi, cikais, cikas, ciki, cita, cits, daudzi, daža, dažs, ikkatra, ikkatrs, ikkura, ikkurš, ikviena, ikviens, jebkāda, jebkāds, jebkura, jebkurš, jelkāda, jelkāds, jūsējs, kāda, kādā, kādais, kāds, katra, katrs, kura, kurā, kurais, kurs, kurš, manējs, mana, mans, mūsējs, nekāda, nekādā, nekādais, nekāds, neviena, neviens, pate, pati, pats, savējs, sava, savs, šāda, šāds, šī, šis, šitāda, šitāds, šitaids, šitejāda, šitejāds, šitā, šitais, šitas, šitentāda, šitentāds, šitentas, štā, štas, štis, tāda, tāds, tā, tas, taste, tāte, tavējs, tava, tavs, vairāki, vēlviena, vēlviens, vienotra, vienotrs, viņējs, viņā, viņais, visa, viss.
PRON
are: daudzkas, es, jebkas, jelkas, jis, jūs, kas, mēs, nekas, nezinkas, sevis, tu, viņa, viņš, viš.
Syntax role det
is used for Latvian pronoun category, which modify nouns in the sentence and agree with this noun in gender, number and case. Pronominal quantifiers daudzi “many” and vairāki “several” , and personal possessives manējais, tavējais, mūsējais, jūsējais, viņējais are DET
, however in Latvian grammar they are described as adjectives.
Auxiliary Verbs
Latvian has three auxiliary verbs AUX: būt “to be”, tikt “to get”, and tapt “to become” (obsolete). The auxiliary verb is used in several types of constructions: * Analytic word forms of verbs (būt, tikt). * The copula in non-verbal predicates (būt). * The copula in infinitive predicates (būt).
Būt, tikt and tapt may still occur as normal VERB if they are used in purely existential sentences or indicate location. Verbs with modal meaning are not considered auxiliary in Latvian.
Deverbal Nouns, Participles, Coverbs
Deverbal nouns with endings -šana, -šanās (skriešana “running”) are tagged as NOUN. Most converbs with endings -ot, -oties, -am, -ām, -amies, -āmies, -dams, dama, -damies, -damās are tagged as VERB
or AUX
. Most adjectival participles (redzams, aizgājis, negaidīts, velkošs) are tagged as VERB
. Exceptions are lexicalized uses with separate meaning, like protams “of course”, acīmredzot “obvious”, which are tagged as PART
, and iespējams “possible”, which is tagged as ADJ
.
Features
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
. - The following parts of speech inflect for
Gender
as they must agree with nouns: ADJ, DET, NUM, VERB, AUX. For verbs (including auxiliaries), only participles inflect forGender
. Finite verbs don’t. - The two main values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number:NOUN
,PROPN
,PRON
,ADJ
,DET
,VERB
andAUX
(finite, participles and verbal nouns), marginallyNUM
. Selected nouns are plurale tantumPtan
or singulare tantumColl
. - Case has 6 possible values:
Nom
,Gen
,Dat
,Acc
,Loc
,Voc
. It occurs with the nominal words, i.e.,NOUN
,PROPN
,PRON
,ADJ
,DET
,NUM
,VERB
andAUX
(participles and verbal nouns). - Definite has 2 possible values:
Ind
andDef
. The following parts of speech inflect for definitnes:ADJ
,VERB
andAUX
(participles).
Degree and Polarity
- Degree applies to adjectives (ADJ), adverbs (ADV), and some participles (VERB, AUX), and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has two values,
Pos
andNeg
, and applies to verbs (VERB, AUX).- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
Polarity=Neg
. - Occasionaly ne occurs as a part of correlative conjunction and is marked with
Polarity=Neg
. - Word jā occurs as an independent affirmation particle (PART) and is marked with
Polarity=Pos
. - The
Polarity
feature is not used with pronouns and determiners, although there is a subset of pronouns and determiners which are considered to be negated traditionally. ThePronType=Neg
feature is used there instead.
- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
Verbal Features
- There are five main (de)verbal form types, distinguished by the UPOS tag and the value of the VerbForm feature:
- Aspect applies only to part of participles (
VERB
,AUX
) and is either imperfectiveImp
or perfectivePerf
. - Finite verbs always have one of five values of Mood:
Ind
,Imp
,Cnd
,Qot
orNec
. - Tense is used for verbs and participles:
- Verbs in the indicative mood always have one of three
Tense
values:Past
,Pres
orFut
. - Infinitive, imperative, conditional, quotative, and necessitative forms do not have the
Tense
feature. - The
Tense
feature is also used to distinguish declinable participles (taggedVERB
orAUX
) into two groups: present participles (ziedošs “[it is] flowering” and lasāms “[it is] readable”) and past participles (darījis “[he has] been doing” and pateikts “[it has] been said”).
- Verbs in the indicative mood always have one of three
- There are two values used for the [Voice() feature:
Act
andPass
:- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
Voice=Pass
. - Finite verb forms and active participles (ziedošs “[it is] flowering” and darījis “[he has] been doing”) have
Voice=Act
.
- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
- Evident applies to finite verb forms (
VERB
,AUX
) and depends on value ofMood
: quotatives have valueNfh
, but indicative have valueFh
.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON, determiners DET and pronominal adverbs ADV with 8 permissible values:
Prs
,Rcp
,Int
,Rel
,Dem
,Tot
,Neg
,Ind
. - NumType is used with numerals (also cardinal numbers) NUM, ordinal numbers ADJ, and some adverbs ADV:
- Numerals and ordinal numbers has one of three possible values:
Card
,Ord
orFrac
. - Adverbs vienreiz “once”, divreiz “twice”, trīsreiz “thrice”, četrreiz, piecreiz, sešreiz, septiņreiz, astoņreiz, deviņreiz, desmitreiz “ten times”, pusotrreiz “one and a half times” has
NumType=Mult
.
- Numerals and ordinal numbers has one of three possible values:
- The Poss feature marks possessive personal pronouns and determiners (e.g., mans “my”) and possessive adjectives (e.g., tavējais “yours”) with value
Yes
. - The Reflex feature marks reflexive pronoun sevis.
- Reflexivity is also marked on reflexive verbs and participles (VERB, e.g., mazgāties, pusapģērbusies).
- Person is marked for pronouns and finite verbs and has three values:
1
,2
and3
.- It is a lexical feature of personal pronouns
PRON
like es “I”, tu “you” (singular), viņš “he”, viņa “she”, mēs “we”, jūs “you” (plural), viņi “they” (plural, masculine), viņas “they” (plural, feminine). - It is a lexical feature of personal possessives
DET
/PRON
mans, manējais “my/mine”, tavs, tavējais “your/yours” (singular), mūsējais “our/ours”, jūsējais “your/yours” (plural), viņējais “his/hers/theirs”. Person
is also marked on some demonstrative pronouns with value3
.- As a cross-reference to subject, person is also marked on finite verbs (
VERB
, AUX).
- It is a lexical feature of personal pronouns
- Foreign is annotated
Yes
for foreign words X. - Abbr is annotated
Yes
for abbreviations, which can be nouns NOUN (DJ), PROPN (NATO),ADJ
(god. “honored”),VERB
(skat. “see”),ADV
(v.j.l. “above sea level”),SYM
(utt. “etc.”).
Unused Features
Features not applicable for Latvian:
Syntax
Core Arguments
- Nominal subject (nsubj) is a noun phrase usually in the nominative case. However:
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier.
- With predicates nebūt, trūkt, pietikt, netrūkt, nepietikt noun phrase can be in genitive.
- A finite subordinate clause may serve as the subject and is labeled csubj.
- The noun phrase may be in the dative, if the predicate is in the necessitative mood (man jāskatās spēle “I have to watch the game”) or if the predicate is with modal meaning and has subordinated infinitive (viņam vajadzētu pasteigties “he should hurry”).
- Objects as defined in the Latvian grammar may be either bare noun phrases in accusative, dative, or genitive, or prepositional phrases in accusative, dative, genitive. All objects are labeled as obj or iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
obj
. - Accusative objects are considered
obj
. - Objects in dative and genitive cases and prepositional objects are considered
iobj
.
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
Non-verbal Clauses
The copula verb būt “be” is used in equational and attributional nonverbal clauses. Purely existential clauses (also indicating location) use būt as well, but it is treated as the head of the clause and tagged VERB.
Relations Overview
The following relation subtypes are used in Latvian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- flat:foreign for non-first words in quoted foreign phrases
- flat:name for exocentric complex name
The following relation types are not used for Latvian: clf, dislocated, list, reparandum. However, reparandum
should be introduced in future, as appropriate speech texts are annotated.
Annotating Textual Errors
Following MISC
values can be used to annotate errors in the source text interfering with treebank annotation:
CorrectionType=Spelling
for typos (FORM
is given as in text, whileLEMMA
,UPOS
,XPOS
andFEATS
as for word without the error)CorrectionType=Spacing
for missing or unnecessary whitespacesCorrectionType=InsertedPunctAfter
for cases when there is missing punctuation mark (usually comma) after this tokenCorrectionType=RemovedPunctuation
for unnecessary punctuation (usually comma)- In case of
CorrectionType=Spelling
additional featureCorrectedForm=
… gives the corrected form.
Treebanks
There are 2 Latvian UD treebanks: