UD for Latvian 
Tokenization and Word Segmentation
In general, words are delimited by whitespace characters and punctuation is separated. Description of exceptions follows:
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
- Abbreviations without spaces are treated as single words and may contain punctuation (utt. “etc.”). In following cases we treat abbreviation as a single token even if whitespace is used between part of abbreviation and punctuation mark: u.t.jpr., u.c., u.tml., v.tml., u.t.t., N.B., P.S. and P.P.S.
- Double surnames such as Vīķe-Freiberga and words abbreviated with dashes such as e-pasts “e-mail”, k-dze “Ms.” are tokenized as a single token.
- In Latvian ordinal numerals are written with punctuation mark without whitespace like abbreviations (1.), so we tokenize ordinal numeral together with punctuation mark as one token.
- Multiple dots (… and .. ) are considered as one token. Multiple ?! are considered one token, ?!… is considered to be two tokens (?! and …).
Paragraph borders from the original text is indicated by comment line # newpar in cases when paragraph borders aligns sentence borders and MISC value NewPar=Yes for the token following mid-sentence paragraph break. MISC value SpaceAfter=No is used to note tokens lacking any whitespace after.
Morphology
Tags
Latvian uses all 17 universal POS categories.
Particles
PART tag is used for following function words: acīmredzot, ak, ar, arī, arīdzan, da, diemžēl, diez, diezin, droši, gan, i, ij, ik, ir, it, itin, ja, jau, jā, jel, jo, kaut, kā, lai, laikam, mjā, ne, nea, nebūt, nez, nezin, nē, nu, nudien, nujā, nū, nūja, nūjā, pat, patiesi, patiešām, protams, proti, taču, tad, tak, tā, tāpat, tātad, tiešām, tik, tikai, tikpat, tipa, tomēr, turklāt, vai, varbūt, vēl, vien, vienīgi, vis.
Particles can be homonymous with other POS, most notably, conjunctions CCONJ and SCONJ, interjections INTJ, and adverbs ADV, correct POS is assigned based on sentence context.
Pronouns and Determiners
Effectively distinguishing PRON and DET categories in Latvian is very hard as words used as DET can also be used as PRON, and, thus, traditional Latvian grammar does not define determiners as a distinct POS. Since version 2.15 pronoun (PRON) vs. determiner (DET) distinction is done by lemma (similarly as is done with PDT). In earlyer versions distinction was made based on tree structure.
Currently DET are: abas, abi, cikais, cikas, ciki, cita, cits, daudzi, daža, dažs, ikkatra, ikkatrs, ikkura, ikkurš, ikviena, ikviens, jebkāda, jebkāds, jebkura, jebkurš, jelkāda, jelkāds, jūsējs, kāda, kādā, kādais, kāds, katra, katrs, kura, kurā, kurais, kurs, kurš, manējs, mana, mans, mūsējs, nekāda, nekādā, nekādais, nekāds, neviena, neviens, pate, pati, pats, savējs, sava, savs, šāda, šāds, šī, šis, šitāda, šitāds, šitaids, šitejāda, šitejāds, šitā, šitais, šitas, šitentāda, šitentāds, šitentas, štā, štas, štis, tāda, tāds, tā, tas, taste, tāte, tavējs, tava, tavs, vairāki, vēlviena, vēlviens, vienotra, vienotrs, viņējs, viņā, viņais, visa, viss.
PRON are: daudzkas, es (“I”), jebkas, jelkas, jis, jūs (“you”, plural), kas, mēs (“we”), nekas, nezinkas, sevis, tu (“you”, singular), viņa (“she”), viņš (“he”), viš (“they/he/she”).
Syntax role det is used for Latvian pronoun category, which modify nouns in the sentence and agree with this noun in gender, number and case. Pronominal quantifiers daudzi “many” and vairāki “several” , and personal possessives manējais, tavējais, mūsējais, jūsējais, viņējais are DET, however in Latvian grammar they are described as adjectives.
Auxiliary Verbs
Latvian has three auxiliary verbs AUX: būt “to be”, tikt “to get”, and tapt “to become” (obsolete). The auxiliary verb is used in several types of constructions: * Analytic word forms of verbs (būt, tikt). * The copula in non-verbal predicates (būt). * The copula in infinitive predicates (būt).
Būt, tikt and tapt may still occur as normal VERB if they are used in purely existential sentences or indicate location. Verbs with modal meaning are not considered auxiliary in Latvian.
Deverbal Nouns, Participles, Coverbs
Deverbal nouns with endings -šana, -šanās (skriešana “running”) are tagged as NOUN. Most converbs with endings -ot, -oties, -am, -ām, -amies, -āmies, -dams, dama, -damies, -damās are tagged as VERB or AUX. Most adjectival participles (redzams, aizgājis, negaidīts, velkošs) are tagged as VERB. Exceptions are lexicalized uses with separate meaning, like protams “of course”, acīmredzot “obvious”, which are tagged as PART, and iespējams “possible”, which is tagged as ADJ.
Features
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
MascorFem. - The following parts of speech inflect for
Genderas they must agree with nouns: ADJ, DET, NUM, VERB, AUX. For verbs (including auxiliaries), only participles inflect forGender. Finite verbs don’t. - The two main values of the Number feature are
SingandPlur. The following parts of speech inflect for number:NOUN,PROPN,PRON,ADJ,DET,VERBandAUX(finite, participles and verbal nouns), marginallyNUM. Selected nouns are plurale tantumPtanor singulare tantumColl. - Case has 6 possible values:
Nom,Gen,Dat,Acc,Loc,Voc. It occurs with the nominal words, i.e.,NOUN,PROPN,PRON,ADJ,DET,NUM,VERBandAUX(participles and verbal nouns). - Definite has 2 possible values:
IndandDef. The following parts of speech inflect for definitnes:ADJ,VERBandAUX(participles).
Degree and Polarity
- Degree applies to adjectives (ADJ), adverbs (ADV), and some participles (VERB, AUX), and has one of three possible values:
Pos,Cmp,Sup. - Polarity has two values,
PosandNeg, and applies to verbs (VERB, AUX).- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
Polarity=Neg. - Occasionaly ne occurs as a part of correlative conjunction and is marked with
Polarity=Neg. - Word jā occurs as an independent affirmation particle (PART) and is marked with
Polarity=Pos. - The
Polarityfeature is not used with pronouns and determiners, although there is a subset of pronouns and determiners which are considered to be negated traditionally. ThePronType=Negfeature is used there instead.
- Words ne, nē “no” occurs as independent negation particles (PART) and are marked with
Verbal Features
- There are five main (de)verbal form types, distinguished by the UPOS tag and the value of the VerbForm feature:
- Aspect applies only to part of participles (
VERB,AUX) and is either imperfectiveImpor perfectivePerf. - Finite verbs always have one of five values of Mood:
Ind,Imp,Cnd,QotorNec. - Tense is used for verbs and participles:
- Verbs in the indicative mood always have one of three
Tensevalues:Past,PresorFut. - Infinitive, imperative, conditional, quotative, and necessitative forms do not have the
Tensefeature. - The
Tensefeature is also used to distinguish declinable participles (taggedVERBorAUX) into two groups: present participles (ziedošs “[it is] flowering” and lasāms “[it is] readable”) and past participles (darījis “[he has] been doing” and pateikts “[it has] been said”).
- Verbs in the indicative mood always have one of three
- There are two values used for the [Voice() feature:
ActandPass:- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
Voice=Pass. - Finite verb forms and active participles (ziedošs “[it is] flowering” and darījis “[he has] been doing”) have
Voice=Act.
- Passive participles (lasāms “[it is] readable” and pateikts “[it has] been said”) has
- Evident applies to finite verb forms (
VERB,AUX) and depends on value ofMood: quotatives have valueNfh, but indicative have valueFh.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON, determiners DET and pronominal adverbs ADV with 8 permissible values:
Prs,Rcp,Int,Rel,Dem,Tot,Neg,Ind. - NumType is used with numerals (also cardinal numbers) NUM, ordinal numbers ADJ, and some adverbs ADV:
- Numerals and ordinal numbers has one of three possible values:
Card,OrdorFrac. - Adverbs vienreiz “once”, divreiz “twice”, trīsreiz “thrice”, četrreiz, piecreiz, sešreiz, septiņreiz, astoņreiz, deviņreiz, desmitreiz “ten times”, pusotrreiz “one and a half times” has
NumType=Mult.
- Numerals and ordinal numbers has one of three possible values:
- The Poss feature marks possessive personal pronouns and determiners (e.g., mans “my”) and possessive adjectives (e.g., tavējais “yours”) with value
Yes. - The Reflex feature marks reflexive pronoun sevis.
- Reflexivity is also marked on reflexive verbs and participles (VERB, e.g., mazgāties, pusapģērbusies).
- Person is marked for pronouns and finite verbs and has three values:
1,2and3.- It is a lexical feature of personal pronouns
PRONlike es “I”, tu “you” (singular), viņš “he”, viņa “she”, mēs “we”, jūs “you” (plural), viņi “they” (plural, masculine), viņas “they” (plural, feminine). - It is a lexical feature of personal possessives
DET/PRONmans, manējais “my/mine”, tavs, tavējais “your/yours” (singular), mūsējais “our/ours”, jūsējais “your/yours” (plural), viņējais “his/hers/theirs”. Personis also marked on some demonstrative pronouns with value3.- As a cross-reference to subject, person is also marked on finite verbs (
VERB, AUX).
- It is a lexical feature of personal pronouns
- Foreign is annotated
Yesfor foreign words X. - Abbr is annotated
Yesfor abbreviations, which can be nouns NOUN (DJ), PROPN (NATO),ADJ(god. “honored”),VERB(skat. “see”),ADV(v.j.l. “above sea level”),SYM(utt. “etc.”).
ExtPos
ExtPos is currently used for annotating fixed constructions. See ExtPos for Latvian for currently used values and examples.
Unused Features
Features not applicable for Latvian:
Syntax
Core Arguments
- Nominal subject (nsubj) is a noun phrase usually in the nominative case. However:
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier.
- With predicates nebūt, trūkt, pietikt, netrūkt, nepietikt noun phrase can be in genitive.
- A finite subordinate clause may serve as the subject and is labeled csubj.
- The noun phrase may be in the dative, if the predicate is in the necessitative mood (man jāskatās spēle “I have to watch the game”) or if the predicate is with modal meaning and has subordinated infinitive (viņam vajadzētu pasteigties “he should hurry”).
- Objects as defined in the Latvian grammar may be either bare noun phrases in accusative, dative, or genitive, or prepositional phrases in accusative, dative, genitive. All objects are labeled as obj or iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
obj. - Accusative objects are considered
obj. - Objects in dative and genitive cases and prepositional objects are considered
iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (zēnam jāuzraksta mājasdarbs “the boy has to write a homework.”), and it is labeled as
Non-verbal Clauses
The copula verb būt “be” is used in equational and attributional nonverbal clauses. Purely existential clauses (also indicating location) use būt as well, but it is treated as the head of the clause and tagged VERB.
Relations Overview
The following relation subtypes are used in Latvian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- flat:foreign for non-first words in quoted foreign phrases
- flat:name for exocentric complex name
- advmod:neg for negative particles
- advmod:emph for emphasizing particles
The following relation types are not used for Latvian: clf, dislocated, list, reparandum. However, reparandum should be introduced in future, as appropriate speech texts are annotated.
Annotating Textual Errors
Following MISC values can be used to annotate errors in the source text interfering with treebank annotation:
CorrectionType=Spellingfor typos (FORMis given as in text, whileLEMMA,UPOS,XPOSandFEATSas for word without the error)CorrectionType=Spacingfor missing or unnecessary whitespacesCorrectionType=InsertedPunctAfterfor cases when there is missing punctuation mark (usually comma) after this tokenCorrectionType=RemovedPunctuationfor unnecessary punctuation (usually comma)- In case of
CorrectionType=Spellingadditional featureCorrectedForm=… gives the corrected form.
Treebanks
There are 2 Latvian UD treebanks: