UD for Latgalian 
It is important to note that currently UD guidelines for annotating Latgalian is in a very early stage as not much text has been annotated yet. Currently these guidelines are strongly based on Latvian guidelines and everything not described here is assumed to be annotated similarly as Latvian. NB! This is subject to change when more texts will be annotated.
Tokenization and Word Segmentation
In general, words are delimited by whitespace characters and punctuation is separated. Description of exceptions follows:
- A whitespace separating digits in a large number is not treated as a word separator. For example, 1 000 000 (“1,000,000” by English rules) is one token.
- Abbreviations without spaces are treated as single words and may contain punctuation (v.tml. “etc.”). In following cases we treat abbreviation as a single token even if whitespace is used between part of abbreviation and punctuation mark: v.tml., N.B., P.S. and P.P.S.
- Double surnames such as Vīke-Freiberga and words abbreviated with dashes such as e-posts “e-mail”, k-dze “Ms.” are tokenized as a single token.
- In Latgalian ordinal numerals are written with punctuation mark without whitespace like abbreviations (1.), so we tokenize ordinal numeral together with punctuation mark as one token.
- Multiple dots (… and .. ) are considered as one token. Multiple ?! are considered one token, ?!… is considered to be two tokens (?! and …).
Paragraph borders from the original text is indicated by comment line # newpar in cases when paragraph borders aligns sentence borders and MISC value NewPar=Yes for the token following mid-sentence paragraph break. MISC value SpaceAfter=No is used to note tokens lacking any whitespace after.
Morphology
Tags
Latgalian uses all 17 universal POS categories.
Particles
PART tag is used for following function words: ar, ari, až, ba, da, dīvamžāļ, dīz, gon, ik, it, kab, kazyn, konče, koč, kod, kuo, lai, laikam, mošeit, mož, na, nabejs, naviņ, naz, nazyn, nui, nā, pat, prūtams, rikti, ta, tak, tik, tikai, to, tok, tože, varbyut, viņ, vys, vīneigi. This list might be expanded in future.
Pronouns and Determiners
Effectively distinguishing PRON and DET categories in Latgalian (similarly as in Latvian) is very hard and currently no clear guidelines has been developed yet. Following the example of Latvian, distinction is done by lemma.
Currently DET are: itei, itys, kaida, kaids, kura, kurs, muna, muns, sova, sovs, tei, tis, toveja, tovejs.
PRON are: es, jei, jis, jī, kas, tu.
These lists will be expanded in future.
Auxiliary Verbs
Latgalian has one auxiliary verb AUX: byut “to be”. The auxiliary verb is used in several types of constructions:
- Analytic word forms of verbs.
- The copula in non-verbal predicates.
- The copula in infinitive predicates.
Byut may still occur as normal VERB if it is used in purely existential sentences or indicate location.
Verbs with modal meaning are not considered auxiliary in Latgalian.
Deverbal Nouns, Participles, Coverbs
Latgalian features rich set of deverbal derivations and not everything has been analized to align with UD guidelines yet. However, deverbal nouns with endings -šona, -šonuos (skrīšona “running”) are tagged as NOUN. Most converbs with endings -ūt, -ūts, -ūte, -ūtīs, -om, -omīs, -dams, -dama, -damīs, -damuos are tagged as VERB or AUX. Most adjectival participles (radzams, aizguojs, nagaideits, valkūšs) are tagged as VERB. Exceptions are lexicalized uses with separate meaning, like prūtams “of course”, acimradzūt “obvious”, which are tagged as PART, and īspiejams “possible”, which is tagged as ADJ.
Features
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
MascorFem. - The following parts of speech inflect for
Genderas they must agree with nouns: ADJ, DET, NUM, VERB, AUX. For verbs (including auxiliaries), only participles inflect forGender. Finite verbs don’t. - The two main values of the Number feature are
SingandPlur. The following parts of speech inflect for number:NOUN,PROPN,PRON,ADJ,DET,VERBandAUX(finite, participles and verbal nouns), marginallyNUM. Selected nouns are plurale tantumPtanor singulare tantumColl. - Case has 6 possible values:
Nom,Gen,Dat,Acc,Loc,Voc. It occurs with the nominal words, i.e.,NOUN,PROPN,PRON,ADJ,DET,NUM,VERBandAUX(participles and verbal nouns). - Definite has 2 possible values:
IndandDef. The following parts of speech inflect for definitnes:ADJ,NUM,VERBandAUX(participles).
Verbal Features
- There are five main (de)verbal form types, distinguished by the UPOS tag and the value of the VerbForm feature:
- Aspect applies only to part of participles (
VERB,AUX) and is either imperfectiveImpor perfectivePerf. - Finite verbs always have one of five values of Mood:
Ind,Imp,Cnd,QotorNec. - Tense is used for verbs and participles:
- Verbs in the indicative mood always have one of three
Tensevalues:Past,PresorFut. - Infinitive, imperative, conditional, quotative, and necessitative forms do not have the
Tensefeature. - The
Tensefeature is also used to distinguish declinable participles (taggedVERBorAUX) into two groups: present participles (zīdūšs “[it is] flowering” and skaitams “[it is] readable”) and past participles (darejs “[he has] been doing” and pasaceits “[it has] been said”).
- Verbs in the indicative mood always have one of three
- There are two values used for the Voice feature:
ActandPass:- Passive participles (skaitams “[it is] readable” and pasaceits “[it has] been said”) has
Voice=Pass. - Finite verb forms and active participles (zīdūšs “[it is] flowering” and darejs “[he has] been doing”) have
Voice=Act.
- Passive participles (skaitams “[it is] readable” and pasaceits “[it has] been said”) has
- Evident applies to finite verb forms (
VERB,AUX) and depends on value ofMood: quotatives have valueNfh, but indicative have valueFh.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns PRON, determiners DET and pronominal adverbs ADV with 8 permissible values:
Prs,Rcp,Int,Rel,Dem,Tot,Neg,Ind. - NumType is used with numerals (also cardinal numbers) NUM, ordinal numbers ADJ, and some adverbs ADV:
- Numerals and ordinal numbers has one of three possible values:
Card,OrdorFrac. - Adverbs vīnreiz “once”, divreiz “twice”, treisreiz “thrice”, četrreiz, pīcreiz, sešreiz, septeņreiz, ostoņreiz, deveņreiz, desmitreiz “ten times” has
NumType=Mult.
- Numerals and ordinal numbers has one of three possible values:
- The Poss feature marks possessive personal pronouns and determiners (e.g., muns “my”) and possessive adjectives (e.g., tovejs “yours”) with value
Yes. - The Reflex feature marks reflexive pronouns seve, sevi.
- Reflexivity is also marked on reflexive verbs and participles (VERB, e.g., apsamozguot, mozguotīs, apsavāruse, vārusīs).
- Person is marked for pronouns and finite verbs and has three values:
1,2and3.- It is a lexical feature of personal pronouns
PRONlike es “I”, tu “you” (singular), jis “he”, jei “she”, mes “we”, jius “you” (plural), jī “they” (plural, masculine), juos “they” (plural, feminine). - It is a lexical feature of personal possessives
DET/PRONmuns, munejs, munejais “my/mine”, tovs, tovejs, tovejais “your/yours” (singular), myusejs, myusejais “our/ours”, jiusejs, jiusejais “your/yours” (plural). Personis also marked on some demonstrative pronouns with value3.- As a cross-reference to subject, person is also marked on finite verbs (
VERB, AUX).
- It is a lexical feature of personal pronouns
- Foreign is annotated
Yesfor foreign words X. - Abbr is annotated
Yesfor abbreviations, which can be nouns NOUN (DJ), PROPN (NATO),ADJ(gūd. “honored”),VERB(sal. “compare”),ADV(p.Kr. “anno Domini”),SYM(v.tml. “etc.”).
ExtPos
ExtPos is currently used for annotating fixed constructions. See ExtPos for Latgalian for currently used values and examples.
Unused Features
Features not applicable for Latgalian:
Syntax
Core Arguments
- Nominal subject (nsubj) is a noun phrase usually in the nominative case. However:
- If the noun phrase is quantified, it may be in the genitive, which is required by the quantifier.
- With predicates nabyut, tryukt, pītikt, natryukt, napītikt noun phrase can be in genitive.
- A finite subordinate clause may serve as the subject and is labeled csubj.
- The noun phrase may be in the dative, if the predicate is in the necessitative mood (maņ juosaver spēle “I have to watch the game”) or if the predicate is with modal meaning and has subordinated infinitive (jam vajadzātu pasasteigt “he should hurry”).
- Objects as defined in the Latgalian grammar may be either bare noun phrases in accusative, dative, or genitive, or prepositional phrases in accusative, dative, genitive. All objects are labeled as obj or iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (puikam juoatnas iudiņs “the boy has to bring the water.”), and it is labeled as
obj. - Accusative objects are considered
obj. - Objects in dative and genitive cases and prepositional objects are considered
iobj.
- However, if the predicate is in the necessitative mood, object may be in nominative (puikam juoatnas iudiņs “the boy has to bring the water.”), and it is labeled as
Non-verbal Clauses
The copula verb byut “be” is used in equational and attributional nonverbal clauses. Purely existential clauses (also indicating location) use būt as well, but it is treated as the head of the clause and tagged VERB.
Relations Overview
The following relation subtypes are used in Latgalian:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- flat:foreign for non-first words in quoted foreign phrases
- flat:name for exocentric complex name
- advmod:neg for negative particles
- advmod:emph for emphasizing particles
The following relation types are not used for Latgalian: clf, dislocated, list, reparandum. However, reparandum should be introduced in future, as appropriate speech texts are annotated.
Treebanks
There is 1 Latgalian UD treebank: