UD for Finnish
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters.
- Multitoken words (a single syntactic word corresponds to multiple ortographic tokens) include:
- Emoticons like “: )”
- Numerical expressions like “20 000”
- Multiword tokens (a single orthographic token corresponds to multiple syntactic words) include:
- E.g. “ettei” –> “että ei”
- “miksen” –> “miksi en”
- “emmä” –> “en mä”
- Punctuation marks are treated as separate tokens; the exceptions include:
- ordinary numbers (1. tammikuuta)
- abbreviations (esim.)
- Emoticons are single tokens
Morphology
Tags
- Finnish uses all 17 universal POS categories
- Finnish has following auxiliary verbs:
- olla, ei, voida, pitää, saattaa, täytyä, joutua, aikoa, taitaa, tarvita, mahtaa (Finnish-TDT and Finnish-PUD)
- Finnish has one copula verb:
- olla
Verbal Features
- There are three main verbal forms distinguished by the value of VerbForm feature:
- Mood has four values:
Cnd
,Imp
,Ind
orPot
. - Tense has two values:
Past
orPres
. - Voice has two values:
Act
andPass
. - Person has four values,
0
,1
,2
and3
. - Number has values
Sing
orPlur
.
Nominal Features
- Finnish does not have Gender feature
- Number feature has two possible values:
Sing
andPlur
- Case has 15 possible values:
Abe
,Abl
,Acc
,Ade
,All
,Com
,Ela
,Ess
,Gen
,Ill
,Ine
,Ins
,Nom
,Par
,Tra
Degree and Polarity
- Degree applies to adjectives (ADJ), adverbs (ADV) and participle verbs (VERB or AUX), and has one of three possible values:
Pos
,Cmp
,Sup
. - Polarity has only value
Neg
, and applies to negative verb ‘ei’ - Connegative has only value
Yes
and applies to verbs which have been negated by ‘ei’
Syntax
- Nominal subject (nsubj) is a nominal in the nominative, genitive or partitive case (pronouns also accusative), without preposition.
- Objects (obj) can be nominals in nominative, genitive or partitive case (pronouns also accusative), without preposition.
- The copula verb olla (be) is used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
Relations Overview
- The following relation subtypes are used in Finnish:
- acl:relcl for relative clauses
- aux:pass for passive auxiliaries
- cc:preconj for constructions like sekä … että
- compound:nn for noun compound modifiers and appellation modifiers
- compound:prt for adverbal components of particle verbs
- cop:own for possessive copula clauses like Minulla on kynä
- csubj:cop for clausal subject in copula clauses
- flat:foreign for foreign-language token sequences
- flat:name for multi-token proper names
- nmod:gobj for genitive objects of noun derivations of verbs (like talon rakentaminen)
- nmod:gsubj for genitive subjects of noun derivations of verbs (like maljakon särkyminen)
- nmod:poss for posessive genitive modifiers
- nsubj:cop for nominal subject in copula clauses
- xcomp:ds for clausal complements with different subject
- The following relation types are not used in Finnish at all:
- clf, dislocated, iobj, list
Treebanks
There are three Finnish UD treebanks: