UD for Latin
Tokenization and Word Segmentation
Index Thomisticus Treebank, LLCT and UDante
- The tokenization of the Index Thomisticus Treebank (IT-TB) is inherited from that of the original Index Thomisticus corpus by father Roberto Busa. In general, words are delimited by whitespace characters. Punctuations are assigned a token on their own. Description of exceptions follows.
- Words that include enclitics -que, -ve or -ne are split into two tokens, namely one for the word without the enclitics and one for the enclitics. Example: the word corporique (lit. “and to the body”) is split into two tokens: corpori and que. In the IT-TB, this is the only exception to the original tokenization provided by the Index Thomisticus. As of yet, mutliword tokens are missing from the IT-TB (but not LLCT nor UDante).
- Hyphenated compounds such as necesse-esse are not split into two tokens. They are considered one token.
- Dots are not assigned a token on their own when they are part of an abbreviation (e.g., etc. and metaph.).
PROIEL
- In general, words are delimited by whitespace characters. There is no punctuation in the syntactic trees. The exception is words that include enclitics -que, -ve or -ne are split into two tokens, namely one for the word without the enclitics and one for the enclitics. Example: the word corporique (lit. “and to the body”) is split into two tokens: corpori and que.
Perseus
- Tokenization is whitespace-based, with the exception of enclitic -que, -ve, and -ne being split.
Morphology
Tags
Index Thomisticus Treebank, LLCT and UDante
- The IT-TB uses 15 universal POS categories.
Interjections (
INTJ
) and Symbols (SYM
) are not used. - Among determiners (
DET
), we notice the peculiar proto-article ly (only IT-TB). - We generally register only one main auxiliary verb (
AUX
), the Classical sum (“to be”). The form iri (present passive infinitive of the verb eo, “to go”), Classically attested for the periphrastic future passive infinitive, is never used in the IT-TB and neither in the LLCT nor UDante. Also habeo ‘to have’, a later Romance innovation, does not appear as an auxiliary in the IT-TB nor in UDante, despite the time period, but does in the LLCT. The auxiliary sum is used in several types of constructions:- The copula with non-verbal predicates.
- Periphrastic future active infinitive.
- Periphrastic passive perfect indicative.
- Periphrastic passive pluperfect indicative.
- Periphrastic passive future perfect indicative.
- Periphrastic passive perfect subjunctive.
- Periphrastic passive plusperfect subjunctive.
- Periphrastic perfect passive infinitive.
- In the IT-TB, sum is the only lemma that occurs with the
AUX
tag, and it always occurs as such. In some contexts, though (e.g. purely existential sentences), it is considered the head of the predication. - Verbs with modal meaning are not considered auxiliaries in Latin.
PROIEL
- PROIEL uses 14 universal POS categories. Punctuation (
PUNCT
), Particles (PART
) and Symbols (SYM
) are not used. - Among the traditional pronouns of Latin grammar, those that can occur adnominally are treated as detererminers
DET
. - The only common auxiliary verb (
AUX
) is sum (“to be”), although there are a few cases of eo as used in the periphrastic future passive infinitive.
Perseus
- Perseus uses 12 POS tags. Notably,
DET
,PART
, andAUX
are not used.
Nominal Features
Index Thomisticus Treebank, LLCT and UDante
- Nominal words (common/proper nouns
NOUN
/PROPN
) have an inherentGender
feature with one of three values:Masc
,Fem
orNeut
; pronounsPRON
might have one, but can also not express one (e.g. ego ‘I’, first singular person). - The two values of the
Number
feature areSing
andPlur
. The following parts of speech inflect for number:NOUN
,PROPN
,PRON
,ADJ
,DET
,NUM
,VERB
,AUX
. Case
has 7 possible values: nominativeNom
, genitiveGen
, dativeDat
, accusativeAcc
, vocativeVoc
(marginal), locativeLoc
(marginal), ablativeAbl
. It occurs with nominal words, i.e.,NOUN
,PROPN
,PRON
,ADJ
,DET
,NUM
, and also with attributive nominal forms of verbs (VERB
/AUX
) (participles, including gerundives/gerunds).
PROIEL
- As for Index Thomisticus, except PROIEL does not recognize a locative case in Latin.
Perseus
TODO
Degree
Index Thomisticus Treebank, LLCT and UDante
Degree
applies to adjectives (ADJ
), some adverbs (ADV
), and possibly determiners (DET
), with the values of comparativeCmp
and absolute superlativeAbs
(positivePos
is also traditionally considered a degree, but it is rather the absence of one). All these classes and nouns (NOUN
) can show a diminutive degree; verbs (VERB
), in their nominal forms, can take any of the above mentioned values analogously.
PROIEL
Perseus
TODO
Verbal Features
Index Thomisticus Treebank, LLCT and UDante
- Verbs (both finite and non finite forms) have a lexical
Aspect
, either imperfective (Imp
; traditionally infectum), perfective (Perf
), prospective (Prosp
; traditionally also “future”; only on nominal forms), and possibly inchoative (Inch
, if considered inflectional and not derivational). - Finite verbs always bear one of three values of
Mood
: indicativeInd
(realis), imperativeImp
or subjunctiveSub
(irrealis). - There are five values of
Tense
: pastPast
, presentPres
, pluperfectPqp
or futureFut
. - There are two values of the
Voice
feature: activeAct
and passivePass
. Diathesis (voice) is tied to morphology, not syntax: this means that deponent verbs are always tagged for a passive voice. - Verb forms are annotated by means of VerbForm, which distinguishes the so-called finite form
Fin
from nominal forms, oriented towards the possible other lexical parts of speech: participlesPart
(verbalADJ
; including gerundives and also gerunds), verb nouns (verbalNOUN
; traditionally infinitivesInf
, and possibly also supines) and converbs (verbalADV
; traditionally the active supineSup
). - Since the annotation of verbal features in the UD formalism deviates in some points from traditional denominations,
TraditionalMood
andTraditionalTense
features are implemented in theMISC
field so as to ease the retrieval of verb forms according to a more classical schema.
PROIEL
- There are four values of
Tense
:Past
,Pres
,Pqp
orFut
. - There are two values of
Aspect
:Imp
andPerf
. - Traditional categories like the imperfect and the future perfect are decomposed into
Aspect=Imp|Tense=Past
andAspect=Perf|Tense=Fut
.
Perseus
TODO
Pronouns, Determiners, Quantifiers
Index Thomisticus Treebank, LLCT and UDante
PronType
is used with pronouns (PRON
), determiners (DET
), some numerals (NUM
), derived adverbs (ADV
), and in a more etymological sense also conjunctions (CCONJ
/SCONJ
).NumType
has four values: cardinalCard
, distributiveDist
, multiplicativeMult
, ordinalOrd
.NumType
is used with numerals (NUM
), adjectives (ADJ
) and adverbs (ADV
).- The
Poss
feature marks possessive personal adjectives (e.g. noster ‘ours’). - The
Reflex
feature marks reflexive elements such as (sui). Person
is a feature of finite verbs (VERB
/AUX
) and has three values,1
,2
and3
.- Also other features, mostly of lexical type, are used.
PROIEL
Perseus
TODO
Other Features
Index Thomisticus Treebank, LLCT and UDante
Besides the layered features listed above, there are also the following language-specific features:
Abbr
for abbreviations. One value:Yes
. Example: metaph. (standing for metaphysica)NumForm
for Roman numerals. One value:Roman
. Example: VIII (Roman number for ‘8’)VerbType
has been used for modals. One value:Mod
. Example: potest ‘(it/she/he) can’.
Perseus
TODO
Syntax
Core Arguments, Oblique Arguments and Adjuncts
Index Thomisticus Treebank, LLCT and UDante
- Nominal subject (
nsubj
) is a noun phrase (possibly headed by a verbal nominal form) mostly in the nominative, but also in the accusative or ablative case (and actually not limited to that), without preposition.- Clausal subjects
csubj
are very often expressed by means of verb nouns (“infinitives”), but can also be finite predications and then nearly always introduced by a (subordinating) conjunction.
- Clausal subjects
- Objects (
obj
) (core arguments) are bare noun phrases in the accusative. Secondary nominal objects (iobj), also in the accusative, are a marginal and very restricted phenomenon in Latin (but the combinationobj
+ccomp
is slightly more frequent).- A complement/adjunct distinction (not part of the UD formalism) is maintained by means of the subtyped relation
obl:arg
for those complements which are not expressed as core arguments.
- A complement/adjunct distinction (not part of the UD formalism) is maintained by means of the subtyped relation
- All other arguments (usually other cases than nominative/accusative or prepositional phrases) are oblique (obl). Being in the accusative case is a necessary, but not sufficient condition to be
obj
: a particular case are for example the relational (“Greek-style”) accusative noun phrases, but also temporal or locative arguments in the bare accusative. - If a verb subcategorizes for the infinitive (e.g. modal verbs or verbs of control), the infinitival complement is labeled xcomp.
- In passive clauses:
- the subject is labeled with either
nsubj:pass
orcsubj:pass
; - the auxiliary verb in periphrastic passives is labeled with
aux:pass
; - if the agent is present, it has the form of an ablative noun phrase, possibly introduced by the preposition ab ‘by’, and it is labeled obl:agent.
- the subject is labeled with either
PROIEL
PROIEL maintains the following differences, more proper to the original LDT formalism, with respect to the points explained above.
- Objects may be bare noun phrases in accusative, dative, genitive or ablative, or prepositional phrases. For the purpose of UD the objects are divided to core objects, labeled
obj
oriobj
, and oblique objects, labeledobl:arg
.- A bare accusative is considered core.
- All prepositional objects are considered oblique.
- If a verb subcategorizes for two core objects, one of them accusative (or
ccomp
) and the other non-accusative, then the non-accusative object is labelediobj
. Core nominal objects in other situations are labeled justobj
.
- Adjuncts (or adverbial modifiers realized as noun phrases) are usually prepositional phrases, but they can be bare noun phrases as well. They are labeled
obl
. e.g.:- dative noun phrases with benefactive or possessive role (i.e. if the verb does not subcategorize for a single dative object and if it is not a verb of giving or similar, where the dative could be interpreted as the recipient).
- ablative noun phrases expressing the way or means with which something was done.
- all prepositional phrases that are not prepositional objects (i.e., their role and form is not defined lexically by the predicate) are adjuncts.
- In passive clauses, the agent is labeled with
obl:arg
.
Perseus
TODO
Non-verbal Clauses
Index Thomisticus Treebank and PROIEL
- The copula verb sum ‘to be’ is used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
- Purely existential clauses (without indicating location) use sum as well, but this is treated as the head of the clause (and tagged
VERB
instead ofAUX
in PROIEL).
LLCT and UDante
The copula sum ‘to be’ is always AUX
and always annotated as a functional dependent with cop of the given (more) lexical word, barring elliptic constructions.
Perseus
TODO
Relations Overview
Index Thomisticus Treebank, LLCT and UDante
A selection of the more specific (and interesting) relation subtypes used in these treebanks:
acl:relcl
for adnominal clauses functioning as relative clauses (which in Latin means that they contain a reference to the head by means of a so-called relative pronoun)- Note: adnominal clauses headed by participial verb forms do not receive the
:relcl
subtype.
- Note: adnominal clauses headed by participial verb forms do not receive the
advcl:abs
to highlight ablativi absoluti (embedded non-finite clauses whose subject does not also appear as argument in the matrix clause)advcl:cmp
for the standard of comparison expressed as a(n elliptic) clause- and similarly
obl:cmp
for purely nominal standards of comparisons
- and similarly
advcl:pred
secondary predications of all kind, including e.g. “conjunct participles” or “floating quantifiers” (embedded non-finite, possibly nominal, clauses whose subject is co-referent to an argument in the matrix clause)xcomp:pred
secondary predications which is a core extension of the main predication
ccomp:reported
for object clauses expressing direct speechparataxis:reporting
for the reversed phenomenon
csubj:relcl
for so-called free relatives acting as subjects, containing a “double pronoun”- so similarly
ccomp:relcl
,xcomp:relcl
,advcl:relcl
- so similarly
advmod:emph
for adverbs or particles that modify noun phrases and emphasize or negate them (focalisers)expl:pass
for reflexive clitics in reflexive passives (only IT-TB)obl:arg
for oblique arguments which are considered complementsconj:expl
for appositive additions functioning as a expansions (also by repetition) of any element(the _scilicet ‘that is’ type)- various subtypes of
dislocated
according to the original relation of the displaced element - The following relation types are not used at all:
clf
,list
PROIEL
aux:pass
for passive auxiliariescsubj:pass
for clausal subjects of passive verbsflat:foreign
for passages not in Latin (typically quoted passages in Greek)flat:name
for multi-token namesnsubj:pass
for nominal subjects of passive verbsobl:agent
for the agent of passive verbs
Perseus
No subtypes are used.
Treebanks
There are five Latin UD treebanks:
Documentation
The writing up of the Latin-specific documentation, detailing the guide lines pursued by the three currently active Latin treebanks (ITTB, LLCT and UDante), is a work in (slow) progress and has been focusing on the more language-specific issues and values, before going into the details of more universally defined elements. As of now, all research, setting of the guide lines and compilation of documentation pages is being performed by Flavio M. Cecchini (Università Cattolica del Sacro Cuore di Milano), so send any question, doubt or critic to flavio.cecchini[at]unicatt[.]it
, or, better, open an issue on UD’s GitHub!