Executive summary of changes from v1 to v2
This document summarizes the major changes from v1 to v2 of the universal guidelines. The primary purpose is to provide a checklist for treebank developers who want to start the transition from v1 to v2. More background and discussion can be found in the thematic reports on v2, which we cross-reference below. Complete documentation of the new guidelines will follow as quickly as possible.
Word segmentation
The ban on spaces inside words in v1 is lifted in v2 in two circumstances:
- For languages with writing systems that use spaces to mark units smaller than word (typically syllables), spaces are allowed in any word provided that this is declared in the language-specific documentation.
- For other languages, spaces are allowed only for an approved and restricted list of exceptions like numbers (“100 000”) and abbrevations (“i. e.”) that have to be listed explicitly in the language-specific documentation.
More discussion can be found in the section on word segmentation.
Part-of-speech tags
The universal tagset from v1 is basically kept intact in v2 with two minor revisions:
- The tag
CONJ
is renamedCCONJ
to make it parallel toSCONJ
(and more similar to the syntactic relationcc
and less similar toconj
). - The guidelines for tags are modified in three cases:
- The use of
AUX
is extended from auxiliary verbs in a narrow sense to also include copula verbs and nonverbal TAME particles (tense, aspect, mood, evidentiality, and, sometimes, voice or polarity particles). - The use of
PART
is restricted to a small set of words that must be listed in the language-specific documentation. - The borderline between
PRON
andDET
is made more flexible to accommodate cross-linguistic variation.
- The use of
More discussion can be found in the section on part-of-speech tags
Morphological features
The set of universal features is updated from v1 to v2 in the following ways:
- Existing features and feature values are renamed:
Negative
→Polarity
Aspect=Pro
→Aspect=Prosp
VerbForm=Trans
→VerbForm=Conv
Definite=Red
→Definite=Cons
- New features are added (or promoted to universal features):
Evident
(evidentiality) with valueNfh
(non-first hand)Polite
(politeness) with valuesInfm
(informal),Form
(formal),Elev
(elevated status of referent; subtype ofForm
),Humb
(humbled status of speaker; subtype ofForm
)Abbr
(abbreviation) with valueYes
Foreign
with valueYes
- New values are added (or promoted) to existing features:
Animacy=Hum
(human)Case=Equ
(equative) andCase=Cmp
(comparative)Degree=Equ
(equative)Definite=Spec
(specific indefinite)Number=Count
(counting form or count plural),Number=Tri
(trial),Number=Pauc
(paucal),Number=Grpa
(greater paucal),Number=Grpl
(greater plural),Number=Inv
(inverse number)VerbForm=Gdv
(gerundive, not gerund),VerbForm=Vnoun
(verbal nouns other than infinitives)Mood=Prp
(purposive),Mood=Adm
(admirative)Aspect=Iter
(iterative),Aspect=Hab
(habitual)Voice=Mid
(middle voice),Voice=Antip
(antipassive),Voice=Dir
(direct),Voice=Inv
(inverse)PronType=Emp
(emphatic),PronType=Exc
(exclamative)Person=0
,Person=4
- Unused or poorly defined values are removed from existing features:
Tense=Nar
(narrative)NumType=Gen
(generic)
Revisions to the feature system have whenever possible been made to improve clarity and consistency with other systems such as UniMorph. More discussion can be found in the section on features.
Syntactic relations
Although most syntactic relations are the same in v2 as in v1, the guidelines have often been improved by providing more explicit criteria and examples from multiple languages. In this summary, we only list cases where relations have been removed, added or renamed, or where the use of an existing relation has changed significantly.
Clauses and dependents of predicates
- The
dobj
relation is renamedobj
because this seems to be more easily reconcilable with the intended interpretation of “second core argument” or “P/A argument” (without connection to specific cases or semantic roles). - The
nsubjpass
,csubjpass
andauxpass
relations are removed. The use of subtypesnsubj:pass
,csubj:pass
andaux:pass
is strongly encouraged for languages where these distinctions are relevant. - The
nmod
relation, which in v1 was used for nominals modifying either predicates or other nominals, is in v2 restricted to modifying nominals. A new relationobl
(oblique) is introduced for oblique dependents of predicates. - The
cop
relation is restricted to function words (verbal or nonverbal) whose sole function is to link a nonverbal predicate to its subject and which does not add any meaning other than grammaticalized TAME categories. The range of constructions that are analyzed using thecop
relation is subject to language-specific variation but can be identified using universal guidelines.
More discussion about 1-3 and 4 can be found in the sections on core-dependents and copula and nonverbal predicates, respectively.
Coordination and ellipsis
- Coordinating conjunctions (
cc
) and punctuation (punct
) inside coordinated structures are in v2 attached to the immediately succeeding conjunct (instead of the first conjunct as in v1). - The
remnant
relation used to analyze ellipsis in v1 is removed. A new relationorphan
is introduced in order to analyze ellipsis in a way that preserves the integrity of clauses and minimizes the use of special relations.
More discussion about 1 and 2 can be found in the sections on coordination and ellipsis, respectively.
Functional relations
- A new relation
clf
(classifier) is added for nominal classifiers. - The
aux
relation is extended from auxiliary verbs in a narrow sense to also include nonverbal TAME particles (in analogy with the extended use of the part-of-speech tagAUX
). - The
auxpass
relation is removed from the set of universal relations (see above). - The
cop
relation is restricted to pure linking words (see above). - The
neg
relation is removed from the set of universal relations, and polarity is instead encoded in a feature.
More discussion about 1-4 and 5 can be found in the sections on functional relations and semantic categories, respectively.
Multiword expressions
- The
mwe
relation is renamedfixed
to make clear that it is only to be used for fixed grammaticized expressions that behave like function words or short adverbials. - The
name
relation is removed because it has been misinterpreted as applying to all names. - The
foreign
relation is removed because it has been judged superfluous. - A new relation
flat
is added for semi-fixed multiword expressions for which there is no clear evidence that one of the components is the linguistic head. This covers the originally intended uses of thename
andforeign
relations but also other cases like title-name combinations and date expressions that do not have a clear endocentric syntactic structure. Subtypes likeflat:name
andflat:foreign
can be used to preserve information in existing treebanks. - The
compound
relation is extended to cover all types of complex predicates including not only particle verbs (compound:prt
) but also serial verbs (compound:svc
).
More discussion can be found in the sections on multiword expressions and semantic categories.
The CoNLL-U format
The following changes in the CoNLL-U format are adopted for v2:
- Spaces are allowed in the FORM and LEMMA fields (see above).
- Rows corresponding to empty nodes, with an indexing scheme distinct from both tokens and words, are allowed for the representation of ellipsis in the enhanced dependencies.
- The DEPS field should contain the entire enhanced dependency graph (not only additional relations on top of the basic trees).
- Sentence-level metadata are standardized using the comment prefixes
sent_id
andtext
, of which the former is now obligatory in all treebanks. - The use of the MISC field is restricted for ease of processing by requiring that it can be split on the
|
(bar) character without any complex processing of escaping.
More discussion can be found in the section on the CoNLL-U format