Executive summary of changes from v1 to v2
This document summarizes the major changes from v1 to v2 of the universal guidelines. The primary purpose is to provide a checklist for treebank developers who want to start the transition from v1 to v2. More background and discussion can be found in the thematic reports on v2, which we cross-reference below. Complete documentation of the new guidelines will follow as quickly as possible.
The ban on spaces inside words in v1 is lifted in v2 in two circumstances:
- For languages with writing systems that use spaces to mark units smaller than word (typically syllables), spaces are allowed in any word provided that this is declared in the language-specific documentation.
- For other languages, spaces are allowed only for an approved and restricted list of exceptions like numbers (“100 000”) and abbrevations (“i. e.”) that have to be listed explicitly in the language-specific documentation.
More discussion can be found in the section on word segmentation.
The universal tagset from v1 is basically kept intact in v2 with two minor revisions:
- The tag
CCONJto make it parallel to
SCONJ(and more similar to the syntactic relation
ccand less similar to
- The guidelines for tags are modified in three cases:
- The use of
AUXis extended from auxiliary verbs in a narrow sense to also include copula verbs and nonverbal TAME particles (tense, aspect, mood, evidentiality, and, sometimes, voice or polarity particles).
- The use of
PARTis restricted to a small set of words that must be listed in the language-specific documentation.
- The borderline between
DETis made more flexible to accommodate cross-linguistic variation.
- The use of
More discussion can be found in the section on part-of-speech tags
The set of universal features is updated from v1 to v2 in the following ways:
- Existing features and feature values are renamed:
- New features are added (or promoted to universal features):
Evident(evidentiality) with value
Polite(politeness) with values
Elev(elevated status of referent; subtype of
Humb(humbled status of speaker; subtype of
Abbr(abbreviation) with value
- New values are added (or promoted) to existing features:
Number=Count(counting form or count plural),
VerbForm=Gdv(gerundive, not gerund),
VerbForm=Vnoun(verbal nouns other than infinitives)
- Unused or poorly defined values are removed from existing features:
Revisions to the feature system have whenever possible been made to improve clarity and consistency with other systems such as UniMorph. More discussion can be found in the section on features.
Although most syntactic relations are the same in v2 as in v1, the guidelines have often been improved by providing more explicit criteria and examples from multiple languages. In this summary, we only list cases where relations have been removed, added or renamed, or where the use of an existing relation has changed significantly.
Clauses and dependents of predicates
dobjrelation is renamed
objbecause this seems to be more easily reconcilable with the intended interpretation of “second core argument” or “P/A argument” (without connection to specific cases or semantic roles).
auxpassrelations are removed. The use of subtypes
aux:passis strongly encouraged for languages where these distinctions are relevant.
nmodrelation, which in v1 was used for nominals modifying either predicates or other nominals, is in v2 restricted to modifying nominals. A new relation
obl(oblique) is introduced for oblique dependents of predicates.
coprelation is restricted to function words (verbal or nonverbal) whose sole function is to link a nonverbal predicate to its subject and which does not add any meaning other than grammaticalized TAME categories. The range of constructions that are analyzed using the
coprelation is subject to language-specific variation but can be identified using universal guidelines.
Coordination and ellipsis
- Coordinating conjunctions (
cc) and punctuation (
punct) inside coordinated structures are in v2 attached to the immediately succeeding conjunct (instead of the first conjunct as in v1).
remnantrelation used to analyze ellipsis in v1 is removed. A new relation
orphanis introduced in order to analyze ellipsis in a way that preserves the integrity of clauses and minimizes the use of special relations.
- A new relation
clf(classifier) is added for nominal classifiers.
auxrelation is extended from auxiliary verbs in a narrow sense to also include nonverbal TAME particles (in analogy with the extended use of the part-of-speech tag
auxpassrelation is removed from the set of universal relations (see above).
coprelation is restricted to pure linking words (see above).
negrelation is removed from the set of universal relations, and polarity is instead encoded in a feature.
mwerelation is renamed
fixedto make clear that it is only to be used for fixed grammaticized expressions that behave like function words or short adverbials.
namerelation is removed because it has been misinterpreted as applying to all names.
foreignrelation is removed because it has been judged superfluous.
- A new relation
flatis added for semi-fixed multiword expressions for which there is no clear evidence that one of the components is the linguistic head. This covers the originally intended uses of the
foreignrelations but also other cases like title-name combinations and date expressions that do not have a clear endocentric syntactic structure. Subtypes like
flat:foreigncan be used to preserve information in existing treebanks.
compoundrelation is extended to cover all types of complex predicates including not only particle verbs (
compound:prt) but also serial verbs (
The CoNLL-U format
The following changes in the CoNLL-U format are adopted for v2:
- Spaces are allowed in the FORM and LEMMA fields (see above).
- Rows corresponding to empty nodes, with an indexing scheme distinct from both tokens and words, are allowed for the representation of ellipsis in the enhanced dependencies.
- The DEPS field should contain the entire enhanced dependency graph (not only additional relations on top of the basic trees).
- Sentence-level metadata are standardized using the comment prefixes
text, of which the former is now obligatory in all treebanks.
- The use of the MISC field is restricted for ease of processing by requiring that it can be split on the
|(bar) character without any complex processing of escaping.
More discussion can be found in the section on the CoNLL-U format