Enhanced Dependencies
We always intended the Universal Dependencies representation to be used in shallow natural language understanding tasks such as
relation extraction or biomedical event extraction. For such tasks, one is typically interested in the relation between certain
entities, e.g., the relation between two persons or whether one protein interacts with another. UD is particularly well suited
for such tasks as UD trees contain many direct dependencies between content words and many of the dependency labels provide a
lot of information about the type of relation between two content words. However, for some constructions, the dependency path
between two content words of interest can be very long in a UD tree, which complicates determining how the content words are
related. Further, some dependency types such as obl
or nmod
are used for many different types of
arguments and modifiers, and therefore they are not very informative on their own. For these reasons, we also provide guidelines
for an enhanced representation, which makes some of the implicit relations between words more explicit, and augments some of
the dependency labels to facilitate the disambiguation of types of arguments and modifiers.
Enhanced UD graphs may contain some or all of the following enhancements, which are described in the sections below. If a corpus does not annotate any of the enhancements defined in the guidelines, it should always have the underscore character in the DEPS column. That is, the enhanced graph should not be just an exact copy of the basic tree for all sentences in the corpus. Otherwise it creates the impression that the user can expect some enhancements while there are actually none.
- Empty (null) nodes for elided predicates
- Propagation of incoming dependencies to conjuncts
- Propagation of outgoing dependencies from conjuncts
- Additional subject relations for control and raising constructions
- Coreference in relative clause constructions
- Modifier labels that contain the preposition, other case marker or conjunction
Note that the enhanced graph is not necessarily a supergraph of the basic tree, i.e., the graph is not required to contain all the basic dependency relations. For this reason, all relations of the enhanced graph (also the ones that are present in the basic UD tree) have to be included in the DEPS column of a CoNLL-U file. See the specificiation of the CoNLL-U file format for details.
Furthermore, the dependency relation labels in the enhanced graph in DEPS may contain certain extensions that are not permitted
in the basic relation type in the DEPREL column. The regular expression restricting relation labels in DEPREL is pretty simple;
the label can contain only lowercase English letters and at most one colon, which separates the universal and the language-specific
part of the label: ^[a-z]+(:[a-z]+)?$
. In contrast, the relation label in DEPS may contain up to three colons, separating up to
four sections. One of the sections (never the first one) may also contain lowercase Unicode letters and the underscore character:
^[a-z]+(:[a-z]+)?(:[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(_[\p{Ll}\p{Lm}\p{Lo}\p{M}]+)*)?(:[a-z]+)?$
.
Only the first section, the universal relation, is mandatory. The other sections are optional but if they appear, they must appear
in the order described below. We provide a more detailed explanation of the extra sections later on this page; here is a summary:
- Universal dependency relation. In addition to the 37 relations defined in the basic representation, the relation can also be ref.
- Documented relation subtype (either language-specific or more general) from the basic representation.
- The string xsubj, denoting external subject relations of xcomp predicates. This extension is used only with nsubj, csubj, and their subtypes such as nsubj:pass. It does not combine with the other extensions described below because they do not apply to subjects.
- Case and similar information –
adposition or conjunction that occurs as a
case
,mark
orcc
dependent of the node whose relation to its parent is being enhanced. Note that this is the only part where non-ASCII letters are permitted within the enhanced relation label. The word should be normalized (lowercased, no typos), i.e., in general we take its lemma. However, if the case/mark dependent is a fixed multi-word expression, the lemma of the expression is not necessarily composed of lemmas of the individual member words. For instance, the string representing the English expression “As Opposed To” isas_opposed_to
. That is, the casing is normalized from “As” to “as” etc., but “opposed” is not replaced by its lemma “oppose” because the expression is fixed. Similarly, grammaticalized deverbal connectives such as “regarding” may in some languages (if required by the language-specific guidelines) still be tagged VERB, despite being attached as case, and their lemma will thus be verbal (“regard”); nevertheless, the corresponding deprel extension should be the grammaticalized form, i.e., “regarding”. Language-specific guidelines may also specify that certain synonyms (e.g., “toward” and “towards”) be mapped on the same enhanced label, despite having different lemmas. We use the underscore character (“_”) to connect member words. The same approach can also be taken when a node has multiple case markers that are not annotated as a fixed expression, e.g.,out_of
for “out of business”. - Case information –
morphological case of the node whose relation to its parent is being enhanced. Value corresponds to the value of
the Case feature but it is lowercased (e.g.,
gen
instead ofGen
). Unlike in morphological features, multivalues with comma (Case=Acc,Dat
) are not allowed. Case information in enhanced relations must be fully disambiguated.
Ellipsis
(See also the guidelines on ellipsis.)
In the enhanced representation, we add special empty (null) nodes in clauses in which a predicate is elided. (Although the node is termed ‘empty’ in the CoNLL-U format specification, and although it does not correspond to an overt surface token, its FORM, LEMMA, UPOS, XPOS and FEATS may be optionally filled with the assumed values; here they can be copied from the overt occurrence of the predicate.)
Note that this is a case in which the enhanced UD graph is not a supergraph of the basic tree as the basic tree contains orphan
relations, which are not present in the enhanced UD graph.
Propagation of incoming dependencies to conjuncts
In the basic representation, the governor and dependents of a conjoined phrase are all attached to the first conjunct. This often leads to very long dependency paths between content words. The enhanced representation therefore also contains dependencies between the other conjuncts and the governor and dependents of the phrase.
Conjoined subjects and objects
When the subject is a conjoined noun phrase, each of the conjuncts is attached to the predicate.
The same is true for conjoined objects.
This leads to slightly strange dependencies in the case of collective subjects or objects:
However, as the distinction between distributive and collective readings is often context-dependent, we take the simplest approach and always attach all conjuncts to the predicate.
When the subject is attached to a control or raising predicate, there is a dependency between the matrix verb and each conjunct and between the embedded verb and each conjunct.
Conjoined modifiers
Each conjunct in a conjoined modifier phrase gets attached to the governor of the modifier phrase. For example, the following phrase contains a conjoined adjectival phrase that modifies a noun. In the enhanced representation, there is an additional amod
relation between the noun river and the second conjunct wide.
Propagation of outgoing dependencies from conjuncts
In the basic representation, the governor and dependents of a conjoined phrase are all attached to the first conjunct. This often leads to very long dependency paths between content words. The enhanced representation therefore also contains dependencies between the other conjuncts and the governor and dependents of the phrase.
Conjoined verbs and verb phrases
When two verbs share their objects (or other complements), the subject and the object of the conjoined verbs are attached to every conjunct.
However, if the complements of the second verb are not shared, only the shared dependents are attached to every conjunct.
Similarly, the enhanced representation can also distinguish private dependents of the first verb. Note however that in this case it cannot be inferred from the basic representation automatically.
Controlled/raised subjects
The basic trees lack a subject dependency between a controlled verb and its controller
or between an embedded verb and its raised subject. In the enhanced graph, there is an
additional dependency between the embedded verb and the subject of the matrix clause.
This dependency can be recognized by the extension (subtype) :xsubj
.
Basic | Enhanced |
---|---|
Relative clauses
In basic trees, relative pronouns are attached to the main predicate of the relative clause (typically with a nsubj
or obj
relation). In the corresponding enhanced graphs, the relative pronoun is attached to its antecedent with the
special ref
relation and the antecedent is attached as a dependent of the node that is the parent of the relative
pronoun in the basic tree. Typically this parent is the main predicate of the relative clause, but it is not always so
(see examples below).
In the case where there is no explicit relative pronoun, there is no ref
relation in the enhanced graph but the
antecedent is still annotated as a dependent of a node in the relative clause, depending on the role it plays in the
relative clause.
Note that such graphs contain a cycle.
Adverbial relativizers receive the same treatment.
The enhanced relations include deep syntactic relations. Therefore, in case marking languages the enhanced dependencies
may link verb dependents that are not in the expected morphological case, required by surface syntax. In the following
Czech example, the relative modifier phrase v němž “in which” is obligatorily in the locative case form
(Case=Loc
). If it were a main clause, the referent dům “house” would have to be in locative too: v domě
“in house”. However, here it is in the nominative (Case=Nom
), and the enhanced dependency obl
going to a nominative
dependent is something we would not expect to see, given the morpho-syntactic rules of the language.
The relative element does not always depend directly on the predicate of the relative clause. It may be embedded deeper as in the following example.
If the relative clause has a nominal predicate, the relative pronoun may occupy the head position within the clause.
Unlike most relative clauses, here the parent of the relative pronoun in the basic tree is not inside the relative
clause, and its antecedent will not have an additional enhanced relation attaching it to a (non-existent) parent in
the relative clause. Instead, we add a nsubj
relation from the antecedent to the nsubj
of the relative clause
(and remove the corresponding nsubj
relation between the relative pronoun and the subject). The acl:relcl
should
remain the same as in basic dependencies.
Case Information
Adding prepositions (or case information) to the relation name of non-core dependents often makes it possible to disambiguate its
semantic role. We therefore augment certain relation labels with the case information of the modifier.
The augmented relations are nmod
, acl
, obl
and advcl
; if it makes sense in the language, some core relations may also be
augmented: obj
, iobj
, ccomp
.
Case information may be represented by the lemma of an adposition attached via a case
relation.
For clauses, the corresponding information may be represented by the lemma of a mark
dependent instead.
Case information may also be represented by the value of the morphological feature Case.
In some languages, there is both the adposition and the morphological case, and their combination must be reflected in the enhanced relation.
In a similar manner, enhanced UD graphs also contain conj
relations that are augmented with their coordinating conjunction.
This makes the type of coordination between two phrases more explicit which is particularly useful in phrases with multiple
coordinating conjunctions.
The following formal rules apply (copied from the summary at the beginning of this page):
- Adposition or conjunction that occurs as a
case
ormark
orcc
dependent of the node whose relation to its parent is being enhanced. Note that this is the only part where non-ASCII letters are permitted within the enhanced relation label. The word should be normalized (lowercased, no typos), i.e., in general we take its lemma. However, if the case/mark dependent is a fixed multi-word expression, the lemma of the expression is not necessarily composed of lemmas of the individual member words. For instance, the string representing the English expression “As Opposed To” isas_opposed_to
. That is, the casing is normalized from “As” to “as” etc., but “opposed” is not replaced by its lemma “oppose” because the expression is fixed. Similarly, grammaticalized deverbal connectives such as “regarding” may in some languages (if required by the language-specific guidelines) still be tagged VERB, despite being attached as case, and their lemma will thus be verbal (“regard”); nevertheless, the corresponding deprel extension should be the grammaticalized form, i.e., “regarding”. Language-specific guidelines may also specify that certain synonyms (e.g., “toward” and “towards”) be mapped on the same enhanced label, despite having different lemmas. We use the underscore character (“_”) to connect member words. The same approach can also be taken when a node has multiple case markers that are not annotated as a fixed expression, e.g.,out_of
for “out of business”.- Multiple
case
ormark
nodes may occur even if it is not a fixed expression. For example, a type of adverbial clause in Dutch uses two markers om and te, the first one roughly corresponding to English “so that”, the second one being an infinitive marker. The incoming dependency of the subordinate clause will then be labeledadvcl:om_te
. - Case markers may be coordinated, as in they transport goods to and from Prague. Here there are two different relations
between the verb and the nominal:
obl:to
andobl:from
. Both will be added to the enhanced graph.
- Multiple
- Morphological case of the node whose relation to its parent is being enhanced. Value corresponds to the value of
the Case feature but it is lowercased (e.g.,
gen
instead ofGen
). Unlike in morphological features, multivalues with comma (Case=Acc,Dat
) are not allowed. Case information in enhanced relations must be fully disambiguated.- In certain languages and situations, the morphological case is combined with a lexical case marker (adposition). This is particularly useful if adpositions in the language select a subset of the morphological cases available and if the same adposition may have different meanings with different morphological cases.
- It may happen that two adpositions are coordinated, each selects a different morphological case and the noun can satisfy only
one of the case requirements. For instance, [cs] Lidé se rozutekli před a během útoku. “People ran away before
and during the attack.” The first preposition requires instrumental, the second requires genitive, the noun is in genitive.
However, the relations in the enhanced graph should be
obl:před:ins
andobl:během:gen
. The first relation should indicate instrumental despite the fact that the surface form of the noun in the current sentence is not instrumental, and its morphological feature isCase=Gen
. The relationobl:před:gen
does not exist in the language and has no meaning. (Note however that instrumental is not the only option with this preposition; accusative is also possible, andobl:před:acc
does not mean the same thing asobl:před:ins
.)
Additional enhancements
Some postprocessing steps such as demoting light nouns that behave like quantificational determiners (as, for example, described in Schuster and Manning (2016)) can improve the usability of the dependency graphs for downstream applications. However, as most of these additions are highly language-specific, we do not provide any universal guidelines for such a representation and anything beyond the above additions is not part of the UD standard and should not be added to the officially released treebanks.
DZ: Here are some additional thoughts on things that are not part of the officially approved guidelines but I think that they should be considered for addition in the future (based on experience with the treebanks that already contain some enhanced annotation).
- While individual enhancement types are optional, once a particular enhancement type is annotated somewhere in the corpus, the authors should annotate it everywhere in the corpus. This cannot be checked automatically for some enhancement types, but obviously the user will then assume that non-presence of the annotation in a sentence means that the phenomenon does not occur there.
- It would be useful if one could recognize from the enhanced relation type what type of enhancement it represents. (Some relations may be a result of two enhancement types combined.) The Stanford Enhancer does this at least for the controlled subjects (generating
nsubj:xsubj
,nsubj:pass:xsubj
,csubj:xsubj
, orcsubj:pass:xsubj
for the new enhanced relation).