home edit page issue tracker

This page pertains to UD version 2.

Enhanced Dependencies

We always intended the Universal Dependencies representation to be used in shallow natural language understanding tasks such as relation extraction or biomedical event extraction. For such tasks, one is typically interested in the relation between certain entities, e.g., the relation between two persons or whether one protein interacts with another. UD is particularly well suited for such tasks as UD trees contain many direct dependencies between content words and many of the dependency labels provide a lot of information about the type of relation between two content words. However, for some constructions, the dependency path between two content words of interest can be very long in a UD tree, which complicates determining how the content words are related. Further, some dependency types such as obl or nmod are used for many different types of arguments and modifiers, and therefore they are not very informative on their own. For these reasons, we also provide guidelines for an enhanced representation, which makes some of the implicit relations between words more explicit, and augments some of the dependency labels to facilitate the disambiguation of types of arguments and modifiers.

Enhanced UD graphs may contain some or all of the following enhancements, which are described in the sections below. If a corpus does not annotate any of the enhancements defined in the guidelines, it should always have the underscore character in the DEPS column. That is, the enhanced graph should not be just an exact copy of the basic tree for all sentences in the corpus. Otherwise it creates the impression that the user can expect some enhancements while there are actually none.

Note that the enhanced graph is not necessarily a supergraph of the basic tree, i.e., the graph is not required to contain all the basic dependency relations. For this reason, all relations of the enhanced graph (also the ones that are present in the basic UD tree) have to be included in the DEPS column of a CoNLL-U file. See the specificiation of the CoNLL-U file format for details.

Furthermore, the dependency relation labels in the enhanced graph in DEPS may contain certain extensions that are not permitted in the basic relation type in the DEPREL column. The regular expression restricting relation labels in DEPREL is pretty simple; the label can contain only lowercase English letters and at most one colon, which separates the universal and the language-specific part of the label: ^[a-z]+(:[a-z]+)?$. In contrast, the relation label in DEPS may contain up to three colons, separating up to four sections. One of the sections (never the first one) may also contain lowercase Unicode letters and the underscore character: ^[a-z]+(:[a-z]+)?(:[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(_[\p{Ll}\p{Lm}\p{Lo}\p{M}]+)*)?(:[a-z]+)?$. Only the first section, the universal relation, is mandatory. The other sections are optional but if they appear, they must appear in the order described below. We provide a more detailed explanation of the extra sections later on this page; here is a summary:

  1. Universal dependency relation. In addition to the 37 relations defined in the basic representation, the relation can also be ref.
  2. Documented relation subtype (either language-specific or more general) from the basic representation.
  3. The string xsubj, denoting external subject relations of xcomp predicates. This extension is used only with nsubj, csubj, and their subtypes such as nsubj:pass. It does not combine with the other extensions described below because they do not apply to subjects.
  4. Case and similar information – adposition or conjunction that occurs as a case, mark or cc dependent of the node whose relation to its parent is being enhanced. Note that this is the only part where non-ASCII letters are permitted within the enhanced relation label. The word should be normalized (lowercased, no typos), i.e., in general we take its lemma. However, if the case/mark dependent is a fixed multi-word expression, the lemma of the expression is not necessarily composed of lemmas of the individual member words. For instance, the string representing the English expression “As Opposed To” is as_opposed_to. That is, the casing is normalized from “As” to “as” etc., but “opposed” is not replaced by its lemma “oppose” because the expression is fixed. Similarly, grammaticalized deverbal connectives such as “regarding” may in some languages (if required by the language-specific guidelines) still be tagged VERB, despite being attached as case, and their lemma will thus be verbal (“regard”); nevertheless, the corresponding deprel extension should be the grammaticalized form, i.e., “regarding”. Language-specific guidelines may also specify that certain synonyms (e.g., “toward” and “towards”) be mapped on the same enhanced label, despite having different lemmas. We use the underscore character (“_”) to connect member words. The same approach can also be taken when a node has multiple case markers that are not annotated as a fixed expression, e.g., out_of for “out of business”.
  5. Case information – morphological case of the node whose relation to its parent is being enhanced. Value corresponds to the value of the Case feature but it is lowercased (e.g., gen instead of Gen). Unlike in morphological features, multivalues with comma (Case=Acc,Dat) are not allowed. Case information in enhanced relations must be fully disambiguated.

Ellipsis

(See also the guidelines on ellipsis.)

In the enhanced representation, we add special empty (null) nodes in clauses in which a predicate is elided. (Although the node is termed ‘empty’ in the CoNLL-U format specification, and although it does not correspond to an overt surface token, its FORM, LEMMA, UPOS, XPOS and FEATS may be optionally filled with the assumed values; here they can be copied from the overt occurrence of the predicate.)

Note that this is a case in which the enhanced UD graph is not a supergraph of the basic tree as the basic tree contains orphan relations, which are not present in the enhanced UD graph.

Propagation of incoming dependencies to conjuncts

In the basic representation, the governor and dependents of a conjoined phrase are all attached to the first conjunct. This often leads to very long dependency paths between content words. The enhanced representation therefore also contains dependencies between the other conjuncts and the governor and dependents of the phrase.

Conjoined subjects and objects

When the subject is a conjoined noun phrase, each of the conjuncts is attached to the predicate.

The same is true for conjoined objects.

This leads to slightly strange dependencies in the case of collective subjects or objects:

However, as the distinction between distributive and collective readings is often context-dependent, we take the simplest approach and always attach all conjuncts to the predicate.

When the subject is attached to a control or raising predicate, there is a dependency between the matrix verb and each conjunct and between the embedded verb and each conjunct.

Conjoined modifiers

Each conjunct in a conjoined modifier phrase gets attached to the governor of the modifier phrase. For example, the following phrase contains a conjoined adjectival phrase that modifies a noun. In the enhanced representation, there is an additional amod relation between the noun river and the second conjunct wide.

Propagation of outgoing dependencies from conjuncts

In the basic representation, the governor and dependents of a conjoined phrase are all attached to the first conjunct. This often leads to very long dependency paths between content words. The enhanced representation therefore also contains dependencies between the other conjuncts and the governor and dependents of the phrase.

Conjoined verbs and verb phrases

When two verbs share their objects (or other complements), the subject and the object of the conjoined verbs are attached to every conjunct.

However, if the complements of the second verb are not shared, only the shared dependents are attached to every conjunct.

Similarly, the enhanced representation can also distinguish private dependents of the first verb. Note however that in this case it cannot be inferred from the basic representation automatically.

Controlled/raised subjects

The basic trees lack a subject dependency between a controlled verb and its controller or between an embedded verb and its raised subject. In the enhanced graph, there is an additional dependency between the embedded verb and the subject of the matrix clause. This dependency can be recognized by the extension (subtype) :xsubj.

BasicEnhanced

Relative clauses

In basic trees, relative pronouns are attached to the main predicate of the relative clause (typically with a nsubj or obj relation). In the corresponding enhanced graphs, the relative pronoun is attached to its antecedent with the special ref relation and the antecedent is attached as a dependent of the node that is the parent of the relative pronoun in the basic tree. Typically this parent is the main predicate of the relative clause, but it is not always so (see examples below).

In the case where there is no explicit relative pronoun, there is no ref relation in the enhanced graph but the antecedent is still annotated as a dependent of a node in the relative clause, depending on the role it plays in the relative clause.

Note that such graphs contain a cycle.

Adverbial relativizers receive the same treatment.

The enhanced relations include deep syntactic relations. Therefore, in case marking languages the enhanced dependencies may link verb dependents that are not in the expected morphological case, required by surface syntax. In the following Czech example, the relative modifier phrase v němž “in which” is obligatorily in the locative case form (Case=Loc). If it were a main clause, the referent dům “house” would have to be in locative too: v domě “in house”. However, here it is in the nominative (Case=Nom), and the enhanced dependency obl going to a nominative dependent is something we would not expect to see, given the morpho-syntactic rules of the language.

The relative element does not always depend directly on the predicate of the relative clause. It may be embedded deeper as in the following example.

If the relative clause has a nominal predicate, the relative pronoun may occupy the head position within the clause. Unlike most relative clauses, here the parent of the relative pronoun in the basic tree is not inside the relative clause, and its antecedent will not have an additional enhanced relation attaching it to a (non-existent) parent in the relative clause. Instead, we add a nsubj relation from the antecedent to the nsubj of the relative clause (and remove the corresponding nsubj relation between the relative pronoun and the subject). The acl:relcl should remain the same as in basic dependencies.

Case Information

Adding prepositions (or case information) to the relation name of non-core dependents often makes it possible to disambiguate its semantic role. We therefore augment certain relation labels with the case information of the modifier. The augmented relations are nmod, acl, obl and advcl; if it makes sense in the language, some core relations may also be augmented: obj, iobj, ccomp. Case information may be represented by the lemma of an adposition attached via a case relation. For clauses, the corresponding information may be represented by the lemma of a mark dependent instead. Case information may also be represented by the value of the morphological feature Case. In some languages, there is both the adposition and the morphological case, and their combination must be reflected in the enhanced relation.

In a similar manner, enhanced UD graphs also contain conj relations that are augmented with their coordinating conjunction. This makes the type of coordination between two phrases more explicit which is particularly useful in phrases with multiple coordinating conjunctions.

The following formal rules apply (copied from the summary at the beginning of this page):

Additional enhancements

Some postprocessing steps such as demoting light nouns that behave like quantificational determiners (as, for example, described in Schuster and Manning (2016)) can improve the usability of the dependency graphs for downstream applications. However, as most of these additions are highly language-specific, we do not provide any universal guidelines for such a representation and anything beyond the above additions is not part of the UD standard and should not be added to the officially released treebanks.


DZ: Here are some additional thoughts on things that are not part of the officially approved guidelines but I think that they should be considered for addition in the future (based on experience with the treebanks that already contain some enhanced annotation).

BESbswyBESbswyBESbswyBESbswy