Syntax: General Principles
Syntactic annotation in the UD scheme consists of typed dependency relations between words. Each word is either the dependent of one other word in the sentence or of a notional ROOT of the sentence. This means that the dependencies can be thought of as a directed acyclic graph which is a tree (i.e., which has a single root). The goal of the typed dependency relations is a set of broadly observed “universal dependencies” that work across languages. Such dependencies seek to maximize parallelism by allowing the same grammatical relation to be annotated the same way across languages, while making enough crucial distinctions such that different things can be differentiated. The goal of parallelism has limits: The standard does not postulate and annotate “empty” things that do not appear in various languages, and it allows the use of language-specific refinements of universal dependencies to represent particular relations of language-particular importance. We now try to lay down some general principles that should guide the use of universal dependencies to achieve as much parallelism as possible (but not more) across languages. (More specific guidelines can be found in the documentation of the specific dependency relations.)
The principles primarily apply to the basic version of the universal dependencies, where dependencies are assumed to form a rooted tree representing the backbone of the syntactic structure. In addition to the basic dependency structure, certain syntactic constructions may introduce additional dependencies (examples include dependencies that propagate over coordination structures and secondary predication). These dependencies can be represented in the enhanced version of the universal dependencies, where they are encoded in the DEPS field of the CoNLL-U format. The total set of dependencies in the enhanced representation will commonly no longer be a rooted tree, but a rooted directed graph. In particular, the result need not be a directed acyclic graph (DAG). While the graph is mostly tree-like, the enhanced representation of relative clauses introduces small cycles:
The dependency relations added in the enhanced representation are taken from the same inventory as the basic dependencies, but may add additional language-particular subtyping. Detailed guidelines for the enhanced representation still have to be developed. In the meantime, the documentation of the basic dependencies sometimes refers to additional dependencies that we expect to be present in the enhanced representation.
The Primacy of Content Words
Dependency relations hold primarily between content words, rather than being indirect relations mediated by function words.
Function words attach as direct dependents of the most closely related content word.
Punctuation attaches to the head of the clause or phrase to which they belong.
Putting this together gives a complete dependency tree where internal nodes are content words and where function words and punctuation appear as leaves.
Preferring content words as heads maximizes parallelism between languages because content words vary less than function words between languages. In particular, one commonly finds the same grammatical relation being expressed by morphology in some languages or constructions and by function words in other languages or constructions, while some languages may not mark the information at all (such as not marking tense or definiteness).
The Status of Function Words
The primacy of content words implies that function words normally do not have dependents of their own. In particular, it means that multiple function words related to the same content word always appear as siblings, never in a nested structure, regardless of their interpretation. A typical case is that of auxiliary verbs, which never depend on each other.
Note that copula verbs are also counted as auxiliaries in this respect. In copula constructions, auxiliaries will therefore often be attached to predicates that are not verbs.
Similarly, multiple determiners are always attached to the head noun.
However, there are four important exceptions to the rule that function words do not take dependents:
- Multiword function words
- Coordinated function words
- Function word modifiers
- Promotion by head elision
Multiword Function Words
The word forms that make up a fixed multiword expression are connected into a head-initial structure using the special dependency relation u-dep/mwe (see below). When the multiword expression is a functional element, the initial word form will then superficially look like a function word with dependents.
Deciding whether an expression in a language should be treated as a mwe
is something that has to be decided for each language, and in some cases this will require somewhat arbitrary conventions, because it involves choosing a cut point along a path of grammaticalization. Nevertheless, most languages have some very common multiword expressions that effectively behave like other function words as linkers, marks, or case particles, and it would be highly undesirable not to recognize them as a multi-word function word. Examples in English include in spite of (like despite), as well as (like and), and prior to (like before).
Coordinated Function Words
Head coordination is a syntactic process that can apply to almost any word category, including function words like conjunctions and prepositions.
Function Word Modifiers
Certain types of function words can take a restricted class of modifiers, mainly negation (u-dep/neg) and light adverbials (u-dep/advmod or u-dep/nmod). Typical cases are modified determiners like not every (linguist) and exactly two (papers) and modifiers of subordinating conjunctions.
Negation can modify any function word, but other types of modifiers are disallowed for function words that express properties of the head word often expressed morphologically in other languages. This class, which we refer to as pure function words, includes auxiliary verbs, case markers (adposition), and articles, but needs to be defined explicitly for each language. When pure function words appear with modifiers other than negation, we take the modifier to apply to the entire phrase and therefore attaches it to the head word of the function word, as illustrated in the following example.
The analysis here is that right modifies the entire phrase before midnight and therefore attaches to midnight, which is the head of this phrase. (It is a general property of dependency trees that phrase modification is structurally indistinguishable from head modification.) Further support for this analysis comes from the possibility of replacing before midnight by the adverb then.
Making sure that pure function words do not have dependents of their own facilitates the comparison with languages where the corresponding properties are expressed morphologically as well as the conversion to the enhanced representation where this difference is neutralized.
To sum up, our treatment of function word modifiers can be expressed in three principles:
- Pure function words can only be modified by negation (
neg
). - Other function words can also take light adverbial modifiers (
advmod
,nmod
) - When in doubt, prefer a flat structure where function words attach to a content word.
Note also that the language-specific documentation should specify what words (if any) are treated as pure function words in that language.
Promotion by Head Elision
When the natural head of a function word is elided, the function word will be “promoted” to the function normally assumed by the content word head. This type of analysis should in general be preferred over an analysis using the u-dep/remnant relation, because it disrupts the structure less. The remnant analysis should be used only when there is no function word that can be promoted. The following examples illustrate promotion of auxiliaries, prepositions and subordinating conjunctions.
Key ideas of the relation taxonomy
Core arguments vs. oblique modifiers
The UD taxonomy is centered around the fairly clear distinction between core arguments (subjects, objects, clausal complements) versus other dependents. It does not make a distinction between adjuncts and oblique arguments. This latter distinction is taken to be sufficiently subtle, unclear, and argued over that it is eliminated (echoing the viewpoint of the original Penn Treebank annotators).
A mixed functional-structural system
One major role of dependencies is to represent function, but the Universal Dependencies also encode structural notions. On the structural side, languages are taken to principally involve three things:
- Nominal phrases (which are the usual means of entity expression, but may also be used for other things)
- Clauses headed by a predicate (most commonly a verb, but it may be other things, such as an adjective or adverb, or even a predicate nominal, such as He is a wreck)
- Miscellaneous other kinds of modifier words, which may themselves allow some modification, but do not expand into the same rich structures as nominal phrases and predicates.
This three way distinction is generally encoded in dependency names. For example, if a verb is taking an adverbial modifier, it may bear one of three relations u-dep/nmod, u-dep/advcl, or u-dep/advmod depending on which of these three sorts it is:
Similarly, the core grammatical relations differentiate core arguments that are clauses (e.g., u-dep/csubj, u-dep/ccomp) from those that are nominal phrases (e.g., u-dep/nsubj, u-dep/dobj).
Voice
Relation names attempt to differentiate canonical voice (where the proto-agent argument is the subject) from non-canonical voice constructions (where another argument is the subject). This is marked as appropriate on both the subject argument (e.g., nsubjpass) and auxiliaries indicating this (auxpass). Marking both is helpful, as either may be missing.
Clausal dependents
To classify clausal dependents, the UD taxonomy obeys the following principles:
- differentiate core arguments from noncore arguments and adjuncts (see “Core arguments vs. oblique modifiers” above)
- differentiate subjects from complements
- differentiate subjects of passives from other subjects (see “Voice” above)
- differentiate clauses with obligatory control from clauses with other types of subject licensing
- differentiate attachment to predicates from attachment to entities
- be able to capture clausal modifiers of nouns that do not take the form of a relative clause
Note that the UD taxonomy does not attempt to differentiate finite from nonfinite clauses.
Coordination
We treat coordinate structures asymmetrically: The head of the relation is the first conjunct and all the other conjuncts depend on it via the u-dep/conj relation. Coordinating conjunctions and punctuation delimiting the conjuncts are attached using the u-dep/cc and u-dep/punct relations respectively.
He came home , took a shower and immediately went to bed .
conj(came, took)
conj(came, went)
punct(came, ,-4)
cc(came, and)
See u-dep/conj for more discussion of related issues (shared dependents, nested coordination).
Special Relations
Some of the universal relations do not really encode syntactic dependency relations but are used to represent punctuation, various kinds of multiword units, or unanalyzable segments. The use of these relations is subject to special restrictions explained below.
Punctuation
Tokens with the relation u-dep/punct always attach to content words (except in cases of ellipsis) and can never have dependents.
Since punct
is not a normal dependency relation, the usual criteria for determining the head word do not apply.
Instead, we use the following principles:
- A punctuation mark separating coordinated units is attached to the first conjunct.
- A punctuation mark preceding or following a subordinated unit is attached to this unit.
- Within the relevant unit, a punctuation mark is attached at the highest possible node that preserves projectivity.
- Paired punctuation marks (quotes and brackets) should be attached to the same word unless that would create non-projectivity. This word is usually the head of the phrase enclosed in the paired punctuation.
Multiword Structures
The following types of expressions are annotated in a head-initial structure, where all non-first elements depend on the first, and where only the first element can have dependents:
- Fixed multiword expressions (u-dep/mwe)
- Multiword names (u-dep/name)
- Foreign phrases (u-dep/foreign)
In contrast, compounds are annotated to show their modification structure, including a regular concept of head: