Syntax: General Principles
Syntactic annotation in the UD scheme consists of typed dependency relations between words. The basic dependency representation forms a tree, where exactly one word is the head of the sentence, dependent on a notional ROOT and all other words are dependent on another word in the sentence, as exemplified below (where we explicitly represent the root dependency which will otherwise be left implicit).
ROOT she wanted to buy and eat an apple
nsubj(wanted, she)
root(ROOT, wanted)
mark(buy, to)
xcomp(wanted, buy)
cc(eat, and)
conj(buy, eat)
det(apple, an)
obj(buy, apple)
In addition to the basic dependency representation, which is obligatory for all UD treebanks, it is possible to give an enhanced dependency representation, which adds (and in a few cases changes) relations in order to give a more complete basis for semantic interpretation. The enhanced representation is in general not a tree but a general graph structure, as shown below (enhanced dependencies in blue).
# visual-style 5 2 nsubj color:blue
# visual-style 7 2 nsubj color:blue
# visual-style 3 7 xcomp color:blue
# visual-style 7 4 mark color:blue
# visual-style 7 9 obj color:blue
1 ROOT _ _ _ _ 0 root _ _
2 she _ _ _ _ 3 nsubj 5:nsubj|7:nsubj _
3 wanted _ _ _ _ 1 root _ _
4 to _ _ _ _ 5 mark 7:mark _
5 buy _ _ _ _ 3 xcomp _ _
6 and _ _ _ _ 7 cc _ _
7 eat _ _ _ _ 5 conj 3:xcomp _
8 an _ _ _ _ 9 det _ _
9 apple _ _ _ _ 5 obj 7:obj _
In the rest of this document, we discuss the fundamental principles of our dependency annotation, focusing on aspects that are common to both the basic and the enhanced representation. For more information about basic and enhanced dependencies, we refer to the detailed annotation guidelines:
- Basic dependencies
- Enhanced dependencies
The goal of the typed dependency relations is a set of broadly observed “universal dependencies” that work across languages. Such dependencies seek to maximize parallelism by allowing the same grammatical relation to be annotated the same way across languages, while making enough crucial distinctions such that different things can be differentiated. Two things should be noted from the outset:
- The goal of parallelism has limits: The standard does not postulate and annotate “empty” things that do not appear in various languages, and it allows the use of language-specific refinements of universal dependencies to represent particular relations of language-particular importance.
- The notion of dependency has limits: Not all grammatical relations can be reduced to binary asymmetric relations between a syntactic head and a subordinate element, and some of our typed “dependency” relations therefore must be understood as convenient encodings of other relations without implications about syntactic headedness.
We now try to lay down some general principles that should guide the use of universal dependencies to achieve as much parallelism as possible (but not more) across languages.
The Primacy of Content Words
Dependency relations hold primarily between content words, rather than being indirect relations mediated by function words.
Function words attach as direct dependents of the most closely related content word.
Punctuation attaches to the head of the clause or phrase to which they belong.
Putting this together gives a complete dependency tree where internal nodes are content words and where function words and punctuation appear as leaves.
Preferring content words as heads maximizes parallelism between languages because content words vary less than function words between languages. In particular, one commonly finds the same grammatical relation being expressed by morphology in some languages or constructions and by function words in other languages or constructions, while some languages may not mark the information at all (such as not marking tense or definiteness).
The Status of Function Words
The primacy of content words implies that function words normally do not have dependents of their own. In particular, it means that multiple function words related to the same content word always appear as siblings, never in a nested structure, regardless of their interpretation. A typical case is that of auxiliary verbs, which never depend on each other.
Note that copula verbs are also counted as auxiliaries in this respect. In copula constructions, auxiliaries will therefore often be attached to predicates that are not verbs.
Similarly, multiple determiners are always attached to the head noun.
We are aware that the choice to treat function words formally as dependents of content words is at odds with many versions of dependency grammar, which prefer the opposite relation for many syntactic constructions. We prefer to view the relations between content words and function words, not as dependency relations in the narrow sense, but as operations that modify the grammatical category of the content word so that it can participate in different dependency relations with other content words. We refer to these relations as functional relations or function word relations when we want to emphasize that they are different from dependency relations between content words. This view makes function words functionally (but not structurally) similar to morphological operations and is compatible with Tesnière’s notion of the nucleus as the locus of syntactic dependencies.
Nevertheless, there are four important exceptions to the rule that function words do not take dependents:
- Multiword function words
- Coordinated function words
- Function word modifiers
- Promotion by head elision
Multiword Function Words
The word forms that make up a fixed function-word multiword expression (MWE) are connected using the special dependency relation u-dep/fixed (see below). By convention, the first word is always taken as the head, so when the multiword expression is a functional element, the initial word form will then superficially look like a function word with dependents.
Deciding whether an expression in a language should be treated as a fixed multiword expression is something that has to be decided for each language, and in some cases this will require somewhat arbitrary conventions, because it involves choosing a cut point along a path of grammaticalization. Nevertheless, most languages have some very common multiword expressions that effectively behave like other function words as linkers, marks, or case particles, and it would be highly undesirable not to recognize them as a multi-word function word. Examples in English include in spite of (like despite), as well as (like and), and prior to (like before).
Coordinated Function Words
Head coordination is a syntactic process that can apply to almost any word category, including function words like conjunctions and prepositions. In such cases, the standard analysis of coordination is used and function words have dependents.
Function Word Modifiers
Certain types of function words can take a restricted class of modifiers, mainly light adverbials (including negation). Typical cases are modified determiners like not every (linguist) and exactly two (papers) and modifiers of subordinating conjunctions.
Negation can modify any function word, but other types of modifiers are disallowed for function words that express properties of the head word often expressed morphologically in other languages. This class, which we refer to as pure function words, includes auxiliary verbs, case markers (adpositions), and articles, but needs to be defined explicitly for each language. When pure function words appear with modifiers other than negation, we take the modifier to apply to the entire phrase and therefore attach it to the head word of the function word, as illustrated in the following example.
The analysis here is that right modifies the entire phrase before midnight and therefore attaches to midnight, which is the head of this phrase. (It is a general property of dependency trees that phrase modification is structurally indistinguishable from head modification.) Further support for this analysis comes from the possibility of replacing before midnight by the adverb then.
Making sure that pure function words do not have dependents of their own facilitates comparison with languages where the corresponding properties are expressed morphologically as well as conversion to the enhanced representation where this difference is neutralized.
To sum up, our treatment of function word modifiers can be expressed in three principles:
- Pure function words can only be modified by negation.
- Other function words can also take (other) light adverbial modifiers.
- When in doubt, prefer a flat structure where function words attach to a content word.
Note also that the language-specific documentation should specify what words (if any) are treated as pure function words in that language.
Promotion by Head Elision
When the natural head of a function word is elided, the function word will be “promoted” to the function normally assumed by the content word head. This type of analysis should in general be preferred over an analysis using the u-dep/orphan relation, because it disrupts the structure less. The orphan analysis of ellipsis should only be used when there is no function word that can be promoted. The following examples illustrate promotion of auxiliaries, prepositions and subordinating conjunctions (but only the first example illustrates the exception from the rule than function words have no dependents).
The Taxonomy of Typed Dependencies
We now review some of the key ideas underlying our taxonomy of typed dependency relations, focusing first on the central dependency relations between content words.
Core Arguments vs. Oblique Modifiers
The UD taxonomy is centered around the fairly clear distinction between core arguments (subjects, objects, clausal complements) versus other dependents. It does not make a distinction between adjuncts (general modifiers) versus oblique arguments (arguments said to be selected by a head but not expressed as a core argument). The rest of this section expands on the linguistic basis of these choices, and may be skipped.
The definition of core arguments
The core/oblique distinction is ultimately an information packaging distinction. All or nearly all languages have a basic way of expressing the one or two arguments of most verbs (intransitive and transitive verbs), and this unmarked form of argument expression is as a core argument. If additional arguments can appear that are treated similarly to these arguments, they may also be regarded as core arguments. (Some languages have no additional core arguments, while other languages allow multiple object arguments, for instance.) Status as a core argument is decoupled from the semantic roles of participants. Normally, depending on the meaning of a verb, many different semantic roles can be expressed by the same means of encoding core arguments. Nevertheless, there is a correlation: agent and patient or theme roles of predicates in their unmarked valence are normally realized as core arguments.
Syntactically, there is not a single criterion which can be used crosslinguistically to distinguish core arguments from obliques, though there are often good and useful criteria for particular languages. These include:
- Verbs usually only agree with core arguments
- Oblique arguments may usually or always appear marked by an adposition while core arguments appear as bare nominals
- Certain cases, traditionally called nominative, accusative, and absolutive typically mark core arguments
- Core arguments in many languages occupy special positions in the clause, often adjacent to the verb
- Syntactic phenomena such as being the controller of a subordinate clause argument or the target of relativization are limited to core arguments in some languages
At the end of the day, the distinction must be drawn and documented on language particular grounds. For example, many languages have certain verbs which take arguments in oblique cases such as dative or an experiencer case, but these arguments should be regarded as core arguments based on their syntactic behavior being parallel to the arguments of other transitive verbs.
Avoiding an argument/adjunct distinction
Many grammatical frameworks suggest that some obliques are selected by or are arguments of a head (for instance, a source argument of from the Queen is an argument of the head receive), while other obliques are general adjuncts, which can appear with any predicate without the head selecting for them (for instance, a temporal argument such as after the holidays).
However, the argument/adjunct distinction is subtle, unclear, and frequently argued over. For instance, syntacticians at certain times have argued for various obliques to be arguments, while at other times arguing that they are adjuncts, particularly for certain semantic roles such as oblique instruments or sources. We take the distinction to be sufficiently subtle (and its existence as a categorical distinction sufficiently questionable) that the best practical solution is to eliminate it. This approach echoes the viewpoint of the original Penn Treebank annotators.
The core-oblique distinction is generally accepted in language typology as being both more relevant and easier to apply cross-linguistically than the argument-adjunct distinction. See, for example:
- Avery D. Andrews. 2007. The Major Functions of the Noun Phrase. In Timothy Shopen (ed.) Language Typology and Syntactic Description: Clause Structure (2nd ed), Cambridge University Press, Cambridge, United Kingdom, pp. 132-223. (1st edition, 1985.)
- Sandra A. Thompson. 1997. Discourse Motivations for the Core-Oblique Distinction as a Language Universal. In Akio Kamio (ed.) Directions in Functional Linguistics. Benjamins, Amsterdam, the Netherlands, pp. 59-82.
A Mixed Functional-Structural System
One major role of dependencies is to represent function, but the Universal Dependencies also encode structural notions. On the structural side, languages are taken to principally involve three things:
- Nominal phrases (which are the usual means of entity expression, but may also be used for other things)
- Clauses headed by a predicate (most commonly a verb, but it may be other things, such as an adjective or adverb, or even a predicate nominal, such as He is a wreck)
- Miscellaneous other kinds of modifier words, which may themselves allow some modification, but do not expand into the same rich structures as nominal phrases and predicates.
This three-way distinction is generally encoded in dependency names. For example, if a verb is taking an adverbial modifier, it may bear one of three relations u-dep/obl, u-dep/advcl, or u-dep/advmod depending on which of these three sorts it is:
Similarly, the core grammatical relations differentiate core arguments that are clauses (e.g., u-dep/csubj, u-dep/ccomp) from those that are nominal phrases (e.g., u-dep/nsubj, u-dep/obj).
Clausal Dependents
To classify dependents of the main predicate in a clause, the UD taxonomy obeys the following principles:
- differentiate core arguments from noncore arguments and adjuncts (see “Core arguments vs. oblique modifiers” above)
- differentiate subjects from complements
- differentiate clauses with obligatory control from clauses with other types of subject licensing
- differentiate attachment to predicates from attachment to nominal phrases
- capture clausal modifiers of nouns that do not take the form of a relative clause
Additional distinctions (for example, with respect to voice) can be captured via language-specific subtypes
(such as nsubj:pass
for the subject of a passivized verb).
Note that the UD taxonomy does not attempt to differentiate finite from nonfinite clauses.
Coordination
We treat coordinate structures asymmetrically: The head of the relation is the first conjunct and all the other conjuncts depend on it via the u-dep/conj relation. Coordinating conjunctions and punctuation delimiting the conjuncts are attached using the u-dep/cc and u-dep/punct relations respectively to the immediately following conjunct.
He came home , took a shower and immediately went to bed .
conj(came, took)
conj(came, went)
punct(took, ,-4)
cc(went, and)
Multiword Expressions
Multiword expressions (MWEs) are combinations of words that (in some respect and to different degrees) behave as lexical units rather than compositional syntactic phrases. The UD taxonomy contains three special relations for analyzing MWEs:
- u-dep/fixed is used to analyze fixed grammaticized function-word MWEs like in spite of (see above)
- u-dep/flat is used to analyze exocentric semi-fixed MWEs like Barack Obama with no clear head
- u-dep/compound is used to analyze (headed or endocentric) compounds like noun phrase, with head phrase
Structures analyzed with u-dep/fixed and u-dep/flat are headless by definition and are consistently annotated by attaching all non-first elements to the first and only allowing outgoing dependents from the first element.
By contrast, compounds are annotated to show their modification structure, including a regular concept of head:
Special Relations
Besides core dependency relations, functional relations, and relations for analyzing coordination, MWEs and punctuation, the UD taxonomy includes a number of special relations for handling things like orthographic errors in text, disfluencies in speech, and list structures without internal syntactic structure.