home edit page issue tracker

This page still pertains to UD version 1.

Uppsala Group on Ellipsis

(Arne Skjærholt, Chris Manning, Dan Zeman, Verginica Mititelu)

This topic is related to the Github issues #164 and #188.

Ellipsis is a fairly universal phenomenon, yet it is admittedly underspecified in the Universal Dependencies guidelines. It should be documented in the Specific constructions section of each language because it affects many different relations. As of today, there is a section on this topic in the universal part, and in Czech and English. But there are not satisfactory solutions to all instances of ellipsis.

We do not have to handle ellipsis if the elided nodes are leaves in our representation. Following one of the core principles of UD, “do not annotate things that are not there,” we for instance do not add all the missing subjects in pro-drop languages. However, if the deleted node has dependents that were not deleted, we have to specify where these orphans should now be attached and what should be the label of the relation.

Especially tricky are cases where a verb is elided and there is more than one orphaned dependent.

Possible approaches

In a nutshell:

We have identified several possible approaches to orphaned dependents. Some of them are not used in the current version of the UD standard but they are used in other dependency treebanks.

Promotion of a dependent to the head position

This is the easiest thing to do when there is only one orphan. If there are two or more orphans, it may not always be apparent which one should be promoted. It is currently used in UD at various places. We need to document the examples more carefully; but it is difficult to search for them in the data, as there is no specific label that would give them away.

Examples

In English:

While auxiliaries are normally not analyzed as being heads, when a verb has been elided from VP ellipsis, the auxiliary inherits the head-status. This includes the to nonfinite auxiliary.

Mary did n't leave , John did
parataxis(leave, did-7)
nsubj(did-7, John)
So please update whatever you need to
dobj(update, whatever)
rcmod(whatever, need)
xcomp(need, to)

Similarly, when a preposition is stranded in a passive construction, the preposition receives the nmod label on account of lacking a nominal head.

That matter was talked about in detail already
nmod(talked, about)

In Czech:

If the head noun is missing from a noun phrase, i.e. there is just an adjective, possibly also a numeral or a determiner, then one orphan is selected as the main dependent and it gets promoted:

Zatímco mně zbylo pět malých zelených jablíček , Petra měla tři velká červená . \n While to-me remained five small green apples , Petra had three big red .
dobj(měla, červená)
dobj(had, red)
nummod(červená, tři)
nummod(red, three)
amod(červená, velká)
amod(red, big)

Labels

Attaching all orphans to the grandparent

This is similar to promotion but instead of selecting just one orphan to be promoted, we attach all orphans to the grandparent node (or more precisely: to the next available ancestor in the hierarchy). This approach is taken in the Prague family of treebanks. It is not officially used in the current version of UD but in practice it can be found at least in the Czech UD 1.1 data because the current conversion procedure ignores ellipsis.

Labels

The options listed for the promotion approach also apply here, and there is one additional option:

Examples

In the Latin sentence (segment) beatus qui legit et qui audiunt verba prophetiae et servant ea quae in ea scripta sunt tempus enim prope est, a copula is missing; this would not be a problem in UD but the Latin treebank uses the Prague annotation style and the copula was supposed to head the whole sentence. The orphans are attached directly to the artificial ROOT node and the relations are labeled with chained labels such as PNOM_ExD0_PRED. The first in the chain is the relation of the orphan to the missing copula: PNOM. The indexed ExD actually represents the missing node, not a relation. And the PRED is the Prague label for the root relation.

ROOT beatus qui legit et qui audiunt verba prophetiae et servant ea …
PNOM_ExD0_PRED(ROOT, beatus)
COORD_ExD0_PRED(ROOT, et-5)
SBJ_CO(et-5, legit)
COORD(et-5, et-10)
SBJ_CO(et-10, audiunt)
SBJ_CO(et-10, servant)
SBJ(legit, qui-3)
SBJ(audiunt, qui-6)
OBJ(audiunt, verba)
ATR(verba, prophetiae)
OBJ(servant, ea)

One problem with the grandparent approach that can be also seen in the above example is that it may result in several nodes attached directly to the artificial ROOT node. In UD, this would mean that multiple nodes have the root relation. While this is not explicitly banned in the version 1 of the guidelines (and it occasionally appears in the release 1.1 of the data), there is a community consensus that we want to avoid it. So we cannot use the grandparent solution, at least not in the top level of the tree.

An empty NULL node

It is possible to insert an empty node that represents the elided word. The orphans are then attached to this empty node and all relations can keep their labels. We do not do this in the current version of UD but it is used e.g. in the Hindi treebank, in the Russian treebank and elswhere. Instead of the word forms, these nodes are often labeled NULL, in SynTagRus they are called #Fantom, elsewhere there may be just an underscore _ representing an empty word.

While this is the most expressive mechanism, it also postulates content for which there is no direct evidence in the sentence. Hence we should be careful and the introduction of NULL nodes should be restricted. The situation in which they are most needed is when there are several orphans and it is not clear whether and which of them could be promoted to the head position.

There are concerns about the influence of NULL nodes on parsing (a parser now has to learn where to introduce a NULL node in the input sentence). Also, some people believe that a structure with empty nodes is less intuitive for users lacking linguistic background (but other people think the opposite, and we are not aware of studies that would measure intuitivity :-)).

Examples

दीवाली के दिन जुआ खेलें मगर NULL घर में या होटल में \n dīvālī ke dina juā kheleṁ magara NULL ghara meṁ yā hoṭala meṁ \n Diwali of day gambling play but play house in or hotel in
r6(दिन, दीवाली)
r6(dina, dīvālī)
r6(day, Diwali)
lwg_psp(दीवाली, के)
lwg_psp(dīvālī, ke)
lwg_psp(Diwali, of)
k7t(खेलें, दिन)
k7t(kheleṁ, dina)
k7t(play-31, day)
k2(खेलें, जुआ)
k2(kheleṁ, juā)
k2(play, gambling)
ccof(मगर, खेलें)
ccof(magara, kheleṁ)
ccof(but, play-31)
ccof(मगर, NULL-7)
ccof(magara, NULL-20)
ccof(but, play-33)
k7p(NULL-7, या)
k7p(NULL-20, yā)
k7p(play-33, or)
ccof(या, घर)
ccof(yā, ghara)
ccof(or, house)
lwg_psp(घर, में-9)
lwg_psp(ghara, meṁ-22)
lwg_psp(house, in-35)
ccof(या, होटल)
ccof(yā, hoṭala)
ccof(or, hotel)
lwg_psp(होटल, में-12)
lwg_psp(hoṭala, meṁ-25)
lwg_psp(hotel, in-38)

The remnant relation

This is an approach currently used in UD and it is actually the only relation we designed specifically for ellipsis; nevertheless, it only targets one special case of ellipsis. It is usually caused by coordination of clauses that share the same predicate but differ in arguments. Example: John won bronze, Mary silver, and Sandy gold.

John won bronze , Mary silver , and Sandy gold
nsubj(won-2, John-1)
dobj(won-2, bronze-3)
remnant(John-1, Mary-5)
remnant(Mary-5, Sandy-9)
remnant(bronze-3, silver-6)
remnant(silver-6, gold-10)

The motivation here is to preserve the types of dependency relations without introducing a new node. However, there are also drawbacks:

Recommendation

There was a consensus that the remnant relation is not a good solution and should be eventually replaced by something else. We did not reach a consensus of what that something else should be. Restricted usage of NULL nodes was found somewhat appealing by some group members, and opposed by others. Chains of labels is a possible alternative that almost encodes the same information as the NULL nodes (unless there are multiple nodes missing, resulting in attachment ambiguity). It was also proposed to use the promotion approach, even for the price that we will have to select one argument of the verb and attach the other arguments to it.

Obviously we need to continue the discussion in a larger group online. This is a long-term goal anyway, as we are not allowed to modify the guidelines before October 1, 2015, and we are not going to implement any change in the planned November release. But we may want to put a warning to the description of remnant that a revision is under consideration.

As a short-term goal, we should get rid of the multi-root structures in all the languages, although it is not explicitly required by the current guidelines. Since we cannot use remnant for them, we have to resort to promotion. It is not clear which of the multiple orphans to promote (we could say that e.g. dobj has precedence over nsubj, but we would have to construct large tables for all possible combinations of orphans), we propose that the first orphan is always selected, and the other orphans are attached directly to it. The promoted orphan inherits the root relation from the deleted verb, while the other orphans keep their dependency relation labels as if they were attached to the verb. If the orphans have their own dependents, their relations will be preserved as in the following example:

ROOT And Robert the fourth place .
root(ROOT, And)
nsubj(And, Robert)
dobj(And, place)
punct(And, .)
amod(place, fourth)
det(place, the)