home edit page issue tracker

This page pertains to UD version 2.

Working Group on Multiword Expressions

This page was created by Kim Gerdes. Comments have so far been added by Joakim Nivre, Nathan Schneider and Agata Savary.

MWEs in UD

The purpose of this page is to serve as a basis for discussion for the improvement of multi-word annotation in UD.

UD is a syntactic annotation scheme. Thus, these three relations should encode syntactically problematic constructions that cannot be analyzed by other “regular” dependency relations. The objective of a syntactic annotation scheme is not primarily to encode idioms and named entities.

Nathan: A general comment (forgive me if this is obvious or naïve) is that MWEs typically arise over time by reuse and adaptation of core/canonical constructions in the language (such as modification, compounding, and so forth). In MWEs/non-canonical constructions these become specialized, sometimes acquiring idiomatic meaning, sometimes becoming grammatically fossilized, and sometimes forming “minigrammars” (Fillmore’s term; maybe corresponds to “sub-system” below) for certain kinds of productive expressions, like names, numbers, dates, addresses, and kinship terms. Often, as in English light verb constructions, there is a mostly canonical form despite an idiomatic meaning, in which case I think the UD policy is and should be to analyze it with a canonical structure: UD is syntax, not semantics. For minigrammars with frequent and readily identifiable slots, like titles in personal names in English, I think it may be appropriate to identify an appropriate canonical relation and subtype it to provide clarity to annotators and parsers that a special overall construction is present: e.g., nmod:title or nmod:prename. The core slots in a personal name (given names and surnames) do not have any apparent head, though the order is significant within the minigrammar, so flat:name makes sense to me. Similarly with flat:foreign. The hardest one is fixed, because for special grammatical expressions one can often see a trace of the original structure yet recognize it as serving a coherent and lexicalized grammatical purpose, so there may be no clear test that determines where to draw the line. But there is clear room for improvement in the guidelines in any case.

Agata: In this discussion we will probably mainly (constructively) critisize the current proposal. So let me start by saying that I truly admire what has already been done in UD, and I’m grateful for all these efforts that very substantially benefit a very large community, including PARSEME.

Proposals for the MWE relations fixed, flat, and compound:

The fixed relation is used for syntactically irregular constructions, while flat and compound are used for productive constructions corresponding to syntactic sub-systems. We should reserve flat for constructions where a head cannot be determined, and we should use compound for very cohesive constructions, neighboring morphology.

The current definitions of the three relations are problematic or not applied consistently:

All syntactically regular constructions should receive a normal compositional analysis. Their idiomatic status can be annotated on a different level of (semantic) annotation. This includes semantically compositional (the dog slept), semi-compositional ([wide] awake, [heavy] smoker, rain [cats and dogs]), and semantically non-compositional constructions (kick the bucket, green card, cats and dogs, in the light (of), Fr. pomme de terre 'potato'). See the article Kahane, Courtin, Gerdes “Multi-word annotation in syntactic treebanks”, TLT 2018 for a detailed definition of the terms. Agata: I like the idea of the 2-dimensional typology in this paper (along the syntactic axis, and along the semantic one). But I have serious problems with their interactions. Most importantly, the notion of semantic non-compositionality is unclear or even inconsistent. For me:

Constraining more precisely the extent of headless and irregular relations is generally expected to lead to a much smaller set of constructions using flat, compound, or fixed. A mainly syntactic definition will also make the semantic extent of the usage of these relations more variable among languages. We cannot expect, for example, that all person or organization names will syntactically behave the same in different languages.

In addition to the more precise definition, we also have to make proposals on how to preserve previously annotated units of MWE. Agata: There has been a discussion going on (involving Dan, Joakim, Filip, Carlos, Silvio and Agata) about defining an extension of the connlu format to accomodate annotations from other initiatives (like PARSEME), which build on UD treebanks, or more generally on the conllu format. A first version is available here (it still needs validation by the UD-PARSEME group). Note that first a meta-format is defined, which basically consists of (i) adding new columns to the 10 conllu columns, (ii) specifying the names of columns used (in the first line of a file), (iii) standardizing identifiers for source sentences. Each initiative can instantiate this meta-format by defining the names and syntax of its own additional columns. For PARSEME, we defined the cupt format (used in our brand new PARSEME corpus edition 1.1) which adds 1 column named PARSEME:MWE containing VMWE annotations. The same meta-format could be used to conserve the previous MWE annotations from the current version of UD2. A column would just be added with a standardized name (e.g. UD2-obsolete:MWE) and syntax (as currently in UD2).

Here we explore to which extent some basic non-embedded structures can be expressed using the dual MWEPOS/INMWE features, that will be presented below (see thread https://github.com/bguil/UD-French-discussion/issues/16 for a discussion in French on the feature names). Joakim: These features are not compatible with v2 of the guidelines, are they? I think we need to clearly distinguish what we propose as improvements under the current guidelines and what we would like to see in future versions of the guidelines. Agata: I realize that my proposal of not taking fixedness/flexibility into account at all may be much too radical for v2. The only way to save it would be to totally abandon the fixed label and use flat for all headless constructions, whether fixed or semi-fixed. Would this still be too radical? Also, I don’t follow the definition of a headless construction (see below).

Headless structures

Both the flat and the fixed relation are used for headless (=flat) constructions, i.e. when no head can be determined. A construction AB is headless if either both A and B can replace AB or neither A nor B can replace AB – both cases will be described separately below. Headless constructions are annotated in one arbitrary but fixed style. UD chooses the bouquet style, i.e. all words depend on the first word.

If, however, AB can be replaced only by A, A is the head; if AB can be replaced only by B, B is the head. Then the construction is headed and the relation between A and B is not flat or fixed.

Case N: neither A nor B can replace AB:

He reads El Païs. *He reads El. *He reads Païs.

Agata: This would mean the any Det+N construction is headless (if the noun requires a determiner)! For instance, I see the girl. *I see the. *I see girl. Where do I go wrong?

Idem for “ad hoc” and “parce que”

We will determine below which relation, flat or fixed, should be used in these cases.

This case should be considered as default also if the annotator just does not know whether this substitution is possible because the text is in a foreign language, for example.

Case B: both A and B can replace AB:

Note that this description also fits to appos relations. We will have to provide criteria to distinguish when flat, fixed, or appos should be used. Joakim: The description also fits relations like conj, parataxis and list.

For “Hillary Clinton”, both parts can replace the whole. Thus, we cannot determine a head. Note that we can not permute the tokens: *Clinton Hillary

“the president Obama” can be replaced by “the president” and by “Obama”. Again, we cannot determine a head. This time we can permute word order: Obama, the president.

However, “Mister Miller” can be replaced by Miller and not by Mister (Mister can only be used alone as a vocative). It's a headed construction Mister ← Miller. Cf. the section on compounds for noun-noun constructions in English. The same holds for “President Obama” and “French actor Gaspard Ulliel”.

Which relation names for headless constructions?

Headless constructions always have a bouquet structure. But when to use flat, fixed, and appos?

We propose:

If a headless construction…

  1. has a permutable word order (and case B), we use appos. Joakim: This does not seem consistent with the v2 guidelines. The appos relation is currently restricted to the case of loose (or wide) apposition (as discussed below). This may be changed in future versions, although I would personally prefer subsuming all appositions under nmod (possibly with special subtypes). Under the current guidelines, nmod seems to be the best candidate here.
  2. designates as a whole a proper noun, i.e. we would like to give it PROPN as POS, we use flat. A test could include whether the sentence remains grammatical if the construction is replaced by a single proper noun. Depending on the language, this might also include dates and addresses. This does by no means imply that all proper nouns, dates, or addresses should be flat. They should be flat if and only if they are also headless. Question: maybe include POS NOUN for oran utang? Joakim: It seems a bit arbitrary to single out the name category and put (almost) all other headless constructions into fixed. What if they are not fixed expressions but productive constructions?
  3. cannot be analyzed in any way by the annotator, we use flat.
  4. is in a foreign language known to the annotator, we use flat:foreign and add a language feature lang=xxx to each token. This does not imply that all foreign text segments should be flat. Firstly, if the annotator knows the structure, the structures are no longer headless and should be annotated. Secondly, if the construction is a proper noun that is used in the main language of the corpus, the word is not foreign. Hong Kong is flat but not foreign even if we don't know the internal structure in Cantonese (“Perfumed Harbor”).
  5. Remaining headless constructions will be annotated with fixed relations. Joakim: As noted above, it seems wrong to me to use fixed unless the expression is really fixed. And what happened to the idea that fixed is only used for irregular constructions?

Thus, we obtain

  1. “the president -appos→ Obama”, idem: “le président Macron”, “the billionaire Perot”, “the Prophet Mohammed”

  2. “Hillary -flat→Clinton”, “Hong -flat→ Kong”, Burkina -flat→ Faso,

  3. “Al -flat→ Quaida”, “El -flat→ Païs”, “He sang Mahna -flat→ Mahna badi bidibi” (see Title section below)

  4. “orang -flat:foreign→ utan”, “And then she went : gjiko -flat:foreignfrac zen

  5. “ad -fixed→ hoc”, “parce -fixed→ que”.

Note that we do not have fixed:foreign which could be a worthwhile distinction if cases such as the Latin construction “ad -fixed→ hoc” are common in a language.

flat:

flat should only be used for headless constructions.

Discussion of some putative flat constructions:

Difficult cases:

How to handle Papua New Guinea, North Rhine-Westphalia, Provence-Alpes-Côtes d'Azur (PACA), Tamil Nadu?

Proposal: No flat: Papua <compound Guinea, Rhine -conj> Westphalia, Tamil <compound Nadu (Tamil is a language and an English word), la Nouvelle-Orléans.

flat:foreign

Joakim: This and the following two sections seem to presuppose conventions that are neither part of the current guidelines, nor described on this page, such as “MWEPOS=PROPN”. This needs to be clarified. I think it will be much clearer if the section entitled flat is followed by a section entitled fixed. The intervening sections should either be marked as subsections of the flat section or marked elsewhere.

Discussion of some putative flat:foreign constructions:

Ludwig van Beethoven

Although “van” is not a German word (it is mistaken as such in the current guidelines) it is sufficiently similar to the German preposition “von” that the structure is transparent to any German speaker and many family names have a “von” eg Hildegard von Bingen. However, the lang=nl feature can be applied to the ADP van.

Equally, for

Los Angeles, Rio de la Plata

the annotators have two choices for these named entities

  1. analyzing them internally. Then

it is easy to kick out these parts for any training on the treebanks or to downgrade the treebanks to the second solution:

  1. not analyzing them. Then

Note that the current French Google UD has already analyzed English Language subsystems. The same treebank also shows that sometimes this practice leads to errors: For example, Sun Yat-Sen is analyzed as compound> instead of flat> (it's a person's name)

l' Université de Zhongshan ( Sun ←compound- Yat-Sen ) à Canton

Question:

Is it a problem if some annotators know that we have “Los <det Angeles”, “Al <det Qaida” or “Hong <amod Kong” and annotate accordingly and others don't?

Another problem is personal names in different cultures (see the Hillary Rodham Clinton problem above). In order to analyze the name, one will have to know something about the language. But the same holds for any journalist that mentions such a name: How to abbreviate Xi Jinping? Xi or Jinping? If the journalist knows that, the head question can be answered, too.

Encoding

POS encoding: For flat and fixed constructions, we cannot assign an interior structure. So usually all words carry the POS assigned to the whole entity, often PROPN but also ADV, SCONJ etc.

For example, Hong Kong has no internal structure accessible to the English annotator. It receives a flat relation and both words have PROPN as POS.

Text elements that are of foreign origin, not named entities, and the language is known receive the lang feature on each token. For example to Hong and to Kong we add lang=Cantonese or with the ISO abbreviation lang=yue.

In order to preserve existing MWE annotation of syntactically transparent text segments that are now analyzed compositionally, such as “New York”, we can follow losely the PARSEME proposal: the head (here “York”) can carry a special feature MWEPOScompound category” MWEPOS=PROPN and the other members (here only New) of the compound can carry a INMWE=Yes feature

Titles of art work

In the same vein, titles that have an internal structure should be analyzed syntactically, for example: “The Lord of the Rings”

The head, “Lord” should have a MWEPOS=PROPN and all the other element should have INMWE=Yes

Problem: this simple MWEPOS, complement feature cannot capture embedded groupings:

I've finally seen Dr. Strangelove or: How I learned to love the bomb.

The sentence has no flat relation and “Strangelove”, the head of the title, has a MWEPOS=PROPN feature and all other words of the title have a complement=Yes feature. This does not capture the extent of the name Dr. Strangelove but in this case the compound relation Strangelove compound> Dr is endocentric as it carries PROPN on Strangelove.

Equally, for the following example, we can only encode one level of MWE, but it is sufficient to express that Sun Yat-Sen University is one semantic unit with MWEPOS=PROPN on University:

the Sun flat> Yat-Sen <compound University

fixed

fixed is used for “certain fixed grammaticized expressions that behave like function words or short adverbials”.

It concerns mainly complex prepositions, complementizers and determiners.

Nathan: Another set of constructions worth discussing, at least for English, are what Quirk et al. call “quasi-modals”: “used to” (habitual past), “about to” (prospective), “have to” (= must), “be going to” (future), etc. It looks like the infinitive marker to is attached to the infinitive verb as usual, which is fine, but it’s interesting that have, used, be going, and about are idiosyncratic in different morphosyntactic ways, and it’s not even clear what the right POS is for about.

Currently this relation is used too extensively because many of these function words are actually syntactically transparent such as “on top of”, “top of the range”. Joakim: As noted above, they may be syntactically transparent but they are nevertheless irregular in that they lack articles.

Three main changes:

If we regularize an existing fixed MWE, we use the same dual MWEPOS/INMWE feature encoding, introduced above for fixed, to keep the information of the extent of the MWE in the treebank (excluding the case/preposition sub-categorization). Joakim: Again, note that this is not available in v2.

Currently, in many languages, many of the multi-word prepositions and determiners that form semantic units are already annotated compositionally, often with the noun as the head. For example

“on top of that” and similar complex prepositions should now coherently have “top” as NOUN as head, “on” as case and “that” as nmod. “top” could also have a MWEPOS=ADP feature and “on” a INMWE=Yes feature. “of” shall not be marked in any special way.

Equally, “because of” shall not receive any fixed relation or special MWE features.

Cases where we keep the fixed relation:

French: “parce que” 'because'. Reason: We cannot even assign a POS to “parce” and the relation between the two tokens.

German: als ob, nach wie vor, we no longer use fixed for: “unter anderem” because the expression is a transparent PP. Idem for the postpositions in expression such as “von x an” that should have a common compositional analysis (-case→ADP).

Proposal for English: We keep “Of course”? “As of”? “As well”? “Rather than”? “Kind of”? Sy: “rather than” is is syntactically clear: “I would rather wait here than go.” “rather” is an ADV and “than” is its regime. But UD English has many inconstistencies concerniong “than”

And no longer should be analyzed as fixed: Sy: any/no/a little longer: no modifies the ADV longer

“be up to sth”, “instead of”, “according, due, prior to”, “so that”, “more than”, “whether or not”

compound:

Compound should be used for very cohesive regular constructions that are neighboring morphology. Cohesive means that the meaning is often non-compositional although the construction remains productive. Joakim: I think I basically agree with this point, but I would prefer morphosyntactic criteria rather than semantic.

Currently it is used for regular systems of “compounding” in different languages

In particular

If a language does not have a regular system of compounding, the compound relation is not used. In most current French treebank noun-noun compounding is annotated by different means than the compound relation, see discussion below, and compound only appears for foreign segments of text.

Just like for the two other MWE relations, compound should only be used if no other regular dependency is available. For example, in English, compound is currently used for “Prime Minister” although prime is just an adjective and the structure is transparently amod.

Question: Even Lake Michigan, Mount Rushmore, Fort Alamo should be a left-to-right compound? Yes That would use compound for two (slightly) different constructions.

Discussion from the current flat page:

My answers in bold

This paragraph briefly records some of the arguments that have been made in the past on relations for name structure. It is an issue over which there has historically been variation and about which there is some continuing debate. Examples like French actor Gaspard Ulliel: Some treebanks have used nmod for titles and honorifics like Mr. or French actor. Most people think this is inappropriate, since an nmod dependent should be a full phrase, which will typically take its own case as a modifier in a cased language. In contrast, these titles seem to be part of the same phrase as the name that follows them; they show case agreement concord in a cased language.

Answer: This argument would imply that all languages use nmod only where the modifier is a complete noun phrase. This would make the very productive noun-noun compounds in French impossible to analyze: Joakim: I agree, there is nothing in the guidelines to suggest that an nmod needs to be a complete noun phrase.

Imprimante →? laser; accès → handicapés; espace → fumeur

Some grammatical traditions, descending from Latin, call French actor in such cases a “fixed (or close) apposition” and take the name as the head. UD has restricted the appos relation to following appositives (corresponding to “loose (or wide) apposition” in the Latin tradition). The relation appos is only used when you have two full nominals, typically joined loosely, and often separated by a punctuation mark like a comma. So appos is not correct for these cases. Sometimes the relation compound has been used, but this does not seem right. It implies headedness, and titles do not usually behave like compounds: in German, they are not joined to the following words, as compounds are normally joined in German, and they appear at the beginning of names in both German and Hebrew, even though German compounds are head last and Hebrew compounds are head first.

Answer: Is this an argument based on spelling conventions? Joakim: No. The restrictive use of the appos relation is certainly a point worth discussing, but I would say we have to live with it in v2 and use nmod for the other cases.

So compound does not seem appropriate either.

No, this does not follow.

Some UDv1 treebanks used flat for honorifics like Mr., although some felt that was wrong and flat should be restricted to joining the proper nouns of multi-word names. In UDv2, flat was removed and replaced by flat, which allowed a broader notion of a chunk of unheaded material. In the UDv2 guidelines, cases of both titles and honorifics are joined to names with flat.

Work program of the MWE group

Joakim: It is not clear how this and the next section relates to the material above. One question is whether our work should start from (a refined version of) the plan below or from the tentative proposals above. Perhaps it needs to be a continuation of the two.

Issues to address, priorities

What is a MWE?

Classification of semantic and syntactic irregularities

  Compositional Semi-compositional Non-compositional
Regular construction Typical syntax (the dog slept) [wide] awake, [heavy] smoker, rain [cats and dogs] kick the bucket, green card, cats and dogs, in the light (of), Fr. pomme de terre 'potato'
Sub-system Dates: 5th of July, tomorrow morning_Titles: _Miss Smith Ludwig van Beethoven in Ger­man (van is a Dutch word simi­lar to Ger. von) on top (of), in case (of), Fr. à côté (de) 'next (to)'Meaningful dates: _September 11th, 4th of July_Mount Rushmore, Fort Alamo
Irregular construction Fr. peser lourd 'weigh a lot', lit. weigh heavy Fr. cucul la praline 'very silly', lit. silly the praline a) not to mention, a lot (ADJ-er), top of the range, Fr. Dieu sait quoi 'heaven knows what'b) Fr. n'importe quoi 'anything', by and large_c) _each other, Fr. à qui mieux mieux 'each trying to do better than the other', lit. to whom better bet­terd) ad hoc, Al Qaeda, Fr. parce que 'because'

Table 1. Different types of MWEs

Bibliography

old stuff from the page: