This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.
home v2/v2 issue tracker

Segmentation in UD v2

The UD scheme makes a distinction between tokens, word-like elements that can be identified using simple rules, often relying (only) on whitespace and punctuation, and words, which are the linguistically relevant word units needed for morphological and syntactic analysis. In v1, it was assumed that words would never contain spaces, and that “multitoken words” should always be handled using special relations like mwe and goeswith. For v2, we propose to relax this assumption in two ways:

Throughout the remainder of this text, the symbol will be used to indicate orthographic space.

Problems with current treatment of word segmentation

There are two main problems with the current treatment of word segmentation:

Spaces as syllable delimiters

There is pretty much unanimous agreement that words with spaces should be allowed in the Vietnamese treebank, because the alternative of treating all polysyllabic words as multiword expressions would artificially make Vietnamese look very different from all other languages. As far as we know, Vietnamese is the only language where this is necessary, but still all tools will need to be able to support having spaces in CoNLL-U columns. Consider the following example, “Minh is (a) teacher.”, where giáo viên is a bisyllabic word meaning “teacher”. (Currently using underscore, “giáo⎵viên”, because even the tree visualization tool cannot work with word-internal spaces.)

If a language allows spaces inside words on a language-wide basis, this must be declared in the language-specific documentation.

Other cases

There was a general consensus that, for the remainder of the languages, we should essentially maintain the ban on spaces in words. However, we propose that for a highly restricted closed class of orthographic phenomena, we may make exceptions (with prior approval).

Spaces as numeral separators

In the existing French treebank, space delimited numerals, e.g. “100 000” are collapsed into a single numeral, “… de 8 500 à 20 000 euros.” becomes:

We do not see that this is an improvement over simply allowing the space, and the alternative (to have each 000 as a separate token and use goeswith or mwe) is unwieldy and does not exactly fit, e.g. writing 100 000 is not an orthographic error, but rather orthographically normative, and neither is it a fixed expression.

The new tokenisation would be:

Spaces in normalising abbreviations

Spaces should be allowed in order to normalise abbreviations, in Swedish “e.g.” can be written either “t.ex.” or “t ex”

With space “t ex”:

Without space “t.ex.”:

Spaces between a syntactic word and a bound morpheme

In Tuvan, in some tenses, the person/number agreement is written separate from the verbal morpheme. We propose allowing these to be tokenised as one unit

BESbswyBESbswyBESbswyBESbswy