home edit page issue tracker

This page pertains to UD version 2.

Word segmentation in UD v2

The UD scheme makes a distinction between tokens, word-like elements that can be identified using simple rules, often relying (only) on whitespace and punctuation, and words, which are the linguistically relevant word units needed for morphological and syntactic analysis. In v1, it was assumed that words would never contain spaces, and that “multitoken words” should always be handled using special relations like u-dep/fixed and u-dep/goeswith. For v2, we relax this assumption in two ways:

Throughout the remainder of this text, the symbol will be used to indicate orthographic space.

Problems with current treatment of word segmentation

There are two main problems with the current treatment of word segmentation:

Spaces as syllable delimiters

There is unanimous agreement that words with spaces should be allowed in the Vietnamese treebank, because the alternative of treating all polysyllabic words as multiword expressions would artificially make Vietnamese look very different from all other languages. As far as we know, Vietnamese is the only language where this is necessary, but still all tools will need to be able to support having spaces in CoNLL-U columns. Consider the following example, “Minh is (a) teacher.”, where giáo viên is a bisyllabic word meaning “teacher”. (Currently using underscore, “giáo⎵viên”, because even the tree visualization tool cannot work with word-internal spaces.)

1	Minh	Minh	PROPN	_	_	3	nsubj	_	_
2	là	là	VERB	_	_	3	cop	_	_
3	giáo⎵viên	giáo⎵viên	NOUN	_	_	0	root	_	_
4	.	.	PUNCT	_	_	3	punct	_	_

If a language allows spaces inside words on a language-wide basis, this must be declared in the language-specific documentation.

Other cases

For the remainder of the languages, we should essentially maintain the ban on spaces in words. However, we propose that for a highly restricted closed class of orthographic phenomena, we may make exceptions (with prior approval).

Spaces as numeral separators

In the existing French treebank, space delimited numerals, e.g. “100 000” are collapsed into a single numeral, “… de 8 500 à 20 000 euros.” becomes:

1	de	de	ADP	_	_	2	case	_	_
2	8500	8500	NUM	_	_	0	nmod	_	_
3	à	à	ADP	_	_	5	case	_	_
4	20000	20000	NUM	_	_	5	nummod	_	_
5	euros	euro	NOUN	_	Gender=Masc|Number=Plur	0	nmod	_	_

We do not see that this is an improvement over simply allowing the space, and the alternative (to have each 000 as a separate token and use goeswith or mwe) is unwieldy and does not exactly fit, e.g. writing 100 000 is not an orthographic error, but rather orthographically normative, and neither is it a fixed expression.

The new tokenisation would be:

de 8⎵500 à 20⎵000 euros \n from 8,500 to 20,000 euros

Spaces in normalising abbreviations

Spaces should be allowed in order to normalise abbreviations, in Swedish “e.g.” can be written either “t.ex.” or “t ex”

With space “t ex”:

Idrottsmateriel t_ex spikskor , kompass , kartfodral \n Sporting.goods e.⎵g. spiked.shoes , compass , map.holder

Without space “t.ex.”:

Det gäller t.ex. säsongarbetslösa byggnadsarbetare . \n This applies.to e.g. seasonally.unemployed building.workers .

Spaces between a syntactic word and a bound morpheme

In Tuvan, in some tenses, the person/number agreement is written separate from the verbal morpheme. We propose allowing these to be tokenised as one unit

Мен Кызылга чурттап турган⎵мен .  \n I Kyzyl.to living standing.am
nsubj(чурттап, Мен)
aux(чурттап, турган⎵мен)
nmod(чурттап, Кызылга)