home edit page issue tracker

This page pertains to UD version 2.

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à le). Otherwise the tokenization is based on white spaces and punctuations (except for multiword expressions with hyphens which are not split, e.g., Etats-Unis “United States”, sous-marin “submarine” stay one token).

FrenchSpoken does a strictly formal tokenizing where the hyphens are considered as tokens. This means that, in a sentence like A-t-elle bien dormi ?, there are five tokens in a-t-elle: a, -, t, - and elle. The first hyphen is the head (i.e. it receives the link nsubj) and the other tokens are linked with a goeswith:

a - t - elle bien dormi ? \n did she sleep well?
nsubj(a,-)
goeswith(-,t)
goeswith(--2,--4)
goeswith(-,elle)

As for the POS, the hyphens could be tagged with PUNCT and the t with PART (and of course elle with PRON). This way of tokanizing and segmentating seems easier for the parsing and does not require an external source since we keep everything separated. We also don’t have to wonder where to attach the hyphens as they constitute separated tokens. Thus it becomes easier to automatically analyze together cases like là-dessus and là-bas where the hyphen belongs to the left part (là- and dessus or bas) and cases like est-elle where we can choose to attach the hyphen to the right part (-elle).

N.B.: This tokenizing and segmentating choice is arbitrary and other French treebanks could choose to do otherwise (for example they could consider a-t- or -t-elle as a token, or annotate the -t- with expl).