home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à + le). Otherwise the tokenization is based on white spaces and punctuations (except for symbols - and ’ which are not split when they are in a named entity and a single word (Etats-Unis, sous-marin or aujourd’hui are not split).

When the symbol - is used between two different syntactic unit, the - is kept with the second part (usually a pronoun). Ex: vient-il → vient + -il. The quote symbol (‘) is kept with the previous part. Ex: l’école → l’ + école and j’arrive → j’ + arrive.