This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.
home issue tracker

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à le). Otherwise the tokenization is based on white spaces and punctuations (except for multiword expressions with hyphens which are not split, e.g., Etats-Unis “United States”, sous-marin “submarine” stay one token).