home edit page issue tracker

This page pertains to UD version 2.

UD for Breton

Tokenisation and Word Segmentation

Tokenisation was originally done using the Apertium morphological analyser for Breton. This joins certain multiword tokens with spaces as single tokens. Where the number of spaces in the original token matches the number of spaces in the multiword token, these are split into separate tokens in UD, where the part of speech of the multiword token is given to the first token, and subsequent tokens are given the part of speech X and attached with the fixed relation.

The most important tokenisation factor is with the words traditionally described as inflected or conjugated prepositions. Here we analyse them as contractions of prepositions and pronouns. For example, dit is tokenised as a multiword token constructed from da “to” and it “you”.



Some comments on various parts of speech:

Words tagged as AUX:


Auxiliary verbs:

Verbal “particles”:





The following relation subtypes are used in the Breton data:


There is 1 Breton UD treebank: