UD for Portuguese

UD Portuguese contains data from multiple treebanks created by different teams at different times and with often different conversion tools. As a result, differences may sometimes be found across treebanks, though we are making efforts to harmonize them when issues are identified.

Tokenization and Word Segmentation

Words are generally delimited by whitespace or punctuation. No tokens in any of the UD Portuguese corpora currently contain whitespace. Most corpora do use multiword tokens, since contractions in Portuguese are uniquitous, from verbs to prepositions. For example forms like do = de+o (of+the) or fi-lo = fiz+o (did+it).

Units that are generally tokenized apart include:

Units that are not tokenized apart include:



This is an overview only. We are following the UD rules as close as possible. Moreover, we are using MWEPOS in the Misc field to specify the POS tag of multi-word expressions as a whole.

Multiword expressions

As documented in [1], the indication of the POS tags in the case of ‘fixed’ MWEs is particularly relevant, as these ex-pressions are crystallized in such way that their components can have completely different POStags from the total expression. Having the information about the POS-tag of the entire MWE inthe MISC field helps to justify some dependency relations. We use adopted the MWEPOS=VAL for such cases, where val is any valid value for the UPOSTAG field.


Morphological features are included in all corpora, partially in some cases. In some corpora these are added automatically using MorphoBr and in some cases supplemented using information from other annotation layers (e.g. Bosque).



Only ser and estar should be considered copulas.

Passives are distinguished (nsubj:pass, csubj:pass), possessived (nmod:poss), predeterminers (det:predet for “ambos” in “ambos os filhos), preconj (cc:preconj for “ou” in “ou X ou Y”).

auxiliaries verbs



There are 4 Portuguese UD treebanks: