home edit page issue tracker

This page pertains to UD version 2.

UD for Persian

UD Persian contains data from multiple treebanks created by different teams at different times and with often different conversion tools. Currently there are two treebanks: Seraji and PerDT.

Tokenization and Word Segmentation

Words are generally delimited by whitespace or punctuation. No tokens in any of the UD Persian corpora currently contain whitespace. There are some multi-word tokens in the data in case of clitics.

Morphology

See specific documentations of each treebank.

Tags

See specific documentations of each treebank.

Features

See specific documentations of each treebank.

Syntax

Standard deprels are used, except for some of the follwing relations:

Treebanks

There are two Persian UD treebanks: