home edit page issue tracker

This page pertains to UD version 2.

UD Persian Seraji

Language: Persian (code: fa)
Family: IE

This treebank has been part of Universal Dependencies since the UD v1.1 release.

The following people have contributed to making this treebank part of UD: Mojgan Seraji, Filip Ginter, Joakim Nivre, Martin Popel, Daniel Zeman.

Repository: UD_Persian-Seraji
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: news, fiction, medical, legal, social, spoken, nonfiction

Questions, comments? General annotation questions (either Persian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [mojgan • seraji96 (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually, natively in UD style
Relations annotated manually, natively in UD style

Description

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

The Persian Universal Dependency Treebank (Persian UD) is the converted version of the Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015). The treebank has its original annotation scheme based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The scheme was extended for Persian to include the language specific syntactic relations that could not be covered by the primary scheme developed for English. The treebank consists of ca 6000 annotated and validated sentences of written texts with large domain variations, in terms of different genres (containing newspaper articles, fiction, technical descriptions, and documents about culture and art) and tokenization. The variations in the tokenization are due to the orthographic variations of compound words and fixed expressions in the language.

Apart from the universal annotation scheme and the general rules in UD, the Persian UD and the UPDT differ further in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in UPDT have been separated from the clitics and received distinct labels in the Persian UD.

The conversion of the UPDT to the Universal Dependencies has been carried out semi-automatically. In this process, we used a conversion script for reversing the head and dependent relations in the prepositional modifier (prep) and object of a preposition (pobj). Furthermore, we have used other scripts tailored for Persian to separate different types of clitics from their host. Subsequently we added different rules for rewriting the coarse-grained part-of-speech tags and the dependency labels. Morphological features were then mapped semi-automatically. In the current release, lemmas are added for a large number of tokens. This process is further done semi-automatically. The entire process has been manually validated.

Acknowledgments

The conversion of the UPDT to the Persian UD has been performed by Mojgan Seraji in collaboration with Filip Ginter. The annotations (PoS tags and dependency relations) were manually checked and corrected by Mojgan Seraji. The universal morphological features and lemmas were further added by Mojgan. The process has been carried out in consultation with Joakim Nivre. The original UPDT was also developed by Mojgan Seraji at Uppsala University. Mojgan is deeply thankful to Joakim Nivre and Carina Jahani for their consultations during the development of the UPDT.

Statistics of UD Persian Seraji

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPUNCTSCONJVERBX

Features

CaseDegreeMoodNumberNumTypePersonPolarityPronTypeReflexTenseVerbForm

Relations

aclacl:relcladvcladvmodamodapposauxaux:passcasecccc:preconjccompcompoundcompound:lvccompound:prtconjcopdepdetdet:predetdiscoursedislocatedfixedflatflat:foreignmarknmodnmod:possnsubjnsubj:ncnsubj:passnummodobjoblparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Verbs with Reflexive Core Objects

Relations Overview