home edit page issue tracker

This page pertains to UD version 2.

UD Persian PerDT

Language: Persian (code: fa)
Family: Indo-European, Iranian

This treebank has been part of Universal Dependencies since the UD v2.7 release.

The following people have contributed to making this treebank part of UD: Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, Alireza Nourian.

Repository: UD_Persian-PerDT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.7

License: CC BY-SA 4.0

Genre: news, fiction, nonfiction, academic, web, blog

Questions, comments? General annotation questions (either Persian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [rasooli (æt) seas • upenn • edu, pegh • safari (æt) gmail • com]. Development of the treebank happens in the UD repository but not directly in the final CoNLL-U files. You may submit bug fixes as pull requests against the dev branch but you have to go to the folder called not-to-release and locate the source files there. Contact the treebank maintainers if in doubt.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
Relations annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion

Description

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data:

The Persian Universal Dependency Treebank (PerUDT) is based on Persian Dependency Treebank (PerDT) (Rasooli et al.,2013). The original Treebank consists of 29K sentences sampled from contemporary Persian text in different genres including: news, academic papers, magazine articles and fictions.

This treebank was annotated based on a language-specific schema and its automatic conversion involved three main steps: revising tokenization, POS mapping and dependency mapping.

In tokenization step, in order to separate multiword inflections of simple verbs grouped as one token in PerDT, we followed the guidelines in (Rasooli et al., 2013, Table 3) to automatically find the main verbs. Also we automatically separated pronominal clitics.

In POS conversion step, we used the state of the art BERT-based Persian NER tagger (Taher et al.,2020) with manual corrections to extend recall. Through seven different entities detected by tagger, we used Person and Location to mark PROPN tags.

PerDT contains 43 syntactic relations with no straightforward mapping for most of them, conjunctions arranged from the beginning of the sentence to the end and more importantly, prepositions regarded as the head of prepositional phrases and auxiliary verbs as the head of sentences. So we rearranged the order of conjunctions from end to the beginning through a script and tailored rules to convert each kind of relation to its UD version properly. Through the whole process and at the end of each step, we investigated the results and applied manual corrections if it was needed.

Acknowledgments

Thanks to Morteza Rezaei-Sharifabadi for helping with the copyright of this data.

Statistics of UD Persian PerDT

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

MoodNumberPersonPolarityPronTypeTenseVerbFormVoice

Relations

acladvcladvmodamodapposauxaux:passcaseccccompcompoundcompound:lvcompound:lvcconjcopcsubjdepdetfixedflat:nameflat:numgoeswithiobjmarknmodnsubjnsubj:passnummodobjoblobl:argparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview