home edit page issue tracker

This page pertains to UD version 2.

UD Persian PerDT

Language: Persian (code: fa)
Family: Indo-European, Iranian

This treebank has been part of Universal Dependencies since the UD v2.7 release.

The following people have contributed to making this treebank part of UD: Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian. Please refer to the follwoing work, if you use this data:

Repository: UD_Persian-PerDT

License: CC BY-SA 4.0

Genre: news, fiction, medical, legal, social, spoken, nonfiction

Questions, comments? General annotation questions (either Persian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [rasooli • seas.upenn.edu]. Development of the treebank happens in a parent repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Please submit pull requests against dev branch of the UD repository.

Annotation Source
Lemmas annotated manually
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually, automatically converted to UD
Relations annotated manually, automatically converted to UD with some manual corrections

Description

The Persian Universal Dependency Treebank (PerUDT) is based on Persian Dependency Treebank (PerDT) (Rasooli et al.,2013). The original Treebank consists of 29K sentences sampled from contemporary Persian text in different genres including: news, academic papers, magazine articles and fictions.

This treebank was annotated based on a language-specific schema and its automatic conversion involved three main steps: revising tokenization, POS mapping and dependency mapping.

In tokenization step, in order to separate multiword inflections of simple verbs grouped as one token in PerDT, we followed the guidelines in (Rasooli et al., 2013, Table 3) to automatically find the main verbs. Also we automatically separated pronominal clitics.

In POS conversion step, we used the state of the art BERT-based Persian NER tagger (Taher et al.,2020) with manual corrections to extend recall. Through seven different entities detected by tagger, we used Person and Location to mark PROPN tags.

PerDT contains 43 syntactic relations with no straightforward mapping for most of them, conjunctions arranged from the beginning of the sentence to the end and more importantly, prepositions regarded as the head of prepositional phrases and auxiliary verbs as the head of sentences. So we rearranged the order of conjunctions from end to the beginning through a script and tailored rules to convert each kind of relation to its UD version properly. Through the whole process and at the end of each step, we investigated the results and applied manual corrections if it was needed.

References

Acknowledgments

Thanks to Morteza Rezaei-Sharifabadi for helping with the copyright of this data.

Statistics of UD Persian PerDT

Split #Sent. #Tok. #word #Type Lemma #Verbs
Train 26196 459K 34.9K 20.7K 5275
Dev 1456 26K 7.0K 5.2K 1427
Test 1455 24K 6.7K 5.1K 1671
All 29107 501K 36.7K 21.6K 5413

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X

Features

Mood – Number – Person – Polarity – PronType – Tense – VerbForm – Voice

Relations

acl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – compound – compound:lv – compound:lvc – conj – cop – csubj – dep – det – dislocated – fixed – flat:name – flat:num – goeswith – iobj – mark – nmod – nsubj – nsubj:pass – nummod – obj – obl – obl:arg – parataxis – punct – root – vocative – xcomp

Tokenization and Word Segmentation

Morphology

Tags

Pronouns, Determiners, Quantifiers

Syntax

Auxiliary Verbs and Copula

Relations Overview