home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Persian Seraji

Language: Persian (code: fa)
Family: IE

This treebank has been part of Universal Dependencies since the UD v1.1 release.

The following people have contributed to making this treebank part of UD: Mojgan Seraji, Filip Ginter, Joakim Nivre, Martin Popel, Daniel Zeman, Minoo Nassajian.

Repository: UD_Persian-Seraji
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17

License: CC BY-SA 4.0

Genre: news, fiction, medical, legal, social, spoken, nonfiction

Questions, comments? General annotation questions (either Persian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [mojgan • seraji96 (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation	Source
Lemmas	annotated manually
UPOS	annotated manually in non-UD style, automatically converted to UD
XPOS	annotated manually
Features	annotated manually, natively in UD style
Relations	annotated manually, natively in UD style

Description

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

The Persian Universal Dependency Treebank (Persian UD) is the converted version of the Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015). The treebank has its original annotation scheme based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The scheme was extended for Persian to include the language specific syntactic relations that could not be covered by the primary scheme developed for English. The treebank consists of ca 6000 annotated and validated sentences of written texts with large domain variations, in terms of different genres (containing newspaper articles, fiction, technical descriptions, and documents about culture and art) and tokenization. The variations in the tokenization are due to the orthographic variations of compound words and fixed expressions in the language.

Apart from the universal annotation scheme and the general rules in UD, the Persian UD and the UPDT differ further in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in UPDT have been separated from the clitics and received distinct labels in the Persian UD.

The conversion of the UPDT to the Universal Dependencies has been carried out semi-automatically. In this process, we used a conversion script for reversing the head and dependent relations in the prepositional modifier (prep) and object of a preposition (pobj). Furthermore, we have used other scripts tailored for Persian to separate different types of clitics from their host. Subsequently we added different rules for rewriting the coarse-grained part-of-speech tags and the dependency labels. Morphological features were then mapped semi-automatically. In the current release, lemmas are added for a large number of tokens. This process is further done semi-automatically. The entire process has been manually validated.

Acknowledgments

The conversion of the UPDT to the Persian UD has been performed by Mojgan Seraji in collaboration with Filip Ginter. The annotations (PoS tags and dependency relations) were manually checked and corrected by Mojgan Seraji. The universal morphological features and lemmas were further added by Mojgan. The process has been carried out in consultation with Joakim Nivre. The original UPDT was also developed by Mojgan Seraji at Uppsala University. Mojgan is deeply thankful to Joakim Nivre and Carina Jahani for their consultations during the development of the UPDT.

Statistics of UD Persian Seraji

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PUNCT – SCONJ – VERB – X

Features

Case – Degree – Mood – Number – NumType – Person – Polarity – PronType – Reflex – Tense – VerbForm

Relations

acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:lvc – compound:prt – conj – cop – dep – det – det:predet – discourse – dislocated – fixed – flat – flat:foreign – mark – nmod – nmod:poss – nsubj – nsubj:nc – nsubj:pass – nummod – obj – obl – parataxis – punct – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 5997 sentences, 151627 tokens and 152923 syntactic words.

This corpus contains 13135 tokens (9%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus does not contain words that contain both letters and punctuation.

This corpus contains 1292 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 715 types of multi-word tokens. Examples: خودش، خودشان، خودم، مرا، معتقدند، بدین، برایش، خودت، دلم، اوست، چیست، کشورمان، ماست، پیداست، خودمان، پدرش، درین، بدان، سرش، مادرش، همسرم، پدرم، آخرش، آن‌هاست، امیدواریم، خانواده‌اش، نامه‌ات، ازین، امیدوارم، امیدوارند، بهش، دستش، دلش، همه‌اش، پایش، کارش، کجاست، آزادند، ارزشهاست، برخوردارند، توست، خانه‌اش، خداست، خودتان، رویش، زین، صدایش، قبلی‌اش، مدتهاست، منظورم.

Morphology

Nominal Features

Number

Plur
- AUX: بودند, باشند, خواهند, می‌توانند, نیستند, باشیم, بوده‌اند, نمی‌توانند, می‌خواهند, می‌خواهیم
- AUX-Fin: باشند, خواهند, باشیم, باشید, بشوند, بتوانند, داریم, دارند, بخواهیم, نباشند
- AUX-Part: بوده‌اند, شده‌اند, بوده‌ایم, توانسته‌اند, شده‌اید, شده‌ایم, نتوانسته‌اند, نگردیده‌اند
- NOUN: انتخابات, افراد, مواد, کسانی, کشورهای, برنامه‌های, اصلاحات, حدود, مطبوعات, آثار
- PRON: ما, آن‌ها, شان, شما, آنان, این‌ها, مان, ایشان, تان, اینان
- VERB: کنند, می‌کنند, هستند, دارند, کردند, ند, کرده‌اند, می‌شوند, کنیم, داشتند
- VERB-Fin: کنند, کنیم, کنید, شوند, دهند, بکنند, بگذارند, بگیرند, بدهند, بیایند
- VERB-Part: کرده‌اند, شده‌اند, داده‌اند, داشته‌اند, گرفته‌اند, گفته‌اند, نشده‌اند, نکرده‌اند, آمده‌اند, نوشته‌اند

Sing
- ADJ-Part: آمده, ساخته, یادشده, بسته, توقیف‌شده, ناخواسته, انجام‌شده, عقب‌افتاده, کاسته, کشته
- AUX: است, بود, خواهد, باشد, ست, بوده, شده, می‌تواند, می‌شود, نبود
- AUX-Fin: خواهد, باشد, شود, نخواهد, گردد, نباشد, بتواند, باشم, دارد, خواهم
- AUX-Part: بوده, شده, نبوده, گردیده, نشده, شده‌ای, می‌شده, نمی‌توانسته, نگردیده, بوده‌ام
- NOUN: ایران, سال, مردم, کشور, روز, کار, قرار, دست, انقلاب, تهران
- PRON: خود, آن, او, این, ش, من, وی, تو, م, اش
- VERB: کرد, گفت, شد, شده, می‌شود, دارد, می‌کند, کرده, نیست, است
- VERB-Fin: کند, شود, کنم, دهد, بشود, باشد, کن, گیرد, بگیرد, گو
- VERB-Part: شده, کرده, داشته, داده, گرفته, آمده, نوشته, دیده, ساخته, رسیده

Case

Loc
- ADV: بالا, اینجا, آنجا, پیش, بیرون, پایین, آن‌جا, زیر

Tem
- ADV: بعد, پس, پیش, حالا, امروز, قبل, اکنون, کنون, همیشه, دیروز

Voc
- INTJ: ای
- NOUN: پروردگارا, الهی, حافظا

Degree and Polarity

Degree

Cmp
- ADJ: بیشتر, بیش, کمتر, بیشتری, بهتر, بالاتر, برتر, بزرگتر, فراتر, نزدیکتر

Pos
- ADJ: اسلامی, دیگر, سیاسی, دوم, گذشته, فرهنگی, جدید, چند, ملی, پیدا

Sup
- ADJ: نخستین, بهترین, اولین, آخرین, بزرگترین, دومین, مهمترین, بیشترین, سومین, بالاترین

Polarity

Neg
- ADJ-Part: ناخواسته, نیاورده, ناخواسته‌ای, نبوده, نجنگیده, نمرده, نیفزوده, پرداخت‌نشده, کشف‌ناشده‌ای
- ADV: نه, غیر, هرگز, دیگر, هیچ, خیر, هیچ‌گاه
- AUX: نباید, نیستند, نخواهد, نباشد, نمی‌تواند, نمی‌توانند, نبوده, نباشند, نمی‌توانیم, نبودند
- AUX-Fin: نخواهد, نباشد, نباشند, نتوانند, نخواهم, نخواهند, نباشم, نشود, نباشید, نشوند
- AUX-Inf: نباید, نبایستی, نمی‌بایست, نمی‌شود
- AUX-Part: نبوده, نشده, نمی‌توانسته, نگردیده, نتوانسته, نتوانسته‌اند, نگردیده‌اند
- VERB: نیست, نداشته, ندارند, نکرده, نشده, نمی‌کند, نمی‌کنند, نمی‌دانستند, نیستم, نمی‌دانم
- VERB-Fin: نشود, نکند, نکنند, نباشد, نکنید, نیاید, مکن, نداند, ندهد, نماند
- VERB-Part: نداشته, نکرده, نشده, نیامده, نداده, نشده‌اند, نکرده‌اند, نرسیده, نمانده, نتوانسته

Verbal Features

Mood

Imp
- AUX-Fin: باش
- VERB-Fin: کن, گو, بگو, بگیر, بده, ببین, برو, بیا, بدانید, بزن

Ind
- AUX-Fin: خواهد, خواهند, نخواهد, دارد, داریم, دارند, خواهم, دارم, خواهیم, می‌باید

Sub
- AUX-Fin: باشد, باشند, باشیم, شود, گردد, نباشد, بتواند, باشم, باشید, بشوند
- VERB-Fin: کنند, کند, شود, کنیم, کنم, دهد, بشود, کنید, باشد, شوند

Tense

Fut
- AUX-Fin: خواهد, خواهند, نخواهد, خواهم, خواهیم, نخواهم, نخواهند, خواهی, خواهید, نخواهی

Past
- AUX: بود, بودند, نبود, شد, بودم, می‌خواست, توانست, بودیم, توانستند, گردید
- AUX-Fin: داشت
- VERB: کرد, گفت, شد, داشت, کردند, داد, افزود, بود, گرفت, می‌کرد

Pres
- AUX: است, باشد, ست, باشند, می‌تواند, می‌شود, می‌باشد, می‌توانند, می‌خواهد, نیستند
- AUX-Fin: باشد, باشند, باشیم, شود, گردد, نباشد, بتواند, باشم, دارد, باشید
- VERB: می‌شود, دارد, می‌کند, کنند, نیست, است, کند, می‌کنند, هستند, دارند
- VERB-Fin: کنند, کند, شود, کنیم, کنم, دهد, بشود, کنید, باشد, شوند

Pronouns, Determiners, Quantifiers

PronType

Dem
- PRON: آن, این, آن‌ها, آنان, این‌ها, همین, آنرا, همان, اینان, دان

Ind
- PRON: بعضی, برخی, دیگران, هرکس, یک, دیگری, بعضی‌ها, بسیاری, تک‌تک, دیگر

Int
- ADV: چرا, چه, چی, چگونه, کجا, مگر, چقدر, چطور, کی, چه‌طور
- DET: چه
- PRON: هرکه, کی

Neg
- DET: هیچ, غیر
- PRON: هیچکدام, هیچکس, هیچیک, هیچ‌یک

Prs
- PRON: خود, او, ما, ش, من, وی, تو, شان, م, شما

Rcp
- PRON: هم, یکدیگر, همدیگر

Rel
- PRON: آنچه

Tot
- PRON: همه, همهٔ, همگی, همگان, همه‌, تمام, هرکدام

NumType

Card
- NUM: یک, دو, یکی, هزار, سه, میلیون, ۲, چهار, ۵, ۳

Ord
- ADJ: دوم, هفتم, اول, سوم, شانزدهم, هشتم, پنجمین, چهاردهم

Reflex

Yes
- PRON: خود, خودم, خودت, خودمو, خویشتن

Person

1
- AUX: بودم, باشیم, باشم, می‌خواهیم, می‌خواهم, داریم, بودیم, می‌توانم, بخواهیم, خواهم
- AUX-Fin: باشیم, باشم, داریم, بخواهیم, خواهم, دارم, خواهیم, نخواهم, بتوانم, بتوانیم
- AUX-Part: بوده‌ایم, بوده‌ام, شده‌ایم
- PRON: ما, من, م, مان, ام, منم, خودم, خودمو, منِ
- VERB: می‌کنم, کنیم, کردیم, کنم, کردم, می‌کنیم, دارم, داریم, هستیم, گفتم
- VERB-Fin: کنیم, کنم, بگویم, بدهیم, بگوییم, ببینم, باشیم, بدهم, برسیم, برویم
- VERB-Part: کرده‌ایم, کرده‌ام, آمده‌ام, آمده‌ایم, دیده‌ایم, شده‌ایم, خوانده‌ام, داده‌ام, دیده‌ام, شده‌ام

2
- AUX: باشید, می‌توانید, می‌خواهید, باش, باشی, بودید, خواستید, خواهی, نمی‌توانید, شده‌ای
- AUX-Fin: باشید, باش, باشی, خواهی, نباشید, بخواهی, خواهید, داری, شوید, نخواهی
- AUX-Part: شده‌ای, بوده‌ای, شده‌اید
- PRON: تو, شما, ت, تان, ات, جنابعالی, شماها, خود, خودت
- VERB: کنید, کن, گو, بگو, دارید, کنی, نیستی, هستی, کردید, ببینید
- VERB-Fin: کنید, کن, گو, بگو, کنی, ببینید, بگیر, بده, بدهید, ببین
- VERB-Part: شنیده‌اید, کرده‌ای, کرده‌اید, نوشته‌ای, داده‌ای, ساخته‌ای, آمده‌ای, آمده‌اید, افکنده‌ای, بوده‌اید

3
- ADJ-Part: آمده, ساخته, یادشده, بسته, توقیف‌شده, ناخواسته, انجام‌شده, عقب‌افتاده, کاسته, کشته
- AUX: است, بود, خواهد, باشد, بودند, ست, باشند, بوده, شده, می‌تواند
- AUX-Fin: خواهد, باشد, باشند, خواهند, شود, نخواهد, گردد, نباشد, بتواند, دارد
- AUX-Part: بوده, شده, بوده‌اند, نبوده, گردیده, شده‌اند, نشده, می‌شده, نمی‌توانسته, نگردیده
- PRON: خود, او, ش, وی, شان, اش, ایشان, خویش, حضرتعالی, و
- VERB: کرد, گفت, شد, شده, می‌شود, دارد, می‌کند, کرده, کنند, نیست
- VERB-Fin: کنند, کند, شود, دهد, بشود, باشد, شوند, دهند, گیرد, بگیرد
- VERB-Part: شده, کرده, داشته, داده, گرفته, آمده, کرده‌اند, نوشته, دیده, ساخته

Other Features

Syntax

Auxiliary Verbs and Copula

This corpus uses 5 lemmas as copulas (cop). Examples: است، بود، شد، گردید، هست.

This corpus uses 8 lemmas as auxiliaries (aux). Examples: است، خواست، بود، بایست، توانست، توان، داشت، کرد.
This corpus uses 4 lemmas as passive auxiliaries (aux:pass). Examples: کرد، گشت، بود، گردید.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (3513)
- VERB--NOUN-ADP(ای) (2)
- VERB--NOUN-ADP(را) (8)
- VERB--PRON (774)
- VERB-Fin--NOUN (463)
- VERB-Fin--NOUN-ADP(ای) (1)
- VERB-Fin--NOUN-ADP(را) (2)
- VERB-Fin--PRON (130)
- VERB-Fin--PRON-ADP(را) (1)
- VERB-Part--NOUN (820)
- VERB-Part--PRON (118)
- VERB-Part--PRON-ADP(را) (1)

obj
- VERB--NOUN (813)
- VERB--NOUN-ADP(را) (1172)
- VERB--PRON (21)
- VERB--PRON-ADP(را) (175)
- VERB-Fin--NOUN (234)
- VERB-Fin--NOUN-ADP(را) (510)
- VERB-Fin--NOUN-ADP(را)-ADP(را) (1)
- VERB-Fin--NOUN-ADP(رو) (2)
- VERB-Fin--PRON (9)
- VERB-Fin--PRON-ADP(را) (85)
- VERB-Part--NOUN (220)
- VERB-Part--NOUN-ADP(را) (332)
- VERB-Part--PRON (6)
- VERB-Part--PRON-ADP(را) (35)

iobj

Verbs with Reflexive Core Objects

This corpus contains 15 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: کرد خود، دانست خود، داد خود، زد خود، دید خود، رساند خود، رسانید خود، آورد خود، افکند خود، بست خود، حلیم خود، دیدار خود، نامید خود، پوشاند خود، کشید خود

Relations Overview

This corpus uses 10 relation subtypes: acl:relcl, aux:pass, cc:preconj, compound:lvc, compound:prt, det:predet, flat:foreign, nmod:poss, nsubj:nc, nsubj:pass
The following 8 relation types are not used in this corpus at all: iobj, csubj, expl, clf, list, orphan, goeswith, reparandum