home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Yiddish YiTB

Language: Yiddish (code: yi)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.17 release.

The following people have contributed to making this treebank part of UD: Kirk Andrews.

Repository: UD_Yiddish-YiTB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18

License: CC BY-SA 4.0

Genre: grammar-examples, learner-essays, bible, wiki, fiction, nonfiction, spoken, web

Questions, comments? General annotation questions (either Yiddish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [m • kirkandrews (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation	Source
Lemmas	assigned by a program, not checked manually
UPOS	annotated manually, natively in UD style
XPOS	not available
Features	not available
Relations	annotated manually, natively in UD style

Description

YiTB is a treebank of linguistically annotated Yiddish data in the Universal Dependencies framework, created via a bootstraping machine learning method. A total of 27,872 tokens are currently in the treebank from a variety of sources and textual genres.

Yiddish is classified as a West Germanic language, although it includes many elements from Semitic and Slavic languages as well. It is written in a modified Hebrew alphabet. Yiddish is structurally similiar to German, but it also consists of many interesting structures not found in other Germanic languages, such as periphrastic verbs.

There are a total of 27,872 tokens in the treebank. Roughly 60% of these stem from the Tatoeba source and consist of short sentences provided by both native and non-native speakers of Yiddish. It must be noted that there are occasional grammatical errors in these sentences, such as the use of the auxiliary zayn ‘be’ instead of hobn ‘have’ in past tense constructions of periphrastic verbs formed with the verb zayn, as well as incorrect syntax of periphrastic verbs which have an underlying complement-head (OV) order and do not follow the typical order expected of an SVO language like Yiddish. This appears to be a common mistake of intermediate L2 Yiddish speakers. The remainder 40% of tokens stem from a variety of native speaker texts and genres. The various source texts and genres are shown below.

Lemmas and transliterations into Latin script are provided as well by self-made models but are not 100% accurate. The transliteration model, which can be accessed here, was trained on wiktionary and transliterated Bible data. The lemmatization model was trained on wiktionary data and can be found here. Rough translations have been provided using a model trained on Tatoeba sentences and parallel Bible verses that is accessible here. Many of these translations need manual correction, and that process is underway. Morphological features are also not included at this time.

Source	Author	Genre	Added	Split
tatoeba.org	Various	grammar/learner	2.17	all
Book of Exodus	Yehoyesh translation	bible	2.17	all
Beethoven’s Moonlight Sonata	Shloyme Bastomski	fiction	2.17	train
Yiddish proverbs	Various	proverb	2.17	all
Haggadahs and Elijah the Prophet	Proste Yiddish	web	2.17	test
Bulletin No. 3: At the Border	Various	nonfiction	2.17	test
A Story with a Cat and Yiddish Dialects	Proste Yiddish	web	2.17	dev
Sholem Aleichem	Proste Yiddish	web	2.17	train
Hirshke Glik	Shmerke Kaczerginski	nonfiction	2.17	dev
Book of Proverbs	Yehoyesh translation	bible	2.17	test
Shavuot and an Old Joke	Proste Yiddish	web	2.17	test
Bankrupt	Katie Brown	fiction	2.17	train
Jews and Yiddish	Nokhem Shtif	nonfiction	2.17	train
Fathers and Children	Chaim Malitz	nonfiction	2.17	train
Wikipedia	Various	nonfiction	2.17	train
A Foolish Child	Jacob Dinezon	fiction	2.17	test
From the Land of Consumption	Shloyme Gilbert	fiction	2.17	dev
The Four Questions	Traditional	liturgical	2.17	test
A Bit of Clarity and Simplicity Regarding the Language Question	Hillel Zeitlin	nonfiction	2.17	train
Song of Songs	Yehoyesh translation	bible	2.17	train
Yiddish: Volume 1	Sheva Zucker	grammar	2.18	train

Acknowledgments

To the best of our knowledge, the source texts used for the creation of this treebank are either in the public domain or are an orphan work for which no copyright holder can be found. If you hold the copyright to any of the texts used in this treebank and would like their removal, please contact us at the email below.

Statistics of UD Yiddish YiTB

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X

Features

ExtPos – Typo

Relations

acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – compound – compound:lvc – compound:prt – compound:redup – conj – cop – csubj – dep – det – det:poss – discourse – dislocated – expl – expl:pv – fixed – flat – flat:foreign – flat:name – goeswith – iobj – mark – nmod – nmod:poss – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:agent – obl:arg – orphan – parataxis – punct – reparandum – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 3113 sentences, 27954 tokens and 28348 syntactic words.

This corpus contains 4448 tokens (16%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 196 types of words that contain both letters and punctuation. Examples: מדינת־ישׂראל, שלום־עליכם, ניו־יאָרק, שלום־עליכמען, ארץ־ישׂראל, באָבע־זיידע, מאַמע־לשון, בני־ישׂראל, ד״ר, הר־הבית, חיי־שעה, יונגער־מאַן, לשון־קודש, עם־ישׂראל, אַלף־בית, אַריכת־ימים, געשטאַפּאָ־מאַן, גרויס־עלטערן, ד“ר, חיי־עולם, טאַטע־מאַמע, טעלעפֿאָן־נומער, ייִדיש־גריכיש, ייִדיש־קינד, ייִדיש־שפּאַניש, ישו־הנוצרי, יש״ו, כּישוף־מאַכערין, לבֿנה־ליכט, לבֿנה־סאָנאַטע, מזרח־סלאַווישע, מלוכה־שפּראַך, מלחמה־העלד, מערבֿ־ייִדישע, משה־רבינו, קריפּטאָ־מזומנים, שומר־מיצוות, תּרגום־לשון, 15־יעריקער, 8–טן, prosteyiddish@gmail.com, א"ב, אַזכּרה־אַקטן, אַל־אַקסאַ־פּלאַץ, אַלף־בת, אַנטי־ייִדישע, אַנטי־פֿאַשיסטישן, אַרבעטס־פּלאַץ, אאַז"װ, אויגן־דאָקטער

This corpus contains 394 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 73 types of multi-word tokens. Examples: ס׳איז, כ׳בין, פֿונעם, צום, כ׳האָב, האָסטו, אויפֿן, ביסטו, נישטאָ, מיטן, בײַם, ס׳זענען, ס'איז, אינעם, אױפֿן, זאָלסטו, לאָמיר, ס׳רובֿ, קענסטו, כ'בין, לאָמיך, מ׳האָט, ס׳וועט, רעדסטו, ווילסטו, וועסטו, כ'האָב, כ׳וועל, כ׳לערן, װעסטו, וואָלטסטו, כ׳הייס, כ׳וויל, ס'װעט, ס′איז, פֿאַרשטײסטו, אַם, אויפען, אונטערן, איבערן, ביזן, בלייבסטו, בײַן, גלייבסטו, געדענקסטו, ווייסטו, זעסטו, כ'וועל, כ'לעב, כ'פֿײַף.

Morphology

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

ExtPos
- ADP
  - DET: אַ
- ADV
  - ADP: אין

Typo
- Yes
  - ADV: דער
  - NOUN: קאַֹװע

Syntax

Auxiliary Verbs and Copula

This corpus uses 1 lemmas as copulas (cop). Examples: זײַן.

This corpus uses 13 lemmas as auxiliaries (aux). Examples: האָבן, זײַן, װעלן, זאָלן, קענען, וועלן, דאַרפֿן, װאָלט, מוזן, טאָרן, מעגן, פֿלעגן, קערן.
This corpus uses 3 lemmas as passive auxiliaries (aux:pass). Examples: װערן, זײַן, ווערן.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (483)
- VERB--PRON (1379)

obj
- VERB--NOUN (859)
- VERB--PRON (276)

iobj
- VERB--PRON (2)

Reflexive Verbs

This corpus contains 148 lemmas that occur at least once with an expl:pv child. Examples: לערנען זיך, פֿילן זיך, טאָן זיך, טרעפֿן זיך, באַטראַכטן זיך, אַנטוויקלען זיך, באַזעצן זיך, דערהערן זיך, הײבן זיך, זעצן זיך, לייגן זיך, נעמען זיך, פֿאַרשטיין זיך, שפּילן זיך, אָנהײבן זיך, באַוויז זיך, באַקענען זיך, בײַטן זיך, דערוויסן זיך, דערנענטערן זיך, האַלטן זיך, וואַשן זיך, לאָזן זיך, לײגן זיך, מאַכן זיך, מערן זיך, ענדערן זיך, פֿאַראינטערעסירן זיך, פֿאַרקילן זיך, פֿירן זיך, פֿרעגן זיך, צוהערן זיך, קוקן זיך, שטעל זיך, שטעלן זיך, אַוועקלײַגן זיך, אַוועקצוזעצן זיך, אַוועקשטעלן זיך, אַראַביש זיך, אַראָפּלאָזן זיך, אַראָפּנידערן זיך, אַרויסרײַסן זיך, אַרײַנגיסן זיך, אַרײַנקוועטש זיך, אָנלען זיך, אָננעמען זיך, אָפּוואַשן זיך, אָפּטײלן זיך, אָפּלאָזן זיך, אָפּצוגעבן זיך

Relations Overview

This corpus uses 15 relation subtypes: acl:relcl, advcl:relcl, aux:pass, compound:lvc, compound:prt, compound:redup, det:poss, expl:pv, flat:foreign, flat:name, nmod:poss, nsubj:outer, nsubj:pass, obl:agent, obl:arg
The following 2 relation types are not used in this corpus at all: clf, list