UD Yiddish YiTB
Language: Yiddish (code: yi)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.17 release.
The following people have contributed to making this treebank part of UD: Kirk Andrews.
Repository: UD_Yiddish-YiTB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: grammar-examples, learner-essays, bible, wiki, fiction, nonfiction, spoken, web
Questions, comments? General annotation questions (either Yiddish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [m • kirkandrews (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | assigned by a program, not checked manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | not available |
| Relations | annotated manually, natively in UD style |
Description
YiTB is a treebank of linguistically annotated Yiddish data in the Universal Dependencies framework, created via a bootstraping machine learning method. A total of 27,872 tokens are currently in the treebank from a variety of sources and textual genres.
Yiddish is classified as a West Germanic language, although it includes many elements from Semitic and Slavic languages as well. It is written in a modified Hebrew alphabet. Yiddish is structurally similiar to German, but it also consists of many interesting structures not found in other Germanic languages, such as periphrastic verbs.
There are a total of 27,872 tokens in the treebank. Roughly 60% of these stem from the Tatoeba source and consist of short sentences provided by both native and non-native speakers of Yiddish. It must be noted that there are occasional grammatical errors in these sentences, such as the use of the auxiliary zayn ‘be’ instead of hobn ‘have’ in past tense constructions of periphrastic verbs formed with the verb zayn, as well as incorrect syntax of periphrastic verbs which have an underlying complement-head (OV) order and do not follow the typical order expected of an SVO language like Yiddish. This appears to be a common mistake of intermediate L2 Yiddish speakers. The remainder 40% of tokens stem from a variety of native speaker texts and genres. The various source texts and genres are shown below.
Lemmas and transliterations into Latin script are provided as well by self-made models but are not 100% accurate. The transliteration model, which can be accessed here, was trained on wiktionary and transliterated Bible data. The lemmatization model was trained on wiktionary data and can be found here. Translations are not provided at this time, but a model trained on Tatoeba sentences and parallel Bible verses is accessible here. Morphological features are also not included at this time.
| Source | Author | Genre | Added | Split |
|---|---|---|---|---|
| tatoeba.org | Various | grammar/learner | 2.17 | all |
| Book of Exodus | Yehoyesh translation | bible | 2.17 | all |
| Beethoven’s Moonlight Sonata | Shloyme Bastomski | fiction | 2.17 | train |
| Yiddish proverbs | Various | proverb | 2.17 | all |
| Haggadahs and Elijah the Prophet | Proste Yiddish | web | 2.17 | test |
| Bulletin No. 3: At the Border | Various | nonfiction | 2.17 | test |
| A Story with a Cat and Yiddish Dialects | Proste Yiddish | web | 2.17 | dev |
| Sholem Aleichem | Proste Yiddish | web | 2.17 | train |
| Hirshke Glik | Shmerke Kaczerginski | nonfiction | 2.17 | dev |
| Book of Proverbs | Yehoyesh translation | bible | 2.17 | test |
| Shavuot and an Old Joke | Proste Yiddish | web | 2.17 | test |
| Bankrupt | Katie Brown | fiction | 2.17 | train |
| Jews and Yiddish | Nokhem Shtif | nonfiction | 2.17 | train |
| Fathers and Children | Chaim Malitz | nonfiction | 2.17 | train |
| Wikipedia | Various | nonfiction | 2.17 | train |
| A Foolish Child | Jacob Dinezon | fiction | 2.17 | test |
| From the Land of Consumption | Shloyme Gilbert | fiction | 2.17 | dev |
| The Four Questions | Traditional | liturgical | 2.17 | test |
| A Bit of Clarity and Simplicity Regarding the Language Question | Hillel Zeitlin | nonfiction | 2.17 | train |
| Song of Songs | Yehoyesh translation | bible | 2.17 | train |
Acknowledgments
To the best of our knowledge, the source texts used for the creation of this treebank are either in the public domain or are an orphan work for which no copyright holder can be found. If you hold the copyright to any of the texts used in this treebank and would like their removal, please contact us at the email below.
Statistics of UD Yiddish YiTB
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X
Features
Relations
acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – compound – compound:lvc – compound:prt – compound:redup – conj – cop – csubj – dep – det – det:poss – discourse – dislocated – expl – expl:pv – fixed – flat – flat:foreign – flat:name – goeswith – iobj – mark – nmod – nmod:poss – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:agent – obl:arg – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 3054 sentences, 27488 tokens and 27879 syntactic words.
- This corpus contains 4377 tokens (16%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 196 types of words that contain both letters and punctuation. Examples: מדינת־ישׂראל, שלום־עליכם, ניו־יאָרק, שלום־עליכמען, ארץ־ישׂראל, באָבע־זיידע, מאַמע־לשון, בני־ישׂראל, ד״ר, הר־הבית, חיי־שעה, יונגער־מאַן, לשון־קודש, עם־ישׂראל, אַלף־בית, אַריכת־ימים, געשטאַפּאָ־מאַן, גרויס־עלטערן, ד“ר, חיי־עולם, טאַטע־מאַמע, טעלעפֿאָן־נומער, ייִדיש־גריכיש, ייִדיש־קינד, ייִדיש־שפּאַניש, ישו־הנוצרי, יש״ו, כּישוף־מאַכערין, לבֿנה־ליכט, לבֿנה־סאָנאַטע, מזרח־סלאַווישע, מלוכה־שפּראַך, מלחמה־העלד, מערבֿ־ייִדישע, משה־רבינו, קריפּטאָ־מזומנים, שומר־מיצוות, תּרגום־לשון, 15־יעריקער, 8–טן, prosteyiddish@gmail.com, א"ב, אַזכּרה־אַקטן, אַל־אַקסאַ־פּלאַץ, אַלף־בת, אַנטי־ייִדישע, אַנטי־פֿאַשיסטישן, אַרבעטס־פּלאַץ, אאַז"װ, אויגן־דאָקטער
- This corpus contains 391 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 73 types of multi-word tokens. Examples: ס׳איז, כ׳בין, פֿונעם, כ׳האָב, צום, האָסטו, אויפֿן, ביסטו, נישטאָ, מיטן, בײַם, ס׳זענען, ס'איז, אינעם, אױפֿן, זאָלסטו, לאָמיר, ס׳רובֿ, קענסטו, כ'בין, לאָמיך, מ׳האָט, ס׳וועט, רעדסטו, ווילסטו, וועסטו, כ'האָב, כ׳וועל, כ׳לערן, װעסטו, וואָלטסטו, כ׳הייס, כ׳וויל, ס'װעט, ס′איז, פֿאַרשטײסטו, אַם, אויפען, אונטערן, איבערן, ביזן, בלייבסטו, בײַן, גלייבסטו, געדענקסטו, ווייסטו, זעסטו, כ'וועל, כ'לעב, כ'פֿײַף.
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
- This corpus does not use the following tags: SYM
- This corpus contains 6 word types tagged as particles (PART): הלוואַי, ניט, נישט, סך, צו, צי
- This corpus contains 23 lemmas tagged as pronouns (PRON): _, אַלעמען, אַלץ, איך, איר, אײַער, גאָרנישט, דו, וואָס, וואס, זי, זיי, זיך, זײ, מיר, מען, עמעץ, עס, עפּעס, ער, װאָס, װעלך, װער
- This corpus contains 36 lemmas tagged as determiners (DET): _, אַ, אַזאַ, אונדזער, אונדזערע, איטלעך, איין, אירן, אירע, אײַער, אײַרע, אײנער, דאָזיק, דיין, דיינען, דער, דײַן, וועמע, וועמען, ווײניק, זייער, זעלביק, זײַן, זײַער, זײער, יעדערער, יענער, מיינע, מיינען, מער, מײַן, מײַנע, סאַמע, קײַן, קײנער, רובֿ
- Out of the above, 2 lemmas occurred sometimes as PRON and sometimes as DET: _, אײַער
- This corpus contains 16 lemmas tagged as auxiliaries (AUX): _, דאַרפֿן, האָבן, וועלן, ווערן, זאָלן, זײַן, טאָרן, מוזן, מעגן, פֿלעגן, קענען, קערן, װאָלט, װעלן, װערן
- Out of the above, 9 lemmas occurred sometimes as AUX and sometimes as VERB: _, דאַרפֿן, האָבן, וועלן, זײַן, פֿלעגן, קענען, קערן, װערן
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
- ExtPos
- ADP
- DET: אַ
- ADV
- ADP: אין
- ADP
- Typo
- Yes
- ADV: דער
- NOUN: קאַֹװע
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: זײַן.
- This corpus uses 14 lemmas as auxiliaries (aux). Examples: האָבן, זײַן, װעלן, קענען, זאָלן, וועלן, דאַרפֿן, װאָלט, מוזן, טאָרן, מעגן, פֿלעגן, _, קערן.
- This corpus uses 3 lemmas as passive auxiliaries (aux:pass). Examples: װערן, זײַן, ווערן.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (447)
- VERB--PRON (1354)
- obj
- VERB--NOUN (828)
- VERB--PRON (274)
- iobj
- VERB--PRON (2)
Reflexive Verbs
- This corpus contains 149 lemmas that occur at least once with an expl:pv child. Examples: לערנען זיך, פֿילן זיך, טאָן זיך, טרעפֿן זיך, באַטראַכטן זיך, אַנטוויקלען זיך, באַוויז זיך, באַזעצן זיך, דערהערן זיך, זעצן זיך, לייגן זיך, לײַגן זיך, פֿאַרשטיין זיך, שפּילן זיך, אָנהויבן זיך, באַקענען זיך, בײַטן זיך, דערוויסן זיך, דערנענטערן זיך, האַלטן זיך, הײַבן זיך, וואַשן זיך, לאָזן זיך, מאַכן זיך, נעמען זיך, ענדערן זיך, פֿאַראינטערעסירן זיך, פֿאַרקילן זיך, פֿירן זיך, פֿרעגן זיך, צוגעהערן זיך, קוקן זיך, שטעל זיך, שטעלן זיך, אַוועקלײַגן זיך, אַוועקצוזעצן זיך, אַוועקשטעלן זיך, אַראַביש זיך, אַראָפּלאָזן זיך, אַראָפּנידערן זיך, אַרויסרײַסן זיך, אַרײַנגיסן זיך, אַרײַנקוועטש זיך, אָנלען זיך, אָננעמען זיך, אָפּוואַשן זיך, אָפּטײלען זיך, אָפּלאָזן זיך, אָפּצוגעבן זיך, אויסטיילן זיך
Relations Overview
- This corpus uses 15 relation subtypes: acl:relcl, advcl:relcl, aux:pass, compound:lvc, compound:prt, compound:redup, det:poss, expl:pv, flat:foreign, flat:name, nmod:poss, nsubj:outer, nsubj:pass, obl:agent, obl:arg
- The following 2 relation types are not used in this corpus at all: clf, list