home edit page issue tracker

This page pertains to UD version 2.

Universal Dependencies

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 600 contributors producing over 200 treebanks in over 150 languages (see the bottom of this page for updated numbers from the latest release). If you are new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.

If you want to receive news about Universal Dependencies, you can subscribe to the UD mailing list. If you want to discuss individual annotation questions, use the Github issue tracker.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Abaza treebanks

UD_Abaza-ATB is a treebank based on [Spoken corpus of Abaza](http://lingconlab.ru/spoken_abaza/).

 

Language documentation

See the language documentation page.

Abkhaz treebanks

UD_Abkhaz-AbNC is a treebank based on texts from the Abkhaz National Corpus, [AbNC](https://clarino.uib.no/abnc).

 

Language documentation

See the language documentation page.

Afrikaans treebanks

UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

 

Language documentation

See the language documentation page.

Akkadian treebanks

162 royal inscriptions of four early Neo-Assyrian kings.

 

A small set of sentences from Babylonian royal inscriptions.

 

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Akuntsu treebanks

UD_Akuntsu-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/akun1241"> Akuntsú</a>. The sentences stem from the grammatical description by Aragon (2014) and Aragon's field work. Sentence annotation and documentation by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos.

 

Language documentation

See the language documentation page.

Albanian treebanks

The UD Albanian Treebank is a small treebank for Standard Albanian, developed within a project framework at Uppsala University. The data was extracted from Wikipedia.

 

Language documentation

See the language documentation page.

Amharic treebanks

UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.

 

Language documentation

See the language documentation page.

Ancient Greek treebanks

UD Ancient Greek PTNK contains portions of the Septuagint according to the Codex Alexandrinus.

 

UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

 

This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

See the language documentation page.

Ancient Hebrew treebanks

UD Ancient Hebrew PTNK contains portions of the Biblia Hebraic Stuttgartensia with morphological annotations from [ETCBC](https://github.com/etcbc/bhsa).

 

Language documentation

See the language documentation page.

Apurina treebanks

This is an Apurinã treebank consisting of sentences from a grammatical description of the language by Maília Fernanda.

 

Language documentation

See the language documentation page.

Arabic treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

 

The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

 

See here for comparative statistics of Arabic treebanks.

Language documentation

See the language documentation page.

Armenian treebanks

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the V. Brusov State University in Yerevan.

 

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

 

See here for comparative statistics of Armenian treebanks.

Language documentation

See the language documentation page.

Assyrian treebanks

The Uppsala Assyrian Treebank is a small treebank for Modern Standard Assyrian. The corpus is collected and annotated manually. The data was randomly collected from different textbooks and a short translation of The Merchant of Venice.

 

Language documentation

See the language documentation page.

Azerbaijani treebanks

This is a small treebank of grammatical examples for Azerbaijani. The treebank tries to be neutral about the particular variety (North or South Azerbaijani, hence uses the ISO code for the macrolanguage (`az`).

 

Language documentation

See the language documentation page.

Bambara treebanks

The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

 

Language documentation

See the language documentation page.

Basque treebanks

The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.

 

Language documentation

See the language documentation page.

Bavarian treebanks

MaiBaam is manually annotated with part-of-speech tags and syntactic dependencies. The treebank encompasses diverse text genres (wiki articles and discussions, grammar examples, fiction, and commands for virtual assistants) and dialects from the North, Central and South Bavarian areas as well as the dialectal transition areas in between.

 

Language documentation

See the language documentation page.

Beja treebanks

A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.

 

Language documentation

See the language documentation page.

Belarusian treebanks

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

 

Language documentation

See the language documentation page.

Bengali treebanks

The BRU Bengali treebank has been created at Begum Rokeya University, Rangpur, by the members of Semantics Lab.

 

Language documentation

See the language documentation page.

Bhojpuri treebanks

The [Bhojpuri](https://en.wikipedia.org/wiki/Bhojpuri_language) UD Treebank (BHTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

 

Language documentation

See the language documentation page.

Bororo treebanks

UD_Bororo-BDT is a compilation of annotated sentences in [Bororo](https://glottolog.org/resource/languoid/id/boro1282). The corpus encompasses sentences derived from diverse sources: grammar examples, mythological narratives, fieldwork material, and other sources. Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

 

Language documentation

See the language documentation page.

Breton treebanks

UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).

 

Language documentation

See the language documentation page.

Bulgarian treebanks

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.

 

Language documentation

See the language documentation page.

Buryat treebanks

The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

 

Language documentation

See the language documentation page.

Cantonese treebanks

A Cantonese treebank (in Traditional Chinese characters) of film subtitles and of legislative proceedings of Hong Kong, parallel with the Chinese-HK treebank.

 

Language documentation

See the language documentation page.

Cappadocian treebanks

This is a treebank of Pharasiot, a critically endangered Greek dialect originally spoken near Cappadocia. The source material is fairy tales collected during field study.

 

Language documentation

See the language documentation page.

Catalan treebanks

Catalan data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

 

Language documentation

See the language documentation page.

Cebuano treebanks

UD_Cebuano_GJA is a collection of annotated Cebuano sample sentences randomly taken from three different sources: community-contributed samples from the website Tatoeba, a Cebuano grammar book by Bunye & Yap (1971) and Tanangkinsing's reference grammar on Cebuano (2011). This project is currently work in progress.

 

Language documentation

See the language documentation page.

Chinese treebanks

A treebank of Chinese sentences adapted for learner of level A1 to C1 (HSK1 to 5) collected on the [Chinese Grammar Wiki](https://resources.allsetlearning.com/chinese/grammar/\) (CC BY-NC-SA 3.0 License) website. The treebank was manually annotated by researchers of Paris Nanterre University (Modyco) in the mSUD annotation schema (morpheme level Surface Universal Dependencies).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

A Traditional Chinese treebank of film subtitles and of legislative proceedings of Hong Kong, parallel with the Cantonese-HK treebank.

 

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

 

A treebank of Chinese patent application texts collected from the Chinese patent office's website CNIPA. The sentences are randomly selected from the patent claims of the IPC section "G" from November 2017 to September 2018.

 

Simplified Chinese Universal Dependencies dataset converted from the GSD (traditional) dataset with manual corrections.

 

Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

 

See here for comparative statistics of Chinese treebanks.

Language documentation

See the language documentation page.

Chukchi treebanks

This data is a manual annotation of the corpus from multimedia annotated corpus of the [Chuklang](http://chuklang.ru/) project, a dialectal corpus of the Amguema variant of Chukchi.

 

Language documentation

See the language documentation page.

Classical Armenian treebanks

The present release includes a treebank of the Classical Armenian translation of the four Gospels (95370 tokens in 4146 sentences) as part of the UD Classical Armenian-CAVaL treebank project. It results from a conversion of the PROIEL annotation of the Classical Armenian Gospels, which has been manually corrected and extended with additional information.

 

Language documentation

See the language documentation page.

Classical Chinese treebanks

Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.
  • Contributors: Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Yuan Li, Hiroyuki Shirasu, Kazunori Fujita
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

A dependency Treebank of "逍遥游(Enjoyment in Untroubled Ease)" written by Zhuangzi.

 

See here for comparative statistics of Classical Chinese treebanks.

Language documentation

See the language documentation page.

Coptic treebanks

UD Coptic contains manually annotated Sahidic Coptic texts, including Biblical texts, sermons, letters, and hagiography.

 

Language documentation

See the language documentation page.

Croatian treebanks

The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the [hr500k](http://hdl.handle.net/11356/1183) corpus.

 

Language documentation

See the language documentation page.

Czech treebanks

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

 

The Czech-PDT UD treebank is based on the Prague Dependency Treebank – Consolidated 1.0 (PDT-C), created at the Charles University in Prague.

 

FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

 

The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 2.0, created at the Charles University in Prague.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD_Czech-Poetry contains random samples of Czech 19th-century poetry from the Corpus of Czech Verse parsed with UDPipe2 (trained on UD Czech-PDT 2.11) and manually corrected.

 

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish treebanks

The Danish UD treebank is a conversion of the Danish Dependency Treebank.

 

Language documentation

See the language documentation page.

Dutch treebanks

This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

 

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

 

See here for comparative statistics of Dutch treebanks.

Language documentation

See the language documentation page.

Egyptian treebanks

Egyptian-UJaen is the first morphosyntactic treebank created for Pre-Coptic Egyptian in Universal Dependencies. It contains sentences manually annotated at the University of Jaén (Spain) that were selected from texts written in Old Egyptian, Middle Egyptian, Late Egyptian and Demotic.

 

Language documentation

See the language documentation page.

English treebanks

Universal Dependencies syntax annotations from the GUM corpus (https://gucorpling.org/gum/)

 

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).
  • Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, Ethan Chi, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Nathan Schneider, Sam Bowman, Hanzhi Zhu, Daniel Galbraith, John Bauer
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD Atis Treebank is a manually annotated treebank consisting of the sentences in the Atis (Airline Travel Informations) dataset which includes the human speech transcriptions of people asking for flight information on the automated inquiry systems.

 

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

Repository for the Genre Tests for Linguistic Evaluation (GENTLE) Corpus
  • Contributors: Tatsuya Aoyama, Shabnam Behzad, Luke Gessler, Lauren Levine, Yi-Ju Jessica Lin, Yang Janet Liu, Siyao Logan Peng, Yilun Zhu, Amir Zeldes
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

 

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, "hers", (independent) "his", (singular) "theirs", "mine", and (singular) "yours".

 

This repository includes the Dependency Treebank of Spoken L2 English (SL2E), which consists of Universal Dependency annotations for a random sample of sentences from the <a href="https://alaginrc.nict.go.jp/nict_jle/index_E.html" target="_blank">NICT JLE</a>, a corpus of spoken second language English. <a href="https://github.com/LCR-ADS-Lab/SL2E-Dependency-Treebank" target="_blank">The homepage of the project is here.</a>

 

UD_English-CTeTex is a technical text corpus annotated in Universal Dependency syntax containing 196 software requirements.

 

Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)

 

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

Erzya treebanks

UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

 

Language documentation

See the language documentation page.

Estonian treebanks

UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,972 trees, 437,769 tokens.

 

UD EWT treebank consists of different genres of new media. The treebank contains 7,190 trees, 90,585 tokens.

 

See here for comparative statistics of Estonian treebanks.

Language documentation

See the language documentation page.

Faroese treebanks

This is a treebank of Faroese based on the Faroese Wikipedia.

 

UD_Faroese-FarPaHC is a conversion of the [Faroese Parsed Historical Corpus (FarPaHC)](https://github.com/einarfs/farpahc) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
  • Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Anton Karl Ingason, Eiríkur Rögnvaldsson, Joel C. Wallenberg
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Faroese treebanks.

Language documentation

See the language documentation page.

Finnish treebanks

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

 

FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script and later manually revised.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

 

Finnish-OOD is an external out-of-domain test set for Finnish-TDT annotated natively into UD scheme.

 

See here for comparative statistics of Finnish treebanks.

Language documentation

See the language documentation page.

French treebanks

The **UD_French-GSD** was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.
  • Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni, Carly Dickerson, Guy Perrier
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

**UD_French-Sequoia** is an automatic conversion of the [SUD_French-Sequoia](https://github.com/surfacesyntacticud/SUD_French-Sequoia) treebank, which comes from the former corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

 

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.

 

A Universal Dependencies corpus for spoken French.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe, Bruno Guillaume
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The corpus **UD_French-FQB** is an automatic conversion of the [French QuestionBank v1](http://alpage.inria.fr/Treebanks/FQB/), a corpus entirely made of questions.

 

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Frisian Dutch treebanks

UD_Frisian_Dutch-Fame is a selection of 400 sentences from the FAME! speech corpus by Yilmaz et al. (2016a, 2016b). The treebank is manually annotated using the UD scheme.

 

Language documentation

See the language documentation page.

Galician treebanks

The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña) and at CiTIUS (Universidade de Santiago de Compostela).

 

The Galician PUD is a treebank for Galician developed at CiTIUS (Universidade de Santiago de Compostela).

 

The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.

 

See here for comparative statistics of Galician treebanks.

Language documentation

See the language documentation page.

Georgian treebanks

The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Georgian Language Corpus (GLC) available at http://corpora.iliauni.edu.ge/ and sentences selected from Wiki in accordance with the 132 scientific fields.

 

Language documentation

See the language documentation page.

German treebanks

The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This treebank aims at gathering texts of the German literary history. Currently, it hosts Fragments of the early Romanticism, i.e. aphorism-like texts mainly dealing with philosophical issues concerning art, beauty and related topics.

 

UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

 

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.

Gheg treebanks

UD Gheg Pear Stories (GPS) contains renarrations of Wallace Chafe's Pear Stories video (pearstories.org) by heritage speakers of Gheg Albanian living in Switzerland and speakers from Prishtina.

 

Language documentation

See the language documentation page.

Gothic treebanks

The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

 

Language documentation

See the language documentation page.

Greek treebanks

GUD is a resource for EL manually annotated for morphology and syntax. It is an ongoing project led by Stella Markantonatou and Vivian Stamou (hereinafter: the GUD team), both researchers at the [Institute for Language and Speech Processing](http://www.ilsp.gr/) (ILSP/Athena Research Centre).

 

The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

 

See here for comparative statistics of Greek treebanks.

Language documentation

See the language documentation page.

Guajajara treebanks

UD_Guajajara-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/guaj1255">Guajajara</a>. Sentences stem from multiple sources such as descriptions of the language, short stories, dictionaries and translations from the New Testament. Sentence annotation and documentation by Lorena Martín Rodríguez and Fabrício Ferraz Gerardi.

 

Language documentation

See the language documentation page.

Guarani treebanks

UD_Guarani-OldTuDeT is a collection of annotated texts in <a href="https://glottolog.org/resource/languoid/id/oldp1258">Old Guaraní</a>. All known sources in this language are being annotated: cathesisms, grammars (seventeenth and eighteenth century), sentences from dictionaries, and other texts. Sentence annotation and documentation by Fabrício Ferraz Gerardi and Lorena Martín Rodríguez.

 

Language documentation

See the language documentation page.

Gujarati treebanks

GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.

 

Language documentation

See the language documentation page.

Haitian Creole treebanks

This is a treebank of Haitian creole. It contains 144 sentences selected from 3 major genres: bible, literary texts, newspapers. Kreyòl (Kreyòl Ayisyen, Haitian Creole, iso-639-1: ht) is the main language of Haïti. The dialect described here is the Cap Haïtien dialect which differs slightly in its lexicon with Center and South varieties.

 

Language documentation

See the language documentation page.

Hausa treebanks

This treebank contains data of Southern Autogramm, for the Zaria dialect of Nigeria (Southern Hausa).

 

This treebank contains data of Northern Autogramm, for the Ader dialect of Niger Republic (Northern Hausa).

 

See here for comparative statistics of Hausa treebanks.

Language documentation

See the language documentation page.

Hebrew treebanks

Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section (https://www.iahlt.org/)

 

A Universal Dependencies Corpus for Hebrew.

 

See here for comparative statistics of Hebrew treebanks.

Language documentation

See the language documentation page.

Highland Puebla Nahuatl treebanks

UD_Highland_Puebla_Nahuatl-ITML is a collection of texts in the Highland Puebla variety of Nahuatl (ISO-639: `azz`) spoken in 24 municipalities in the state of Mexico in Puebla. The treebank contains spoken monologue and dialogue, scientific texts translated from Spanish and some miscellaneous grammatical examples from a language course.

 

Language documentation

See the language documentation page.

Hindi treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

 

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Hittite treebanks

UD_Hittite-HitTB is a small Universal Dependencies treebank for Hittite, containing original sentences from Hoffner and Melchert's tutorial to A Grammar of the Hittite Language.

 

Language documentation

See the language documentation page.

Hungarian treebanks

The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).

 

Language documentation

See the language documentation page.

Icelandic treebanks

UD_Icelandic-IcePaHC is a conversion of the [Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
  • Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Hildur Jónsdóttir, Kristín Bjarnadóttir, Anton Karl Ingason, Kristján Rúnarsson, Steinþór Steingrímsson, Joel C. Wallenberg, Eiríkur Rögnvaldsson
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD_Icelandic-Modern is a conversion of the [modern additions](https://github.com/antonkarl/icecorpus/tree/master/additions2019) to the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.
  • Contributors: Kristján Rúnarsson, Þórunn Arnardóttir, Hinrik Hafsteinsson, Starkaður Barkarson, Hildur Jónsdóttir, Steinþór Steingrímsson, Einar Freyr Sigurðsson
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Icelandic-PUD is the Icelandic part of the Parallel Universal Dependencies (PUD) treebanks.

 

UD_Icelandic-GC is a conversion of the gold part of [GreynirCorpus](https://github.com/mideind/GreynirCorpus), which has been manually corrected and verified. The corpus is parsed into full constituency trees, and converted using [UDConverter-GreynirCorpus](https://github.com/thorunna/UDConverter-GreynirCorpus).
  • Contributors: Vilhjálmur Þorsteinsson, Hulda Óladóttir, Þórunn Arnardóttir, Sveinbjörn Þórðarson, Haukur Barri Símonarson, Katla Ásgeirsdóttir
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Icelandic treebanks.

Language documentation

See the language documentation page.

Indonesian treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman, Ika Alfina, Arawinda Dinakaramani, Muhammad Yudistira Hanifmuti, Jessica Naraiswari Arwidarasti, Yogi Lesmana Sulestio
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Indonesian-GSD treebank was originally converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb) in 2015. In order to comply with the latest Indonesian annotation guidelines, the treebank has undergone a major revision between UD releases v2.8 and v2.9 (2021).

 

UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named [**Kethu**](https://github.com/ialfina/kethu) that was also a conversion from a constituency treebank built by [**Dinakaramani et al. (2015)**](https://github.com/famrashel/idn-treebank). We named this treebank **Indonesian-CSUI**, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.
  • Contributors: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Indonesian treebanks.

Language documentation

See the language documentation page.

Irish treebanks

A Universal Dependencies 4910-sentence treebank for modern Irish.

 

A Universal Dependencies treebank of 2596 tweets in modern Irish.

 

This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a [separate section below](#bardic-segment).

 

See here for comparative statistics of Irish treebanks.

Language documentation

See the language documentation page.

Italian treebanks

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

 

The UD_Italian-VIT corpus was obtained by conversion from VIT (Venice Italian Treebank), developed at the Laboratory of Computational Linguistics of the Università Ca' Foscari in Venice (Delmonte et al. 2007; Delmonte 2009; http://rondelmo.it/resource/VIT/Browser-VIT/index.htm).

 

Italian-Old is a treebank containing **Dante Alighieri's Comedy**, based on the 1994 Petrocchi edition and taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. The syntactic annotation has been done from scratch, following UD annotation scheme. It is a treebank of Old Italian, specifically Florentine. The Comedy was composed between approximately 1306 and 1321.

 

UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

ParlaMint-It is a collection of transcriptions of parliamentary sessions of the Italian Senate annotated in Universal Dependencies. The corpus is part of a larger multilingual collection of parliamentary transcripts built during the ParlaMint project (https://www.clarin.eu/parlamint).

 

TWITTIRÒ-UD is a collection of ironic Italian tweets annotated in Universal Dependencies. The treebank can be exploited for the training of NLP systems to enhance their performance on social media texts, and in particular, for irony detection purposes.

 

Manually corrected Treebank of Learner Italian drawn from the Valico corpus and correspondent corrected sentences.

 

PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

 

The MarkIT resource contains around 800 sentences extracted from students' essays manually annotated with syntactic depencendies. The treebank covers seven types of marked constructions, plus some ambiguous sentences whose syntax can be wrongly classified as marked.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese treebanks

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ).
  • Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ). UD-Japanese-BCCWJLUW is the other word segmentation version of UD-Japanese-BCCWJ. We use **Long Unit Word (LUW)** as their syntactic word in UD definition.
  • Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Javanese treebanks

UD Javanese-CSUI is a dependency treebank in Javanese, a regional language in Indonesia with more than 68 million users. It was developed by Alfina et al. from the Faculty of Computer Science, Universitas Indonesia. The newest version has 1000 sentences and 14K words with manual annotation.

 

Language documentation

See the language documentation page.

Kaapor treebanks

**UD_Kaapor-TuDeT** is a collection of annotated sentences in [Ka'apor](https://glottolog.org/resource/languoid/id/urub1250). The project is a work in progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Kangri treebanks

The Kangri UD Treebank (KDTB) is a part of the Universal Dependency treebank project.

 

Language documentation

See the language documentation page.

Karelian treebanks

UD Karelian-KKPP is a manually annotated new corpus of Karelian made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

 

Language documentation

See the language documentation page.

Karo treebanks

UD_Karo-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/karo1306"> Karo</a>. The sentences stem from the only grammatical description of the language (Gabas, 1999) and from the sentences in the dictionary by the same author (Gabas, 2007). Sentence annotation and documentation by Fabrício Ferraz Gerardi.

 

Language documentation

See the language documentation page.

Kazakh treebanks

The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

 

Language documentation

See the language documentation page.

Khunsari treebanks

The AHA Khunsari Treebank is a small treebank for contemporary Khunsari. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Khunsari speakers.

 

Language documentation

See the language documentation page.

Kiche treebanks

UD Kʼicheʼ-IU is a treebank consisting of sentences from a variety of text domains but principally dictionary example sentences and linguistic examples.

 

Language documentation

See the language documentation page.

Komi Permyak treebanks

This is a Komi-Permyak literary language treebank consisting of original and translated texts.

 

Language documentation

See the language documentation page.

Komi Zyrian treebanks

UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.

 

This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.

 

See here for comparative statistics of Komi Zyrian treebanks.

Language documentation

See the language documentation page.

Korean treebanks

The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).

 

The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Korean treebanks.

Language documentation

See the language documentation page.

Kurmanji treebanks

The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

 

Language documentation

See the language documentation page.

Kyrgyz treebanks

UD_Kyrgyz-KTMU is dependency parsing based treebank in Kyrgyz language. Sentences were selected partly from Kyrgyz story and novel books, partly from Kyrgyz news websites.

 

This is a small treebank of grammatical examples for Kyrgyz.

 

See here for comparative statistics of Kyrgyz treebanks.

Language documentation

See the language documentation page.

Latgalian treebanks

UD_Latgalian-Cairo is an example treebank to provide minimal dataset for Latgalian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

 

Language documentation

See the language documentation page.

Latin treebanks

Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

 

This Universal Dependencies version of the **LLCT** (Late Latin Charter Treebank) consists of an automated conversion of the **LLCT2** treebank from the Latin Dependency Treebank (LDT) format into the Universal Dependencies standard.

 

The **UDante** treebank is based on the Latin texts of Dante Alighieri, taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Latin language, more precisely of **literary Medieval Latin** (XIVth century).
  • Contributors: Flavio Massimiliano Cecchini, Giovanni Moretti, Marco Passarotti, Rachele Sprugnoli, Daniela Corbetta, Federica Favero, Federica Gamba, Martina de Laurentiis, Giulia Pedonese, Andrea Peverelli, Elena Vagnoni, Mirko Tavoni
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

UD_Latin-CIRCSE is a repository of treebanks featuring Latin texts natively annotated at the CIRCSE Research Centre in Milan (https://centridiricerca.unicatt.it/circse/en.html) following the Universal Dependencies (UD) (https://universaldependencies.org) annotation scheme. The repository includes prose and poetry texts from different periods.
  • Contributors: Federica Iurescia, Federica Gamba, Flavio Massimiliano Cecchini, Francesco Mambrini, Giovanni Moretti, Marco Passarotti, Paolo Ruffolo
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.

 

See here for comparative statistics of Latin treebanks.

Language documentation

See the language documentation page.

Latvian treebanks

Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).

 

This is an example treebank made to ilustrate UD annotation choices made for Latvian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

 

See here for comparative statistics of Latvian treebanks.

Language documentation

See the language documentation page.

Ligurian treebanks

The Genoese Ligurian Treebank is a small, manually annotated collection of contemporary Ligurian prose. The focus of the treebank is written Genoese, the koiné variety of Ligurian which is associated with today's literary, journalistic and academic ligurophone sphere.

 

Language documentation

See the language documentation page.

Lithuanian treebanks

Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

 

The Lithuanian dependency treebank ALKSNIS v3.0 (Vytautas Magnus University).
  • Contributors: Andrius Utka, Erika Rimkutė, Agnė Bielinskienė, Jolanta Kovalevskaitė, Loïc Boizou, Gabrielė Aleksandravičiūtė, Kristina Brokaitė, Daniel Zeman, Natalia Perkova, Bernadeta Griciūtė
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Lithuanian treebanks.

Language documentation

See the language documentation page.

Livvi treebanks

UD Livvi-KKPP is a manually annotated new corpus of Livvi-Karelian made directly in the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

 

Language documentation

See the language documentation page.

Low Saxon treebanks

The UD Low Saxon LSDC dataset consists of sentences in 8 major Low Saxon dialect groups from both Germany and the Netherlands. These sentences are (or are to become) part of the LSDC dataset and represent the language from mostly the 19th and early 20th century in genres such as short stories, novels, speeches, letters and fairytales.

 

Language documentation

See the language documentation page.

Luxembourgish treebanks

The LuxBank corpus currently consists of the translated Cairo Cicling examples, and will be extended to include examples from a national dataset. It is the first comprehensive tree bank dataset for Luxembourgish.

 

Language documentation

See the language documentation page.

Macedonian treebanks

The Macedonian-MTB treebank is a collection of annotated sentences taken from the Macedonian version of the Cairo CICLing Corpus and from the university textbook in syntax "Contemporary Macedonian Language 4" by Simov Sazdov.

 

Language documentation

See the language documentation page.

Madi treebanks

UD_Madi-Jarawara is a collection of annotated sentences in Madí (Jarawara dialect) from a variety of sources, including grammar examples, oral stories, didatic material, and dictionary examples.

 

Language documentation

See the language documentation page.

Maghrebi Arabic French treebanks

A Universal Dependencies corpus for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. We added to the UD annotations NER annotations extending the French Treebank NER scheme (Sagot et al, 2012) and Offensive language classification and corrected many of the translations (still ongoing).

 

Language documentation

See the language documentation page.

Makurap treebanks

UD_Makuráp-TuDeT is a collection of annotated texts in Makuráp. The project is a work in progress and the treebank is being updated on a regular basis. The sentences are being annotated by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos, and Luan Cabral.

 

Language documentation

See the language documentation page.

Malayalam treebanks

Currently just a small sample of Malayalam grammatical examples.

 

Language documentation

See the language documentation page.

Maltese treebanks

MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.

 

Language documentation

See the language documentation page.

Manx treebanks

This is the Cadhan Aonair UD treebank for Manx Gaelic, created by Kevin Scannell.

 

Language documentation

See the language documentation page.

Marathi treebanks

UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

 

Language documentation

See the language documentation page.

Mbya Guarani treebanks

UD Mbya_Guarani-Thomas is a corpus of Mbyá Guaraní (Tupian) texts collected by Guillaume Thomas. The current version of the corpus consists of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu, Caazapá Department, Paraguay.

 

UD Mbya_Guarani-Dooley is a corpus of narratives written in Mbyá Guaraní (Tupian) in Brazil, and collected by Robert Dooley. Due to copyright restrictions, the corpus that is distributed as part of UD only contains the annotation (tags, features, relations) while the FORM and LEMMA columns are empty.

 

See here for comparative statistics of Mbya Guarani treebanks.

Language documentation

See the language documentation page.

Middle French treebanks

UD_Middle_French-PROFITEROLE is the Middle French section of the PROFITEROLE corpus, the Old French section is UD_OLD_FRENCH-PROFITEROLE.
  • Contributors: Sophie Prévost, Eric Villemonte de la Clergerie, Mathilde Regnault, Loïc Grobol, Benoît Crabbé, Mathieu Dehouck, Alexei Lavrentiev
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Moksha treebanks

Erme Universal Dependencies annotated texts Moksha are the origin of UD_Moksha-JR with annotation (CoNLL-U) for texts in the Moksha language, it originally consists of a sample from a number of fiction authors writing originals in Moksha.

 

Language documentation

See the language documentation page.

Munduruku treebanks

UD_Munduruku-TuDeT is a collection of annotated sentences in [Mundurukú](http://www.endangeredlanguages.com/lang/2981). The project is a work in progress and the treebank is being updated on a regular basis. </br> </br> </br> </br> </br>

 

Language documentation

See the language documentation page.

Naija treebanks

A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).
  • Contributors: Bernard Caron, Emmett Strickland, Marine Courtin, Kim Gerdes, Bruno Guillaume, Sylvain Kahane, Chika Kennedy Ajede, Emeka Onwuegbuzia, Samson Tella
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Nayini treebanks

The AHA Nayini Treebank is a small treebank for contemporary Nayini. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Nayini speakers.

 

Language documentation

See the language documentation page.

Neapolitan treebanks

This treebank contains example sentences in Neapolitan, translated by a native speaker.

 

Language documentation

See the language documentation page.

Nheengatu treebanks

The [UD_Nheengatu-CompLin](https://doi.org/10.5753/stil.2023.234131) is a treebank of [Nheengatu](https://glottolog.org/resource/languoid/id/nhen1239) (ISO-639: `yrl`), also known, inter alia, as Modern Tupi and *Língua Geral Amazônica*. It comprises sentences from diverse published sources, e.g., spontaneous speech, grammatical descriptions, fables, myths, coursebooks, and dictionaries.

 

Language documentation

See the language documentation page.

North Sami treebanks

This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

 

Language documentation

See the language documentation page.

Norwegian treebanks

The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. The current version of NDT has been automatically converted to the UD scheme by Ingerid Løyning Dale, Per Erik Solberg and Andre Kåsen at the Norwegian Language Bank at the National Library of Norway. This conversion builds to a large extent on previous conversions by Lilja Øvrelid at the University of Oslo.

 

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Old Church Slavonic treebanks

The Old Church Slavonic (OCS) UD treebank is based on canonical Old Church Slavonic data from the PROIEL and TOROT treebanks.

 

Language documentation

See the language documentation page.

Old East Slavic treebanks

`UD_Old_East_Slavic-RNC` is a sample of the Middle Russian corpus (1300-1700), a part of the Russian National Corpus. The data were originally annotated according to the RNC and extended UD-Russian morphological schemas and UD 2.4 dependency schema.

 

The Ruthenian UD treebank includes texts written in the territories of modern Belarus, Lithuania, Ukraine, and Poland in ca. 1300-1700. A sample of legal and nonfiction texts is drawn from the Ruthenian Corpus.

 

UD\_Old\_East\_Slavic-TOROT is a conversion of a selection of Old East Slavonic and Middle Russian data from the Tromsø Old Russian and OCS Treebank (TOROT), which was originally annotated in PROIEL dependency format.

 

UD Old\_East\_Slavic-Birchbark is based on the RNC Corpus of Birchbark Letters and includes documents written in 1025-1500 in an East Slavic vernacular (letters, household and business records, records for church services, spell against diseases, and other short inscriptions). The treebank is manually syntactically annotated in the UD 2.0 scheme, morphological and lexical annotation is a conversion of the original RNC annotation.

 

See here for comparative statistics of Old East Slavic treebanks.

Language documentation

See the language documentation page.

Old French treebanks

UD_Old_French-PROFITEROLE is an expansion of the previous UD_Old_French-SRCMF (which was a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).
  • Contributors: Sophie Prévost, Aurélie Collomb, Kim Gerdes, Isabelle Tellier, Marine Courtin, Alexei Lavrentiev, Céline Guillot-Barbance, Loïc Grobol, Mathilde Regnault
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Old Irish treebanks

A Universal Dependencies treebank for the Old Irish glosses of St. Gall.

 

A Universal Dependencies treebank for the Old Irish Würzburg glosses.

 

See here for comparative statistics of Old Irish treebanks.

Language documentation

See the language documentation page.

Old Turkish treebanks

This repository contains an [Old Turkish](https://iso639-3.sil.org/code/otk) treebank built upon Old Turkic script texts.

 

Language documentation

See the language documentation page.

Ottoman Turkish treebanks

An Ottoman Turkish dependency treebank annotated in UD style. Created by Enes Yılandiloğlu.

 

An Ottoman Turkish dependency treebank annotated in UD style. Created by [Şaziye Betül Özateş](https://sb-b.github.io/), Tarık Emre Tıraş, Efe Eren Genç from Boğaziçi University, and Esma Fatıma Bilgin Taşdemir from Medeniyet University.

 

See here for comparative statistics of Ottoman Turkish treebanks.

Language documentation

See the language documentation page.

Paumari treebanks

This is a small treebank of Paumari, a low-resource Amazonian language.

 

Language documentation

See the language documentation page.

Persian treebanks

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data: * Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian. "The Persian Dependency Treebank Made Universal". 2020 (to appear).

 

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

 

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Polish treebanks

The Polish PDB-UD treebank is automatically converted from the Polish Dependency Bank 2.0 (PDB 2.0). Both treebanks were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

 

The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.

 

This is the Polish portion of the Parallel Universal Dependencies (PUD) treebanks, created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

 

See here for comparative statistics of Polish treebanks.

Language documentation

See the language documentation page.

Pomak treebanks

The Pomak UD treebank is derived from the Pomak Dependency Treebank, a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

 

Language documentation

See the language documentation page.

Portuguese treebanks

UD_Portuguese-PetroGold is a fully revised treebank which consists of academic texts from the oil & gas domain in Brazilian Portuguese.

 

Porttinari-base [(Duran et al., 2023)](https://sol.sbc.org.br/index.php/stil/article/view/25443/25264) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese [(Pardo et al., 2021)](https://sol.sbc.org.br/index.php/stil/article/view/17778/17612), following the "Universal Dependencies" international grammar framework [(de Marneffe et al., 2021)](https://aclanthology.org/2021.cl-2.11/).

 

This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.
  • Contributors: Alexandre Rademaker, Cláudia Freitas, Elvis de Souza, Aline Silveira, Tatiana Cavalcanti, Wograine Evelyn, Luisa Rocha, Isabela Soares-Bastos, Eckhard Bick, Fabricio Chalub, Guilherme Paulino-Passos, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

CINTIL-UDep is a dependency bank of Portuguese that is treebanked with Universal Dependencies. It contains over 38K annotated sentences (and 476K tokens), of mostly newspaper text.
  • Contributors: Mariana Avelãs, António Branco, Marisa Campos, Catarina Carvalheiro, Rita Carvalho, Sérgio Castro, Francisco Costa, Cláudia Martins, Rita Pereira, Sílvia Pereira, Clara Pinto, Andreia Querido, Joana Ramos, João Silva, Sara Silveira
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).
  • Contributors: Alexandre Rademaker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Fabricio Chalub, Carlos Ramisch, Juan Belieni, Vanessa Berwanger Wille, Rodrigo Pintucci
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva, Alexandre Rademaker, Elvis de Souza
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Romanian treebanks

The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.

 

SiMoNERo is a medical corpus of contemporary Romanian.

 

This is a (currently small) Twitter treebank containing a subset of tweets from [CoRoSeOf](https://github.com/DianaHoefels/CoRoSeOf).

 

The UD treebank ArT is a treebank of the Aromanian dialect of the Romanian language in UD format.

 

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0
  • Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca, Roman Untilov, Petru Rebeja
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Russian treebanks

Universal Dependencies treebank is based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 and GramEval-2020 shared tasks collections.

 

UD_Russian-Poetry contains samples of Russian poetry written in 19th – early 21th centuries. The treebank is based on the Poetry Corpus of the Russian National Corpus.

 

Russian data from the SynTagRus corpus.

 

Russian Universal Dependencies Treebank annotated and converted by Google.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Russian treebanks.

Language documentation

See the language documentation page.

Sanskrit treebanks

A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

 

The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB). Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

 

See here for comparative statistics of Sanskrit treebanks.

Language documentation

See the language documentation page.

Scottish Gaelic treebanks

A treebank of Scottish Gaelic based on the [Annotated Reference Corpus Of Scottish Gaelic (ARCOSG)](https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG).

 

Language documentation

See the language documentation page.

Serbian treebanks

The Serbian UD treebank is based on the [SETimes-SR](http://hdl.handle.net/11356/1200) corpus and additional news documents from the Serbian web.

 

Language documentation

See the language documentation page.

Sinhala treebanks

This treebank consists contemporary written Sinhala text taken from a 10M corpus maintained by UCSC, Sri Lanka. The corpus contains novels, short stories, Sinhala translations, critiques and Sinhala newspapers.

 

Language documentation

See the language documentation page.

Skolt Sami treebanks

The UD Skolt Sami Giellagas treebank is based almost entirely on spoken Skolt Sami corpora.

 

Language documentation

See the language documentation page.

Slovak treebanks

The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

 

Language documentation

See the language documentation page.

Slovenian treebanks

The SSJ treebank is the reference UD treebank for Slovenian, consisting of approximately 13,000 sentences and 267,097 tokens from fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. As of UD release 2.10 in May 2022, the original version of the SSJ UD treebank has been partially manually revised and extended with new manually annotated data.

 

The Spoken Slovenian Treebank (SST) is a manually annotated collection of transcribed audio recordings featuring spontaneous speech in various everyday situations. It includes 344 unique speech events (documents) amounting to approximately 10 hours of speech, encompassing a total of 6,104 utterances and 76,341 tokens.

 

See here for comparative statistics of Slovenian treebanks.

Language documentation

See the language documentation page.

Soi treebanks

The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Soi speakers.

 

Language documentation

See the language documentation page.

South Levantine Arabic treebanks

The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project. TO-DO: Add 20 annotated sentences from CCC as a train set.

 

Language documentation

See the language documentation page.

Spanish treebanks

Spanish data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

 

The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The COSER UD Treebank (COSER-UD) is the first syntactically annotated corpus of spoken Spanish, based on a sample of the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish".

 

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Swedish treebanks

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

 

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

 

Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

 

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Swedish Sign Language treebanks

The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.

 

Language documentation

See the language documentation page.

Swiss German treebanks

_UD\_Swiss\_German-UZH_ is a tiny manually annotated treebank of 100 sentences in different Swiss German dialects and a variety of text genres.

 

Language documentation

See the language documentation page.

Tagalog treebanks

UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.

 

Ugnayan is a manually annotated Tagalog treebank currently composed of educational fiction and nonfiction text. The treebank is under development at the University of the Philippines.

 

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tamil treebanks

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

 

MWTT - Modern Written Tamil Treebank has sentences taken primarily from a text called "A Grammar of Modern Tamil by Thomas Lehmann (1993). This initial release has 536 sentences of various lengths, and all of these are added as the test set.

 

See here for comparative statistics of Tamil treebanks.

Language documentation

See the language documentation page.

Tatar treebanks

UD Tatar-NMCTT is a manually annotated corpus of the Tatar language based on the text from Tatar-Inform (tatar-inform.tatar), an online news website.

 

Language documentation

See the language documentation page.

Teko treebanks

UD_Teko-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/emer1243"> Tekó (Emérillon) </a>. The sentences stem from the only grammatical description of the language (Rose, 2011). Sentence annotation and documantation by Uliana Vedenina and Fabrício Ferraz Gerardi.

 

Language documentation

See the language documentation page.

Telugu treebanks

The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

 

Language documentation

See the language documentation page.

Telugu English treebanks

UD Telugu_English-TECT is a Telugu-English code-switching treebank.

 

Language documentation

See the language documentation page.

Thai treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Tswana treebanks

UD Tswana-Popapolelo is a translation of the 20 Cairo Cicling sentences (https://github.com/UniversalDependencies/cairo) annotated with XPOS, UPOS and dependency relations.

 

Language documentation

See the language documentation page.

Tupinamba treebanks

UD_Tupinamba-TuDeT is a collection of annotated sentences in [Tupinambá](https://glottolog.org/resource/languoid/id/tupi1273). All known sources in this language are being annotated: cathecisms, letters, poems, theater plays, and grammars (sixteenth and seventeenth century). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

 

Language documentation

See the language documentation page.

Turkish treebanks

Turkish-Kenet UD Treebank is the biggest treebank of Turkish. It consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples.

 

Turkish version of the Penn Treebank. It consists of a total of 9,560 manually annotated sentences and 87,367 tokens. (It only includes sentences up to 15 words long.)
  • Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Neslihan Kara, Bilge Nas Arıcan, Merve Özçelik, Deniz Baran Aslan
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Turkish Tourism is a domain specific treebank consisting of 19,750 manually annotated sentences and 92,200 tokens. These sentences were taken from the original customer reviews of a tourism company.
  • Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Büşra Marşan, Bilge Nas Arıcan, Neslihan Kara, Deniz Baran Aslan, Ezgi Sanıyar, Cengiz Asmazoğlu
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This treebank is a translation of English ATIS (Airline Travel Information System) corpus (see References). It consists of 5432 sentences.

 

This is a treebank annotating example sentences from a comprehensive grammar book of Turkish.

 

Turkish FrameNet consists of 2,700 manually annotated example sentences and 19,221 tokens. Its data consists of the sentences taken from the Turkish FrameNet Project. The annotated sentences can be filtered according to the semantic frame category of the root of the sentence.
  • Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Oğuzhan Kuyrukçu, Bilge Nas Arıcan, Ezgi Sanıyar, Neslihan Kara, Merve Özçelik
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

A Turkish dependency treebank annotated in UD style. Created by the members of [TABILAB](https://tabilab.cmpe.boun.edu.tr/) from Boğaziçi University.
  • Contributors: Büşra Marşan, Salih Furkan Akkurt, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak&Eryiğit, 2018; Sulubacak et al., 2016).
  • Contributors: Utku Türk, Şaziye Betül Özateş, Büşra Marşan, Salih Furkan Akkurt, Çağrı Çöltekin, Gülşen Cebiroğlu Eryiğit, Memduh Gökırmak, Hüner Kaşıkara, Umut Sulubacak, Francis Tyers
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Turkish German treebanks

UD Turkish-German SAGT is a Turkish-German code-switching treebank that is developed as part of the [SAGT](https://www.ims.uni-stuttgart.de/en/research/projects/sagt/) project.

 

Language documentation

See the language documentation page.

Ukrainian treebanks

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]

 

Language documentation

See the language documentation page.

Umbrian treebanks

UD_Umbrian-IKUVINA is a dependency treebank rendering of the Iguvine tablets ([Wikipedia](https://en.wikipedia.org/wiki/Iguvine_Tablets)). The seven bronze tablets describe religious ceremonies performed by the Umbrian people in Italy before the rise of the Roman empire. The corpus will eventually contain all the tablets. But as of May 2022, only tablet I is release with partial morphological analysis and partial lemmatisation. (POS tagging and Dependency trees are complete)

 

Language documentation

See the language documentation page.

Upper Sorbian treebanks

A small treebank of Upper Sorbian based mostly on Wikipedia.

 

Language documentation

See the language documentation page.

Urdu treebanks

The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

 

Language documentation

See the language documentation page.

Uyghur treebanks

The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.

 

Language documentation

See the language documentation page.

Veps treebanks

UD Veps-VWT is a manually annotated corpus of Veps made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts written in Central Veps dialect.

 

Language documentation

See the language documentation page.

Vietnamese treebanks

The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).

 

This treebank includes a set of sentences from [OPUS](https://opus.nlpl.eu/), sourced from subtitles, talks, and educational videos.

 

See here for comparative statistics of Vietnamese treebanks.

Language documentation

See the language documentation page.

Warlpiri treebanks

A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.

 

Language documentation

See the language documentation page.

Welsh treebanks

UD Welsh-CCG (Corpws Cystrawennol y Gymraeg) is a treebank of Welsh, annotated according to the Universal Dependencies guidelines.

 

Language documentation

See the language documentation page.

Western Armenian treebanks

A Universal Dependencies treebank for Western Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

 

Language documentation

See the language documentation page.

Western Sierra Puebla Nahuatl treebanks

UD Western Sierra Puebla Nahuatl-IU is a treebank consisting of sentences from written fiction and non-fiction, spontaenous speech, and grammar examples.

 

Language documentation

See the language documentation page.

Wolof treebanks

UD_Wolof-WTB is a natively manual developed treebank for Wolof. Sentences were collected from encyclopedic, fictional, biographical, religious texts and news.

 

Language documentation

See the language documentation page.

Xavante treebanks

UD_Xavante-XDT is a collection of annotated sentences in [Xavante](https://glottolog.org/resource/languoid/id/xava1240). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](http://languagestructure.github.io/), Ivan Roksandic.

 

Language documentation

See the language documentation page.

Xibe treebanks

The UD Xibe Treebank is a corpus of the Xibe language (ISO 639-3: *sjo*) containing manually annotated syntactic trees under the Universal Dependencies. Sentences come from three sources: grammar book examples, newspaper (Cabcal News) and Xibe textbooks.

 

Language documentation

See the language documentation page.

Yakut treebanks

UD_Yakut-YKTDT is a collection Yakut ([Sakha]) sentences (https://glottolog.org/resource/languoid/id/yaku1245). The project is work-in-progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Yoruba treebanks

Parts of the Yoruba Bible and of the Yoruba edition of Wikipedia, hand-annotated natively in Universal Dependencies.

 

Language documentation

See the language documentation page.

Yupik treebanks

UD_Yupik-SLI is a treebank of St. Lawrence Island Yupik (ISO 639-3: ess) that has been manually annotated at the morpheme level, based on a finite-state morphological analyzer by [Chen et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.326). The word-level annotation, merging multiword expressions, is provided in not-to-release/ess_sli-ud-test.merged.conllu. More information about the treebank can be found in our publication (AmericasNLP, 2021).

 

Language documentation

See the language documentation page.

Zaar treebanks

A Universal Dependencies corpus for Zaar (aka Sayanci), a member of the Chadic branch of the Afro-Asiatic phylum. The language is mainly spoken by about 200,000 speakers in the Bogoro and Tafawa Balewa local governments of Bauchi State, Nigeria.

 

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Possible Future Extensions

People have expressed interest in providing annotated data for the following languages but no valid data has been provided so far.

Akkadian treebanks

UD_Akkadian-MCONG is a treebank of normalized Akkadian sentences drawn mostly from Neo-Assyrian corpora lemmatized on [Oracc](http://oracc.museum.upenn.edu/). Sentences are annotated for lemma, syntactic dependencies, and morphological features. The treebank contains approximately 112,000 words.

 

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Amharic treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Archaic Irish treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Assamese treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Bengali treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks originally created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Bhojpuri treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.