home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

Universal Dependencies

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 600 contributors producing over 200 treebanks in over 150 languages. If you are new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.

💡 Understanding UD	🔍 Using UD	🔨 Contributing to UD
Short introduction to UD (history)	Query UD treebanks online	How to contribute to UD
Annotation guidelines (changes) UPOS tags ▪ feats ▪ deprels ▪ CoNLL-U format	Download UD treebanks: all releases ☞ Release 2.18 (May 15, 2026)	UD mailing list
		Guidelines issue tracker
Tutorials and events	Tools for working with UD
🚀 Projects related to UD
SUD: Surface Syntactic Universal Dependencies ▪ Deep Universal Dependencies ▪ Universal PropBank ▪ CorefUD: Coreference in Universal Dependencies ▪ UNER: Universal Named Entity Recognition ▪ UMR: Uniform Meaning Representation ▪ UniMorph ▪ UDMorph ▪ UDer: Universal Derivations ▪ PARSEME: Multiword expressions ▪ UniDive COST Action ▪ UCxn: Universal Constructions ▪ UD on Hugging Face

📖 Overview Publications

Linguistic framework: Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, and Daniel Zeman (2021). Universal Dependencies. Computational Linguistics 47(2): 255–308.
Treebank data: Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman (2020). Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 4034–4043, Marseille, France.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Abaza 1 <1K Northwest Caucasian

Abaza treebanks

ATB <1K ⓁⒻ

UD_Abaza-ATB is a treebank based on [Spoken corpus of Abaza](http://lingconlab.ru/spoken_abaza/).

Contributors: Alexey Koshevoy, Anastasia Panova, Ilya Makarchuk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Abkhaz 1 13K Northwest Caucasian

Abkhaz treebanks

AbNC 13K ⓁⒻ

UD_Abkhaz-AbNC is a treebank based on texts from the Abkhaz National Corpus, [AbNC](https://clarino.uib.no/abnc).

Contributors: Paul Meurer
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Afrikaans 1 49K IE, Germanic

Afrikaans treebanks

AfriBooms 49K ⓁⒻ

UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

Contributors: Peter Dirix, Liesbeth Augustinus, Daniel van Niekerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Akkadian 2 25K Afro-Asiatic, Semitic

Akkadian treebanks

RIAO 23K ⓁⒻ

162 royal inscriptions of four early Neo-Assyrian kings.

Contributors: Mikko Luukko, Aleksi Sahala, Sam Hardwick, Krister Lindén
Repository master dev
README
Treebank hub page
Download

PISANDUB 1K Ⓛ

A small set of sentences from Babylonian royal inscriptions.

Contributors: Kamil Kopacewicz
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Akuntsu 1 1K Tupian, Tupari

Akuntsu treebanks

TuDeT 1K ⓁⒻ

UD_Akuntsu-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/akun1241"> Akuntsú</a>. The sentences stem from the grammatical description by Aragon (2014) and Aragon's field work. Sentence annotation and documentation by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos.

Contributors: Carolina Aragon, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Albanian 2 4K IE, Albanian

Albanian treebanks

STAF 3K ⓁⒻ

The UD-Albanian-STAF (Saarbruecken Treebank of Albanian Fiction) is a treebank of the Albanian language, comprising 202 randomly selected sentences from six fictional books published between 1963 and 2004.

Contributors: Luigi Talamo, Edita Luftiu, Nelda Kote, Rozana Rushiti, Anila Çepani
Repository master dev
README
Treebank hub page
Download

TSA <1K ⓁⒻ

The UD Albanian Treebank is a small treebank for Standard Albanian, developed within a project framework at Uppsala University. The data was extracted from Wikipedia.

Contributors: Marsida Toska
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Albanian treebanks.

Language documentation

See the language documentation page.

Alemannic 2 21K IE, Germanic

Alemannic treebanks

UZH 1K

_UD\_Alemannic-UZH_ is a tiny manually annotated treebank of 100 sentences in different Swiss German dialects and a variety of text genres.

Contributors: Noëmi Aepli
Repository master dev
README
Treebank hub page
Download

DIVITAL 19K

UD_Alemannic-DIVITAL is a manually corrected treebank of Alemannic Alsatian consisting of sentences from several genres.

Contributors: Nathanaël Beiner, Barbara Hoff, Delphine Bernhard
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Alemannic treebanks.

Language documentation

See the language documentation page.

Amharic 1 10K Afro-Asiatic, Semitic

Amharic treebanks

ATT 10K ⓁⒻ

UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.

Contributors: Binyam Ephrem, Gashaw Arutie, Tsegay Woldemariam, Juan Ignacio Navarro Horñiacek
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ancient Greek 3 456K IE, Greek

Ancient Greek treebanks

PTNK 39K ⓁⒻ Ⓟ

UD Ancient Greek PTNK contains portions of the Septuagint according to the Codex Alexandrinus.

Contributors: Daniel Swanson
Repository master dev
README
Treebank hub page
Download

PROIEL 214K ⓁⒻ

UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Perseus 202K ⓁⒻ

This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

See the language documentation page.

Ancient Hebrew 1 145K Afro-Asiatic, Semitic

Ancient Hebrew treebanks

PTNK 145K ⓁⒻ Ⓟ

UD Ancient Hebrew PTNK contains portions of the Biblia Hebraic Stuttgartensia with morphological annotations from [ETCBC](https://github.com/etcbc/bhsa) and syntactic annotations partially based on [MACULA](https://github.com/Clear-Bible/macula-hebrew/).

Contributors: Daniel Swanson
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Apurina 1 1K Arawakan, Purus

Apurina treebanks

UFPA 1K ⓁⒻ

This is an Apurinã treebank consisting of sentences from a grammatical description of the language by Maília Fernanda.

Contributors: Marília Fernanda, Sidney Facundes, Bruna Lima Padovani, Jack Rueter, Niko Partanen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Arabic 3 1,042K Afro-Asiatic, Semitic

Arabic treebanks

PUD 20K ⓁⒻ Ⓟ

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

PADT 282K ⓁⒻⒺ

The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Shadi Saleh
Repository master dev
README
Treebank hub page
Download

NYUAD 738K ✘ⓁⒻ

The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

Contributors: Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Arabic treebanks.

Language documentation

See the language documentation page.

Armenian 2 150K IE, Armenian

Armenian treebanks

ArmTDP 104K ⓁⒻ

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

BSUT 46K ⓁⒻ

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the V. Brusov State University in Yerevan.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Armenian treebanks.

Language documentation

See the language documentation page.

Assamese 1 <1K IE, Indic

Assamese treebanks

AiW <1K ⓁⒻ

The Assamese-AiW treebank is a manually annotated corpus in Assamese (Assamese script). Assamese is an Indo-Aryan language written in the Assamese script, from Left-to-Right. Word order is Subject-Object-Verb (SOV) with relatively free constituent order.

Contributors: Kaushik Sengupta, Luigi Talamo, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Assyrian 1 <1K Afro-Asiatic, Semitic

Assyrian treebanks

AS <1K ⓁⒻ

The Uppsala Assyrian Treebank is a small treebank for Modern Standard Assyrian. The corpus is collected and annotated manually. The data was randomly collected from different textbooks and a short translation of The Merchant of Venice.

Contributors: Mary Yako
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Azerbaijani 1 <1K Turkic, Southwestern

Azerbaijani treebanks

TueCL <1K ⓁⒻ Ⓟ

This is a small treebank of grammatical examples for Azerbaijani. The treebank tries to be neutral about the particular variety (North or South Azerbaijani, hence, uses the ISO code for the macrolanguage (`az`).

Contributors: Soudabeh Eslami, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bambara 1 13K Mande

Bambara treebanks

CRB 13K ⓁⒻ

The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

Contributors: Katya Aplonova, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Basque 1 121K Basque

Basque treebanks

BDT 121K ⓁⒻ

The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.

Contributors: Maria Jesus Aranzabe, Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Iakes Goenaga, Koldo Gojenola, Larraitz Uria
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bavarian 1 15K IE, Germanic

Bavarian treebanks

MaiBaam 15K Ⓟ

MaiBaam is manually annotated with part-of-speech tag, syntactic dependencies, and German lemmas. The treebank encompasses diverse text genres (wiki articles and discussions, grammar examples, fiction, and commands for virtual assistants) and dialects from the North, Central and South Bavarian areas as well as the dialectal transition areas in between.

Contributors: Verena Blaschke, Barbara Kovačić, Siyao Peng, Miriam Winkler, Barbara Plank
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Beja 1 11K Afro-Asiatic, Cushitic

Beja treebanks

Autogramm 11K Ⓕ

A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.

Contributors: Martine Vanhove, Rayan Ziane, Sylvain Kahane, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Belarusian 1 305K IE, Slavic

Belarusian treebanks

HSE 305K ⓁⒻⒺ

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

Contributors: Olga Lyashevskaya, Angelika Peljak-Łapińska, Daria Petrova, Yana Shishkina
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bengali 1 <1K IE, Indic

Bengali treebanks

BRU <1K ⓁⒻ

The BRU Bengali treebank has been created at Begum Rokeya University, Rangpur, by the members of Semantics Lab.

Contributors: Siratun Jannat, Mizanur Rahoman, Shafi Sourov, Jannatul Ferdaousi, Syeda Shahzadi, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bhojpuri 1 6K IE, Indic

Bhojpuri treebanks

BHTB 6K ⓁⒻ

The [Bhojpuri](https://en.wikipedia.org/wiki/Bhojpuri_language) UD Treebank (BHTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

Contributors: Atul Kr. Ojha, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bokota 1 2K Chibchan, Guaymiic

Bokota treebanks

ChibErgIS 2K Ⓛ

A Universal Dependencies corpus for Bokota, a member of the Chibchan language family. The language is spoken by about 500 speakers in Panama. The variant of the Bokota treebank is spoken in the Comarca Ngobe-Bugle, at the border of the Veraguas Province, along the Caribbean coast. The other known variant of the language is called Buglere, and is spoken in the province of Chiriqui, in Panama.

Contributors: Marie Benzerrak, Natalia Cáceres Arandia
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bororo 1 160K Bororoan

Bororo treebanks

BDT 160K ⓁⒻ

UD_Bororo-BDT is a compilation of annotated sentences in [Bororo](https://glottolog.org/resource/languoid/id/boro1282). The corpus encompasses sentences derived from diverse sources: grammar examples, mythological narratives, fieldwork material, and other sources. Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

Contributors: Fabrício Ferraz Gerardi, Lucas Toribio, Dolores Sollberger
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Brahui 1 <1K Dravidian

Brahui treebanks

Kholum <1K ⓁⒻ

The Kholum treebank is a manually annotated corpus in Brahui.

Contributors: Muhammad Afzal, Luigi Talamo, Helena Vaz, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Breton 1 10K IE, Celtic

Breton treebanks

KEB 10K ⓁⒻ

UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bulgarian 1 156K IE, Slavic

Bulgarian treebanks

BTB 156K ⓁⒻⒺ

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.

Contributors: Kiril Simov, Petya Osenova, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Buryat 1 10K Mongolic

Buryat treebanks

BDT 10K ⓁⒻ Ⓟ

The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

Contributors: Elena Badmaeva, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cantonese 1 13K Sino-Tibetan, Chinese

Cantonese treebanks

HK 13K Ⓛ Ⓟ

A Cantonese treebank (in Traditional Chinese characters) of film subtitles and of legislative proceedings of Hong Kong, parallel with the Chinese-HK treebank.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cappadocian 2 4K IE, Greek

Cappadocian treebanks

AMGiC <1K ⓁⒻ

The "Asia Minor Greek in Contact" treebank (AMGiC, UD_AMGiC) is compiled from sentences entailing contact-induced morphosyntactic phenomena (CIMSP) that are a result of the contact between Greek and Turkish varieties in Anatolia and in adjacent regions. The sentences are traced in Asia Minor Greek (AMG) dialectal sources. In addition to the UD analysis, the AMGiC treebank provides information concerning the sociolinguistic context within which CIMSP arise.

Contributors: Konstantinos Sampanis, Prokopis Prokopidis, Furkan Akkurt, Helin Binici
Repository master dev
README
Treebank hub page
Download

TueCL 4K ⓁⒻ

This is a treebank of Pharasiot, a critically endangered Greek dialect originally spoken near Cappadocia. The source material is fairy tales collected during field study.

Contributors: Eleni Vligouridou, Inessa Iliadou, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Cappadocian treebanks.

Language documentation

See the language documentation page.

Catalan 1 547K IE, Romance

Catalan treebanks

AnCora 547K ⓁⒻⒺ

Catalan data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

Contributors: Héctor Martínez Alonso, Elena Pascual, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cebuano 1 1K Austronesian, Greater Central Philippine

Cebuano treebanks

GJA 1K ⓁⒻ Ⓟ

UD_Cebuano_GJA is a collection of annotated Cebuano sample sentences randomly taken from three different sources: community-contributed samples from the website Tatoeba, a Cebuano grammar book by Bunye & Yap (1971) and Tanangkinsing's reference grammar on Cebuano (2011). This project is currently work in progress.

Contributors: Glyd Aranes
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Central Kurdish 1 <1K IE, Iranian

Central Kurdish treebanks

Mukri <1K ⓁⒻ

This treebank contains manually annotated data for Mukri Kurdish (Indo-European) belonging to Kurdish language family, following the Universal Dependencies (UD) guidelines. It aims to offer a syntactically and morphologically consistent dataset that helps with Kurdish language processing and cross-linguistic studies. The current release includes texts in Kurdish Roman Alphabet script and provides dependency annotation at the word, phrase, and sentence levels.

Contributors: Hiwa Asadpour, Luigi Talamo, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Chinese 7 309K Sino-Tibetan, Chinese

Chinese treebanks

GSDSimp 123K ⓁⒻ Ⓟ

Simplified Chinese Universal Dependencies dataset converted from the GSD (traditional) dataset with manual corrections.

Contributors: Peng Qi, Koichi Yasuoka
Repository master dev
README
Treebank hub page
Download

GSD 123K ⓁⒻ Ⓟ

Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

Contributors: Mo Shen, Ryan McDonald, Daniel Zeman, Peng Qi
Repository master dev
README
Treebank hub page
Download

Beginner 19K ⓁⒻ

A treebank of Chinese sentences adapted for learner of level A1 to C1 (HSK1 to 5) collected on the [Chinese Grammar Wiki](https://resources.allsetlearning.com/chinese/grammar/\) (CC BY-NC-SA 3.0 License) website. The treebank was manually annotated by researchers of Paris Nanterre University (Modyco) in the mSUD annotation schema (morpheme level Surface Universal Dependencies).

Contributors: Kirian Guiller, Yidi Huang, Yixuan Li, Qishen Wu, Bruno Guillaume, Sylvain Kahane, Kim Gerdes
Repository master dev
README
Treebank hub page
Download

PUD 21K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
Repository master dev
README
Treebank hub page
Download

HK 9K Ⓛ Ⓟ

A Traditional Chinese treebank of film subtitles and of legislative proceedings of Hong Kong, parallel with the Cantonese-HK treebank.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

CFL 7K Ⓛ

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

Contributors: John Lee, Herman Leung, Keying Li
Repository master dev
README
Treebank hub page
Download

PatentChar 4K

A treebank of Chinese patent application texts collected from the Chinese patent office's website CNIPA. The sentences are randomly selected from the patent claims of the IPC section "G" from November 2017 to September 2018.

Contributors: Yixuan Li, Kim Gerdes, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Chinese treebanks.

Language documentation

See the language documentation page.

Chintang 1 14K Sino-Tibetan, Himalayish

Chintang treebanks

CTNTB 14K ⓁⒻⒺ

UD\_Chintang-CTNTB is a Universal Dependencies (UD) treebank for the Chintang language. The annotation converted from glosses from "A Grammar of Chintang A Tibeto-Burman Language of Nepal" by Robert Schikowski.

Contributors: Kira Tulchynska, Robert Schikowski, Alena Witzlack-Makarevich
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Chukchi 1 6K Chukotko-Kamchatkan

Chukchi treebanks

HSE 6K Ⓔ

This data is a manual annotation of the corpus from multimedia annotated corpus of the [Chuklang](http://chuklang.ru/) project, a dialectal corpus of the Amguema variant of Chukchi.

Contributors: Francis Tyers, Karina Mischenkova
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Classical Armenian 1 99K IE, Armenian

Classical Armenian treebanks

CAVaL 99K ⓁⒻ

The present release includes the Classical Armenian translation of the Gospels and the first book of the "History of the Armenians" by Movses Khorenatsi. The annotation of the Gospels results from a rule-based conversion from the PROIEL annotation, manually corrected and extended with additional information. The annotation of the "History of the Armenians" has been performed by a UDPipe2 annotator and manually corrected.

Contributors: Petr Kocharov, Lilit Kharatyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Classical Chinese 2 433K Sino-Tibetan, Chinese

Classical Chinese treebanks

Kyoto 433K ⓁⒻ ?

Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.

Contributors: Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Yuan Li, Hiroyuki Shirasu, Kazunori Fujita
Repository master dev
README
Treebank hub page
Download

TueCL <1K ⓁⒻ

A dependency Treebank of "逍遥游(Enjoyment in Untroubled Ease)" written by Zhuangzi.

Contributors: Yifei Chen, John Wang, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Classical Chinese treebanks.

Language documentation

See the language documentation page.

Coptic 2 91K Afro-Asiatic, Egyptian

Coptic treebanks

Scriptorium 58K ⓁⒻ

UD Coptic contains manually annotated Sahidic Coptic texts, including Biblical texts, sermons, letters, and hagiography.

Contributors: Mitchell Abrams, Elizabeth Davidson, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

Bohairic 32K ⓁⒻ

UD_Coptic-Bohairic contains manually annotated Bohairic Coptic texts, including Biblical narrative and poetic texts, epistles, and hagiography.

Contributors: Amir Zeldes, Nina Speransky
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Coptic treebanks.

Language documentation

See the language documentation page.

Croatian 1 199K IE, Slavic

Croatian treebanks

SET 199K ⓁⒻ Ⓟ

The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the [hr500k](http://hdl.handle.net/11356/1183) corpus.

Contributors: Tanja Samardžić, Aleksandra Miletić, Nikola Ljubešić, Željko Agić, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Czech 6 4,162K IE, Slavic

Czech treebanks

PDTC 3,440K ⓁⒻⒺ

The Czech-PDTC UD treebank is based on the Prague Dependency Treebank – Consolidated (PDT-C) 2.0, created at the Charles University in Prague.

Contributors: Daniel Zeman, Jan Hajič, Alevtina Bémová, Eva Buráňová, Eva Hajičová, Jiří Havelka, Jaroslava Hlaváčová, Jiří Kárník, Veronika Kolářová, Lucie Kučová, Markéta Lopatková, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko, Michal Novák, Petr Pajas, Jarmila Panevová, Petr Sgall, Milan Straka, Magda Ševčíková, Jan Štěpánek, Barbora Štěpánková, Zdeňka Urešová, Barbora Vidová Hladká, Zdeněk Žabokrtský
Repository master dev
README
Treebank hub page
Download

CAC 494K ⓁⒻⒺ

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

FicTree 167K ⓁⒻⒺ

FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

Contributors: Tomáš Jelínek, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

CLTT 36K ⓁⒻⒺ

The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 2.0, created at the Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman, Martin Popel
Repository master dev
README
Treebank hub page
Download

PUD 18K ⓁⒻⒺ Ⓟ

Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Poetry 6K ⓁⒻ

UD_Czech-Poetry contains random samples of Czech 19th-century poetry from the Corpus of Czech Verse parsed with UDPipe2 (trained on UD Czech-PDT 2.11) and manually corrected.

Contributors: Silvie Cinková
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish 1 100K IE, Germanic

Danish treebanks

DDT 100K ⓁⒻ

The Danish UD treebank is a conversion of the Danish Dependency Treebank.

Contributors: Anders Johannsen, Héctor Martínez Alonso, Barbara Plank
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Dutch 2 505K IE, Germanic

Dutch treebanks

LassySmall 297K ⓁⒻⒺ

This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

Contributors: Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

Alpino 208K ⓁⒻⒺ

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Dutch treebanks.

Language documentation

See the language documentation page.

Egyptian 1 34K Afro-Asiatic, Egyptian

Egyptian treebanks

PC 34K ⓁⒻ

Egyptian-PC is the first dependency treebank created for the morphosyntactic annotation of pre-Coptic Egyptian. It is developed at the University of Jaén. Its current state (UD v2.18) consists of 3,089 sentences and 34,234 tokens manually annotated from the Pyramid Texts.

Contributors: Roberto Antonio Díaz Hernández, Bruno Guillaume, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

English 13 1,126K IE, Germanic

English treebanks

GUM 256K ⓁⒻⒺ

Universal Dependencies syntax annotations from the GUM corpus (https://gucorpling.org/gum/)

Contributors: Siyao Peng, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

EWT 254K ⓁⒻⒺ

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).

Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, Ethan Chi, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Nathan Schneider, Sam Bowman, Hanzhi Zhu, Daniel Galbraith, John Bauer
Repository master dev
README
Treebank hub page
Download

LinES 106K ⓁⒻ Ⓟ

UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

ParTUT 49K ⓁⒻ

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

Atis 61K ⓁⒻ Ⓟ

UD Atis Treebank is a manually annotated treebank consisting of the sentences in the Atis (Airline Travel Informations) dataset which includes the human speech transcriptions of people asking for flight information on the automated inquiry systems.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız
Repository master dev
README
Treebank hub page
Download

GENTLE 17K ⓁⒻⒺ

Repository for the Genre Tests for Linguistic Evaluation (GENTLE) Corpus

Contributors: Tatsuya Aoyama, Shabnam Behzad, Luke Gessler, Lauren Levine, Yi-Ju Jessica Lin, Yang Janet Liu, Siyao Logan Peng, Yilun Zhu, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

CHILDES 302K ⓁⒺ

This repository contains Universal Dependencies (UD) trees for utterances from child–adult spoken interactions in English, drawn from [CHILDES](https://childes.talkbank.org/) transcripts.

Contributors: Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider
Repository master dev
README
Treebank hub page
Download

LittlePrince 6K ⓁⒻ

This treebank contains manually corrected Universal Dependency annotations for 500 sentences from the English translation of *The Little Prince*.

Contributors: Lori Levin, Annie Zhang, Thomas Palakapilly, Jack Sun, Larry Zhang
Repository master dev
README
Treebank hub page
Download

PUD 21K ⓁⒻⒺ Ⓟ

This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
Repository master dev
README
Treebank hub page
Download

CTeTex 9K Ⓛ

UD_English-CTeTex is a technical text corpus annotated in Universal Dependency syntax containing 196 software requirements.

Contributors: Naïma Hassert, Pierre André Ménard, Edith Galy
Repository master dev
README
Treebank hub page
Download

Pronouns 1K ⓁⒻ

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, "hers", (independent) "his", (singular) "theirs", "mine", and (singular) "yours".

Contributors: Robert Munro
Repository master dev
README
Treebank hub page
Download

GUMReddit 16K ✘ⓁⒻⒺ

Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)

Contributors: Siyao Peng, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

ESLSpok 21K

This repository includes the Dependency Treebank of Spoken L2 English (SL2E), which consists of Universal Dependency annotations for a random sample of sentences from the <a href="https://alaginrc.nict.go.jp/nict_jle/index_E.html" target="_blank">NICT JLE</a>, a corpus of spoken second language English. <a href="https://github.com/LCR-ADS-Lab/SL2E-Dependency-Treebank" target="_blank">The homepage of the project is here.</a>

Contributors: Kris Kyle, Masaki Eguchi, Aaron Miller, Ted Sither
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

Erzya 1 20K Uralic, Mordvin

Erzya treebanks

JR 20K ⓁⒻ

UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

Contributors: Jack Rueter, Francis Tyers, Elena Klementieva, Olga Erina, Ivan Riabov
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Esperanto 2 3K Constructed

Esperanto treebanks

Prago 3K ⓁⒻ

UD Esperanto-Prago is the Universal Dependencies syntax annotation on Manifesto de Prago (Prague Manifesto) and Deklaratio pri Homaranismo.

Contributors: Masanori Oya
Repository master dev
README
Treebank hub page
Download

Cairo <1K ⓁⒻ Ⓟ

This is an example treebank made to ilustrate UD annotation choices made for Esperanto based on the Cairo sample sentences.

Contributors: Masanori Oya
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Esperanto treebanks.

Language documentation

See the language documentation page.

Estonian 2 528K Uralic, Finnic

Estonian treebanks

EDT 437K ⓁⒻⒺ

UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,972 trees, 437,769 tokens.

Contributors: Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Andriela Rääbis, Liisi Torga
Repository master dev
README
Treebank hub page
Download

EWT 90K ⓁⒻⒺ

UD EWT treebank consists of different genres of new media. The treebank contains 7,190 trees, 90,585 tokens.

Contributors: Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Dage Särg, Sandra Eiche, Andriela Rääbis
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Estonian treebanks.

Language documentation

See the language documentation page.

Faroese 2 50K IE, Germanic

Faroese treebanks

OFT 10K ⓁⒻ

This is a treebank of Faroese based on the Faroese Wikipedia.

Contributors: Daniel Zeman, Bjartur Mortensen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

FarPaHC 40K ⓁⒻ

UD_Faroese-FarPaHC is a conversion of the [Faroese Parsed Historical Corpus (FarPaHC)](https://github.com/einarfs/farpahc) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).

Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Anton Karl Ingason, Eiríkur Rögnvaldsson, Joel C. Wallenberg
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Faroese treebanks.

Language documentation

See the language documentation page.

Finnish 4 397K Uralic, Finnic

Finnish treebanks

TDT 202K ⓁⒻⒺ

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

Contributors: Filip Ginter, Jenna Kanerva, Veronika Laippala, Niko Miekka, Anna Missilä, Stina Ojala, Sampo Pyysalo
Repository master dev
README
Treebank hub page
Download

FTB 159K ⓁⒻ

FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script and later manually revised.

Contributors: Jussi Piitulainen, Hanna Nurmi, Jack Rueter
Repository master dev
README
Treebank hub page
Download

OOD 19K ⓁⒻ

Finnish-OOD is an external out-of-domain test set for Finnish-TDT annotated natively into UD scheme.

Contributors: Jenna Kanerva
Repository master dev
README
Treebank hub page
Download

PUD 15K ⓁⒻⒺ Ⓟ

Contributors: Jenna Kanerva, Filip Ginter, Stina Ojala, Anna Missilä
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Finnish treebanks.

Language documentation

See the language documentation page.

French 9 708K IE, Romance

French treebanks

GSD 400K ⓁⒻ

The **UD_French-GSD** was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.

Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni, Carly Dickerson, Guy Perrier
Repository master dev
README
Treebank hub page
Download

ALTS 68K ⓁⒻ

ALTS (AUTOMATED Sixteenth-century corpus) is a treebank of sixteenth-century legal French from Normandy and the Channel Islands.

Contributors: Natalia Romanova, Rayan Ziane, Khensa Daoudi, Théo Brillet
Repository master dev
README
Treebank hub page
Download

Sequoia 70K ⓁⒻ

**UD_French-Sequoia** is an automatic conversion of the [SUD_French-Sequoia](https://github.com/surfacesyntacticud/SUD_French-Sequoia) treebank, which comes from the former corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

Contributors: Marie Candito, Djamé Seddah, Guy Perrier, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

ParisStories 42K ⓁⒻ

Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.

Contributors: Kim Gerdes, Sylvain Kahane, Menel Mahamdi
Repository master dev
README
Treebank hub page
Download

Rhapsodie 44K ⓁⒻ

A Universal Dependencies corpus for spoken French.

Contributors: Kim Gerdes, Sylvain Kahane, Mariam Nakhlé, Chunxiao Yan, Aline Etienne, Marine Courtin
Repository master dev
README
Treebank hub page
Download

ParTUT 28K ⓁⒻ

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

PUD 24K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

FQB 23K ⓁⒻ

The corpus **UD_French-FQB** is an automatic conversion of the [French QuestionBank v1](http://alpage.inria.fr/Treebanks/FQB/), a corpus entirely made of questions.

Contributors: Djamé Seddah, Marie Candito, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

PoitevinDIVITAL 5K Ⓛ

UD_French-PoitevinDIVITAL is a manually corrected treebank of Poitevin-Saintongeais consisting of sentences from several genres.

Contributors: Marianne Vergez-Couret
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Frisian Dutch 1 3K Code switching

Frisian Dutch treebanks

Fame 3K

UD_Frisian_Dutch-Fame is a selection of 400 sentences from the FAME! speech corpus by Yilmaz et al. (2016a, 2016b). The treebank is manually annotated using the UD scheme.

Contributors: Anouck Braggaar, Rob van der Goot
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Galician 3 188K IE, Romance

Galician treebanks

TreeGal 25K ⓁⒻ

The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña) and at CiTIUS (Universidade de Santiago de Compostela).

Contributors: Marcos Garcia, Xulia Sánchez-Rodríguez, Albina Sarymsakova
Repository master dev
README
Treebank hub page
Download

CTG 139K ⓁⒻ

The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.

Contributors: Xavier Gómez Guinovart
Repository master dev
README
Treebank hub page
Download

PUD 23K ⓁⒻ Ⓟ

The Galician PUD is a treebank for Galician developed at CiTIUS (Universidade de Santiago de Compostela). It follows the annotation guidelines of [Galician-TreeGal](https://github.com/UniversalDependencies/UD_Galician-TreeGal).

Contributors: Albina Sarymsakova, Xulia Sánchez-Rodríguez, Marcos Garcia
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Galician treebanks.

Language documentation

See the language documentation page.

Georgian 2 83K Kartvelian

Georgian treebanks

GNC 23K ⓁⒻ

UD_Georgian-GNC is a treebank based on texts from the Georgian National Corpus, [GNC](https://clarino.uib.no/gnc).

Contributors: Paul Meurer
Repository master dev
README
Treebank hub page
Download

GLC 60K ⓁⒻ

The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Georgian Language Corpus (GLC) available at http://corpora.iliauni.edu.ge/ and sentences selected from Wiki in accordance with the 132 scientific fields.

Contributors: Irina Lobzhanidze
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Georgian treebanks.

Language documentation

See the language documentation page.

German 4 3,810K IE, Germanic

German treebanks

HDT 3,455K ⓁⒻ

UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

Contributors: Emanuel Borges Völker, Felix Hennig, Arne Köhn, Maximilan Wendt, Verena Blaschke, Nina Böbel, Leonie Weissweiler
Repository master dev
README
Treebank hub page
Download

GSD 292K ⓁⒻ

The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Slav Petrov, Wolfgang Seeker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Adriane Boyd, Verena Blaschke
Repository master dev
README
Treebank hub page
Download

PUD 21K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

LIT 40K ⓁⒻ

This treebank aims at gathering texts of the German literary history. Currently, it hosts Fragments of the early Romanticism, i.e. aphorism-like texts mainly dealing with philosophical issues concerning art, beauty and related topics.

Contributors: Alessio Salomoni
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.

Gheg 1 15K IE, Albanian

Gheg treebanks

GPS 15K ⓁⒻ

UD Gheg Pear Stories (GPS) contains renarrations of Wallace Chafe's Pear Stories video (pearstories.org) by heritage speakers of Gheg Albanian living in Switzerland and speakers from Prishtina.

Contributors: Christian Ebert, Artan Islamaj, Adrian Kuqi, Barbara Sonnenhauser, Paul Widmer, Magdalena Plamada
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gorontalo 1 <1K Austronesian, Greater Central Philippine

Gorontalo treebanks

BungoLoLombi <1K ⓁⒻ

Bungo lo Lombi is a Universal Dependencies parsed corpus of modern spoken Gorontalo as spoken in Gorontalo City, Gorontalo Province, Indonesia. It comprises fieldwork samples obtained by Colleen Alena O'Brien.

Contributors: Andrew Thomas Dyer, Colleen Alena O'Brien
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gothic 1 55K IE, Germanic

Gothic treebanks

PROIEL 55K ⓁⒻ

The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Greek 6 110K IE, Greek

Greek treebanks

GDT 63K ⓁⒻ

The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (https://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (https://www.ilsp.gr).

Contributors: Prokopis Prokopidis
Repository master dev
README
Treebank hub page
Download

GUD 25K ⓁⒻ

GUD is a resource for EL manually annotated for morphology and syntax. It is an ongoing project led by Stella Markantonatou and Vivian Stamou (hereinafter: the GUD team), both researchers at the [Institute for Language and Speech Processing](http://www.ilsp.gr/) (ILSP/Athena Research Centre).

Contributors: Stella Markantonatou, Vivian Stamou, Socrates Vak
Repository master dev
README
Treebank hub page
Download

Lesbian 6K ⓁⒻ

A Universal Dependencies (UD) treebank for the dialect of Lesbos, a low-resource living Northern variety of Modern Greek. The treebank currently contains 625 sentences with manual annotations following the Universal Dependencies framework, representing the first UD treebank for a Northern Modern Greek dialect.

Contributors: Stavros Bompolas, Stella Markantonatou, Antonios Anastasopoulos, Vivian Stamou
Repository master dev
README
Treebank hub page
Download

GLCII 9K ⓁⒻ

A treebank based on version 2 of the Greek Learner Corpus (GLCII), consisting of written data produced by learners of Modern Greek.

Contributors: Christina Klironomou, Thelka Pasparaki, Arianna Masciolini
Repository master dev
README
Treebank hub page
Download

Messinian <1K ⓁⒻ

Messenian is in the Southern group of dialects of Modern Greek to which also belongs the main variety (the Standard).

Contributors: Stella Markantonatou, Katerina Mouzou, Vivian Stamou
Repository master dev
README
Treebank hub page
Download

Cretan 4K ⓁⒻ

The text of the treebank was transcribed with Wisper (trained on Cretan) from 9 tapes containing folklore narratives by one speaker, Ioannis Anagnostakis, who is responsible for their composition. The narratives are radio broadcasts in digital format, with permission from the Audiovisual Department of the Vikelaia Municipal Library of Heraklion, Crete (1998-2001). The data were split into training (70%), dev (10%) and test (20%) sets.

Contributors: Socrates Vakirtzian, Stella Markantonatou, Vivian Stamou
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Greek treebanks.

Language documentation

See the language documentation page.

Guajajara 1 9K Tupian, Maweti-Guarani

Guajajara treebanks

TuDeT 9K ⓁⒻ

UD_Guajajara-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/guaj1255">Guajajara</a>. Sentences stem from multiple sources such as descriptions of the language, short stories, dictionaries and translations from the New Testament. Sentence annotation and documentation by Lorena Martín Rodríguez and Fabrício Ferraz Gerardi.

Contributors: Lorena Martín Rodríguez, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Guarani 1 <1K Tupian, Maweti-Guarani

Guarani treebanks

OldTuDeT <1K ⓁⒻ

UD_Guarani-OldTuDeT is a collection of annotated texts in <a href="https://glottolog.org/resource/languoid/id/oldp1258">Old Guaraní</a>. All known sources in this language are being annotated: cathesisms, grammars (seventeenth and eighteenth century), sentences from dictionaries, and other texts. Sentence annotation and documentation by Fabrício Ferraz Gerardi and Lorena Martín Rodríguez.

Contributors: Fabrício Ferraz Gerardi, Lorena Martín Rodríguez
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gujarati 1 1K IE, Indic

Gujarati treebanks

GujTB 1K Ⓛ Ⓟ

GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.

Contributors: Maitrey Mehta, Mayank Jobanputra
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gwichin 1 1K Na-Dene

Gwichin treebanks

TueCL 1K Ⓛ

UD_Gwichin-TueCL is a small treebank of Alaskan Gwich'in, an endangered Athabascan language, based on material located in the Alaska Native Language Archive.

Contributors: Matthew Andrews, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Haitian Creole 2 75K Creole

Haitian Creole treebanks

Autogramm 3K ⓁⒻ

This is a treebank of Haitian creole. It contains 144 sentences selected from 3 major genres: bible, literary texts, newspapers. Kreyòl (Kreyòl Ayisyen, Haitian Creole, iso-639-1: ht) is the main language of Haïti. The dialect described here is the Cap Haïtien dialect which differs slightly in its lexicon with Center and South varieties.

Contributors: Claudel Pierre-Louis, Sandra Jagodzińska, Sylvain Kahane, Agata Savary, Emmanuel Schang
Repository master dev
README
Treebank hub page
Download

Adolphe 71K ⓁⒻ

This is a treebank for Haitian creole. It contains 3314 sentences and 300,000+ words selected from 1 bible-related source and was annotated programmatically. Kreyòl (Kreyòl Ayisyen, Haitian Creole, iso-639-1: ht) is the main language of Haïti.

Contributors: Jephtey Adolphe
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Haitian Creole treebanks.

Language documentation

See the language documentation page.

Hausa 4 53K Afro-Asiatic, West Chadic

Hausa treebanks

EasternAutogramm 9K ⓁⒻ

This treebank contains data of the Autogramm project, for the (Kano) Eastern dialect of Hausa, Nigeria.

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

WesternAutogramm 13K ⓁⒻ

This treebank contains data of Southern Autogramm, for the (Tibiri) Gobir dialect of Niger Republic (Western Hausa).

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

NorthernAutogramm 15K ⓁⒻ

This treebank contains data of Northern Autogramm, for the Ader dialect of Niger Republic (Northern Hausa).

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

SouthernAutogramm 14K ⓁⒻ

This treebank contains data of Southern Autogramm, for the Zaria dialect of Nigeria (Southern Hausa).

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hausa treebanks.

Language documentation

See the language documentation page.

Hebrew 4 376K Afro-Asiatic, Semitic

Hebrew treebanks

IAHLTwiki 140K ⓁⒻ

Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section (https://www.iahlt.org/)

Contributors: Amir Zeldes, Avner Algom, Noam Ordan, Yifat Ben Moshe, Shira Wigderson
Repository master dev
README
Treebank hub page
Download

IAHLTknesset 67K ⓁⒻ

Publicly available IAHLT UD Hebrew Treebank's Knesset section (https://www.iahlt.org/)

Contributors: Amir Zeldes, Avner Algom, Noam Ordan, Yifat Ben Moshe, Nick Howell, Shira Wigderson, Omer Strass, Israel Landau, Netanel Dahan, Yael Minerbi, Hilla Merhav, Emmanuelle Kowner, Shuly Wintner, Gili Goldin, Ella Rabinovich, Vladimir Gurevich
Repository master dev
README
Treebank hub page
Download

HTB 160K ⓁⒻ

A Universal Dependencies Corpus for Hebrew.

Contributors: Yoav Goldberg, Reut Tsarfaty, Amir More, Shoval Sadde, Victoria Basmov, Yuval Pinter
Repository master dev
README
Treebank hub page
Download

PostRab 8K ⓁⒻ

A Universal Dependencies treebank of post-Rabbinic historical Hebrew, comprising ~300 (~8000 tokens) sentences annotated for morphology and syntax from diverse pre-modern sources.

Contributors: Rachel Tal, Elisheva Brauner, Shlomit Fuchs, Orly Albek, Avi Shmidman, Yitzchak Lindenbaum, Ephraim Meiri
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hebrew treebanks.

Language documentation

See the language documentation page.

Highland P. Nahuatl 1 10K Uto-Aztecan

Highland Puebla Nahuatl treebanks

ITML 10K ⓁⒻⒺ

UD_Highland_Puebla_Nahuatl-ITML is a collection of texts in the Highland Puebla variety of Nahuatl (ISO-639: `azz`) spoken in 24 municipalities in the state of Mexico in Puebla. The treebank contains spoken monologue and dialogue, scientific texts translated from Spanish and some miscellaneous grammatical examples from a language course.

Contributors: Robert Pugh, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hindi 2 375K IE, Indic

Hindi treebanks

HDTB 351K ⓁⒻ

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 23K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Hittite 1 1K IE, Anatolian

Hittite treebanks

HitTB 1K ⓁⒻ

UD_Hittite-HitTB is a small Universal Dependencies treebank for Hittite, containing original sentences from Hoffner and Melchert's tutorial to A Grammar of the Hittite Language.

Contributors: Erik Andersen, Ben Rozonoyer
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hungarian 1 42K Uralic, Ugric

Hungarian treebanks

Szeged 42K ⓁⒻ

The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).

Contributors: Richárd Farkas, Katalin Simkó, Zsolt Szántó, Viktor Varga, Veronika Vincze
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Icelandic 4 1,183K IE, Germanic

Icelandic treebanks

IcePaHC 985K ⓁⒻ

UD_Icelandic-IcePaHC is a conversion of the [Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).

Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Hildur Jónsdóttir, Kristín Bjarnadóttir, Anton Karl Ingason, Kristján Rúnarsson, Steinþór Steingrímsson, Joel C. Wallenberg, Eiríkur Rögnvaldsson
Repository master dev
README
Treebank hub page
Download

Modern 80K ⓁⒻ

UD_Icelandic-Modern is a conversion of the [modern additions](https://github.com/antonkarl/icecorpus/tree/master/additions2019) to the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.

Contributors: Kristján Rúnarsson, Þórunn Arnardóttir, Hinrik Hafsteinsson, Starkaður Barkarson, Hildur Jónsdóttir, Steinþór Steingrímsson, Einar Freyr Sigurðsson
Repository master dev
README
Treebank hub page
Download

GC 99K ⓁⒻ

UD_Icelandic-GC is a conversion of the gold part of [GreynirCorpus](https://github.com/mideind/GreynirCorpus), which has been manually corrected and verified. The corpus is parsed into full constituency trees, and converted using [UDConverter-GreynirCorpus](https://github.com/thorunna/UDConverter-GreynirCorpus).

Contributors: Vilhjálmur Þorsteinsson, Hulda Óladóttir, Þórunn Arnardóttir, Sveinbjörn Þórðarson, Haukur Barri Símonarson, Katla Ásgeirsdóttir
Repository master dev
README
Treebank hub page
Download

PUD 18K ⓁⒻ Ⓟ

Icelandic-PUD is the Icelandic part of the Parallel Universal Dependencies (PUD) treebanks.

Contributors: Hildur Jónsdóttir
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Icelandic treebanks.

Language documentation

See the language documentation page.

Ika 1 5K Chibchan, Arhuacic

Ika treebanks

ChibErgIS 5K ⓁⒻ

A Universal Dependencies corpus for Ika, a member of the Chibchan language family. The language is spoken by about 25,000 speakers in Colombia.

Contributors: Jana Bajorat, Natalia Cáceres Arandia
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Indonesian 3 169K Austronesian, Malayo-Sumbawan

Indonesian treebanks

PUD 19K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman, Ika Alfina, Arawinda Dinakaramani, Muhammad Yudistira Hanifmuti, Jessica Naraiswari Arwidarasti, Yogi Lesmana Sulestio
Repository master dev
README
Treebank hub page
Download

GSD 122K ⓁⒻ

The Indonesian-GSD treebank was originally converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb) in 2015. In order to comply with the latest Indonesian annotation guidelines, the treebank has undergone a major revision between UD releases v2.8 and v2.9 (2021).

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Septina Dian Larasati, Ika Alfina
Repository master dev
README
Treebank hub page
Download

CSUI 28K ⓁⒻ

UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named [**Kethu**](https://github.com/ialfina/kethu) that was also a conversion from a constituency treebank built by [**Dinakaramani et al. (2015)**](https://github.com/famrashel/idn-treebank). We named this treebank **Indonesian-CSUI**, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.

Contributors: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Indonesian treebanks.

Language documentation

See the language documentation page.

Irish 3 168K IE, Celtic

Irish treebanks

IDT 115K ⓁⒻ

A Universal Dependencies 4910-sentence treebank for modern Irish.

Contributors: Teresa Lynn, Jennifer Foster, Sarah McGuinness, Abigail Walsh, Jason Phelan, Kevin Scannell
Repository master dev
README
Treebank hub page
Download

TwittIrish 47K Ⓛ

A Universal Dependencies treebank of 2596 tweets in modern Irish.

Contributors: Lauren Cassidy, Teresa Lynn, Jennifer Foster, Sarah McGuinness
Repository master dev
README
Treebank hub page
Download

Cadhan 4K ⓁⒻ

This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a [separate section below](#bardic-segment).

Contributors: Kevin Scannell, Theodorus Fransen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Irish treebanks.

Language documentation

See the language documentation page.

Italian 11 1,020K IE, Romance

Italian treebanks

ISDT 298K ⓁⒻⒺ

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

Contributors: Cristina Bosco, Alessandro Lenci, Simonetta Montemagni, Maria Simi
Repository master dev
README
Treebank hub page
Download

VIT 280K ⓁⒻ

The UD_Italian-VIT corpus was obtained by conversion from VIT (Venice Italian Treebank), developed at the Laboratory of Computational Linguistics of the Università Ca' Foscari in Venice (Delmonte et al. 2007; Delmonte 2009; http://rondelmo.it/resource/VIT/Browser-VIT/index.htm).

Contributors: Fabio Tamburini, Maria Simi, Cristina Bosco
Repository master dev
README
Treebank hub page
Download

KIParlaForest 18K ⓁⒻ

The KIParla Forest treebank is a treebank of spoken Italian based on the [KIParla Corpus](https://kiparla.it/)

Contributors: Ludovica Pannitto, Eleonora Zucchini, Cristina Bosco, Caterina Mauri, Manuela Sanguinetti, Esther Cocco
Repository master dev
README
Treebank hub page
Download

Valico 6K ⓁⒻ

Manually corrected Treebank of Learner Italian drawn from the Valico corpus and correspondent corrected sentences.

Contributors: Elisa Di Nuovo, Manuela Sanguinetti, Cristina Bosco, Alessandro Mazzei
Repository master dev
README
Treebank hub page
Download

Old 122K ⓁⒻⒺ

Italian-Old is a treebank containing **Dante Alighieri's Comedy** (composed between approximately 1306 and 1321), based on the 1994 Petrocchi edition and taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Old Italian, specifically Florentine.

Contributors: Claudia Corbetta, Marco Passarotti, Flavio Massimiliano Cecchini, Giovanni Moretti
Repository master dev
README
Treebank hub page
Download

ParlaMint 20K ⓁⒻ

ParlaMint-It is a collection of transcriptions of parliamentary sessions of the Italian Senate annotated in Universal Dependencies. The corpus is part of a larger multilingual collection of parliamentary transcripts built during the ParlaMint project (https://www.clarin.eu/parlamint).

Contributors: Chiara Alzetta, Marta Sartor, Simonetta Montemagni, Giulia Venturi
Repository master dev
README
Treebank hub page
Download

ParTUT 55K ⓁⒻ

UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

TWITTIRO 29K ⓁⒻ

TWITTIRÒ-UD is a collection of ironic Italian tweets annotated in Universal Dependencies. The treebank can be exploited for the training of NLP systems to enhance their performance on social media texts, and in particular, for irony detection purposes.

Contributors: Alessandra T. Cignarella, Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

PoSTWITA 124K ⓁⒻ

PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

PUD 23K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

MarkIT 40K ⓁⒻ

The MarkIT resource contains around 800 sentences extracted from students' essays manually annotated with syntactic depencendies. The treebank covers seven types of marked constructions, plus some ambiguous sentences whose syntax can be wrongly classified as marked.

Contributors: Teresa Paccosi, Alessio Palmero Aprosio, Sara Tonelli
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese 6 2,645K Japanese

Japanese treebanks

GSD 193K Ⓛ Ⓟ

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSDLUW 150K Ⓛ Ⓟ

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 28K Ⓛ Ⓟ

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUDLUW 22K Ⓛ Ⓟ

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

BCCWJ 1,253K ✘Ⓛ

Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
Repository master dev
README
Treebank hub page
Download

BCCWJLUW 995K ✘Ⓛ

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ). UD-Japanese-BCCWJLUW is the other word segmentation version of UD-Japanese-BCCWJ. We use **Long Unit Word (LUW)** as their syntactic word in UD definition.

Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Javanese 1 14K Austronesian, Javanese

Javanese treebanks

CSUI 14K ⓁⒻ

UD Javanese-CSUI is a dependency treebank in Javanese, a regional language in Indonesia with more than 68 million users. It was developed by Alfina et al. from the Faculty of Computer Science, Universitas Indonesia. The newest version has 1000 sentences and 14K words with manual annotation.

Contributors: Ika Alfina, Arlisa Yuliawati, Dipta Tanaya, Arawinda Dinakaramani, Daniel Zeman, Putri Rizqiyah, Sri Hartati Wijono, Rangga Prangwedana Prangwedana
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kaapor 1 <1K Tupian, Maweti-Guarani

Kaapor treebanks

TuDeT <1K ⓁⒻ

**UD_Kaapor-TuDeT** is a collection of annotated sentences in [Ka'apor](https://glottolog.org/resource/languoid/id/urub1250). The project is a work in progress and the treebank is being updated on a regular basis.

Contributors: Fabrício Ferraz Gerardi, Carolina Aragon, Gustavo Godoy
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kadiweu 1 <1K Guaicuruan

Kadiweu treebanks

Unicamp <1K ⓁⒻ

UD_Kadiweu-UNICAMP is a treebank for [Kadiwéu](https://glottolog.org/resource/languoid/id/kadi1248) (ISO-639: `kbc`), an endangered Indigenous language of Brazil. It consists of isolated sentences produced by native speakers.

Contributors: Filomena Spatti Sandalo, Leonel Figueiredo de Alencar, Charlotte Chambelland Galves, Luiz Veronesi, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kangri 1 2K IE, Indic

Kangri treebanks

KDTB 2K ⓁⒻ

The Kangri UD Treebank (KDTB) is a part of the Universal Dependency treebank project.

Contributors: Shweta Chauhan, Shefali Saxena, Apoorva Jha, Philemon Daniel
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Karelian 1 3K Uralic, Finnic

Karelian treebanks

KKPP 3K ⓁⒻ

UD Karelian-KKPP is a manually annotated new corpus of Karelian made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

Contributors: Tommi A Pirinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Karo 1 2K Tupian, Ramarama

Karo treebanks

TuDeT 2K ⓁⒻ

UD_Karo-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/karo1306"> Karo</a>. The sentences stem from the only grammatical description of the language (Gabas, 1999) and from the sentences in the dictionary by the same author (Gabas, 2007). Sentence annotation and documentation by Fabrício Ferraz Gerardi.

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kazakh 1 10K Turkic, Northwestern

Kazakh treebanks

KTB 10K ⓁⒻ

The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

Contributors: Aibek Makazhanov, Jonathan North Washington, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Khoekhoe 1 29K Khoe-Kwadi

Khoekhoe treebanks

KDT 29K ⓁⒻⒺ Ⓟ

UD\_Khoekhoe-KDT is a Universal Dependencies (UD) treebank for the Khoekhoegowab (Khoekhoe) language. The annotation was performed manually based on glosses. This treebank includes texts from various sources: fiction, grammar, and spoken conversation. The treebank contains **27k tokens**, distributed as follows: - **Training set**: 15k tokens - **Development set**: 2k tokens - **Test set**: 10k tokens

Contributors: Kira Tulchynska, Alena Witzlack-Makarevich, Sylvanus Job, Michael Hahn
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kiche 1 10K Mayan

Kiche treebanks

IU 10K ⓁⒻ

UD Kʼicheʼ-IU is a treebank consisting of sentences from a variety of text domains but principally dictionary example sentences and linguistic examples.

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Komi Permyak 1 1K Uralic, Permic

Komi Permyak treebanks

UH 1K ⓁⒻ

This is a Komi-Permyak literary language treebank consisting of original and translated texts.

Contributors: Larisa Ponomareva, Niko Partanen, Jack Rueter, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Komi Zyrian 2 10K Uralic, Permic

Komi Zyrian treebanks

Lattice 8K ⓁⒻ

UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.

Contributors: Niko Partanen, KyungTae Lim, Thierry Poibeau, Jack Rueter
Repository master dev
README
Treebank hub page
Download

IKDP 2K ⓁⒻ

This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.

Contributors: Niko Partanen, Rogier Blokland, Michael Rießler, Jack Rueter
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Komi Zyrian treebanks.

Language documentation

See the language documentation page.

Korean 5 615K Korean

Korean treebanks

KSL 155K Ⓛ

UD_Korean-KSL is a dependency treebank of second-language (L2) Korean.

Contributors: Hakyung Sung, Gyu-Ho Shin
Repository master dev
README
Treebank hub page
Download

Kaist 350K Ⓛ

The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).

Contributors: Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

GSD 80K Ⓛ

The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

PUD 16K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

LittlePrince 13K ⓁⒻⒺ

UD Korean-LittlePrince is a UD adaptation of the k-SNACS dataset [(Hwang et al. 2020)](https://aclanthology.org/2020.dmr-1.6/).

Contributors: Junghyun Min, Jena Hwang, Nathan Schneider
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Korean treebanks.

Language documentation

See the language documentation page.

Kyrgyz 2 25K Turkic, Northwestern

Kyrgyz treebanks

KTMU 24K ⓁⒻ

UD_Kyrgyz-KTMU is dependency parsing based treebank in Kyrgyz language. The dataset mostly contains headlines from Kyrgyz news websites.

Contributors: İbrahim Benli
Repository master dev
README
Treebank hub page
Download

TueCL 1K ⓁⒻ Ⓟ

This is a small treebank of grammatical examples for Kyrgyz. It is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages, designed to facilitate cross-linguistic research on these related languages.

Contributors: Bermet Chontaeva, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Kyrgyz treebanks.

Language documentation

See the language documentation page.

Latgalian 1 <1K IE, Baltic

Latgalian treebanks

Cairo <1K ⓁⒻⒺ Ⓟ

UD_Latgalian-Cairo is an example treebank to provide minimal dataset for Latgalian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

Contributors: Lauma Pretkalniņa, Gunta Nešpore-Bērzkalne
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Latin 6 1,012K IE, Italic

Latin treebanks

ITTB 450K ⓁⒻ

Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

Contributors: Marco Passarotti, Marinella Testori, Daniel Zeman, Berta González Saavedra, Flavio Massimiliano Cecchini
Repository master dev
README
Treebank hub page
Download

LLCT 242K ⓁⒻ

This Universal Dependencies version of the **LLCT** (Late Latin Charter Treebank) consists of an automated conversion of the **LLCT2** treebank from the Latin Dependency Treebank (LDT) format into the Universal Dependencies standard.

Contributors: Timo Korkiakangas, Flavio Massimiliano Cecchini, Marco Passarotti
Repository master dev
README
Treebank hub page
Download

UDante 55K ⓁⒻ

The **UDante** treebank is based on the Latin texts of Dante Alighieri, taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Latin language, more precisely of **literary Medieval Latin** (XIVth century).

Contributors: Flavio Massimiliano Cecchini, Giovanni Moretti, Marco Passarotti, Rachele Sprugnoli, Daniela Corbetta, Federica Favero, Federica Gamba, Martina de Laurentiis, Giulia Pedonese, Andrea Peverelli, Elena Vagnoni, Mirko Tavoni
Repository master dev
README
Treebank hub page
Download

CIRCSE 29K ⓁⒻ

UD_Latin-CIRCSE is a repository of treebanks featuring Latin texts natively annotated at the CIRCSE Research Centre in Milan (https://centridiricerca.unicatt.it/circse/en.html) following the Universal Dependencies (UD) (https://universaldependencies.org) annotation scheme. The repository includes prose and poetry texts from different periods.

Contributors: Federica Iurescia, Federica Gamba, Flavio Massimiliano Cecchini, Francesco Mambrini, Giovanni Moretti, Marco Passarotti, Paolo Ruffolo
Repository master dev
README
Treebank hub page
Download

Perseus 29K ⓁⒻ

This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman, Federica Gamba
Repository master dev
README
Treebank hub page
Download

PROIEL 205K ⓁⒻ

The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Latin treebanks.

Language documentation

See the language documentation page.

Latvian 2 330K IE, Baltic

Latvian treebanks

LVTB 330K ⓁⒻⒺ

Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).

Contributors: Lauma Pretkalniņa, Laura Rituma, Gunta Nešpore-Bērzkalne, Baiba Saulīte, Artūrs Znotiņš, Normunds Grūzītis
Repository master dev
README
Treebank hub page
Download

Cairo <1K ⓁⒻⒺ Ⓟ

This is an example treebank made to ilustrate UD annotation choices made for Latvian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

Contributors: Lauma Pretkalniņa, Laura Rituma, Baiba Saulīte, Gunta Nešpore-Bērzkalne
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Latvian treebanks.

Language documentation

See the language documentation page.

Ligurian 1 6K IE, Romance

Ligurian treebanks

GLT 6K ⓁⒻ Ⓟ ?

The Genoese Ligurian Treebank is a small, manually annotated collection of contemporary Ligurian prose. The focus of the treebank is written Genoese, the koiné variety of Ligurian which is associated with today's literary, journalistic and academic ligurophone sphere.

Contributors: Stefano Lusito, Jean Maillard
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Lithuanian 2 75K IE, Baltic

Lithuanian treebanks

ALKSNIS 70K ⓁⒻⒺ

The Lithuanian dependency treebank ALKSNIS v3.0 (Vytautas Magnus University).

Contributors: Andrius Utka, Erika Rimkutė, Agnė Bielinskienė, Jolanta Kovalevskaitė, Loïc Boizou, Gabrielė Aleksandravičiūtė, Kristina Brokaitė, Daniel Zeman, Natalia Perkova, Bernadeta Griciūtė
Repository master dev
README
Treebank hub page
Download

HSE 5K ⓁⒻ

Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Lithuanian treebanks.

Language documentation

See the language documentation page.

Livvi 1 1K Uralic, Finnic

Livvi treebanks

KKPP 1K ⓁⒻ

UD Livvi-KKPP is a manually annotated new corpus of Livvi-Karelian made directly in the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

Contributors: Tommi A Pirinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Low Saxon 1 22K IE, Germanic

Low Saxon treebanks

LSDC 22K ⓁⒻ

The UD Low Saxon LSDC dataset consists of sentences in 8 major Low Saxon dialect groups from both Germany and the Netherlands. These sentences are (or are to become) part of the LSDC dataset and represent the language from mostly the 19th and early 20th century in genres such as short stories, novels, speeches, letters and fairytales.

Contributors: Janine Siewert
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Luxembourgish 1 <1K IE, Germanic

Luxembourgish treebanks

LuxBank <1K Ⓛ Ⓟ

The LuxBank corpus currently consists of the translated Cairo Cicling examples, and will be extended to include examples from a national dataset. It is the first comprehensive tree bank dataset for Luxembourgish.

Contributors: Alistair Plum, Christoph Purschke, Caroline Döhmer, Anne-Marie Lutgen, Emilia Milano
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Macedonian 1 1K IE, Slavic

Macedonian treebanks

MTB 1K ⓁⒻ Ⓟ

The Macedonian-MTB treebank is a collection of annotated sentences taken from the Macedonian version of the Cairo CICLing Corpus and from the university textbook in syntax "Contemporary Macedonian Language 4" by Simov Sazdov.

Contributors: Vladimir Cvetkoski
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Madi 1 <1K Arawan

Madi treebanks

Jarawara <1K ⓁⒻ

UD_Madi-Jarawara is a collection of annotated sentences in Madí (Jarawara dialect) from a variety of sources, including grammar examples, oral stories, didatic material, and dictionary examples.

Contributors: Alan Vogel, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Maghrebi Arabic French 1 19K Code switching

Maghrebi Arabic French treebanks

Arabizi 19K ⓁⒻ

A Universal Dependencies corpus for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. We added to the UD annotations NER annotations extending the French Treebank NER scheme (Sagot et al, 2012) and Offensive language classification and corrected many of the translations (still ongoing).

Contributors: Arij Riabi, Farah Essaidi, Amal Fethi, Menel Mahamdi, Djamé Seddah
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Makurap 1 <1K Tupian, Tupari

Makurap treebanks

TuDeT <1K ⓁⒻ

UD_Makuráp-TuDeT is a collection of annotated texts in Makuráp. The project is a work in progress and the treebank is being updated on a regular basis. The sentences are being annotated by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos, and Luan Cabral.

Contributors: Carolina Aragon, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Malayalam 1 2K Dravidian

Malayalam treebanks

UFAL 2K ⓁⒻ Ⓟ

Currently just a small sample of Malayalam grammatical examples.

Contributors: Abishek Stephen, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Maltese 1 44K Afro-Asiatic, Semitic

Maltese treebanks

MUDT 44K

MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.

Contributors: Slavomír Čéplö, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Manx 1 20K IE, Celtic

Manx treebanks

Cadhan 20K ⓁⒻ

This is the Cadhan Aonair UD treebank for Manx Gaelic, created by Kevin Scannell.

Contributors: Kevin Scannell
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Marathi 2 122K IE, Indic

Marathi treebanks

UFAL 3K ⓁⒻ

UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

Contributors: Vinit Ravishankar
Repository master dev
README
Treebank hub page
Download

CMUPAN 118K ⓁⒻ

This treebank is a modified version of a semi-automatically treebank authord by Aditi Chaudhary, which in turn is based on the treebanks released by KCIS, IIIT-Hyderabad. Additionally, the treebank also contains Marathi-Discourse: A manually annotated 35-sentence corpus covering political discourse.

Contributors: Pranav Kushare, Aditi Chaudhary, Luigi Talamo, Annemarie Verkerk, Helena Vaz
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Marathi treebanks.

Language documentation

See the language documentation page.

Mbya Guarani 1 1K Tupian, Maweti-Guarani

Mbya Guarani treebanks

Thomas 1K ⓁⒻ

UD Mbya_Guarani-Thomas is a corpus of Mbyá Guaraní (Tupian) texts collected by Guillaume Thomas. The current version of the corpus consists of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu, Caazapá Department, Paraguay.

Contributors: Guillaume Thomas
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Mbya Guarani treebanks.

Language documentation

See the language documentation page.

Middle Armenian 1 1K IE, Armenian

Middle Armenian treebanks

ArmTDP 1K ⓁⒻ

A Universal Dependencies treebank for Middle Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Anna S. Danielyan, Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Middle French 2 126K IE, Romance

Middle French treebanks

ALTM 7K ⓁⒻ

Middle-French ALTM (AUTOMATED Legal Texts Medieval) is a treebank of medieval legal French from Normandy. Currently in contains one text, an extract from _Coutume, style et usage au temps des Échiquiers de Normandie_, dated 1425.

Contributors: Natalia Romanova, Rayan Ziane, Khensa Daoudi, Théo Brillet
Repository master dev
README
Treebank hub page
Download

PROFITEROLE 119K ⓁⒻ

UD_Middle_French-PROFITEROLE is the Middle French section of the PROFITEROLE corpus, the Old French section is UD_OLD_FRENCH-PROFITEROLE.

Contributors: Sophie Prévost, Eric Villemonte de la Clergerie, Mathilde Regnault, Loïc Grobol, Benoît Crabbé, Mathieu Dehouck, Alexei Lavrentiev
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Middle French treebanks.

Language documentation

See the language documentation page.

Moksha 1 4K Uralic, Mordvin

Moksha treebanks

JR 4K ⓁⒻ

Erme Universal Dependencies annotated texts Moksha are the origin of UD_Moksha-JR with annotation (CoNLL-U) for texts in the Moksha language, it originally consists of a sample from a number of fiction authors writing originals in Moksha.

Contributors: Jack Rueter, Maria Levina, Nadezhda Kabaeva, Judit Molnár, Khalid Alnajjar
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Munduruku 1 1K Tupian, Munduruku

Munduruku treebanks

TuDeT 1K ⓁⒻ

UD_Munduruku-TuDeT is a collection of annotated sentences in [Mundurukú](http://www.endangeredlanguages.com/lang/2981). The project is a work in progress and the treebank is being updated on a regular basis.

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Naga 1 3K Sino-Tibetan, Tangkhul-Maring

Naga treebanks

Suansu 3K ⓁⒻⒺ Ⓟ

UD_Naga-Suansu is a Universal Dependencies (UD) treebank for Suansu (Glottocode: suan1234), an endangered Tibeto-Burman language spoken on the Indo-Myanmar border. The annotation was performed manually based on glosses. This treebank includes texts from fiction and grammar. The treebank contains **3.1k** tokens, distributed as follows: - **Training set**: 2945 tokens - **Test set**: 157 tokens

Contributors: Jessica K. Ivani, Kira Tulchynska
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Naija 1 140K Creole

Naija treebanks

NSC 140K ⓁⒻ

A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).

Contributors: Bernard Caron, Emmett Strickland, Marine Courtin, Kim Gerdes, Bruno Guillaume, Sylvain Kahane, Chika Kennedy Ajede, Emeka Onwuegbuzia, Samson Tella
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Neapolitan 1 <1K IE, Romance

Neapolitan treebanks

RB <1K ⓁⒻ Ⓟ

This treebank contains example sentences in Neapolitan, translated by a native speaker.

Contributors: Rodolfo Basile, Daniel Zeman, Ludovica Pannitto, Arianna Masciolini
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Nenets 1 1K Uralic, Samoyedic

Nenets treebanks

Tundra 1K ⓁⒻ

The Tundra Nenets UD treebank is converted from the [Tundra Nenets mSUD treebank](https://github.com/surfacesyntacticud/mSUD_Nenets-Tundra). The conversion from mSUD to UD is performed automatically followed by a comprehensive manual revision to ensure compliance with the UD annotation standards.

Contributors: Morgane Bona, Bruno Guillaume, Sylvain Kahane, Aleksandra Miletić, Nikolett Mus, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Nepali 1 <1K IE, Indic

Nepali treebanks

BK <1K ⓁⒻ

UD_Nepali-BK is a manually annotated Universal Dependencies treebank for Nepali, an Indo-Aryan language written in Devanagari. The treebank contains sentences from a fictional narrative story and an argumentative discourse text, and follows the Universal Dependencies v2 guidelines.

Contributors: Samuel BK, Luigi Talamo, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Nheengatu 1 26K Tupian, Maweti-Guarani

Nheengatu treebanks

CompLin 26K ⓁⒻ

[UD_Nheengatu-CompLin](https://aclanthology.org/2024.propor-2.8) is a treebank of [Nheengatu](https://glottolog.org/resource/languoid/id/nhen1239), also known as Modern Tupi and *Língua Geral Amazônica* (ISO 639: `yrl`). It comprises sentences drawn from a wide range of published sources, including spontaneous speech, grammatical descriptions, fables, myths, coursebooks, and dictionaries.

Contributors: Leonel Figueiredo de Alencar, Dominick Maia Alexandre
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

North Sami 1 26K Uralic, Sami

North Sami treebanks

Giella 26K ⓁⒻ

This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

Contributors: Trond Trosterud, Lene Antonsen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Northern Kurdish 1 10K IE, Iranian

Northern Kurdish treebanks

Kurmanji 10K ⓁⒻ

The treebank is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

Contributors: Memduh Gökırmak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Northwest Gbaya 1 2K Niger-Congo, Gbaya-Manza-Ngbaka

Northwest Gbaya treebanks

Autogramm 2K ⓁⒻ

A Universal Dependencies corpus for Northwest Gbaya, a member of the Gbaya branch of the Atlantic-Congo phylum. The language is mainly spoken by about 250,000 speakers in Central African Republic.

Contributors: Paulette Roulon
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Norwegian 2 611K IE, Germanic

Norwegian treebanks

Bokmaal 310K ⓁⒻ

The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. The current version of NDT has been automatically converted to the UD scheme by Ingerid Løyning Dale, Per Erik Solberg and Andre Kåsen at the Norwegian Language Bank at the National Library of Norway. This conversion builds to a large extent on previous conversions by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle, Thea Tollersrud, Ingerid Løyning Dale, Per Erik Solberg, Andre Kåsen
Repository master dev
README
Treebank hub page
Download

Nynorsk 301K ⓁⒻ

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle, Thea Tollersrud, Ingerid Løyning Dale, Per Erik Solberg, Andre Kåsen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Occitan 1 25K IE, Romance

Occitan treebanks

TTB 25K Ⓛ

Tolosa Treebank was developed as part of the EFA 227/16 LINGUATEC Project, financed by the POCTEFA Interreg European funds. It includes data from literature, newspapers, encyclopedia, scientific papers and web blogs.

Contributors: Aleksandra Miletić, Myriam Bras, Louise Esher, Clamença Poujade, Jean Sibille, Marianne Vergez-Couret
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Odia 1 5K IE, Indic

Odia treebanks

ODTB 5K ⓁⒻ

The Odia UD Treebank (ODTB) is a part of the Universal Dependency treebank project.

Contributors: Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati Sahoo, Biswakalpita Mohapatra, Satya Ranjan Dash, Bijayalaxmi Dash, Kusum Lata
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Church Slavonic 1 198K IE, Slavic

Old Church Slavonic treebanks

PROIEL 198K ⓁⒻ

The Old Church Slavonic (OCS) UD treebank is based on canonical Old Church Slavonic data from the PROIEL and TOROT treebanks.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old East Slavic 4 580K IE, Slavic

Old East Slavic treebanks

RNC 168K ⓁⒻ

`UD_Old_East_Slavic-RNC` is a sample of the Middle Russian corpus (1300-1700), a part of the Russian National Corpus. The data were originally annotated according to the RNC and extended UD-Russian morphological schemas and UD 2.4 dependency schema.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

Ruthenian 137K ⓁⒻ

The Ruthenian UD treebank includes texts written in the territories of modern Belarus, Lithuania, Ukraine, and Poland in ca. 1300-1700. A sample of legal and nonfiction texts is drawn from the Ruthenian Corpus.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava, Maria Shvedova
Repository master dev
README
Treebank hub page
Download

TOROT 246K ⓁⒻ

UD\_Old\_East\_Slavic-TOROT is a conversion of a selection of Old East Slavonic and Middle Russian data from the Tromsø Old Russian and OCS Treebank (TOROT), which was originally annotated in PROIEL dependency format.

Contributors: Hanne Eckhoff
Repository master dev
README
Treebank hub page
Download

Birchbark 27K ⓁⒻ

UD Old\_East\_Slavic-Birchbark is based on the RNC Corpus of Birchbark Letters and includes documents written in 1025-1500 in an East Slavic vernacular (letters, household and business records, records for church services, spell against diseases, and other short inscriptions). The treebank is manually syntactically annotated in the UD 2.0 scheme, morphological and lexical annotation is a conversion of the original RNC annotation.

Contributors: Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Old East Slavic treebanks.

Language documentation

See the language documentation page.

Old English 1 <1K IE, Germanic

Old English treebanks

Cairo <1K ⓁⒻ Ⓟ

Old English [Cairo](https://github.com/UniversalDependencies/cairo) sentences with UD and additional annotations

Contributors: Lauren Levine, Junghyun Min, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old French 2 255K IE, Romance

Old French treebanks

ALTM 15K ⓁⒻ

Old-French ALTM (AUTOMATED Legal Texts Medieval) is a treebank of medieval legal French from Normandy. Currently in contains one text, _Atiremens et jugiés d'eschequiers_, dated 1314.

Contributors: Natalia Romanova, Rayan Ziane, Mathieu Goux, Khensa Daoudi, Pierre Larrivée
Repository master dev
README
Treebank hub page
Download

PROFITEROLE 240K ⓁⒻ

UD_Old_French-PROFITEROLE is an expansion of the previous UD_Old_French-SRCMF (which was a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).

Contributors: Sophie Prévost, Aurélie Collomb, Kim Gerdes, Isabelle Tellier, Marine Courtin, Alexei Lavrentiev, Céline Guillot-Barbance, Loïc Grobol, Mathilde Regnault, Mathieu Dehouck
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Old French treebanks.

Language documentation

See the language documentation page.

Old Georgian 1 6K Kartvelian

Old Georgian treebanks

GLC 6K ⓁⒻ

The Old Georgian UD Treebank (UD_Old_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Old Georgian Language Corpus (OGLC) available at https://oge.iliauni.edu.ge/.

Contributors: Irina Lobzhanidze
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Irish 2 <1K IE, Celtic

Old Irish treebanks

DipWBG <1K ⓁⒻ

A Universal Dependencies treebank for the Old Irish Würzburg glosses.

Contributors: Adrian Doyle
Repository master dev
README
Treebank hub page
Download

DipSGG <1K ⓁⒻ

A Universal Dependencies treebank for the Old Irish glosses of St. Gall.

Contributors: Adrian Doyle
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Old Irish treebanks.

Language documentation

See the language documentation page.

Old Occitan 1 52K IE, Romance

Old Occitan treebanks

CorAG 52K Ⓕ

UD_Old_Occitan-CorAG (Corpus de l'Ancien Gascon) is a corpus of medieval and early modern legal texts in Gascon, a variety of Old Occitan. The texts were digitized from existing editions and subsequently manually annotated in Universal Dependencies (PoS, functions and some morphological features).

Contributors: Barbara Francioni, Natalia Romanova, Rayan Ziane, Khensa Daoudi, Pierre Larrivée
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Turkish 1 <1K Turkic, Northeastern

Old Turkish treebanks

Clausal <1K

This repository contains an [Old Turkish](https://iso639-3.sil.org/code/otk) treebank built upon Old Turkic script texts.

Contributors: Mehmet Oguz Derin, Takahiro Harada
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ottoman Turkish 3 31K Turkic, Southwestern

Ottoman Turkish treebanks

DUDU 22K ⓁⒻ Ⓟ

An Ottoman Turkish dependency treebank annotated in UD style. Created by Enes Yılandiloğlu.

Contributors: Enes Yılandiloğlu
Repository master dev
README
Treebank hub page
Download

TueCL <1K ⓁⒻ Ⓟ

The Ottoman Turkish-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across five Turkic languages (Turkish, Azerbaijani, Kyrgyz, Uzbek, and Ottoman Turkish), designed to facilitate cross-linguistic research on these related languages.

Contributors: Enes Yılandiloğlu
Repository master dev
README
Treebank hub page
Download

BOUN 8K ⓁⒻ

An Ottoman Turkish dependency treebank annotated in UD style. Created by [Şaziye Betül Özateş](https://sb-b.github.io/), Tarık Emre Tıraş, Efe Eren Genç from Boğaziçi University, and Esma Fatıma Bilgin Taşdemir from Medeniyet University.

Contributors: Şaziye Betül Özateş, Tarık Emre Tıraş, Efe Eren Genç, Esma Fatıma Bilgin Taşdemir
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ottoman Turkish treebanks.

Language documentation

See the language documentation page.

Pashto 2 6K IE, Iranian

Pashto treebanks

Sikaram 5K ⓁⒻ Ⓟ

The Pashto-Sikaram treebank is a native UD treebank with manually annotated texts from various sources.

Contributors: Ján Faryad, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Prince 1K ⓁⒻ

The UD Pashto-Prince treebank contains manually annotated Pashto sentences from two textual sources: 50 sentences from Le Petit Prince, which was then translated and adapted into Northern Pashto, and 14 sentences from a Pashto prose text on Pashtun leadership. All sentences are annotated natively according to Universal Dependencies guidelines.

Contributors: Salwan Aziz, Luigi Talamo, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Pashto treebanks.

Language documentation

See the language documentation page.

Paumari 1 <1K Arawan

Paumari treebanks

TueCL <1K

This is a small treebank of Paumari, a low-resource Amazonian language.

Contributors: Annika Ott, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Persian 2 654K IE, Iranian

Persian treebanks

PerDT 501K ⓁⒻ

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data: * Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian. "The Persian Dependency Treebank Made Universal". 2020 (to appear).

Contributors: Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, Alireza Nourian
Repository master dev
README
Treebank hub page
Download

Seraji 152K ⓁⒻ

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

Contributors: Mojgan Seraji, Filip Ginter, Joakim Nivre, Martin Popel, Daniel Zeman, Minoo Nassajian
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Pesh 1 4K Chibchan, Pesh

Pesh treebanks

ChibErgIS 4K ⓁⒻ

A Universal Dependencies corpus for Pesh (aka Paya), a member of the Chibchan language family. The language is spoken by about 500 speakers in Honduras.

Contributors: Natalia Cáceres Arandia, Claudine Chamoreau, Sylvain Kahane, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Phrygian 1 1K IE, Greek

Phrygian treebanks

KUL 1K ⓁⒻ

UD Phrygian-KUL started as part of a Master's thesis in linguistics at KU Leuven, annotating the New Phrygian subcorpus of the ancient Phrygian language. It has since expanded to include Old and Middle Phrygian texts.

Contributors: Oggi Peeters
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Polish 4 546K IE, Slavic

Polish treebanks

PDB 349K ⓁⒻⒺ

The Polish PDB-UD treebank is automatically converted from the Polish Dependency Bank 2.0 (PDB 2.0). Both treebanks were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

Contributors: Alina Wróblewska, Daniel Zeman, Jan Mašek, Rudolf Rosa
Repository master dev
README
Treebank hub page
Download

PUD 18K ⓁⒻⒺ Ⓟ

This is the Polish portion of the Parallel Universal Dependencies (PUD) treebanks, created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

Contributors: Alina Wróblewska
Repository master dev
README
Treebank hub page
Download

LFG 130K ⓁⒻⒺ

The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.

Contributors: Agnieszka Patejuk, Adam Przepiórkowski
Repository master dev
README
Treebank hub page
Download

MPDT 47K ⓁⒻⒺ

UD_Polish-MPDT is a treebank of Middle Polish (17th–18th centuries). It is a rule-based conversion of the [Middle Polish Dependency Treebank](https://korba.edu.pl/treebank?lang=en) (Wieczorek, 2025) from its original annotation to the Universal Dependencies format. The MPDT sentences are sourced from the [KorBa corpus](https://korba.edu.pl/overview?lang=en) (Gruszczyński et al., 2022).

Contributors: Kamil Tomaszek, Alina Wróblewska, Aleksandra Wieczorek
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Polish treebanks.

Language documentation

See the language documentation page.

Pomak 1 34K IE, Slavic

Pomak treebanks

Philotis 34K ⓁⒻ

The Pomak UD treebank is derived from the Pomak Dependency Treebank, a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

Contributors: Ritván Karahóǧa, Vivian Stamou, Stella Markantonatou
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Portuguese 7 1,545K IE, Romance

Portuguese treebanks

Porttinari 168K ⓁⒻⒺ

Porttinari-base [(Duran et al., 2023)](https://sol.sbc.org.br/index.php/stil/article/view/25443/25264) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese [(Pardo et al., 2021)](https://sol.sbc.org.br/index.php/stil/article/view/17778/17612), following the "Universal Dependencies" international grammar framework [(de Marneffe et al., 2021)](https://aclanthology.org/2021.cl-2.11/).

Contributors: Magali Sanches Duran, Lucelene Lopes, Maria das Graças Volpe Nunes, Thiago Alexandre Salgueiro Pardo
Repository master dev
README
Treebank hub page
Download

DANTEStocks 80K ⓁⒻ

DANTEStocks (Di Felippo et al., 2024) is a collection of Brazilian Portuguese tweets on the stock market domain that is part of Porttinari (“PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" framework (de Marneffe et al., 2021).

Contributors: Ariani Di Felippo, Norton Trevisan Roman, Thiago Alexandre Salgueiro Pardo, Bryan Khelven da Silva Barbosa, Maria das Graças Volpe Nunes
Repository master dev
README
Treebank hub page
Download

PetroGold 250K ⓁⒻ

UD_Portuguese-PetroGold is a fully revised treebank which consists of academic texts from the oil & gas domain in Brazilian Portuguese.

Contributors: Elvis de Souza, Cláudia Freitas, Aline Silveira, Tatiana Cavalcanti, Maria Clara Castro, Wograine Evelyn
Repository master dev
README
Treebank hub page
Download

Bosque 227K ⓁⒻ

This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.

Contributors: Alexandre Rademaker, Cláudia Freitas, Elvis de Souza, Aline Silveira, Tatiana Cavalcanti, Wograine Evelyn, Luisa Rocha, Isabela Soares-Bastos, Eckhard Bick, Fabricio Chalub, Guilherme Paulino-Passos, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
Repository master dev
README
Treebank hub page
Download

CINTIL 475K ⓁⒻ

CINTIL-UDep is a dependency bank of Portuguese that is treebanked with Universal Dependencies. It contains over 38K annotated sentences (and 476K tokens), of mostly newspaper text.

Contributors: Mariana Avelãs, António Branco, Marisa Campos, Catarina Carvalheiro, Rita Carvalho, Sérgio Castro, Francisco Costa, Cláudia Martins, Rita Pereira, Sílvia Pereira, Clara Pinto, Andreia Querido, Joana Ramos, João Silva, Sara Silveira
Repository master dev
README
Treebank hub page
Download

GSD 318K ⓁⒻ

The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Alexandre Rademaker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Fabricio Chalub, Carlos Ramisch, Juan Belieni, Vanessa Berwanger Wille, Rodrigo Pintucci
Repository master dev
README
Treebank hub page
Download

PUD 23K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva, Alexandre Rademaker, Elvis de Souza
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Punjabi 2 3K IE, Indic

Punjabi treebanks

CS 1K ⓁⒻ

The UD_Punjabi_CS is a manually annotated treebank in Punjabi (also called Eastern Punjabi, Gurmukhi script) language. The Indo-Aryan language is prominently spoken in Punjab region of India and Pakistan with strong diasporas in Western countries like Canada, UK and USA etc. It is written from Left-to-Right with Subject-Object-Verb (SOV) word ordering.

Contributors: Ali Haider Khan, Luigi Talamo, Helena Vaz, Zarina Begum, Andrew Dyer, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Rang 1K ⓁⒻ

The Punjabi-Rang treebank is a manually annotated corpus in Punjabi (Shahmukhi script).

Contributors: Rimsha Abid, Luigi Talamo, Helena Vaz, Andrew Dyer, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Punjabi treebanks.

Language documentation

See the language documentation page.

Romanian 6 942K IE, Romance

Romanian treebanks

ArT <1K ⓁⒻ

The UD treebank ArT is a treebank of the Aromanian dialect of the Romanian language in UD format.

Contributors: Verginica Barbu Mititelu, Mihaela Cristescu, Manuela Nevaci
Repository master dev
README
Treebank hub page
Download

MolDoRo <1K Ⓛ

A small treebank of sentences in Moldovan Romanian, using the Cyrillic writing system (as used in Moldova until 1989).

Contributors: Olesea Caftanatov, Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Nonstandard 572K ⓁⒻ

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0

Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca, Roman Untilov, Petru Rebeja
Repository master dev
README
Treebank hub page
Download

RRT 218K ⓁⒻ

The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.

Contributors: Verginica Barbu Mititelu, Elena Irimia, Cenel-Augusto Perez, Radu Ion, Radu Simionescu, Martin Popel
Repository master dev
README
Treebank hub page
Download

SiMoNERo 146K ⓁⒻ

SiMoNERo is a medical corpus of contemporary Romanian.

Contributors: Maria Mitrofan, Verginica Barbu Mititelu
Repository master dev
README
Treebank hub page
Download

TueCL 4K ⓁⒻ

The Romanian Social Media Sexist Language UD Treebank is a reference treebank in Universal Dependencies (UD) format for Romanian sexist language. Currently small, it comprises a subset of tweets sourced from [CoRoSeOf](https://github.com/DianaHoefels/CoRoSeOf).

Contributors: Diana Hoefels, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Russian 5 3,455K IE, Slavic

Russian treebanks

Taiga 1,758K ⓁⒻ

Universal Dependencies treebank is based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 and GramEval-2020 shared tasks collections.

Contributors: Olga Lyashevskaya, Olga Rudina, Natalia Vlasova, Anna Zhuravleva
Repository master dev
README
Treebank hub page
Download

SynTagRus 1,515K ⓁⒻⒺ

Russian data from the SynTagRus corpus.

Contributors: Kira Droganova, Olga Lyashevskaya, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSD 97K ⓁⒻ

Russian Universal Dependencies Treebank annotated and converted by Google.

Contributors: Ryan McDonald, Vitaly Nikolaev, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

Poetry 64K ⓁⒻ

UD_Russian-Poetry contains samples of Russian poetry written in 19th – early 21th centuries. The treebank is based on the Poetry Corpus of the Russian National Corpus.

Contributors: Olga Lyashevskaya, Natalia Vlasova, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

PUD 19K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Russian treebanks.

Language documentation

See the language documentation page.

Ruuli 1 6K Niger-Congo, Bantoid

Ruuli treebanks

RDT 6K ⓁⒻⒺ

UD_Ruuli-RDT is a Universal Dependencies (UD) treebank for the Ruruuli-Lunyala (Ruuli) language. The annotation was converted from interlinear glossed text and manually annotated for syntactic relations. The treebank includes texts from various sources: conversations, oral folktales, biographic monologue, movie subtitles, grammar examples, and factual prose. The treebank contains approximately 6,000 tokens.

Contributors: Kira Tulchynska, Anna Veselovsky, Alena Witzlack-Makarevich
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Sanskrit 2 208K IE, Indic

Sanskrit treebanks

UFAL 1K ⓁⒻ

A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

Contributors: Puneet Dwivedi, Daniel Zeman, Erica Biagetti
Repository master dev
README
Treebank hub page
Download

Vedic 206K ⓁⒻ

The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB). Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

Contributors: Salvatore Scarlata, Elia Ackermann, Oliver Hellwig, Erica Biagetti, Paul Widmer, Sven Sellmer
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Sanskrit treebanks.

Language documentation

See the language documentation page.

Scottish Gaelic 1 90K IE, Celtic

Scottish Gaelic treebanks

ARCOSG 90K ⓁⒻ

A treebank of Scottish Gaelic based on the [Annotated Reference Corpus Of Scottish Gaelic (ARCOSG)](https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG).

Contributors: Colin Batchelor
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Serbian 1 97K IE, Slavic

Serbian treebanks

SET 97K ⓁⒻ Ⓟ

The Serbian UD treebank is based on the [SETimes-SR](http://hdl.handle.net/11356/1200) corpus and additional news documents from the Serbian web.

Contributors: Tanja Samardžić, Aleksandra Miletić, Nikola Ljubešić
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Shanghainese 1 8K Sino-Tibetan, Chinese

Shanghainese treebanks

ShUD 8K Ⓛ

**UD Shanghainese-ShUD** is the first UD treebank for Shanghainese.

Contributors: Qizhen Yang
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Sicilian 1 11K IE, Romance

Sicilian treebanks

STB 11K ⓁⒻ

The Sicilian Treebank is a small parallel corpus of Sicilian texts, automatically parsed and then manually revised, with Italian translations. It includes both contemporary and folkloric materials. The main focus is documenting typical morphosyntactic features of the written Sicilian.

Contributors: Cristina Bosco, Sabrina D'Alì, Elisa Di Nuovo, Mario Guglielmetti, Caterina Maria Cappello
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Sindhi 1 95K IE, Indic

Sindhi treebanks

Isra 95K ⓁⒻ

A UD dataset for Sindhi, based on newswire (primarily Kawish), folk stories from the Adabi forums, handwritten text to demonstrate linguistic features, and a reparsing of the unfinished MazharDootio dataset.

Contributors: Mutee-u Rahman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Sinhala 2 1K IE, Indic

Sinhala treebanks

Appuwa <1K ⓁⒻ

This treebank contains a manually annotated Sinhala narrative based on the folk story "Appuwa", created as part of a project on treebank development for understudied languages within the Universal Dependencies framework.

Contributors: Warangana Sammani, Luigi Talamo, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

STB <1K ⓁⒻ

This treebank consists contemporary written Sinhala text taken from a 10M corpus maintained by UCSC, Sri Lanka. The corpus contains novels, short stories, Sinhala translations, critiques and Sinhala newspapers.

Contributors: Liyanage Chamila, Sarveswaran Kengatharaiyer
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Sinhala treebanks.

Language documentation

See the language documentation page.

Skolt Sami 1 3K Uralic, Sami

Skolt Sami treebanks

Giellagas 3K ⓁⒻ

The UD Skolt Sami Giellagas treebank is based almost entirely on spoken Skolt Sami corpora.

Contributors: Jack Rueter, Markus Juutinen, Francis Tyers, Tommi A Pirinen, Mika Hämäläinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Slovak 1 106K IE, Slavic

Slovak treebanks

SNK 106K ⓁⒻⒺ

The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

Contributors: Katarína Gajdošová, Mária Šimková, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Slovenian 2 365K IE, Slavic

Slovenian treebanks

SSJ 267K ⓁⒻ

The SSJ treebank is the reference UD treebank for Slovenian, consisting of approximately 13,000 sentences and 267,097 tokens from fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. As of UD release 2.10 in May 2022, the original version of the SSJ UD treebank has been partially manually revised and extended with new manually annotated data.

Contributors: Kaja Dobrovoljc, Tomaž Erjavec, Simon Krek
Repository master dev
README
Treebank hub page
Download

SST 98K ⓁⒻ

The Spoken Slovenian Treebank (SST) is a manually annotated collection of transcribed audio recordings featuring spontaneous speech in various everyday situations. It includes 344 unique speech events (documents) amounting to approximately 10 hours of speech, encompassing a total of 6,121 utterances and 98,393 tokens.

Contributors: Kaja Dobrovoljc, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Slovenian treebanks.

Language documentation

See the language documentation page.

South Levantine Arabic 1 <1K Afro-Asiatic, Semitic

South Levantine Arabic treebanks

MADAR <1K Ⓛ

The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project. TO-DO: Add 20 annotated sentences from CCC as a train set.

Contributors: Shorouq Zahra
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Southern Kurdish 1 1K IE, Iranian

Southern Kurdish treebanks

Garrusi 1K ⓁⒻ

A dependency treebank in Universal Dependencies (UD) format, derived from narrative and questionnaire texts. This document gives an overview of the annotation scheme, linguistic features, and structural patterns seen in the data.

Contributors: Hiwa Asadpour, Luigi Talamo, Helena Vaz, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Spanish 4 1,022K IE, Romance

Spanish treebanks

AnCora 560K ⓁⒻⒺ

Spanish data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

Contributors: Héctor Martínez Alonso, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 23K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
Repository master dev
README
Treebank hub page
Download

GSD 431K ⓁⒻ

The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Miguel Ballesteros, Héctor Martínez Alonso, Ryan McDonald, Elena Pascual, Natalia Silveira, Daniel Zeman, Joakim Nivre, John Bauer
Repository master dev
README
Treebank hub page
Download

COSER 8K ⓁⒻ

The COSER UD Treebank (COSER-UD) is the first syntactically annotated corpus of spoken Spanish, based on a sample of the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish".

Contributors: Johnatan Bonilla
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Spanish Sign Language 1 1K Sign Language

Spanish Sign Language treebanks

LSE 1K

The Universal Dependency treebank for Spanish Sign Language (Lengua de Signos Española [LSE], ISO 639-3: ssp) was developed by the GRADES group at the University of Vigo.

Contributors: José María García-Miguel, Carmen Cabeza
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Swedish 5 229K IE, Germanic

Swedish treebanks

LinES 102K ⓁⒻ Ⓟ

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

Talbanken 96K ⓁⒻⒺ

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

Contributors: Joakim Nivre, Aaron Smith, Victor Norrman
Repository master dev
README
Treebank hub page
Download

SweLL 10K ⓁⒻ

A treebank of learner Swedish based on SweLL, the Swedish Learner Language corpus.

Contributors: Arianna Masciolini, Aleksandrs Berdicevskis, Maria Irena Szawerna, Caroline Grand-Clement
Repository master dev
README
Treebank hub page
Download

PUD 19K ⓁⒻⒺ Ⓟ

Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

Contributors: Joakim Nivre, Bernadeta Griciūtė, Victor Norrman
Repository master dev
README
Treebank hub page
Download

Old <1K ⓁⒻ

UD Swedish-Old is a treebank containing texts from Old Swedish (1225-1526).

Contributors: Lars Ahrenberg, Lars Borin, Astrid Berntsson Ingelstam, Joakim Nivre, Eva Pettersson, Sara Stymne
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Swedish Sign Language 1 1K Sign Language

Swedish Sign Language treebanks

SSLC 1K

The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.

Contributors: Moa Gärdenfors, Carl Börstell, Robert Östling, Lars Wallin, Mats Wirén
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tagalog 2 1K Austronesian, Greater Central Philippine

Tagalog treebanks

TRG <1K ⓁⒻ

UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.

Contributors: Stephanie Samson, Daniel Zeman, Mary Ann C. Tan
Repository master dev
README
Treebank hub page
Download

Ugnayan 1K Ⓛ

Ugnayan is a manually annotated Tagalog treebank currently composed of educational fiction and nonfiction text. The treebank is under development at the University of the Philippines.

Contributors: Angelina Aquino
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tamil 2 12K Dravidian

Tamil treebanks

TTB 9K ⓁⒻⒺ

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

Contributors: Loganathan Ramasamy, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

MWTT 2K ⓁⒻ

MWTT - Modern Written Tamil Treebank has sentences taken primarily from a text called "A Grammar of Modern Tamil by Thomas Lehmann (1993). This initial release has 536 sentences of various lengths, and all of these are added as the test set.

Contributors: Sarveswaran Kengatharaiyer, Parameswari Krishnamurthy, Keerthana Balasubramani
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tamil treebanks.

Language documentation

See the language documentation page.

Tatar 1 2K Turkic, Northwestern

Tatar treebanks

NMCTT 2K ⓁⒻ

UD Tatar-NMCTT is a manually annotated corpus of the Tatar language based on the text from Tatar-Inform (tatar-inform.tatar), an online news website.

Contributors: Chihiro Taguchi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Teko 1 2K Tupian, Maweti-Guarani

Teko treebanks

TuDeT 2K ⓁⒻ

UD_Teko-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/emer1243"> Tekó (Emérillon) </a>. The sentences stem from the only grammatical description of the language (Rose, 2011). Sentence annotation and documantation by Uliana Vedenina and Fabrício Ferraz Gerardi.

Contributors: Uliana Vedenina, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Telugu 1 6K Dravidian

Telugu treebanks

MTG 6K

The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

Contributors: Taraka Rama, Sowmya Vajjala
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Telugu English 1 <1K Code switching

Telugu English treebanks

TECT <1K

UD Telugu_English-TECT is a Telugu-English code-switching treebank.

Contributors: Anishka Vissamsetty
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Thai 2 99K Tai-Kadai

Thai treebanks

TUD 77K

*UD Thai TUD* (Thai Universal Dependency Treebank) is a treebank of 3,627 syntactic trees from the Thai National Corpus and Wikipedia, annotated in Universal Dependencies, covering diverse text types and topics across various domains.

Contributors: Panyut Sriwirote, Wei Qi Leong, Charin Polpanumas, Santhawat Thanyawong, William Chandra Tjhi, Wirote Aroonmanakun, Attapol T. Rutherford, Ratanon Jiamsundutsadee, Punyanuch Maitreenukul
Repository master dev
README
Treebank hub page
Download

PUD 22K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Thai treebanks.

Language documentation

See the language documentation page.

Tswana 1 <1K Niger-Congo, Bantoid

Tswana treebanks

Popapolelo <1K Ⓕ Ⓟ

UD Tswana-Popapolelo is a translation of the 20 Cairo Cicling sentences (https://github.com/UniversalDependencies/cairo) annotated with XPOS, UPOS and dependency relations.

Contributors: Ansu Berg, Roald Eiselen, Tanja Gaustad, Rigardt Pretorius
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tupinamba 1 4K Tupian, Maweti-Guarani

Tupinamba treebanks

TuDeT 4K ⓁⒻ

UD_Tupinamba-TuDeT is a collection of annotated sentences in [Tupinambá](https://glottolog.org/resource/languoid/id/tupi1273). All known sources in this language are being annotated: cathecisms, letters, poems, theater plays, and grammars (sixteenth and seventeenth century). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Turkish 10 735K Turkic, Southwestern

Turkish treebanks

BOUN 125K ⓁⒻ

A Turkish dependency treebank annotated in UD style. Created by the members of [TABILAB](https://tabilab.cmpe.boun.edu.tr/) from Boğaziçi University.

Contributors: Büşra Marşan, Furkan Akkurt, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
Repository master dev
README
Treebank hub page
Download

Kenet 178K ⓁⒻ

Turkish-Kenet UD Treebank is the biggest treebank of Turkish. It consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Arife Betül Yenice, Bilge Nas Arıcan, Ezgi Sanıyar
Repository master dev
README
Treebank hub page
Download

Penn 183K ⓁⒻ

Turkish version of the Penn Treebank. It consists of a total of 9,560 manually annotated sentences and 87,367 tokens. (It only includes sentences up to 15 words long.)

Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Neslihan Kara, Bilge Nas Arıcan, Merve Özçelik, Deniz Baran Aslan
Repository master dev
README
Treebank hub page
Download

Tourism 91K ⓁⒻ

Turkish Tourism is a domain specific treebank consisting of 19,750 manually annotated sentences and 92,200 tokens. These sentences were taken from the original customer reviews of a tourism company.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Büşra Marşan, Bilge Nas Arıcan, Neslihan Kara, Deniz Baran Aslan, Ezgi Sanıyar, Cengiz Asmazoğlu
Repository master dev
README
Treebank hub page
Download

Atis 44K ⓁⒻ Ⓟ

This treebank is a translation of English ATIS (Airline Travel Information System) corpus (see References). It consists of 5432 sentences.

Contributors: Mehmet Köse, Olcay Taner Yıldız
Repository master dev
README
Treebank hub page
Download

IMST 58K ⓁⒻ

The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak&Eryiğit, 2018; Sulubacak et al., 2016).

Contributors: Utku Türk, Şaziye Betül Özateş, Büşra Marşan, Furkan Akkurt, Çağrı Çöltekin, Gülşen Cebiroğlu Eryiğit, Memduh Gökırmak, Hüner Kaşıkara, Umut Sulubacak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

FrameNet 19K ⓁⒻ

Turkish FrameNet consists of 2,700 manually annotated example sentences and 19,221 tokens. Its data consists of the sentences taken from the Turkish FrameNet Project. The annotated sentences can be filtered according to the semantic frame category of the root of the sentence.

Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Oğuzhan Kuyrukçu, Bilge Nas Arıcan, Ezgi Sanıyar, Neslihan Kara, Merve Özçelik
Repository master dev
README
Treebank hub page
Download

TueCL <1K ⓁⒻ Ⓟ

The Turkish-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages (Turkish, Azerbaijani, Kyrgyz, and Uzbek), designed to facilitate cross-linguistic research on these related languages.

Contributors: Furkan Akkurt, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

GB 17K ⓁⒻ

This is a treebank annotating example sentences from a comprehensive grammar book of Turkish.

Contributors: Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

PUD 16K ⓁⒻ Ⓟ

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Turkish English 1 <1K Code switching

Turkish English treebanks

BUTR <1K ⓁⒻ

UD_Turkish_English-BUTR is a treebank of Turkish-English code-switched sentences collected from Boğaziçi University students, annotated in the Universal Dependencies framework to provide a standardized resource for analyzing syntactic patterns in Turkish-English code-switching.

Contributors: Furkan Akkurt, Nursena Teker, Helin Binici, Ahmet Demir, Konstantinos Sampanis
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Turkish German 1 37K Code switching

Turkish German treebanks

SAGT 37K ⓁⒻ

UD Turkish-German SAGT is a Turkish-German code-switching treebank that is developed as part of the [SAGT](https://www.ims.uni-stuttgart.de/en/research/projects/sagt/) project.

Contributors: Özlem Çetinoğlu, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ukrainian 2 231K IE, Slavic

Ukrainian treebanks

ParlaMint 109K ⓁⒻ

UD_Ukrainian-ParlaMint is a collection of Ukrainian parliamentary transcripts annotated in Universal Dependencies. The texts are published on the official website of the Ukrainian parliament (https://www.rada.gov.ua/documents/Stenbul_pz/) and are taken for UD_Ukrainian-ParlaMint from the Ukrainian section of the ParlaMint project (https://www.clarin.eu/parlamint).

Contributors: Maria Shvedova, Arsenii Lukashevskyi
Repository master dev
README
Treebank hub page
Download

IU 122K ⓁⒻⒺ

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]

Contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ukrainian treebanks.

Language documentation

See the language documentation page.

Umbrian 1 1K IE, Italic

Umbrian treebanks

IKUVINA 1K Ⓕ

UD_Umbrian-IKUVINA is a dependency treebank rendering of the Iguvine tablets ([Wikipedia](https://en.wikipedia.org/wiki/Iguvine_Tablets)). The seven bronze tablets describe religious ceremonies performed by the Umbrian people in Italy before the rise of the Roman empire. The corpus will eventually contain all the tablets. But as of May 2022, only tablet I is release with partial morphological analysis and partial lemmatisation. (POS tagging and Dependency trees are complete)

Contributors: Mathieu Dehouck
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Upper Sorbian 1 11K IE, Slavic

Upper Sorbian treebanks

UFAL 11K ⓁⒻ

A small treebank of Upper Sorbian based mostly on Wikipedia, partly also on other Sorbian websites.

Contributors: Daniel Zeman, Anna Nedoluzhko
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Urdu 1 138K IE, Indic

Urdu treebanks

UDTB 138K ⓁⒻ

The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Uyghur 1 40K Turkic, Southeastern

Uyghur treebanks

UDT 40K ⓁⒻ

The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.

Contributors: Marhaba Eli, Daniel Zeman, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Uzbek 3 14K Turkic, Southeastern

Uzbek treebanks

UzUDT 7K ⓁⒻ

UD_Uzbek-UzUDT is a manually annotated Universal Dependencies treebank for Uzbek language.

Contributors: Sanatbek Matlatipov, Elmurod Kuriyozov
Repository master dev
README
Treebank hub page
Download

UT 5K ⓁⒻ

This is the first Uzbek UD treebank.

Contributors: Arofat Akhundjanova
Repository master dev
README
Treebank hub page
Download

TueCL <1K ⓁⒻ Ⓟ

The Uzbek-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages: Turkish, Azerbaijani, Kyrgyz, and Uzbek.

Contributors: Arofat Akhundjanova, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Uzbek treebanks.

Language documentation

See the language documentation page.

Veps 1 1K Uralic, Finnic

Veps treebanks

VWT 1K ⓁⒻ

UD Veps-VWT is a manually annotated corpus of Veps made using the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts written in Central Veps dialect.

Contributors: Käbi Laan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Vietnamese 2 59K Austro-Asiatic, Viet-Muong

Vietnamese treebanks

VTB 58K Ⓛ

The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).

Contributors: Lương Nguyễn Thị, Linh Hà Mỹ, Phương Lê Hồng, Huyền Nguyễn Thị Minh
Repository master dev
README
Treebank hub page
Download

TueCL 1K ⓁⒻ

This treebank includes a set of sentences from [OPUS](https://opus.nlpl.eu/), sourced from subtitles, talks, and educational videos.

Contributors: Hoa Do, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Vietnamese treebanks.

Language documentation

See the language documentation page.

Warlpiri 1 <1K Pama-Nyungan, Western

Warlpiri treebanks

UFAL <1K ⓁⒻ

A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.

Contributors: Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Welsh 1 54K IE, Celtic

Welsh treebanks

CCG 54K ⓁⒻ

UD Welsh-CCG (Corpws Cystrawennol y Gymraeg) is a treebank of Welsh, annotated according to the Universal Dependencies guidelines.

Contributors: Johannes Heinecke, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Western Armenian 1 122K IE, Armenian

Western Armenian treebanks

ArmTDP 122K ⓁⒻ

A Universal Dependencies treebank for Western Armenian was developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Western S.P. Nahuatl 1 19K Uto-Aztecan

Western Sierra Puebla Nahuatl treebanks

MesoTree 19K ⓁⒻⒺ

UD Western Sierra Puebla Nahuatl-MesoTree is a combination of the existing UD Western Sierra Puebla Nahuatl-IU treebank (ITML) (with some updates to annotations due to caught errors or changes annotation decisions) and new sentences annotated as part of the NSF-funded project, "Syntactically-annotated corpora for endangered languages in areal contact" (MesoTree).

Contributors: Robert Pugh, Marivel Huerta Mendez, Mitsuya Sasaki, Francis Tyers, María Ximena Juarez Huerta, Ángeles Márquez Hernández
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Wolof 1 44K Niger-Congo, Northern Atlantic

Wolof treebanks

WTB 44K ⓁⒻ

UD_Wolof-WTB is a natively manual developed treebank for Wolof. Sentences were collected from encyclopedic, fictional, biographical, religious texts and news.

Contributors: Bamba Dione
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Xavante 1 2K Macro-Je

Xavante treebanks

XDT 2K ⓁⒻ

UD_Xavante-XDT is a collection of annotated sentences in [Xavante](https://glottolog.org/resource/languoid/id/xava1240). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](http://languagestructure.github.io/), Ivan Roksandic.

Contributors: Ivan Roksandic, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Xibe 1 15K Tungusic

Xibe treebanks

XDT 15K ⓁⒻ

The UD Xibe Treebank is a corpus of the Xibe language (ISO 639-3: *sjo*) containing manually annotated syntactic trees under the Universal Dependencies. Sentences come from three sources: grammar book examples, newspaper (Cabcal News) and Xibe textbooks.

Contributors: He Zhou, Juyeon Chung, Elena Klyachko, Francis Tyers, Sandra Kübler
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yakut 1 1K Turkic, Northeastern

Yakut treebanks

YKTDT 1K ⓁⒻ

UD_Yakut-YKTDT is a collection Yakut ([Sakha]) sentences (https://glottolog.org/resource/languoid/id/yaku1245). The project is work-in-progress and the treebank is being updated on a regular basis.

Contributors: Tatiana Merzhevich, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yiddish 1 28K IE, Germanic

Yiddish treebanks

YiTB 28K Ⓛ

YiTB is a treebank of linguistically annotated Yiddish data in the Universal Dependencies framework, created via a bootstraping machine learning method. A total of 27,872 tokens are currently in the treebank from a variety of sources and textual genres.

Contributors: Kirk Andrews
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yoruba 1 8K Niger-Congo, Defoid

Yoruba treebanks

YTB 8K ⓁⒻ

Parts of the Yoruba Bible and of the Yoruba edition of Wikipedia, hand-annotated natively in Universal Dependencies.

Contributors: Adédayọ̀ Olúòkun, Daniel Zeman, Seyi Williams, Ọlájídé Ishola
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yupik 1 2K Eskimo-Aleut

Yupik treebanks

SLI 2K ⓁⒻ

UD_Yupik-SLI is a treebank of St. Lawrence Island Yupik (ISO 639-3: ess) that has been manually annotated at the morpheme level, based on a finite-state morphological analyzer by [Chen et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.326). The word-level annotation, merging multiword expressions, is provided in not-to-release/ess_sli-ud-test.merged.conllu. More information about the treebank can be found in our publication (AmericasNLP, 2021).

Contributors: Hyunji Hayley Park, Lane Schwartz, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Zaar 1 20K Afro-Asiatic, West Chadic

Zaar treebanks

Autogramm 20K ⓁⒻ

A Universal Dependencies corpus for Zaar (aka Sayanci), a member of the Chadic branch of the Afro-Asiatic phylum. The language is mainly spoken by about 200,000 speakers in the Bogoro and Tafawa Balewa local governments of Bauchi State, Nigeria.

Contributors: Sylvain Kahane, Bruno Guillaume, Bernard Caron, Katharine Jiang
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Zazaki 1 1K IE, Iranian

Zazaki treebanks

ZSD 1K ⓁⒻ

A dependency treebank in Universal Dependencies (UD) format for Zazakî (Kirmanckî), based on transcribed spoken interviews recorded in situ in Dêrsim (Turk. Tunceli) for the forthcoming documentary "KUTENE – Last Dots of Dêrsim". It provides manually verified POS tags, morphological features, and dependency relations.

Contributors: Mahîr Dogan, Luigi Talamo, Helena Vaz, Annemarie Verkerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Possible Future Extensions

People have expressed interest in providing annotated data for the following languages but no valid data has been provided so far.

Akkadian 1 117K Afro-Asiatic, Semitic

Akkadian treebanks

MCONG 117K ⓁⒻ

UD_Akkadian-MCONG is a treebank of normalized Akkadian sentences drawn mostly from Neo-Assyrian corpora lemmatized on [Oracc](http://oracc.museum.upenn.edu/). Sentences are annotated for lemma, syntactic dependencies, and morphological features. The treebank contains approximately 112,000 words.

Contributors: Matthew Ong
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Amharic 3 <1K Afro-Asiatic, Semitic

Amharic treebanks

ADT <1K Ⓛ

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Dawit J. Tilahun
Repository master dev
README
Treebank hub page
Download

Inku -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Josiah Solomon
Repository master dev
README
Treebank hub page
Download

SAMTA -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Abnet Shimeles, Michael Gasser, Nazareth Amlesom
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Apalai 1 <1K Cariban

Apalai treebanks

ADT <1K ⓁⒻ

This Apalaí treebank is a collection of sentences from different sources, including own fieldwork material.

Contributors: Fernando O. de Carvalho, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Balatipone 1 - Bororoan

Balatipone treebanks

BDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Balochi 1 - IE, Iranian

Balochi treebanks

Kessah -

UD_Balochi-GPS is a treebank of the Balochi language variety spoken in southeastern Iran annotated according to the Universal Dependencies framework.

Contributors: Hiwa Asadpour, Annemarie Verkerk, Muhammad Afzal
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bengali 3 2K IE, Indic

Bengali treebanks

CMUPAN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Aditi Chaudhary
Repository master dev
README
Treebank hub page
Download

PUD - Ⓟ

This is a part of the Parallel Universal Dependencies (PUD) treebanks originally created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

Contributors: Pritha Majumdar, Deepak Alok, Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Sabdakosh 2K ⓁⒻ

UD_Bengali-Sabdakosh is a corpus parsed sentences from contemporary Bengali prose, consisting of passages of up to 50 sentences from modern fiction.

Contributors: Andrew Thomas Dyer, Riffat Sharmin, Sadia Afrin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Central Romani 1 - IE, Indic

Central Romani treebanks

Selice -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Lucie Zemanová, Viktor Elšík, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Chechen 1 - Nakh-Daghestanian, Nakh

Chechen treebanks

MottDT -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Anna Veselovsky, Zarina Molochieva, Alena Witzlack-Makarevich
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Classical Nahuatl 1 - Uto-Aztecan

Classical Nahuatl treebanks

FloCo -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Robert Pugh, Marivel Huerta Mendez, Mitsuya Sasaki, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Corsican 1 - IE, Romance

Corsican treebanks

DIVITAL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Alice Millour, Daniele Mortato
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Cuicatec 1 - Oto-Manguean

Cuicatec treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Cusco Quechua 1 - Quechuan

Cusco Quechua treebanks

Squoia -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Annette Rios, Francis Tyers, Trey Jagiella, Josephine Douglas
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Czech 1 - IE, Slavic

Czech treebanks

CzeSL -

Morphosyntactic annotation of 1600 sentences from the Czesl-MAN corpus (Czech as a second language).

Contributors: Jiří Hana, Barbora Hladká, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Dargwa 1 - Nakh-Daghestanian, Lak-Dargwa

Dargwa treebanks

Mehweb -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sasha Kozhukhar, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Enawene Nawe 1 - Arawakan, Central Arawakan

Enawene Nawe treebanks

ENDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fabrício Ferraz Gerardi, Wayali Iholalare Kaholase Saloma, Walitere Enawene, Xohikwa Lonarese Enawene
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

French 1 - IE, Romance

French treebanks

ParolesDeNormands -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Natalia Romanova, Rayan Ziane, Khensa Daoudi
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

French Sign Language 1 - Sign Language

French Sign Language treebanks

STKAutogramm -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sylvain Kahane, Santiago Herrera, Philomène Périn, Caroline Bogliotti, Lou Guillaume
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Frisian 1 51K IE, Germanic

Frisian treebanks

Frysk 51K ⓁⒻ

The UD Frisian-FA-RuG treebank is a West Frisian treebank.

Contributors: Wilbert Heeringa, Gosse Bouma, Hans Van de Velde
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Gedeo 1 <1K Afro-Asiatic, Cushitic

Gedeo treebanks

GDT <1K ⓁⒻ

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Dawit J. Tilahun
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Georgian 1 1K Kartvelian

Georgian treebanks

GEOWIKI 1K ⓁⒻ

**UD\_Georgian-GEOWIKI** is a Universal Dependencies corpus for the Georgian language, derived from randomly selected texts from the Georgian Wikipedia. The corpus currently contains 385 sentences, which were automatically tokenized and morphologically tagged using a TreeTagger model trained on a separate Georgian dataset. You can find the TreeTagger resource [here](https://github.com/SophikoComp/TreeTagger-for-Georgian). The output was semi-automatically converted into the UD format through custom Python scripts, followed by extensive manual correction to ensure conformity with UD guidelines and improve annotation quality. Morphological annotations are complete and validated; syntactic annotations are currently in progress. This corpus contributes to the growing set of low-resource language resources and provides a foundation for future syntactic, morphological, and cross-linguistic research in Georgian NLP.

Contributors: Mate Didebashvili
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Georgian treebanks.

Language documentation

See the language documentation page.

Gilaki 1 - IE, Iranian

Gilaki treebanks

GileDaar -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Luigi Talamo
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Greek 2 - IE, Greek

Greek treebanks

Cypriot -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Stella Markantonatou, Spyros Armostis, Stavros Bompolas, Vivian Stamou, Maria Apostolidou, Eirini Chalkia, Konstantinos Raftis, Christina-Athanasia Petropoulou, Dionisis Piskopos
Repository master dev
README
Treebank hub page
Download

Griko -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Stella Markantonatou, Emanuela Pinna, Stavros Bompolas, Maria Lekakou, Josep Quer, Vivian Stamou
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Greek treebanks.

Language documentation

See the language documentation page.

Hiligaynon 1 <1K Austronesian, Greater Central Philippine

Hiligaynon treebanks

HTB <1K Ⓛ

UD Hiligaynon-HTB is a UD treebank containing sentences manually-annotated from grammar books [PALI Language Texts](https://www.hawaiiopen.org/bookseries/pali-language-texts-philippines/) made available by University of Hawaii Press.

Contributors: Mary Ann C. Tan
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Hindi 1 4K IE, Indic

Hindi treebanks

Convers 4K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Riyaz Ahmad Bhat
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Huave 1 - Huavean

Huave treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Italian 1 - IE, Romance

Italian treebanks

TrIttok -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Luisa Troncone
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese 2 - Japanese

Japanese treebanks

JDD -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Reina Akama, Mai Omura, Masayuki Asahara
Repository master dev
README
Treebank hub page
Download

JDDLUW -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Reina Akama, Mai Omura, Masayuki Asahara
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Kabyle 1 23K Afro-Asiatic, Berber

Kabyle treebanks

ADPT 23K ⓁⒻ

UD UD_Kabyle-ADPT (Association pour le Développement et la Promotion de Tamazight) is a treebank of Berber (Kabyle variant), annotated according to the Universal Dependencies guidelines.

Contributors: Lakhdar Aliane
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Kiga 1 - Niger-Congo, Bantoid

Kiga treebanks

EKigaTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: David Bamutura
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Komi 1 <1K Uralic, Permic

Komi treebanks

OldPermic <1K ⓁⒻ

This is an Universal Dependencies treebank of Old Permic. The treebank is currently under progress, and will be published completely in the next Universal Dependencies release (v2.14), which is scheduled for May 15, 2024 (data freeze on May 1).

Contributors: Niko Partanen, Jack Rueter, Rogier Blokland
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Kullvi 1 - IE, Indic

Kullvi treebanks

KDTB -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Shweta Chauhan
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Ladino 1 - IE, Romance

Ladino treebanks

BOUN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Laz 1 2K Kartvelian

Laz treebanks

BOUN 2K ⓁⒻ

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk, Kaan Bayar, Ayşegül Dilara Özercan, Görkem Yiğit Öztürk, Betül Bilgin
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Magahi 2 7K IE, Indic

Magahi treebanks

MGTB 7K ⓁⒻ

The [Magahi](https://en.wikipedia.org/wiki/Magahi_language) UD Treebank (MGTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

Contributors: Mohit Raj, Deepak Alok, Ritesh Kumar, Atul Kr. Ojha, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD - Ⓟ

Contributors: Deepak Alok, Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Malagasy 1 - Austronesian, Barito

Malagasy treebanks

Hazo -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Luigi Talamo
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Mandyali 1 2K IE, Indic

Mandyali treebanks

MDTB 2K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Shweta Chauhan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Mansi 1 - Uralic, Ugric

Mansi treebanks

CoWS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Csilla Horváth
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Megrelian 1 - Kartvelian

Megrelian treebanks

MLC -

The Megrelian UD Treebank (UD_Megrelian-MLC) is the first syntactically annotated corpus of Megrelian, based on a collection of annotated sentences selected from the Megrelian Language Corpus (MLC) available at http://xmf.iliauni.edu.ge/ .

Contributors: Irina Lobzhanidze
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Middle Irish 2 <1K IE, Celtic

Middle Irish treebanks

CritMITB <1K Ⓛ

Annotation of the classic Scela Mucce Meic Dathó ("The tale of Mac Dathó's pig").

Contributors: Ben Rozonoyer, Erik Andersen
Repository master dev
README
Treebank hub page
Download

DipMITB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Adrian Ó Dubhghaill
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Middle Persian 1 - IE, Iranian

Middle Persian treebanks

MPCD -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Thomas Jügel, Kianoosh Rezania
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Mongolian 1 - Mongolic

Mongolian treebanks

MTLR -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Siqin Bai
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Nkore 1 - Niger-Congo, Bantoid

Nkore treebanks

ENkoreTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: David Bamutura
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Northern Kurdish 1 2K IE, Iranian

Northern Kurdish treebanks

Bezeyni 2K ⓁⒻ

The Bezeyni treebank is a small collection of annotated sentences in Bezeynî, a Kurdish language variety spoken in Turkey. The corpus contains 177 sentences with morphological and syntactic annotations following Universal Dependencies guidelines, providing basic coverage of nominal, verbal, and clause structures.

Contributors: Hiwa Asadpour
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Northern Luri 1 - IE, Iranian

Northern Luri treebanks

Khorramabadi -

UD_NorthernLuri-KHS is a treebank of the Khoranabadi variety of Northern Luri, annotated according to the Universal Dependencies framework.

Contributors: Hiwa Asadpour, Annemarie Verkerk, Zahra Zargarani, Sahar Shahjalaledin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Occitan 1 - IE, Romance

Occitan treebanks

DIVITAL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Myriam Bras
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old English 2 26K IE, Germanic

Old English treebanks

OEDT 26K ⓁⒻ

This is a 25,000 word UD treebank of Old English. The text has been retrieved from Martín Arista, Javier (ed.), et al. 2023. ParCorOEv3 [www.nerthusproject.com]. The treebank is a revised version of the dataset of Domínguez Barragán, S. 2024. Universal Dependencies of Old English. PhD Dissertation, University of La Rioja.

Contributors: Javier Martín Arista, Dario Metola
Repository master dev
README
Treebank hub page
Download

TueCL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fanyi Meng, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Georgian 3 - Kartvelian

Old Georgian treebanks

GNC -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Paul Meurer
Repository master dev
README
Treebank hub page
Download

MidGLauRo -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Chia-Wei Lin, Diego Luinetti
Repository master dev
README
Treebank hub page
Download

OGLauRo - Ⓟ

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Chia-Wei Lin, Diego Luinetti
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Japanese 1 3K Japanese

Old Japanese treebanks

LMJ 3K ⓁⒻⒺ

UD_Old_Japanese-LMJ is a collection of annotated texts in Late Middle Japanese, starting with Book 9 from he celebrated gunki monogatari (war tale) *Heike Monogatari*.

Contributors: Stanislav Reichert, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Occitan 1 - IE, Romance

Old Occitan treebanks

OOT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Mariagrazia Staffieri
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Saxon 1 - IE, Germanic

Old Saxon treebanks

ConOS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Christian Chiarcos, Janine Siewert
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Palenquero 1 - Creole

Palenquero treebanks

COL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Daniel Casas
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Pali 1 - IE, Indic

Pali treebanks

PaliCanon -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Chia-Wei Lin, Jasvinder Singh, Khemarato Bhikkhu
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Papiamento 2 - Creole

Papiamento treebanks

AW -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Urso Wieske
Repository master dev
README
Treebank hub page
Download

CW -

If you can read this sentence, then we are still working on our first release.

Contributors: Urso Wieske
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Peripheral Mongolian 1 - Mongolic

Peripheral Mongolian treebanks

Ordos -

... 1-2 sentences (see [release checklist](https://universaldependencies.org/contributing/repository_files.html#the-readme-file) for README guidelines) ...

Contributors: Wenchao Li
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Persian 3 - IE, Iranian

Persian treebanks

HPT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fatemeh Abbasi, Amirsaeid Moloodi
Repository master dev
README
Treebank hub page
Download

IPerUDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Roya Kabiri
Repository master dev
README
Treebank hub page
Download

PUD - Ⓟ

This is a part of the Parallel Universal Dependencies (PUD) treebanks (original set of languages annotated for CoNLL 2017 shared task; Persian was added later).

Contributors: Ali Basirat
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Pnar 1 - Austro-Asiatic, Khasian

Pnar treebanks

PTB -

UD Pnar-PTB is a conversion from the Ring (2017) dataset ([doi:10.21979/N9/KVFGBZ](http://dx.doi.org/10.21979/N9/KVFGBZ)) that underpins a grammatical description of the Pnar language (Ring 2015, [http://hdl.handle.net/10356/62519](http://hdl.handle.net/10356/62519)). The corpus consists of folktales and interviews transcribed, translated, and interlinearized.

Contributors: Hiram Ring
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Pontic 1 - IE, Greek

Pontic treebanks

BOUN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Portuguese 2 101K IE, Romance

Portuguese treebanks

DHBB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Alexandre Rademaker
Repository master dev
README
Treebank hub page
Download

PortJur 101K ⓁⒻ

The legal portion of [Porttinari](https://sites.google.com/icmc.usp.br/poetisa/resources-and-tools), which includes public law texts produced by the judiciary (mainly summaries) and the legislature (laws), including widely known laws in Brazil, as Henry Borel law, Internet Civil Rights law, Maria da Penha law, Copyright law, Agrarian Reform law, Elderly Persons statute and Child and Adolescent statute.

Contributors: Lucelene Lopes, Maria das Graças Volpe Nunes, Magali Sanches Duran, Thiago Alexandre Salgueiro Pardo
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Prakrit 1 <1K IE, Indic

Prakrit treebanks

DIPI <1K ⓁⒻ

**UD Prakrit-DIPI** (*Digitising Imperial Prakrit Inscriptions*) is a UD-annotated corpus of the Ashokan Prakrit inscriptions and edicts (parallel texts written in various dialects) representing an early stage of Middle Indo-Aryan. This corpus aims to facilitate comparative work on the Ashokan dialects with the help of new computational methods.

Contributors: Aryaman Arora, Adam Farris
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Punjabi 1 6K IE, Indic

Punjabi treebanks

PunTB 6K ⒻⒺ

**PunTB** (a very imaginative acronym for **Pun**jabi **T**ree**b**ank) is an in-progress treebank of Punjabi in the Gurmukhi script, aiming to cover a wide range of genres and formats.

Contributors: Aryaman Arora
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Punjabi treebanks.

Language documentation

See the language documentation page.

Puno Quechua 1 - Quechuan

Puno Quechua treebanks

UIBK -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Elwin Huaman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Sardinian 2 - IE, Romance

Sardinian treebanks

ContSar -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Luigi Talamo, Nicoletta Puddu
Repository master dev
README
Treebank hub page
Download

EModSar -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Luigi Talamo, Nicoletta Puddu
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Serbian 1 45K IE, Slavic

Serbian treebanks

ParCoLab 45K ⓁⒻ

ParCoLab is a treebank of Serbian based on literary texts. It was originally developed between 2014 and 2018 (original corpus available here).

Contributors: Aleksandra Miletić
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Seri 1 - Hokan, Seri

Seri treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Shipibo Konibo 1 - Pano-Tacanan

Shipibo Konibo treebanks

PUCP -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Ronald Ahmed Cárdenas Acosta
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Sindhi 1 6K IE, Indic

Sindhi treebanks

MazharDootio 6K ⓁⒻ

The Sindhi Universal Dependency Treebank was automatically converted from Sindhi Dependency Treebank (SDTB) which is part of an ongoing effort of creating multi-layered treebanks for Sindhi.

Contributors: Mazhar Dootio
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Spanish 1 <1K IE, Romance

Spanish treebanks

SVarT <1K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Johnatan Bonilla
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Spanish English 1 - Code switching

Spanish English treebanks

Miami -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Robert Pugh, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Swahili 1 - Niger-Congo, Bantoid

Swahili treebanks

OPUSGV -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Kenneth Steimel
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Swedish 1 - IE, Germanic

Swedish treebanks

Eukalyptus -

The Swedish-Eukalyptus treebank has been converted from Eukalyptus, a phrase-structure treebank of contemporary written Swedish. As of now, the conversion has not yet been finished, and no manual corrections have been done.

Contributors: Aleksandrs Berdicevskis, Gerlof Bouma
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Tagabawa 1 <1K Austronesian, Greater Central Philippine

Tagabawa treebanks

GJA <1K Ⓛ Ⓟ

UD_Tagabawa_GJA is a collection of annotated Bagobo-Tagabawa sentences taken from different sources. It is currently under development.

Contributors: Glyd Aranes
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Tagalog 1 340K Austronesian, Greater Central Philippine

Tagalog treebanks

NewsCrawl 340K ⓁⒻ

The Tagalog Universal Dependencies NewsCrawl dataset consists of annotated text extracted from the Leipzig Tagalog Corpus. Data included in the Leipzig Tagalog Corpus were crawled from Tagalog-language online news sites.

Contributors: Elsie Marie Or, Angelina Aquino, Lester James Miranda
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tetun 1 - Austronesian, Malayo-Polynesian

Tetun treebanks

TUDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Gabriel de Jesus
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Thai 1 - Tai-Kadai

Thai treebanks

Autogramm -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Siriluck Rattananiyomkul, Sylvain Kahane, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Thai treebanks.

Language documentation

See the language documentation page.

Tigrinya 1 - Afro-Asiatic, Semitic

Tigrinya treebanks

TiTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/contributing/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Nazareth Amlesom, Michael Gasser
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Tunisian Arabic 1 - Afro-Asiatic, Semitic

Tunisian Arabic treebanks

NAxLAT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Rayan Ziane
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Turkish 1 142K Turkic, Southwestern

Turkish treebanks

ULU 142K ⓁⒻ

The UD_Turkish-ULU Treebank, is an automatic conversion of the ULU Treebank

Contributors: Metin Bilgin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Turkmen 1 47K Turkic, Southwestern

Turkmen treebanks

TUD 47K ⓁⒻ

UD_Turkmen-TUD is a silver-standard Universal Dependencies treebank for Turkmen, created by translating Turkish UD treebank data into Turkmen and applying silver annotation through cross-lingual transfer.

Contributors: Sherzod Hakimov
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Tuwari 1 - Sepik

Tuwari treebanks

Autogramm -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sylvain Loiseau
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Uspanteko 1 13K Mayan

Uspanteko treebanks

MesoTree 13K Ⓛ

The MesoTree Uspanteko UD treebank consists of a token-balanced set of trees drawn dictionary example sentences and spoken narratives.

Contributors: Juan Ajsivinac, Robert Henderson, Tomás Méndez López, Cheyenne Wing
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Uyghur 1 - Turkic, Southeastern

Uyghur treebanks

LDS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Faruk Mardan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Retired Treebanks

The following treebanks have been part of one or more UD releases in the past but they are no longer maintained and they have been excluded from the most recent release.

English 1 97K IE, Germanic

English treebanks

ESL 97K ✘Ⓛ

UD English-ESL / Treebank of Learner English (TLE) contains manual POS tag and dependency annotations for 5,124 English as a Second Language (ESL) sentences drawn from the Cambridge Learner Corpus First Certificate in English (FCE) dataset.

Contributors: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz, Margarita Misirpashayeva
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

French 1 573K IE, Romance

French treebanks

FTB 573K ✘ⓁⒻ

The Universal Dependency version of the French Treebank (Abeillé et al., 2003), hereafter UD_French-FTB, is a treebank of sentences from the newspaper Le Monde, initially manually annotated with morphological information and phrase-structure and then converted to the Universal Dependencies annotation scheme.

Contributors: Marie Candito, Bruno Guillaume, Teresa Lynn, Héctor Martínez Alonso, Benoît Sagot, Djamé Seddah, Eric Villemonte de la Clergerie
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Hindi English 1 26K Code switching

Hindi English treebanks

HIENCS 26K ✘Ⓛ

The Hindi-English Code-switching treebank is based on code-switching tweets of Hindi and English multilingual speakers (mostly Indian) on Twitter. The treebank is manually annotated using UD sceheme. The training and evaluations sets were seperately annotated by different annotators using UD v2 and v1 guidelines respectively. The evaluation sets are automatically converted from UD v1 to v2.

Contributors: Riyaz Ahmad Bhat, Irshad Ahmad Bhat
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created.

Japanese 2 204K Japanese

Japanese treebanks

Modern 14K Ⓛ

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Corpus of Historical Japanese' (CHJ).

Contributors: Mai Omura, Masayuki Asahara, Yuta Takahashi
Repository master dev
README
Treebank hub page
Download

KTC 189K ✘Ⓛ

Please add a summary section to the treebank readme file

Contributors: Masayuki Asahara, Hiroshi Kanayama, Yuji Matsumoto, Yusuke Miyao, Shunsuke Mori, Takaaki Tanaka, Sumire Uematsu
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Khunsari 1 <1K IE, Iranian

Khunsari treebanks

AHA <1K ⓁⒻ

The AHA Khunsari Treebank is a small treebank for contemporary Khunsari. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Khunsari speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Mbya Guarani 1 11K Tupian, Maweti-Guarani

Mbya Guarani treebanks

Dooley 11K ✘ⓁⒻ

UD Mbya_Guarani-Dooley is a corpus of narratives written in Mbyá Guaraní (Tupian) in Brazil, and collected by Robert Dooley. Due to copyright restrictions, the corpus that is distributed as part of UD only contains the annotation (tags, features, relations) while the FORM and LEMMA columns are empty.

Contributors: Guillaume Thomas
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Mbya Guarani treebanks.

Language documentation

See the language documentation page.

Nayini 1 <1K IE, Iranian

Nayini treebanks

AHA <1K ⓁⒻ

The AHA Nayini Treebank is a small treebank for contemporary Nayini. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Nayini speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Norwegian 1 55K IE, Germanic

Norwegian treebanks

NynorskLIA 55K ⓁⒻ

This Norwegian treebank is based on the LIA treebank of transcribed spoken Norwegian dialects. The treebank has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Andre Kaasen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Soi 1 <1K IE, Iranian

Soi treebanks

AHA <1K ⓁⒻ

The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Soi speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Diversity of the Latest Release

Languages per Family (Genus)

Words per Family (Genus)

See also separate pages with a list of UD languages and a list of scripts used in UD treebanks.

Universal Dependencies

💡 Understanding UD

🔍 Using UD

🔨 Contributing to UD

🚀 Projects related to UD

📖 Overview Publications

Current UD Languages

Abaza treebanks

Language documentation

Abkhaz treebanks

Language documentation

Afrikaans treebanks

Language documentation

Akkadian treebanks

Language documentation

Akuntsu treebanks

Language documentation

Albanian treebanks

Language documentation

Alemannic treebanks

Language documentation

Amharic treebanks

Language documentation

Ancient Greek treebanks

Language documentation

Ancient Hebrew treebanks

Language documentation

Apurina treebanks

Language documentation

Arabic treebanks

Language documentation

Armenian treebanks

Language documentation

Assamese treebanks

Language documentation

Assyrian treebanks

Language documentation

Azerbaijani treebanks

Language documentation

Bambara treebanks

Language documentation

Basque treebanks

Language documentation

Bavarian treebanks

Language documentation

Beja treebanks

Language documentation

Belarusian treebanks

Language documentation

Bengali treebanks

Language documentation

Bhojpuri treebanks

Language documentation

Bokota treebanks

Language documentation

Bororo treebanks

Language documentation

Brahui treebanks

Language documentation

Breton treebanks

Language documentation

Bulgarian treebanks

Language documentation

Buryat treebanks

Language documentation

Cantonese treebanks

Language documentation

Cappadocian treebanks

Language documentation

Catalan treebanks

Language documentation

Cebuano treebanks

Language documentation

Central Kurdish treebanks

Language documentation

Chinese treebanks

Language documentation

Chintang treebanks

Language documentation

Chukchi treebanks