home edit page issue tracker

This page pertains to UD version 2.

Universal Dependencies

Universal Dependencies (UD) is a framework for cross-linguistically consistent grammatical annotation and an open community effort with over 200 contributors producing more than 100 treebanks in over 60 languages.

If you want to receive news about Universal Dependencies, you can subscribe to the UD mailing list.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Afrikaans 1 49K IE, Germanic

Afrikaans treebanks

Original 49K
UD Afrikaans is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Ancient Greek 2 414K IE, Greek

Ancient Greek treebanks

PROIEL 211K
UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

 

Original 202K
This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Arabic 3 1,042K Afro-Asiatic, Semitic

Arabic treebanks

NYUAD 738K
The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

 

Original 282K
The Arabic UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

 

PUD 20K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Arabic treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Bambara 1 <1K Mande

Bambara treebanks

Original <1K
The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Basque 1 121K Basque

Basque treebanks

Original 121K
The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.
  • Contributors: Maria Jesus Aranzabe, Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Iakes Goenaga, Koldo Gojenola, Larraitz Uria
  • Repository master dev
  • README
  • Treebank hub page

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Belarusian 1 8K IE, Slavic

Belarusian treebanks

Original 8K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Bulgarian 1 156K IE, Slavic

Bulgarian treebanks

Original 156K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Buryat 1 10K Mongolic

Buryat treebanks

Original 10K
The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Cantonese 1 <1K Sino-Tibetan

Cantonese treebanks

Original <1K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Catalan 1 531K IE, Romance

Catalan treebanks

Original 531K
Catalan data from the AnCora corpus.

 

Language documentation

See the language documentation page.
Chinese 4 153K Sino-Tibetan

Chinese treebanks

Original 123K
Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

 

PUD 21K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
  • Repository master dev
  • README
  • Treebank hub page

 

CFL 7K
The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

 

HK 1K
Please add a summary section to the treebank readme file

 

See here for comparative statistics of Chinese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Coptic 1 11K Afro-Asiatic, Egyptian

Coptic treebanks

Original 11K
UD Coptic contains manually annotated Sahidic Coptic texts, currently from the Gospel of Mark, Shenoute of Atripe's "Not Because a Fox Barks", the Letters of Besa, and several short stories from the Apophthegmata Patrum.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Croatian 1 197K IE, Slavic

Croatian treebanks

Original 197K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Czech 5 2,222K IE, Slavic

Czech treebanks

Original 1,506K
The Czech UD treebank is based on the Prague Dependency Treebank 3.0 (PDT), created at the Charles University in Prague.

 

CAC 494K
The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

 

FicTree 167K
FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

 

CLTT 35K
The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 1.0, created at Charles University in Prague.

 

PUD 18K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.
Danish 1 100K IE, Germanic

Danish treebanks

Original 100K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Dutch 2 310K IE, Germanic

Dutch treebanks

Original 208K
This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

 

LassySmall 101K
This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

 

See here for comparative statistics of Dutch treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
English 5 496K IE, Germanic

English treebanks

Original 254K
A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).
  • Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Sam Bowman, Hanzhi Zhu, Daniel Galbraith
  • Repository master dev
  • README
  • Treebank hub page

 

ESL 88K
UD English-ESL / Treebank of Learner English (TLE) contains manual POS tag and dependency annotations for 5,124 English as a Second Language (ESL) sentences drawn from the Cambridge Learner Corpus First Certificate in English (FCE) dataset.
  • Contributors: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz
  • Repository master dev
  • README
  • Treebank hub page

 

LinES 82K
UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

 

ParTUT 49K
UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

PUD 21K
This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of English treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Erzya 1 <1K Uralic, Mordvin

Erzya treebanks

Original <1K
UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Estonian 1 106K Uralic, Finnic

Estonian treebanks

Original 106K
UD Estonian is a conversion of a subpart of Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Finnish 3 377K Uralic, Finnic

Finnish treebanks

Original 202K
UD_Finnish is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.
  • Contributors: Filip Ginter, Jenna Kanerva, Veronika Laippala, Niko Miekka, Anna Missilä, Stina Ojala, Sampo Pyysalo
  • Repository master dev
  • README
  • Treebank hub page

 

FTB 159K
Please add a summary section to the treebank readme file

 

PUD 15K
Please add a summary section to the treebank readme file

 

See here for comparative statistics of Finnish treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
French 6 1,099K IE, Romance

French treebanks

FTB 573K
The Universal Dependency version of the French Treebank (Abeillé et al., 2003), hereafter UD_French-FTB, is a treebank of sentences from the newspaper Le Monde, initially manually annotated with morphological information and phrase-structure and then converted to the Universal Dependencies annotation scheme.
  • Contributors: Marie Candito, Bruno Guillaume, Teresa Lynn, Héctor Martínez Alonso, Benoît Sagot, Djamé Seddah, Eric Villemonte de la Clergerie
  • Repository master dev
  • README
  • Treebank hub page

 

Original 402K
The French UD was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.
  • Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni
  • Repository master dev
  • README
  • Treebank hub page

 

Sequoia 70K
UD_French-Sequoia is an automatic conversion of the Sequoia Treebank corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

 

ParTUT 28K
UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

PUD 24K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe
  • Repository master dev
  • README
  • Treebank hub page

 

Spoken -
Please add a summary section to the treebank readme file

 

See here for comparative statistics of French treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Galician 2 164K IE, Romance

Galician treebanks

Original 138K
Please add a summary section to the treebank readme file

 

TreeGal 25K
The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).

 

See here for comparative statistics of Galician treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
German 2 313K IE, Germanic

German treebanks

Original 292K
The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

PUD 21K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.
Gothic 1 55K IE, Germanic

Gothic treebanks

Original 55K
The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Greek 1 63K IE, Greek

Greek treebanks

Original 63K
The Greek UD treebank is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Hebrew 1 161K Afro-Asiatic, Semitic

Hebrew treebanks

Original 161K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Hindi 2 375K IE, Indic

Hindi treebanks

Original 351K
Please add a summary section to the treebank readme file

 

PUD 23K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Hindi treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Hungarian 1 42K Uralic, Ugric

Hungarian treebanks

Original 42K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Indonesian 2 147K Austronesian

Indonesian treebanks

Original 121K
The Indonesian UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

PUD 25K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Indonesian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Irish 1 23K IE, Celtic

Irish treebanks

Original 23K
A Universal Dependencies 1020-sentence treebank for modern Irish.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Italian 4 436K IE, Romance

Italian treebanks

Original 293K
The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

 

PoSTWITA 64K
PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

 

ParTUT 55K
UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles.

 

PUD 23K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Italian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Japanese 3 402K Japanese

Japanese treebanks

KTC 189K
Please add a summary section to the treebank readme file
  • Contributors: Masayuki Asahara, Hiroshi Kanayama, Yuji Matsumoto, Yusuke Miyao, Shunsuke Mori, Takaaki Tanaka, Sumire Uematsu
  • Repository master dev
  • README
  • Treebank hub page

 

Original 186K
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
  • Contributors: Hiroshi Kanayama, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Ryan McDonald, Joakim Nivre, Daniel Zeman, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu
  • Repository master dev
  • README
  • Treebank hub page

 

PUD 26K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman, Hiroshi Kanayama
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Japanese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Kazakh 1 11K Turkic, Northwestern

Kazakh treebanks

Original 11K
The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Korean 3 97K Korean

Korean treebanks

Original 74K
The Korean UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

PUD 22K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

Sejong -
Please add a summary section to the treebank readme file

 

See here for comparative statistics of Korean treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Kurmanji 1 10K IE, Iranian

Kurmanji treebanks

Original 10K
The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Latin 3 491K IE, Latin

Latin treebanks

ITTB 291K
Please add a summary section to the treebank readme file

 

PROIEL 171K
The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War and Cicero's Letters to Atticus.

 

Original 29K
This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

See here for comparative statistics of Latin treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Latvian 1 90K IE, Baltic

Latvian treebanks

Original 90K
Latvian UD Treebank is based on Latvian Treebank (http://sintakse.korpuss.lv) being created at University of Latvia, Institute of Mathematics and Computer Science, Artificial Intelligence Laboratory (http://ailab.lv).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Lithuanian 2 5K IE, Baltic

Lithuanian treebanks

Original 5K
Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

 

Alksnis -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Maltese 1 2K Afro-Asiatic, Semitic

Maltese treebanks

Original 2K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Marathi 1 3K IE, Indic

Marathi treebanks

Original 3K
UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
North Sami 1 26K Uralic, Sami

North Sami treebanks

Original 26K
This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Norwegian 3 625K IE, Germanic

Norwegian treebanks

Bokmaal 310K
The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

Nynorsk 301K
The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

NynorskLIA 13K
This Norwegian treebank is based on the LIA treebank of transcribed spoken Norwegian dialects. The treebank has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

See here for comparative statistics of Norwegian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Old Church Slavonic 1 57K IE, Slavic

Old Church Slavonic treebanks

Original 57K
The Old Church Slavonic (OCS) UD treebank is based on the Old Church Slavonic data from the PROIEL treebank and contains the text of the Codex Marianus New Testament translation.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Persian 1 152K IE, Iranian

Persian treebanks

Original 152K
The Persian Universal Dependency Treebank (Persian UD) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Polish 1 83K IE, Slavic

Polish treebanks

Original 83K
The UD Polish treebank is based on “Składnica zależnościowa” (the Polish dependency treebank) version 0.5.

 

Language documentation

See the language documentation page.
Portuguese 3 570K IE, Romance

Portuguese treebanks

BR 319K
Please add a summary section to the treebank readme file

 

Original 227K
This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank.
  • Contributors: Alexandre Rademaker, Eckhard Bick, Fabricio Chalub, Cláudia Freitas, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
  • Repository master dev
  • README
  • Treebank hub page

 

PUD 23K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Portuguese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Romanian 2 239K IE, Romance

Romanian treebanks

Original 218K
The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.
  • Contributors: Verginica Barbu Mititelu, Elena Irimia, Cenel-Augusto Perez, Radu Ion, Radu Simionescu, Martin Popel
  • Repository master dev
  • README
  • Treebank hub page

 

Nonstandard 20K
The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank.
  • Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Romanian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Russian 3 1,226K IE, Slavic

Russian treebanks

SynTagRus 1,107K
Russian data from the SynTagRus corpus.

 

Original 99K
Please add a summary section to the treebank readme file

 

PUD 19K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Russian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Sanskrit 1 1K IE, Indic

Sanskrit treebanks

Original 1K
A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Serbian 1 86K IE, Slavic

Serbian treebanks

Original 86K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Slovak 1 106K IE, Slavic

Slovak treebanks

Original 106K
The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

 

Language documentation

See the language documentation page.
Slovenian 2 170K IE, Slavic

Slovenian treebanks

Original 140K
The Slovenian UD Treebank is a rule-based conversion of the ssj500k treebank, the largest collection of manually syntactically annotated data in Slovenian, originally annotated in the JOS annotation scheme.

 

SST 29K
The Spoken Slovenian UD Treebank (SST) is the first syntactically annotated corpus of spoken Slovenian, based on a sample of the reference GOS corpus, a collection of transcribed audio recordings of monologic, dialogic and multi-party spontaneous speech in different everyday situations.

 

See here for comparative statistics of Slovenian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Spanish 3 1,004K IE, Romance

Spanish treebanks

AnCora 549K
Spanish data from the AnCora corpus.

 

Original 431K
The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).
  • Contributors: Miguel Ballesteros, Héctor Martínez Alonso, Ryan McDonald, Elena Pascual, Natalia Silveira, Daniel Zeman, Joakim Nivre
  • Repository master dev
  • README
  • Treebank hub page

 

PUD 23K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.
Swedish 3 195K IE, Germanic

Swedish treebanks

Original 96K
The Swedish-TP treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

 

LinES 79K
UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

 

PUD 19K
Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

 

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.
Swedish Sign Language 1 1K Sign Language

Swedish Sign Language treebanks

Original 1K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Tamil 1 9K Dravidian, Southern

Tamil treebanks

Original 9K
The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Telugu 1 6K Dravidian, South Central

Telugu treebanks

Original 6K
The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Thai 1 23K Tai-Kadai

Thai treebanks

PUD 23K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Turkish 2 74K Turkic, Southwestern

Turkish treebanks

Original 58K
The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak et al., 2016).
  • Contributors: Çağrı Çöltekin, Gülşen Cebiroğlu Eryiğit, Memduh Gökırmak, Hüner Kaşıkara, Umut Sulubacak, Francis Tyers
  • Repository master dev
  • README
  • Treebank hub page

 

PUD 16K
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin
  • Repository master dev
  • README
  • Treebank hub page

 

See here for comparative statistics of Turkish treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Ukrainian 1 100K IE, Slavic

Ukrainian treebanks

Original 100K
Gold standard Universal Dependencies corpus for Ukrainian, developed for UD version 2 originally, by Institute for Ukrainian, NGO.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Upper Sorbian 1 10K IE, Slavic

Upper Sorbian treebanks

Original 10K
A small treebank of Upper Sorbian based mostly on Wikipedia.

 

Language documentation

See the language documentation page.
Urdu 1 138K IE, Indic

Urdu treebanks

Original 138K
The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Uyghur 1 15K Turkic, Southeastern

Uyghur treebanks

Original 15K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Vietnamese 1 43K Austro-Asiatic

Vietnamese treebanks

Original 43K
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Upcoming UD Languages

Amharic 1 - Afro-Asiatic, Semitic

Amharic treebanks

Original - ?
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Armenian 1 - IE, Armenian

Armenian treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Bengali 2 - IE, Indic

Bengali treebanks

BRU -
Please add a summary section to the treebank readme file

 

DDS - ?
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Dargwa 1 - Nakho-Dagestanian

Dargwa treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Faroese 1 - IE, Germanic

Faroese treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Kannada 1 - Dravidian, Southern

Kannada treebanks

Original -
... 1-2 sentences (see http://universaldependencies.org/release_checklist.html#the-readme-file for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Kyrgyz 1 - Turkic, Northwestern

Kyrgyz treebanks

Original -
... 1-2 sentences (see http://universaldependencies.org/release_checklist.html#the-readme-file for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Naija 1 - Creole

Naija treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Old French 1 - IE, Romance

Old French treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Romansh 2 - IE, Romance

Romansh treebanks

Original -
Please add a summary section to the treebank readme file

 

Sursilv -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Somali 1 - Afro-Asiatic, Cushitic

Somali treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.
Sorani 1 - IE, Iranian

Sorani treebanks

Original -
Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Download

The data is released through LINDAT/CLARIN.