home edit page issue tracker

This page pertains to UD version 2.

Universal Dependencies

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages. If you’re new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.

If you want to receive news about Universal Dependencies, you can subscribe to the UD mailing list. If you want to discuss individual annotation questions, use the Github issue tracker.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Abaza treebanks

UD_Abaza-ATB is a treebank based on [Spoken corpus of Abaza](https://linghub.ru/spoken_abaza/).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Afrikaans treebanks

UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

 

Language documentation

See the language documentation page.

Akkadian treebanks

162 royal inscriptions of four early Neo-Assyrian kings.

 

A small set of sentences from Babylonian royal inscriptions.

 

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Akuntsu treebanks

UD_Akunstu-TuDeT is a collection of annotated sentences in <a href="http://endangeredlanguages.com/lang/1567"> Akuntsú</a>. The project is work in progress and the treebank is being updated on a regular basis. </br> </br> </br> </br> </br>

 

Language documentation

See the language documentation page.

Albanian treebanks

The UD Albanian Treebank is a small treebank for Standard Albanian, developed within a project framework at Uppsala University. The data was extracted from Wikipedia.

 

Language documentation

See the language documentation page.

Amharic treebanks

UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ancient Greek treebanks

UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

 

This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Apurina treebanks

This is an Apurinã treebank consisting of sentences from a grammatical description of the language by Maília Fernanda.

 

Language documentation

See the language documentation page.

Arabic treebanks

The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

 

See here for comparative statistics of Arabic treebanks.

Language documentation

See the language documentation page.

Armenian treebanks

The Eastern Armenian UD treebank is based on the Eastern Armenian section of the Armenian Dependency Treebank (Հայերենի ծառադարան), developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Assyrian treebanks

The Uppsala Assyrian Treebank is a small treebank for Modern Standard Assyrian. The corpus is collected and annotated manually. The data was randomly collected from different textbooks and a short translation of The Merchant of Venice.

 

Language documentation

See the language documentation page.

Bambara treebanks

The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Basque treebanks

The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Beja treebanks

A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.

 

Language documentation

See the language documentation page.

Belarusian treebanks

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

 

Language documentation

See the language documentation page.

Bhojpuri treebanks

The [Bhojpuri](https://en.wikipedia.org/wiki/Bhojpuri_language) UD Treebank (BHTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Breton treebanks

UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).

 

Language documentation

See the language documentation page.

Bulgarian treebanks

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.

 

Language documentation

See the language documentation page.

Buryat treebanks

The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

 

Language documentation

See the language documentation page.

Cantonese treebanks

A Cantonese treebank (in Traditional Chinese characters) of film subtitles and of legislative proceedings of Hong Kong, parallel with the Chinese-HK treebank.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Catalan treebanks

Catalan data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

 

Language documentation

See the language documentation page.

Chinese treebanks

Simplified Chinese Universal Dependencies dataset converted from the GSD (traditional) dataset with manual corrections.

 

Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

A Traditional Chinese treebank of film subtitles and of legislative proceedings of Hong Kong, parallel with the Cantonese-HK treebank.

 

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

 

See here for comparative statistics of Chinese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Chukchi treebanks

This data is a manual annotation of the corpus from multimedia annotated corpus of the [Chuklang](http://chuklang.ru/) project, a dialectal corpus of the Amguema variant of Chukchi.

 

Language documentation

See the language documentation page.

Classical Chinese treebanks

Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.
  • Contributors: Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Yuan Li, Hiroyuki Shirasu, Kazunori Fujita
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Coptic treebanks

UD Coptic contains manually annotated Sahidic Coptic texts, including Biblical texts, sermons, letters, and hagiography.

 

Language documentation

See the language documentation page.

Croatian treebanks

The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the [hr500k](http://hdl.handle.net/11356/1183) corpus.

 

Language documentation

See the language documentation page.

Czech treebanks

The Czech-PDT UD treebank is based on the Prague Dependency Treebank 3.0 (PDT), created at the Charles University in Prague.

 

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

 

FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

 

The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 1.0, created at Charles University in Prague.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Czech-PCEDT UD treebank is based on the Prague Czech-English Dependency Treebank 2.0 (PCEDT), created at the Charles University in Prague.

 

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish treebanks

The Danish UD treebank is a conversion of the Danish Dependency Treebank.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Dutch treebanks

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

 

This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

 

See here for comparative statistics of Dutch treebanks.

Language documentation

See the language documentation page.

English treebanks

Universal Dependencies syntax annotations from the GUM corpus (https://corpling.uis.georgetown.edu/gum/)

 

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

 

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, "hers", (independent) "his", (singular) "theirs", "mine", and (singular) "yours".

 

Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://corpling.uis.georgetown.edu/gum/)

 

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).
  • Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, Ethan Chi, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Nathan Schneider, Sam Bowman, Hanzhi Zhu, Daniel Galbraith
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD English-ESL / Treebank of Learner English (TLE) contains manual POS tag and dependency annotations for 5,124 English as a Second Language (ESL) sentences drawn from the Cambridge Learner Corpus First Certificate in English (FCE) dataset.
  • Contributors: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz, Margarita Misirpashayeva
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

The English-PCEDT UD treebank is based on the Prague Czech-English Dependency Treebank 2.0 (PCEDT), created at the Charles University in Prague.

 

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

Erzya treebanks

UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

 

Language documentation

See the language documentation page.

Estonian treebanks

UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,972 trees, 437,769 tokens.

 

UD EWT treebank consists of different genres of new media. The treebank contains 5,536 trees, 68,868 tokens.

 

See here for comparative statistics of Estonian treebanks.

Language documentation

See the language documentation page.

Faroese treebanks

UD_Icelandic-FarPaHC is a conversion of the [Faroese Parsed Historical Corpus (FarPaHC)](https://github.com/einarfs/farpahc) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
  • Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Anton Karl Ingason, Eiríkur Rögnvaldsson, Joel C. Wallenberg
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a treebank of Faroese based on the Faroese Wikipedia.

 

See here for comparative statistics of Faroese treebanks.

Language documentation

See the language documentation page.

Finnish treebanks

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

 

Finnish-OOD is an external out-of-domain test set for Finnish-TDT annotated natively into UD scheme.

 

FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script and later manually revised.

 

See here for comparative statistics of Finnish treebanks.

Language documentation

See the language documentation page.

French treebanks

The **UD_French-GSD** was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.
  • Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni, Carly Dickerson, Guy Perrier
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

**UD_French-Sequoia** is an automatic conversion of the Sequoia Treebank corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe, Bruno Guillaume
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The corpus **UD_French-FQB** is an automatic conversion of the [French QuestionBank v1](http://alpage.inria.fr/Treebanks/FQB/), a corpus entirely made of questions.

 

A Universal Dependencies corpus for spoken French.

 

The Universal Dependency version of the French Treebank (Abeillé et al., 2003), hereafter UD_French-FTB, is a treebank of sentences from the newspaper Le Monde, initially manually annotated with morphological information and phrase-structure and then converted to the Universal Dependencies annotation scheme.
  • Contributors: Marie Candito, Bruno Guillaume, Teresa Lynn, Héctor Martínez Alonso, Benoît Sagot, Djamé Seddah, Eric Villemonte de la Clergerie
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Frisian treebanks

The UD Frisian-FA-RuG treebank is a West Frisian treebank.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Frisian Dutch treebanks

UD_Frisian_Dutch-Fame is a selection of 400 sentences from the FAME! speech corpus by Yilmaz et al. (2016a, 2016b). The treebank is manually annotated using the UD scheme.

 

Language documentation

See the language documentation page.

Galician treebanks

The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).

 

The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.

 

See here for comparative statistics of Galician treebanks.

Language documentation

See the language documentation page.

German treebanks

UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

 

The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This treebank aims at gathering texts of the German literary history. Currently, it hosts Fragments of the early Romanticism, i.e. aphorism-like texts mainly dealing with philosophical issues concerning art, beauty and related topics.

 

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.

Gothic treebanks

The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Greek treebanks

The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

 

Language documentation

See the language documentation page.

Guajajara treebanks

<p> UD_Guajajara-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/guaj1255">Guajajara</a>. The project is a work in progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Hebrew treebanks

A Universal Dependencies Corpus for Hebrew.

 

Language documentation

See the language documentation page.

Hindi treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

 

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Hindi English treebanks

The Hindi-English Code-switching treebank is based on code-switching tweets of Hindi and English multilingual speakers (mostly Indian) on Twitter. The treebank is manually annotated using UD sceheme. The training and evaluations sets were seperately annotated by different annotators using UD v2 and v1 guidelines respectively. The evaluation sets are automatically converted from UD v1 to v2.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hittite treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hungarian treebanks

The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Icelandic treebanks

UD_Icelandic-IcePaHC is a conversion of the [Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
  • Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Hildur Jónsdóttir, Kristín Bjarnadóttir, Anton Karl Ingason, Kristján Rúnarsson, Steinþór Steingrímsson, Joel C. Wallenberg, Eiríkur Rögnvaldsson
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD_Icelandic-Modern is a conversion of the modern additions to the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
  • Contributors: Kristján Rúnarsson, Þórunn Arnardóttir, Hinrik Hafsteinsson, Starkaður Barkarson, Hildur Jónsdóttir, Steinþór Steingrímsson, Einar Freyr Sigurðsson
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Icelandic-PUD is the Icelandic part of the Parallel Universal Dependencies (PUD) treebanks.

 

See here for comparative statistics of Icelandic treebanks.

Language documentation

See the language documentation page.

Indonesian treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman, Ika Alfina, Arawinda Dinakaramani, Muhammad Yudistira Hanifmuti, Jessica Naraiswari Arwidarasti, Yogi Lesmana Sulestio
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named [**Kethu**](https://github.com/ialfina/kethu) that was also a conversion from a constituency treebank built by [**Dinakaramani et al. (2015)**](https://github.com/famrashel/idn-treebank). We named this treebank **Indonesian-CSUI**, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.
  • Contributors: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Indonesian UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

See here for comparative statistics of Indonesian treebanks.

Language documentation

See the language documentation page.

Irish treebanks

A Universal Dependencies 4910-sentence treebank for modern Irish.

 

A Universal Dependencies treebank of 866 tweets in modern Irish.

 

See here for comparative statistics of Irish treebanks.

Language documentation

See the language documentation page.

Italian treebanks

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

 

The UD_Italian-VIT corpus was obtained by conversion from VIT (Venice Italian Treebank), developed at the Laboratory of Computational Linguistics of the Università Ca' Foscari in Venice (Delmonte et al. 2007; Delmonte 2009; http://rondelmo.it/resource/VIT/Browser-VIT/index.htm).

 

UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

 

TWITTIRÒ-UD is a collection of ironic Italian tweets annotated in Universal Dependencies. The treebank can be exploited for the training of NLP systems to enhance their performance on social media texts, and in particular, for irony detection purposes.

 

PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

 

Manually corrected Treebank of Learner Italian drawn from the Valico corpus and correspondent corrected sentences.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese treebanks

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ).
  • Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Corpus of Historical Japanese' (CHJ).

 

Please add a summary section to the treebank readme file

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...
  • Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...
  • Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Kaapor treebanks

**UD_Kaapor-TuDeT** is a collection of annotated sentences in [Ka'apor](https://glottolog.org/resource/languoid/id/urub1250). The project is a work in progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Kangri treebanks

The Kangri UD Treebank (KDTB) is a part of the Universal Dependency treebank project.

 

Language documentation

See the language documentation page.

Karelian treebanks

UD Karelian-KKPP is a manually annotated new corpus of Karelian made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

 

Language documentation

See the language documentation page.

Kazakh treebanks

The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Khunsari treebanks

The AHA Khunsari Treebank is a small treebank for contemporary Khunsari. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Khunsari speakers.

 

Language documentation

See the language documentation page.

Kiche treebanks

UD Kʼicheʼ-IU is a treebank consisting of sentences from a variety of text domains but principally dictionary example sentences and linguistic examples.

 

Language documentation

See the language documentation page.

Komi Permyak treebanks

This is a Komi-Permyak literary language treebank consisting of original and translated texts.

 

Language documentation

See the language documentation page.

Komi Zyrian treebanks

UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.

 

This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.

 

See here for comparative statistics of Komi Zyrian treebanks.

Language documentation

See the language documentation page.

Korean treebanks

The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).

 

The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Please add a summary section to the treebank readme file

 

See here for comparative statistics of Korean treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kurmanji treebanks

The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Latin treebanks

This Universal Dependencies version of the **LLCT** (Late Latin Charter Treebank) consists of an automated conversion of the **LLCT2** treebank from the Latin Dependency Treebank (LDT) format into the Universal Dependencies standard.

 

Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

 

The **UDante** treebank is based on the Latin texts of Dante Alighieri, taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Latin language, more precisely of **literary Medieval Latin** (XIVth century).

 

The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.

 

This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

 

See here for comparative statistics of Latin treebanks.

Language documentation

See the language documentation page.

Latvian treebanks

Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).

 

Language documentation

See the language documentation page.

Laz treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Lithuanian treebanks

The Lithuanian dependency treebank ALKSNIS v3.0 (Vytautas Magnus University).
  • Contributors: Andrius Utka, Erika Rimkutė, Agnė Bielinskienė, Jolanta Kovalevskaitė, Loïc Boizou, Gabrielė Aleksandravičiūtė, Kristina Brokaitė, Daniel Zeman, Natalia Perkova, Bernadeta Griciūtė
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

 

See here for comparative statistics of Lithuanian treebanks.

Language documentation

See the language documentation page.

Livvi treebanks

UD Livvi-KKPP is a manually annotated new corpus of Livvi-Karelian made directly in the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

 

Language documentation

See the language documentation page.

Low Saxon treebanks

The UD Low Saxon LSDC dataset consists of sentences in 18 Low Saxon dialects from both Germany and the Netherlands. These sentences are (or are to become) part of the LSDC dataset and represent the language from the 19th and early 20th century in genres such as short stories, novels, speeches, letters and fairytales.

 

Language documentation

See the language documentation page.

Magahi treebanks

The [Magahi](https://en.wikipedia.org/wiki/Magahi_language) UD Treebank (MGTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks originally created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Makurap treebanks

UD_Makuráp-TuDeT is a collection of annotated texts in Makuráp. The project is a work in progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Maltese treebanks

MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.

 

Language documentation

See the language documentation page.

Manx treebanks

This is the Cadhan Aonair UD treebank for Manx Gaelic, created by Kevin Scannell.

 

Language documentation

See the language documentation page.

Marathi treebanks

UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mbya Guarani treebanks

UD Mbya_Guarani-Thomas is a corpus of Mbyá Guaraní (Tupian) texts collected by Guillaume Thomas. The current version of the corpus consists of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu, Caazapá Department, Paraguay.

 

UD Mbya_Guarani-Dooley is a corpus of narratives written in Mbyá Guaraní (Tupian) in Brazil, and collected by Robert Dooley. Due to copyright restrictions, the corpus that is distributed as part of UD only contains the annotation (tags, features, relations) while the FORM and LEMMA columns are empty.

 

See here for comparative statistics of Mbya Guarani treebanks.

Language documentation

See the language documentation page.

Middle Irish treebanks

Annotation of the classic Scela Mucce Meic Dathó ("The tale of Mac Dathó's pig").

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Moksha treebanks

Erme Universal Dependencies annotated texts Moksha are the origin of UD_Moksha-JR with annotation (CoNLL-U) for texts in the Moksha language, it originally consists of a sample from a number of fiction authors writing originals in Moksha.

 

Language documentation

See the language documentation page.

Munduruku treebanks

UD_Munduruku-TuDeT is a collection of annotated sentences in [Mundurukú](http://www.endangeredlanguages.com/lang/2981). The project is a work in progress and the treebank is being updated on a regular basis. </br> </br> </br> </br> </br>

 

Language documentation

See the language documentation page.

Naija treebanks

A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).
  • Contributors: Bernard Caron, Emmett Strickland, Marine Courtin, Kim Gerdes, Bruno Guillaume, Sylvain Kahane, Chika Kennedy Ajede, Emeka Onwuegbuzia, Samson Tella
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Nayini treebanks

The AHA Nayini Treebank is a small treebank for contemporary Nayini. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Nayini speakers.

 

Language documentation

See the language documentation page.

North Sami treebanks

This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

 

Language documentation

See the language documentation page.

Norwegian treebanks

The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

This Norwegian treebank is based on the LIA treebank of transcribed spoken Norwegian dialects. The treebank has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

 

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Old Church Slavonic treebanks

The Old Church Slavonic (OCS) UD treebank is based on the Old Church Slavonic data from the PROIEL treebank and contains the text of the Codex Marianus New Testament translation.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Old East Slavic treebanks

`UD_Old_East_Slavic-RNC` is a sample of the Middle Russian corpus (1300-1700), a part of the Russian National Corpus. The data were originally annotated according to the RNC and extended UD-Russian morphological schemas and UD 2.4 dependency schema.

 

UD\_Old\_East\_Slavic-TOROT is a conversion of a selection of the Old East Slavonic and Middle Russian data in the Tromsø Old Russian and OCS Treebank (TOROT), which was originally annotated in PROIEL dependency format.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

See here for comparative statistics of Old East Slavic treebanks.

Language documentation

See the language documentation page.

Old French treebanks

UD_Old_French-SRCMF is a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).
  • Contributors: Sophie Prévost, Aurélie Collomb, Kim Gerdes, Isabelle Tellier, Marine Courtin, Alexei Lavrentiev, Céline Guillot-Barbance, Loïc Grobol
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

See the language documentation page.

Old Irish treebanks

A Universal Dependencies treebank for the Old Irish glosses of St. Gall.

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Old Japanese treebanks

UD_Old_Japanese-LMJ is a collection of annotated texts in Late Middle Japanese. The texts are being translated by SR, and are annotated by SR and FFG.

 

Language documentation

See the language documentation page.

Old Turkish treebanks

`UD_Old_Turkish-Tonqq` is an [Old Turkish](https://iso639-3.sil.org/code/otk) treebank built upon Turkic script texts or sentences that are trivially convertible.

 

Language documentation

See the language documentation page.

Persian treebanks

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data: * Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian. "The Persian Dependency Treebank Made Universal". 2020 (to appear).

 

The Persian Universal Dependency Treebank (Persian UD) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

 

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Polish treebanks

The Polish PDB-UD treebank is based on the Polish Dependency Bank 2.0 (PDB 2.0), created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. The PDB-UD treebank is an extended and corrected version of the Polish SZ-UD treebank (the release 1.2 to 2.3).

 

The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.

 

This is the Polish portion of the Parallel Universal Dependencies (PUD) treebanks, created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw.Re

 

See here for comparative statistics of Polish treebanks.

Language documentation

See the language documentation page.

Portuguese treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva, Alexandre Rademaker
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.
  • Contributors: Alexandre Rademaker, Cláudia Freitas, Elvis de Souza, Aline Silveira, Tatiana Cavalcanti, Wograine Evelyn, Luisa Rocha, Isabela Soares-Bastos, Eckhard Bick, Fabricio Chalub, Guilherme Paulino-Passos, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Romanian treebanks

The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.

 

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0
  • Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca, Roman Untilov, Petru Rebeja
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

SiMoNERo is a medical corpus of contemporary Romanian.

 

The UD treebank ArT is a treebank of the Aromanian dialect of the Romanian language in UD format.

 

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Russian treebanks

Universal Dependencies treebank is based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 and GramEval-2020 shared tasks collections.

 

Russian Universal Dependencies Treebank annotated and converted by Google.

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Russian data from the SynTagRus corpus.

 

See here for comparative statistics of Russian treebanks.

Language documentation

See the language documentation page.

Sanskrit treebanks

A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

 

The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB). Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

 

See here for comparative statistics of Sanskrit treebanks.

Language documentation

See the language documentation page.

Scottish Gaelic treebanks

A treebank of Scottish Gaelic based on the [Annotated Reference Corpus Of Scottish Gaelic (ARCOSG)](https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG).

 

Language documentation

See the language documentation page.

Serbian treebanks

The Serbian UD treebank is based on the [SETimes-SR](http://hdl.handle.net/11356/1200) corpus and additional news documents from the Serbian web.

 

Language documentation

See the language documentation page.

Sindhi treebanks

The Sindhi Universal Dependency Treebank was automatically converted from Sindhi Dependency Treebank (SDTB) which is part of an ongoing effort of creating multi-layered treebanks for Sindhi.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Skolt Sami treebanks

The UD Skolt Sami Giellagas treebank is based almost entirely on spoken Skolt Sami corpora.

 

Language documentation

See the language documentation page.

Slovak treebanks

The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

 

Language documentation

See the language documentation page.

Slovenian treebanks

The Slovenian UD Treebank is a rule-based conversion of the ssj500k treebank, the largest collection of manually syntactically annotated data in Slovenian, originally annotated in the JOS annotation scheme.

 

The Spoken Slovenian UD Treebank (SST) is the first syntactically annotated corpus of spoken Slovenian, based on a sample of the reference GOS corpus, a collection of transcribed audio recordings of monologic, dialogic and multi-party spontaneous speech in different everyday situations.

 

See here for comparative statistics of Slovenian treebanks.

Language documentation

See the language documentation page.

Soi treebanks

The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Soi speakers.

 

Language documentation

See the language documentation page.

South Levantine Arabic treebanks

The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project. TO-DO: Add 20 annotated sentences from CCC as a train set.

 

Language documentation

See the language documentation page.

Spanish treebanks

Spanish data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

 

The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Swedish treebanks

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

 

Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

 

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

 

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Swedish Sign Language treebanks

The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Swiss German treebanks

_UD\_Swiss\_German-UZH_ is a tiny manually annotated treebank of 100 sentences in different Swiss German dialects and a variety of text genres.

 

Language documentation

See the language documentation page.

Tagalog treebanks

UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.

 

Ugnayan is a manually annotated Tagalog treebank currently composed of educational fiction and nonfiction text. The treebank is under development at the University of the Philippines.

 

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tamil treebanks

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

 

MWTT - Modern Written Tamil Treebank has sentences taken primarily from a text called "A Grammar of Modern Tamil by Thomas Lehmann (1993). This initial release has 536 sentences of various lengths, and all of these are added as the test set.

 

See here for comparative statistics of Tamil treebanks.

Language documentation

See the language documentation page.

Telugu treebanks

The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Thai treebanks

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tupinamba treebanks

UD_Tupinamba-TuDeT is a collection of annotated texts in Tupi(nambá). The project is a work in progress and the treebank is being updated on a regular basis.

 

Language documentation

See the language documentation page.

Turkish treebanks

This is a treebank annotating example sentences from a comprehensive grammar book of Turkish.

 

Turkish Tourism is a domain specific treebank consisting of 19,750 manually annotated sentences and 92,200 tokens. These sentences were taken from the original customer reviews of a tourism company.
  • Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Büşra Marşan, Bilge Nas Arıcan, Neslihan Kara, Deniz Baran Aslan, Ezgi Sanıyar
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Turkish version of the Penn Treebank. It consists of a total of 9,560 manually annotated sentences and 87,367 tokens. (It only includes sentences up to 15 words long.)
  • Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Neslihan Kara, Bilge Nas Arıcan, Merve Özçelik, Deniz Baran Aslan
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Turkish-Kenet UD Treebank is the biggest treebank of Turkish. It consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples.

 

Turkish FrameNet consists of 2,700 manually annotated example sentences and 19,221 tokens. Its data consists of the sentences taken from the Turkish FrameNet Project. The annotated sentences can be filtered according to the semantic frame category of the root of the sentence.
  • Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Oğuzhan Kuyrukçu, Bilge Nas Arıcan, Ezgi Sanıyar, Neslihan Kara, Merve Özçelik
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The largest Turkish dependency treebank annotated in UD style. Created by the members of [TABILAB](http://http://tabilab.cmpe.boun.edu.tr/) from Boğaziçi University.
  • Contributors: Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
  • Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak et al., 2016).

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Turkish German treebanks

UD Turkish-German SAGT is a Turkish-German code-switching treebank that is developed as part of the [SAGT](https://www.ims.uni-stuttgart.de/en/research/projects/sagt/) project.

 

Language documentation

See the language documentation page.

Ukrainian treebanks

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Upper Sorbian treebanks

A small treebank of Upper Sorbian based mostly on Wikipedia.

 

Language documentation

See the language documentation page.

Urdu treebanks

The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

 

Language documentation

See the language documentation page.

Uyghur treebanks

The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Vietnamese treebanks

The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Warlpiri treebanks

A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.

 

Language documentation

See the language documentation page.

Welsh treebanks

UD Welsh-CCG (Corpws Cystrawennol y Gymraeg) is a treebank of Welsh, annotated according to the Universal Dependencies guidelines.

 

Language documentation

See the language documentation page.

Western Armenian treebanks

The Western Armenian UD treebank is based on the Western Armenian section of the Armenian Dependency Treebank (Հայերէնի Ծառադարան), originally developed for UD by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

 

Language documentation

See the language documentation page.

Wolof treebanks

UD_Wolof-WTB is a natively manual developed treebank for Wolof. Sentences were collected from encyclopedic, fictional, biographical, religious texts and news.

 

Language documentation

See the language documentation page.

Yoruba treebanks

Parts of the Yoruba Bible and of the Yoruba edition of Wikipedia, hand-annotated natively in Universal Dependencies.

 

Language documentation

See the language documentation page.

Yupik treebanks

UD_Yupik-SLI is a treebank of St. Lawrence Island Yupik (ISO 639-3: ess) that has been manually annotated at the morpheme level, based on a finite-state morphological analyzer by [Chen et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.326). The word-level annotation, merging multiword expressions, is provided in not-to-release/ess_sli-ud-test.merged.conllu. More information about the treebank can be found in our publication (AmericasNLP, 2021).

 

Language documentation

See the language documentation page.

Upcoming UD Languages

Archaic Irish treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Assamese treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Bengali treebanks

Please add a summary section to the treebank readme file

 

Please add a summary section to the treebank readme file

 

This is a part of the Parallel Universal Dependencies (PUD) treebanks originally created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Cappadocian treebanks

The “Asia Minor Greek in Contact” treebank (AMGiC, UD_AMGiC) will be compiled from sentences entailing contact-induced morphosyntactic phenomena (CIMSP) that are a result of the contact between Greek and Turkish varieties in Anatolia and in adjacent regions. The sentences will be traced in Asia Minor Greek (AMG) dialectal sources. In addition to the UD analysis, the AMGiC treebank will provide information concerning the sociolinguistic context within which CIMSP arise.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Cusco Quechua treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Dargwa treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Georgian treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hiligaynon treebanks

UD Hiligaynon-HTB is a UD treebank containing sentences manually-annotated from grammar books [PALI Language Texts](https://www.hawaiiopen.org/bookseries/pali-language-texts-philippines/) made available by University of Hawaii Press.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Javanese treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Kabyle treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kannada treebanks

Examples from Modern Kannada Grammar by S.N.Sridhar.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Karo treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Khoekhoe treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kiga treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kyrgyz treebanks

... 1-2 sentences (see http://universaldependencies.org/release_checklist.html#the-readme-file for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ladino treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ligurian treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Macedonian treebanks

The Macedonian-MTB treebank is a collection of annotated sentences based on the raw and monolingual corpus called [Macedonian Language Digital Resources - MLDR](http://drmj.manu.edu.mk/%D0%B5%D0%BB%D0%B5%D0%BA%D1%82%D1%80%D0%BE%D0%BD%D1%81%D0%BA%D0%B8-%D0%BA%D0%BE%D1%80%D0%BF%D1%83%D1%81-%D0%BD%D0%B0-%D0%BC%D0%B0%D0%BA%D0%B5%D0%B4%D0%BE%D0%BD%D1%81%D0%BA%D0%B8-%D0%BA%D0%BD%D0%B8/), a.k.a 135 Volumes of Macedonian Literature, published by the Macedonian Academy of Sciences and Arts under the CC Attribution-NonCommercial 4.0 International License. The treebank consists mainly of literary and a few non-fiction texts.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Maghrebi Arabic French treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mandyali treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mongolian treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ndengeleko treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Neapolitan treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

See the language documentation page.

Nepali treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Nkore treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Occitan treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Odia treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...
  • Contributors: Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati Sahoo, Swarnashree Mohanty, Anwesha Swain, Satya Ranjan Dash, Bijayalaxmi Dash
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pnar treebanks

UD Pnar-PTB is a conversion from the Ring (2017) dataset ([doi:10.21979/N9/KVFGBZ](http://dx.doi.org/10.21979/N9/KVFGBZ)) that underpins a grammatical description of the Pnar language (Ring 2015, [http://hdl.handle.net/10356/62519](http://hdl.handle.net/10356/62519)). The corpus consists of folktales and interviews transcribed, translated, and interlinearized.

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pomak treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...
  • Contributors: Ritvan Karahotza, Stella Markantonatou, Vivian Stamou, Antonis Anastasopoulos, Vasilis Sevetlidis, George Pavlides, Dimitris Karamatskos, Vasilis Arampatzakis
  • Repository master dev
  • README
  • Treebank hub page
  • Download

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pontic treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Prakrit treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Romansh treebanks

Please add a summary section to the treebank readme file

 

Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Shipibo Konibo treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sinhala treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Somali treebanks

Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sorani treebanks

Please add a summary section to the treebank readme file

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Swahili treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tatar treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tigrinya treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Yakut treebanks

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

 

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Download

The data is released through LINDAT/CLARIN.