home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Gujarati GujTB

Language: Gujarati (code: gu)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.14 release.

The following people have contributed to making this treebank part of UD: Maitrey Mehta, Mayank Jobanputra.

Repository: UD_Gujarati-GujTB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18

License: CC BY-SA 4.0

Genre: grammar-examples

Questions, comments? General annotation questions (either Gujarati-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [maitrey (æt) cs • utah • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation	Source
Lemmas	annotated manually
UPOS	annotated manually, natively in UD style
XPOS	not available
Features	not available
Relations	annotated manually, natively in UD style

Description

GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.

Currently the treebank is comprised of 187 sentences, out of which 100 are doubly annotated by the authors. We plan to update the treebank with proper morphological annotations and features in the upcoming release.

Acknowledgments

References

Please cite the following paper if you use this treebank in your research:

@inproceedings{jobanputra-etal-2024-universal,
title = "A {U}niversal {D}ependencies Treebank for {G}ujarati",
author = {Jobanputra, Mayank and
Mehta, Maitrey and
{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
editor = {Bhatia, Archna and
Bouma, Gosse and
Do{\u{g}}ru{\"o}z, A. Seza and
Evang, Kilian and
Garcia, Marcos and
Giouli, Voula and
Han, Lifeng and
Nivre, Joakim and
Rademaker, Alexandre},
booktitle = "Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.mwe-1.9",
pages = "56--62",
abstract = "The Universal Dependencies (UD) project has presented itself as a valuable platform to develop various resources for the languages of the world. We present and release a sample treebank for the Indo-Aryan language of Gujarati {--} a widely spoken language with little linguistic resources. This treebank is the first labeled dataset for dependency parsing in the language and the script (the Gujarati script). The treebank contains 187 part-of-speech and dependency annotated sentences from diverse genres. We discuss various idiosyncratic examples, annotation choices and present an elaborate corpus along with agreement statistics. We see this work as a valuable resource and a stepping stone for research in Gujarati Computational Linguistics.",
}

Statistics of UD Gujarati GujTB

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X

Features

ExtPos – Typo

Relations

acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – case – cc – cc:preconj – ccomp – compound – compound:lvc – compound:svc – conj – cop – dep – det – discourse – dislocated – fixed – flat – goeswith – iobj – mark – nmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:tmod – orphan – parataxis – punct – reparandum – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 187 sentences, 1801 tokens and 1885 syntactic words.

This corpus contains 360 tokens (20%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 4 types of words that contain both letters and punctuation. Examples: "મરાયો, ડાબ-, ૩૫-એ, ‘જોકર’

This corpus contains 84 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 80 types of multi-word tokens. Examples: આપવાની, એની, તેની, વધારાની, 2016ના, અઝહરુદ્દીનના, અમલીકરણનો, આવવાની, આસપાસના, ઉત્પત્તિના, એકનો, એના, એનો, કમિશનની, કરવાની, કરવાનો, કાશ્મિરના, કિર્ગિસ્તાનની, કોઈની, ગાંધીજીની, ગામના, ઘરની, જનતાના, જવાનોની, જીવનના, જ્ઞાનની, જ્વાળામુખીઓનો, જ્વાળામુખીનો, ટીમના, ટેસ્ટના, ડીઝલની, ડૉલરની, તામિલનાડુની, તેના, તેનું, તેનો, દૂરબીનની, નીતિના, પરિષદના, પાકિસ્તાનની, પીટરના, પૂછપરછના, પોતાની, પોતાનું, પોતાનો, પ્રકારની, પ્રધાનમંત્રીનું, ફુગનના, ફૂલોનું, ફોનના.

Morphology

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

ExtPos
- ADP
  - ADP: _, ના

Typo
- Yes
  - CCONJ: તેમ
  - NOUN: ઝગડા, પ્રેમ

Syntax

Auxiliary Verbs and Copula

This corpus uses 4 lemmas as copulas (cop). Examples: છે, _, થવું, ન.

This corpus uses 14 lemmas as auxiliaries (aux). Examples: છે, હતું, રહેવું, _, ગયું, શકવું, ન, આવવું, જોઈતું, દેવું, પડવું, શું, હોવું, થવું.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (39)
- VERB--PRON (52)

obj
- VERB--NOUN (75)
- VERB--PRON (15)
- VERB--PRON-ADP(ની) (1)

iobj
- VERB--NOUN (5)
- VERB--PRON (2)

Relations Overview

This corpus uses 10 relation subtypes: acl:relcl, advcl:relcl, cc:preconj, compound:lvc, compound:svc, nmod:poss, nmod:tmod, nsubj:pass, obl:agent, obl:tmod
The following 4 relation types are not used in this corpus at all: csubj, expl, clf, list