UD Gujarati GujTB
Language: Gujarati (code: gu
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.14 release.
The following people have contributed to making this treebank part of UD: Maitrey Mehta, Mayank Jobanputra.
Repository: UD_Gujarati-GujTB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: grammar-examples
Questions, comments? General annotation questions (either Gujarati-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [maitrey (æt) cs • utah • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.
Currently the treebank is comprised of 187 sentences, out of which 100 are doubly annotated by the authors. We plan to update the treebank with proper morphological annotations and features in the upcoming release.
Acknowledgments
References
Please cite the following paper if you use this treebank in your research:
@inproceedings{jobanputra-etal-2024-universal,
title = "A {U}niversal {D}ependencies Treebank for {G}ujarati",
author = {Jobanputra, Mayank and
Mehta, Maitrey and
{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
editor = {Bhatia, Archna and
Bouma, Gosse and
Do{\u{g}}ru{\"o}z, A. Seza and
Evang, Kilian and
Garcia, Marcos and
Giouli, Voula and
Han, Lifeng and
Nivre, Joakim and
Rademaker, Alexandre},
booktitle = "Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.mwe-1.9",
pages = "56--62",
abstract = "The Universal Dependencies (UD) project has presented itself as a valuable platform to develop various resources for the languages of the world. We present and release a sample treebank for the Indo-Aryan language of Gujarati {--} a widely spoken language with little linguistic resources. This treebank is the first labeled dataset for dependency parsing in the language and the script (the Gujarati script). The treebank contains 187 part-of-speech and dependency annotated sentences from diverse genres. We discuss various idiosyncratic examples, annotation choices and present an elaborate corpus along with agreement statistics. We see this work as a valuable resource and a stepping stone for research in Gujarati Computational Linguistics.",
}
Statistics of UD Gujarati GujTB
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Case – Clusivity – Gender – Mood – Number – Polite – Typo – VerbType
Relations
acl – acl:relcl – advcl – advcl:relcl – advmod – amod – appos – aux – case – cc – cc:preconj – ccomp – compound – compound:lvc – compound:svc – conj – cop – dep – det – discourse – dislocated – fixed – flat – goeswith – iobj – mark – nmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:tmod – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 187 sentences, 1801 tokens and 1885 syntactic words.
- This corpus contains 360 tokens (20%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 4 types of words that contain both letters and punctuation. Examples: "મરાયો, ડાબ-, ૩૫-એ, ‘જોકર’
- This corpus contains 84 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 80 types of multi-word tokens. Examples: આપવાની, એની, તેની, વધારાની, 2016ના, અઝહરુદ્દીનના, અમલીકરણનો, આવવાની, આસપાસના, ઉત્પત્તિના, એકનો, એના, એનો, કમિશનની, કરવાની, કરવાનો, કાશ્મિરના, કિર્ગિસ્તાનની, કોઈની, ગાંધીજીની, ગામના, ઘરની, જનતાના, જવાનોની, જીવનના, જ્ઞાનની, જ્વાળામુખીઓનો, જ્વાળામુખીનો, ટીમના, ટેસ્ટના, ડીઝલની, ડૉલરની, તામિલનાડુની, તેના, તેનું, તેનો, દૂરબીનની, નીતિના, પરિષદના, પાકિસ્તાનની, પીટરના, પૂછપરછના, પોતાની, પોતાનું, પોતાનો, પ્રકારની, પ્રધાનમંત્રીનું, ફુગનના, ફૂલોનું, ફોનના.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 9 word types tagged as particles (PART): એ, જ, તો, ન, ના, પણ, ફક્ત, યે, શ્રી
- This corpus contains 48 lemmas tagged as pronouns (PRON): _, અમે, આ, આપણને, આપણા, આપણાં, આપણે, એ, એકબીજાને, એને, એમના, એમને, કાંઈ, કોઇ, કોઈ, કોણે, ક્યારે, જે, જેમાં, જેમાંથી, તને, તમને, તમારા, તમારી, તમે, તારી, તું, તે, તેઓ, તેણી, તેણે, તેના, તેનાં, તેમણે, તેમની, તેમને, પેલી, પોતા, બંનેને, મને, મારા, મારે, મેં, શાની, શાને, શું, સૌથી, હું
- This corpus contains 9 lemmas tagged as determiners (DET): _, આ, એ, એક, કયા, કોઈ, જેમાં, તે, પેલા
- Out of the above, 6 lemmas occurred sometimes as PRON and sometimes as DET: _, આ, એ, કોઈ, જેમાં, તે
- This corpus contains 14 lemmas tagged as auxiliaries (AUX): _, આવવું, ગયું, છે, જોઈતું, થવું, દેવું, ન, પડવું, રહેવું, શકવું, શું, હતું, હોવું
- Out of the above, 3 lemmas occurred sometimes as AUX and sometimes as VERB: _, થવું, હતું
- This corpus does not use the VerbForm feature.
Nominal Features
- Fem
- NOUN: આશા, ઓળખ, દુનિયા, પોલીસ, _, ઘટના
- Masc
- NOUN: બંદોબસ્ત, વિસ્તારમાં
- Neut
- NOUN: આપનારું, કાર્યાલય, ગામમાં
- Plur
- NOUN: _, ઉપલબ્ધિઓ, પગલાંમાં, સદસ્યો, સવાલો
- PRON: આપણને
- Sing
- NOUN: ઘટના
- Abl
- PROPN: દિલ્હીથી
- Abs
- NOUN: ચોપડી, ફળ
- Acc
- NOUN: વિમાનને
- PRON: તેમને
- All
- NOUN: ઘરમાં
- Cmp
- PROPN: છગનથી
- Dat
- NOUN: ભારતને, લોકોને
- PRON: આપણને
- PROPN: મગનને, લિનાને
- Erg
- NOUN: પ્રધાનમંત્રીએ
- PROPN: મિતાએ, ઐયરે, મોદીએ
- Gen
- NOUN: _
- PRON: _
- Loc
- PROPN: અમદાવાદમાં, પુણેમાં, બિશ્કેકમાં, સમિટમાં
- Nom
- PROPN: રામ
- Tem
- NOUN: કાલે, અઠવાડિયામાં
- Ter
- NUM: 7એ
Degree and Polarity
Verbal Features
- Nec
- AUX: પડયું
Pronouns, Determiners, Quantifiers
- Form
- VERB: આવજે
Other Features
- Clusivity
- In
- PRON: આપણને
- In
- Typo
- Yes
- ADJ: ઘાસવાળો
- CCONJ: તેમ
- NOUN: ઝગડા, પ્રેમ
- Yes
- VerbType
- Ideo
- VERB: ડબુક
- Ideo
Syntax
Auxiliary Verbs and Copula
- This corpus uses 4 lemmas as copulas (cop). Examples: છે, _, થવું, ન.
- This corpus uses 14 lemmas as auxiliaries (aux). Examples: છે, હતું, રહેવું, _, ગયું, શકવું, ન, આવવું, જોઈતું, દેવું, પડવું, શું, હોવું, થવું.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (38)
- VERB--NOUN-Erg (1)
- VERB--PRON (52)
- obj
- VERB--NOUN (72)
- VERB--NOUN-Abs (2)
- VERB--NOUN-Acc (1)
- VERB--PRON (13)
- VERB--PRON-ADP(ની) (1)
- VERB--PRON-Acc (1)
- VERB--PRON-Dat (1)
- iobj
- VERB--NOUN (4)
- VERB--NOUN-Dat (1)
- VERB--PRON (2)
Relations Overview
- This corpus uses 10 relation subtypes: acl:relcl, advcl:relcl, cc:preconj, compound:lvc, compound:svc, nmod:poss, nmod:tmod, nsubj:pass, obl:agent, obl:tmod
- The following 4 relation types are not used in this corpus at all: csubj, expl, clf, list