UD Odia ODTB
Language: Odia (code: or)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.16 release.
The following people have contributed to making this treebank part of UD: Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati Sahoo, Biswakalpita Mohapatra, Satya Ranjan Dash, Bijayalaxmi Dash, Kusum Lata.
Repository: UD_Odia-ODTB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: nonfiction, news
Questions, comments? General annotation questions (either Odia-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [shantipriya • parida (æt) gmail • com, shashwatup9k (æt) gmail • com, sdashfca (æt) kiit • ac • in, biswakalpitamohapatra1 (æt) gmail • com, kalyanamalini • shabadi (æt) univ-lille • fr, sahoosaraswati455 (æt) gmail • com, rudrabijayalaxmi (æt) gmail • com,ranapoo (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | annotated manually |
| Features | not available |
| Relations | annotated manually, natively in UD style |
Description
The Odia UD Treebank (ODTB) is a part of the Universal Dependency treebank project.
The Odia Universal Dependency Treebank (ODTB) project is a collaboration initiated by the below institute/university to develop the first Universal Dependency Treebank in Odia language suitable for many natural language processing (NLP) tasks.
- AMD SiloAI, Finland
- University of Lille, France
- Insight Research Ireland Centre for Data Analytics, DSI,University of Galway, Galway, Ireland
- Institute of Mathematics & Applications, India
- KIIT University, India
- Ravenshaw University, India
- Sharda University, India
The Odia UD Treebank (ODTB) consists of 100 sentences. The ODTB data contains syntactic annotation according to dependency-constituency schema, as well as morphological/POS tags and lemmas. In this data, XPOS is annotated according to Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset.
Acknowledgments
We would like to acknowledge the support of the Research Ireland as part of Grant Number SFI/12/RC/2289_P2 Insight_2, Insight Research Ireland Centre for Data Analytics and and CA21167 COST Action UniDive (by COST (European Cooperation in Science and Technology). The authors thank John Bauer (Stanford University) for supporting the Odia Treebank project. The authors also thank Dr. Daniel Zeman (UFAL, Charles University) for his encouragement and support in building ODTB.
References
- If you use this data, please cite:
@inproceedings{parida-etal-2022-universal,
title = "{U}niversal {D}ependency Treebank for {O}dia Language",
author = "Parida, Shantipriya and
Shabadi, Kalyanamalini and
Ojha, Atul Kr. and
Sahoo, Saraswati and
Dash, Satya Ranjan and
Dash, Bijayalaxmi",
editor = "Jha, Girish Nath and
L., Sobha and
Bali, Kalika and
Ojha, Atul Kr.",
booktitle = "Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.wildre-1.15/",
pages = "84--89",
abstract = "This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in Odia were selected from {\textquotedblleft}Samantar{\textquotedblright}, the largest available parallel corpora collection for Indic languages. All the selected sentences are manually annotated following the {\textquotedblleft}Universal Dependency{\textquotedblright} guidelines. The morphological analysis of the Odia treebank was performed using machine learning techniques. The Odia annotated treebank will enrich the Odia language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Odia parser using a machine learning approach. The accuracy of the parser is 86.6{\%} Tokenization, 64.1{\%} UPOS, 63.78{\%} XPOS, 42.04{\%} UAS and 21.34{\%} LAS. Finally, the paper briefly discusses the linguistic analysis of the Odia UD treebank."
}
Statistics of UD Odia ODTB
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – clf – compound – conj – cop – det – discourse – dislocated – flat – iobj – list – mark – nmod – nsubj – nummod – obj – obl – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 100 sentences and 1029 tokens.
- This corpus contains 55 tokens (5%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 3 types of words that contain both letters and punctuation. Examples: କ’ଣ, ତା’ନିକଟରୁ, ତା’ପରେ
Morphology
Tags
- This corpus uses 14 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: INTJ, SYM, X
- This corpus contains 13 word types tagged as particles (PART): ଅତି, ଆଦି, କରିବାକୁ, କି, ଗୋଟିଏ, ଜଣ, ଜଣେ, ନେଇ, ବି, ମଧ୍ଯ, ମଧ୍ୟ, ହିଁ, ହେତୁ
- This corpus contains 4 lemmas tagged as pronouns (PRON): _, ଏ, କୌଣସି, ସେ
- This corpus contains 6 lemmas tagged as determiners (DET): _, ଅନେକ, ଅନ୍ୟ, ଏହି, କୌଣସି, ଜଣେ
- Out of the above, 2 lemmas occurred sometimes as PRON and sometimes as DET: _, କୌଣସି
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (49)
- VERB--NOUN-ADP(_) (1)
- VERB--PRON (10)
- obj
- VERB--NOUN (51)
- VERB--NOUN-ADP(_) (3)
- VERB--PRON (3)
- iobj
- VERB--NOUN (7)
- VERB--PRON (5)
- VERB--PRON-ADP(_) (1)