UD for Gujarati
Gujarati is an Indo-Aryan language originating from the western Indian state of Gujarat. The language is widely spoken by over 56 million speakers and is one of the 22 languages with official status in India. Yet, the Gujarati Computational Linguistics community is still in its infancy. Earlier literature classifies Gujarati in the “Scraping-Bys” category (category 1) in their taxonomy indicating a scant availability of labeled datasets.
Tokenization and Word Segmentation
Words are generally delimited by whitespace or punctuation in Gujarati. We plan to update this documentation with more details on tokenization and word segmentation in Gujarati in the upcoming release.
Morphology
Gujarati morphology is agglutinative and has a rich system of inflectional and derivational morphology. The language has a complex system of verb conjugation, noun declension, and postpositions. We plan to update this documentation with more details on morphology in Gujarati in the upcoming release.
Features
- Morphological features are currently only partially included in the treebank. We plan to update the treebank with proper morphological annotations and features in the upcoming release.
We plan to update this documentation with more details on features in Gujarati in the upcoming release.
Syntax
Standard deprels are used, except for clf
which is not used in any treebank.
- Gujarati is a head-final, or left-branching language.
- Adjectives precede nouns, direct objects come before verbs, and there are postpositions.
- The word order of Gujarati is SOV, and there are three genders and two numbers.
More details on syntax in Gujarati will be added in the upcoming release.
Treebanks
There is 1 Gujarati UD treebanks: