home edit page issue tracker

This page pertains to UD version 2.

UD Sanskrit Vedic

Language: Sanskrit (code: sa)
Family: Indo-European, Indic

This treebank has been part of Universal Dependencies since the UD v2.6 release.

The following people have contributed to making this treebank part of UD: Salvatore Scarlata, Elia Ackermann, Oliver Hellwig, Erica Biagetti, Paul Widmer.

Repository: UD_Sanskrit-Vedic
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 4.0

Genre: nonfiction

Questions, comments? General annotation questions (either Sanskrit-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hellwig7 (æt) gmx • de]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS assigned by a program, with some manual corrections, but not a full manual verification
XPOS annotated manually in non-UD style, automatically converted to UD
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style

Description

The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB).

Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

Vedic Sanskrit is an ancient Indo-Aryan language, one of the oldest transmitted Indo-European languages and the precursor of Classical Sanskrit. The relatively large corpus of Vedic poetry and prose is critical for the reconstruction of the early linguistic history of Indo-European and important as a source for socio-cultural developments in South Asia during the second and first millenia BCE.

The composition of the Vedic treebank is motivated by the need for a resource that can be used for data-driven, quantitatively robust diachronic and synchronic investigations of linguistic phenomena in, and starting with, the oldest layers of Vedic Sanskrit.

References

Annotation and composition of this treebank are described in detail in the following paper:

@inproceedings{hellwig-vtb-lrec-2020,
author = {Hellwig, Oliver and Scarlata, Salvatore and Ackermann, Elia and Widmer, Paul},
title = {The Treebank of {Vedic Sanskrit}},
booktitle = {Proceedings of the LREC},
year = {2020}
}

Train-test split

Following the UD recommendations, documents are kept together when generating the train and test splits. For all texts but the ŚB, a document'' means a hymn (metrical texts) or a chapter (prose texts). As only few, but rather long chapters of the ŚB have been annotated,documents’’ are text lines (each of which contains more than one sentence in most cases) separated by double dandas in the case of the ŚB.

Acknowledgments

The annotation has been performed by Salvatore Scarlata, Oliver Hellwig, Elia Ackermann, and Erica Biagetti.

Statistics of UD Sanskrit Vedic

POS Tags

ADJADVAUXCCONJDETNOUNNUMPARTPRONSCONJVERB

Features

CaseGenderMoodNumberPersonTenseVerbFormVoice

Relations

acladvcladvmodamodapposauxcaseccccompcompoundconjcopcsubjdetdiscoursedislocatedfixedflatiobjmarknmodnsubjnummodobjoblorphanparataxisrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview