This page pertains to UD version 2.

UD Sanskrit Vedic

Language: Sanskrit (code: sa)
Family: Indo-European, Indic

This treebank has been part of Universal Dependencies since the UD v2.6 release.

The following people have contributed to making this treebank part of UD: Salvatore Scarlata, Elia Ackermann, Oliver Hellwig, Erica Biagetti, Paul Widmer, Sven Sellmer.

Repository: UD_Sanskrit-Vedic
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14

License: CC BY-SA 4.0

Genre: nonfiction

Questions, comments? General annotation questions (either Sanskrit-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hellwig7 (æt) gmx • de]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS assigned by a program, with some manual corrections, but not a full manual verification
XPOS annotated manually in non-UD style, automatically converted to UD
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style


The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB).

Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

Vedic Sanskrit is an ancient Indo-Aryan language, one of the oldest transmitted Indo-European languages and the precursor of Classical Sanskrit. The relatively large corpus of Vedic poetry and prose is critical for the reconstruction of the early linguistic history of Indo-European and important as a source for socio-cultural developments in South Asia during the second and first millenia BCE.

The composition of the Vedic treebank is motivated by the need for a resource that can be used for data-driven, quantitatively robust diachronic and synchronic investigations of linguistic phenomena in, and starting with, the oldest layers of Vedic Sanskrit.


Annotation and composition of this treebank are described in detail in the following paper:

author = {Hellwig, Oliver and Scarlata, Salvatore and Ackermann, Elia and Widmer, Paul},
title = {The Treebank of {Vedic Sanskrit}},
booktitle = {Proceedings of the LREC},
year = {2020}

An updated overview of the annotation procedure and coverage can be found here:

title = {Data-driven Dependency Parsing of {V}edic {S}anskrit},
author = {Hellwig, Oliver and Nehrdich, Sebastian and Sellmer, Sven},
journal = {Language Resources \& Evaluation},
volume = {57},
pages = {1173--1206},
year = {2023}

Train-test split

Following the UD recommendations, documents are kept together when generating the train, test and dev splits. For the Vedic data, the term “document” means a hymn (metrical texts) or a chapter (prose texts). Some of these documents are not complete, meaning that not the whole chapter or hymn was annotated. This happens quite often with the Rigveda.


The annotation has been performed by Salvatore Scarlata, Oliver Hellwig, Elia Ackermann, Erica Biagetti, and Sven Sellmer.

