home edit page issue tracker

This page pertains to UD version 2.

Introduction

The Universal Dependencies (UD) project develops treebanks, collections of sentences annotated for word morphology and sentence syntax, for many languages. It seeks to build cross-linguistically consistent treebanks, with the goal of facilitating multilingual parser development, language analysis, psycholinguistics, and language typology. After over a decade of development, UD provides treebanks and annotation guidelines (of varying size and quality) for nearly 200 human languages. While there is much more to do to improve the quality and typological distribution of resources available, UD provides by far the largest body of consistently annotated morphosyntactic data for a large range of human languages. UD is an open community effort and we encourage anyone interested to contribute to it!

The general philosophy of UD is to propose a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing for typological differences and using language-specific extensions when necessary. This is illustrated in the following parallel examples from English, Bulgarian, Czech, Swedish, and Kʼicheʼ. The main grammatical relations shown in blue (involving a passive verb, a nominal subject and an oblique agent) are the same, but the concrete grammatical realization varies.

# visual-style 4 2 nsubj:pass	color:blue
# visual-style 4 7 obl	color:blue
1	The	the	DET	_	Definite=Def|PronType=Art	2	det	_	_
2	dog	dog	NOUN	_	Gender=Neut|Number=Sing	4	nsubj:pass	_	_
3	was	be	AUX	_	Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin	4	aux:pass	_	_
4	chased	chase	VERB	_	Tense=Past|VerbForm=Part	0	ROOT	_	_
5	by	by	ADP	_	_	7	case	_	_
6	the	the	DET	_	Definite=Def|PronType=Art	7	det	_	_
7	cat	cat	NOUN	_	Gender=Neut|Number=Sing	4	obl	_	_
8	.	.	PUNCT	_	_	4	punct	_	_

# visual-style 3 1 nsubj:pass	color:blue
# visual-style 3 5 obl	color:blue
1	Кучето	куче	NOUN	_	Definite=Def|Gender=Neut|Number=Sing	3	nsubj:pass	_	_
2	се	се	PRON	_	Case=Acc|PronType=Prs|Reflex=Yes	3	expl:pass	_	_
3	преследваше	преследвам	VERB	_	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	_
4	от	от	ADP	_	_	5	case	_	_
5	котката	котка	NOUN	_	Definite=Def|Gender=Fem|Number=Sing	3	obl	_	_
6	.	.	PUNCT	_	_	3	punct	_	_

# visual-style 3 1 nsubj:pass	color:blue
# visual-style 3 4 obl	color:blue
1	Pes	pes	NOUN	_	Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	3	nsubj:pass	_	_
2	byl	být	AUX	_	Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act	3	aux:pass	_	_
3	honěn	honit	VERB	_	Aspect=Imp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass	0	root	_	_
4	kočkou	kočka	NOUN	_	Case=Ins|Gender=Fem|Number=Sing	3	obl	_	_
5	.	.	PUNCT	_	_	3	punct	_	_

# visual-style 2 1 nsubj:pass	color:blue
# visual-style 2 4 obl	color:blue
1	Hunden	hund	NOUN	_	Definite=Def	2	nsubj:pass	_	_
2	jagades	jaga	VERB	_	Tense=Past|Voice=Pass	0	root	_	_
3	av	av	ADP	_	_	4	case	_	_
4	katten	katt	NOUN	_	Definite=Def	2	obl	_	_
5	.	.	PUNCT	_	_	2	punct	_	_

# visual-style 1 3 nsubj:pass	color:blue
# visual-style 1 6 obl	color:blue
1	Koqatax	_	VERB	_	_	0	root	_	_
2	ri	_	DET	_	Definite=Def	3	det	_	_
3	tz'i'	_	NOUN	_	_	1	nsubj:pass	_	_
4	kumal	_	NOUN	_	_	6	case	_	_
5	ri	_	DET	_	Definite=Def	6	det	_	_
6	me's	_	NOUN	_	_	1	obl	_	_
7	.	.	PUNCT	_	_	1	punct	_	_

UD adopts a dependency representation of syntax, marking dependents of head words, and organizes sentence analysis around identifying clauses, nominals, and modifiers of these. It is a lexicalist framework that differentiates morphology from syntax, and it emphasizes dependencies between content words to increase crosslinguistic parallelism.

The best introduction to the design of UD as a linguistic framework is de Marneffe et al. (2021). UD has been crafted to strike a delicate balance between approximately 6 dimensions:

  1. UD needs to be satisfactory on linguistic analysis grounds for individual languages.
  2. UD needs to be good for linguistic typology, i.e., providing a suitable basis for bringing out cross-linguistic parallelism across languages and language families.
  3. UD must be suitable for rapid, consistent annotation by a human annotator.
  4. UD must be easily comprehended and used by a non-linguist, whether a language learner or an engineer with prosaic needs for language processing. We refer to this as seeking a habitable design, and it leads us to favor traditional grammar notions and terminology.
  5. UD must be suitable for computer parsing with high accuracy.
  6. UD must support downstream language understanding tasks (relation extraction, reading comprehension, machine translation, …).

It’s easy to come up with a proposal that improves UD on one of these dimensions. The interesting and difficult part is to improve UD while remaining sensitive to all these dimensions.

Getting started using UD

On a practical level, the UD project specifies common data formats and maintains infrastructure for treebanks:

Project organization

UD is an open collaboration with many project members. The administrative structure is kept at a minimum and currently consists of the following:

List of contributors

History and publications