home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

MISC attributes in CoNLL-U

The tenth column in the CoNLL-U format is labeled MISC, standing for “miscellaneous”. It is intended for any additional annotation that data providers want to store at the token level. With the exception of the SpaceAfter attribute, its contents are optional from the UD perspective. Nevertheless, certain types of annotation are used in multiple UD treebanks, and it is desirable that they are annotated in the same fashion as much as possible. This page serves as a notice board to raise awareness about MISC attributes that already exist, their form and purpose. If other treebanks add annotations of a kind described here, it is recommended that they use the same attribute names and values.

Basic format

A single underscore (“_”) in MISC signals that there is no extra annotation. The column cannot be empty and it cannot contain certain characters (TAB, CR, LF, other control characters). It can contain spaces (“ ”) but it cannot start or end with a space.

The vertical bar (“|”) is interpreted as separator of individual MISC annotations where applicable, so it is not recommended to use it unescaped in an annotation. Nevertheless, a CoNLL-U file is not considered invalid if it contains e.g. multiple consecutive vertical bars (“|||”), a leading or trailing “|” in MISC etc.

It is recommended that individual annotations separated by vertical bars are Attribute=Value pairs, similar to the FEATS column of CoNLL-U. Attribute names normally consist of English letters, starting with uppercase and followed by “CamelCase”, that is, uppercase signals new word or segment, lowercase is used otherwise. However, it is not forbidden to have a “|”-delimited annotation that does not start with an attribute name, does not contain “=”, or even is empty (as long as the whole MISC is not empty). Unlike in FEATS, attributes do not have to be sorted alphabetically and it is allowed to have the same attribute multiple times (with the same or with different values) if it makes sense (but it rarely does). Note that tools processing CoNLL-U data may process some MISC annotations and leave others intact; however, it may not be obvious what “leaving intact” means if you have unnamed attributes, or multiple instances of the same attribute where the order of the instances is significant for you. It is thus safer to avoid such practices.

Mandatory MISC attribute: SpaceAfter

Survey of optional attributes

In contrast to FEATS features, the MISC attributes are less standardized. Some are widely used; others are experimental or only included in a small number of treebanks. We encourage treebank developers to review the comprehensive list of MISC attributes and values occurring in data, which can help identify errors. A table linked to searchable instances can be obtained as well:

Attributes documented on this page

A selection of the MISC attributes in use in UD treebanks are summarized below. They are grouped thematically in the following table, then given in alphabetical order with brief documentation. More attributes may be added to this documentation in the future.

Surface format		Data language/source
`Correct{Feature}`	`NewPar`	`Gloss`	`MGloss`	`Vform`
`CorrectForm`	`SpacesAfter`	`Lang`	`OrigLang`
`CorrectSpaceAfter`	`SpacesBefore`	`LGloss`	`Ref`
`ModernForm`	`XML`	`LTranslit`	`Translit`
Morphology		Tense/aspect/modality/polarity
`Analysis` = `Morf`	`LNumValue`	`SType`	`TraditionalTense`
`LDeriv`	`MSeg`	`Tam`	`Vib`
`LId`	`Root`	`TraditionalMood`
Multi-word expressions/named entities		Discourse/coreference		Syntactic relations/constructions
`MWE`	`NamedEntity`	`Bridge`	`PDTB`	`Reduplication`
`MWEPOS`	`Proper`	`Discourse`	`Split`	`Subject`
		`Entity`

Alphabetical list of features

Analysis

See Morf. Used currently in Yupik, the Analysis attribute conveys the kind of information that other treebanks store in the Morf attribute. The two names should be merged across treebanks and languages!

Bridge

Used in conjunction with Entity to indicate bridging anaphora, by creating a pointing relation between two coreference GRP identifiers:

# sent_id = GUM_bio_gordon-32
1	An	a	DET	DT	Definite=Ind|PronType=Art	6	det	_	Entity=(142-abstract
2	incomplete	incomplete	ADJ	JJ	Degree=Pos|Polarity=Neg	6	amod	_	_
3	and	and	CCONJ	CC	_	4	cc	_	_
4	faulty	faulty	ADJ	JJ	Degree=Pos	2	conj	_	_
5	German	German	ADJ	JJ	Degree=Pos	6	amod	_	_
6	translation	translation	NOUN	NN	Number=Sing	21	nsubj:pass	_	SpaceAfter=No
7	,	,	PUNCT	,	_	8	punct	_	_
8	edited	edit	VERB	VBN	Tense=Past|VerbForm=Part	6	acl	_	_
9	by	by	ADP	IN	_	10	case	_	_
10	Dr	Dr	PROPN	NNP	Number=Sing	8	obl	_	Entity=(143-person
11	Moritz	Moritz	PROPN	NNP	Number=Sing	10	flat	_	_
12	Posselt	Posselt	PROPN	NNP	Number=Sing	10	flat	_	Entity=142)143)
13	(	(	PUNCT	-LRB-	_	18	punct	_	SpaceAfter=No
14	Tagebuch	Tagebuch	X	FW	_	18	compound	_	Entity=(142-abstract
15	des	des	X	FW	_	18	compound	_	_
16	Generals	Generals	X	FW	_	18	compound	_	_
17	Patrick	Patrick	PROPN	NNP	Number=Sing	18	compound	_	_
18	Gordon	Gordon	PROPN	NNP	Number=Sing	6	appos	_	Entity=142)|SpaceAfter=No
19	)	)	PUNCT	-RRB-	_	18	punct	_	_
20	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	21	aux:pass	_	_
21	published	publish	VERB	VBN	Tense=Past|VerbForm=Part	0	root	_	SpaceAfter=No
22	,	,	PUNCT	,	_	25	punct	_	_
23	the	the	DET	DT	Definite=Def|PronType=Art	25	det	_	Entity=(144-abstract|Bridge=142<144
24	first	first	ADJ	JJ	Degree=Pos|NumType=Ord	25	amod	_	_
25	volume	volume	NOUN	NN	Number=Sing	21	parataxis	_	Entity=144)
26	at	at	ADP	IN	_	27	case	_	_
27	Moscow	Moscow	PROPN	NNP	Number=Sing	25	orphan	_	Entity=(95-place)
28	in	in	ADP	IN	_	29	case	_	_
29	1849	1849	NUM	CD	NumType=Card	25	orphan	_	Entity=(145-time)|SpaceAfter=No
30	,	,	PUNCT	,	_	32	punct	_	_
31	the	the	DET	DT	Definite=Def|PronType=Art	32	det	_	Entity=(146-abstract|Bridge=142<146
32	second	second	ADJ	JJ	Degree=Pos|NumType=Ord	25	conj	_	Entity=146)
33	at	at	ADP	IN	_	34	case	_	_
34	St	St	PROPN	NNP	Number=Sing	32	orphan	_	Entity=(147-place
35	Petersburg	Petersburg	PROPN	NNP	Number=Sing	34	flat	_	Entity=147)

Here “the first” (entity number 144) and “the second” (entity number 146) are volumes of a “translation” (entity number 142), hence we have Bridge=142<144 and Bridge=142<146, indicating that the identity of 144 and 146 is resolvable by reference to entity 142. See more information in the Entity notation section and the documentation from the Universal Anaphora format specifications

Correct{FEATURE}

For instance: CorrectCase, CorrectDegree, CorrectGender, CorrectMood, CorrectNumber, CorrectPerson, CorrectTense, CorrectVerbForm…

See also CorrectForm and CorrectSpaceAfter.

Shows the value of a morphological feature that would correspond to the correct form if a typo in the underlying text is fixed (while the actual value of the feature in FEATS should correspond to the actual form that appears in the text, as described in the guidelines for typos).

# text = The cars is produced in Detroit.
 The        the       DET     _   Definite=Def|PronType=Art                               2   det     _   _
 cars       car       NOUN    _   Number=Plur                                             4   nsubj   _   _
 is         be        AUX     _   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4   aux     _   CorrectForm=are|CorrectNumber=Plur
 produced   produce   VERB    _   Tense=Past|VerbForm=Part                                0   root    _   _
 in         in        ADP     _   _                                                       6   case    _   _
 Detroit    Detroit   PROPN   _   Number=Sing                                             4   obl     _   SpaceAfter=No
 .          .         PUNCT   _   _                                                       4   punct   _   _

CorrectForm

Shows the expected correct word form when there is a typo in the underlying text. The FORM column contains the form from the text including the error, and the FEATS column contains Typo=Yes, as described in the guidelines for typos.

# text = I have two kats.
I	I	PRON	_	_	2	nsubj	_	_
have	have	VERB	_	_	0	root	_	_
two	two	NUM	_	_	4	nummod	_	_
kats	cat	NOUN	_	Typo=Yes	2	obj	_	CorrectForm=cats|SpaceAfter=No
.	.	PUNCT	_	_	2	punct	_	_

CorrectSpaceAfter

See also CorrectForm, Correct{FEATURE} and SpaceAfter.

CorrectSpaceAfter=Yes indicates that a space between two tokens is missing by error (hence it accompanies a SpaceAfter=No). CorrectSpaceAfter=No indicates that a space should not be there (e.g., before a period; it cannot occur together with SpaceAfter=No). More details are discussed in the guidelines for typos.

# text = This spellingis wrong .
This	this	DET	_	_	2	det	_	_
spelling	spelling	NOUN	_	Number=Sing	4	nsubj	_	SpaceAfter=No|CorrectSpaceAfter=Yes
is	be	AUX	_	_	4	cop	_	_
wrong	wrong	ADJ	_	_	0	root	_	CorrectSpaceAfter=No
.	.	PUNCT	_	_	4	punct	_	_

Discourse

This annotation is used to indicate discourse relations between discourse units, which may or may not span whole sentences. At the beginning of each elementary discourse unit (EDU), the annotation Discourse gives the discourse function of the unit beginning with that token, followed by a colon, the ID of the current unit, and an arrow pointing to the ID of the parent unit in the discourse parse.

For instance, Discourse=purpose:105->104:0 at token 21 in the example below means that this token begins discourse unit 105, which functions as a purpose to unit 104, which begins at token 1 in this sentence (“Padalecki partnered with co-star Jensen Ackles –purpose-> to release a shirt…”). In relations derived from hierarchical discourse trees, as in UD_English-GUM, we also have an added number after a colon - the final :0 indicates that the attachment has a depth of 0, without an intervening span in the original RST constituent tree (this information allows deterministic reconstruction of the RST constituent discourse tree from the conllu file).

1	For	for	ADP	IN	_	4	case	4:case	Discourse=sequence_m:104->98:2
2	the	the	DET	DT	Definite=Def|PronType=Art	4	det	4:det	_
3	second	second	ADJ	JJ	Degree=Pos|NumType=Ord	4	amod	4:amod	_
4	campaign	campaign	NOUN	NN	Number=Sing	16	obl	16:obl:for	_
5	in	in	ADP	IN	_	10	case	10:case	_
6	the	the	DET	DT	Definite=Def|PronType=Art	10	det	10:det	_
7	Always	Always	ADV	NNP	Number=Sing	8	advmod	8:advmod	_
8	Keep	Keep	PROPN	NNP	Number=Sing	10	compound	10:compound	_
9	Fighting	Fighting	PROPN	NNP	Number=Sing	8	xcomp	8:xcomp	_
10	series	series	NOUN	NN	Number=Sing	4	nmod	4:nmod:in	_
11	in	in	ADP	IN	_	12	case	12:case	_
12	April	April	PROPN	NNP	Number=Sing	4	nmod	4:nmod:in	_
13	2015	2015	NUM	CD	NumForm=Digit|NumType=Card	12	nmod:tmod	12:nmod:tmod	SpaceAfter=No
14	,	,	PUNCT	,	_	4	punct	4:punct	_
15	Padalecki	Padalecki	PROPN	NNP	Number=Sing	16	nsubj	16:nsubj	_
16	partnered	partner	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
17	with	with	ADP	IN	_	18	case	18:case	_
18	co-star	co-star	NOUN	NN	Number=Sing	16	obl	16:obl:with	_
19	Jensen	Jensen	PROPN	NNP	Number=Sing	18	appos	18:appos	_
20	Ackles	Ackles	PROPN	NNP	Number=Sing	19	flat	19:flat	_
21	to	to	PART	TO	_	22	mark	22:mark	Discourse=purpose:105->104:0
22	release	release	VERB	VB	VerbForm=Inf	16	advcl	16:advcl:to	_
23	a	a	DET	DT	Definite=Ind|PronType=Art	24	det	24:det	_
24	shirt	shirt	NOUN	NN	Number=Sing	22	obj	22:obj	_
25	featuring	feature	VERB	VBG	VerbForm=Ger	24	acl	24:acl	Discourse=elaboration:106->105:0
26	both	both	DET	DT	_	25	obj	25:obj	_
27	of	of	ADP	IN	_	29	case	29:case	_
28	their	their	PRON	PRP$	Number=Plur|Person=3|Poss=Yes|PronType=Prs	29	nmod:poss	29:nmod:poss	_
29	faces	face	NOUN	NNS	Number=Plur	26	nmod	26:nmod:of	SpaceAfter=No

The unique ROOT node of the discourse tree has no arrow notation, e.g. Discourse=ROOT:2:0 means that this token begins unit 2, which is the Central Discourse Unit (or discourse root) of the current document.

Entity

This annotation is used to encode entity types and, if available, entity linking, coreference information, and other information about entities as well. The span of tokens encompassed by each entity mention is indicated by a pair of Entity annotations in the MISC field, which begin and end the entity span using opening and closing round brackets (or both, for single token entities). The values of each entity annotation, in cases where multiple pieces of information are given for each entity, are separated by -, and the key names for these annotations are specified once in a # Global.Entity comment at the beginning of the document, in the order in which they appear for each entity. A basic example can look like this, with three keys declared - a coreference group ID GRP, the entity type entity and an entity linking identifier identity:

# newdoc id = GUM_voyage_tulsa
# global.Entity = GRP-entity-identity
1	Tulsa	_	_	_	_	_	_	_	Entity=(1-place-Tulsa)
2	Tulsa	_	_	_	_	_	_	_	Entity=(1-place-Tulsa)
3	is	_	_	_	_	_	_	_	_
4	in	_	_	_	_	_	_	_	_
5	the	_	_	_	_	_	_	_	Entity=(2-place-Green_Country
6	Green	_	_	_	_	_	_	_	_
7	Country	_	_	_	_	_	_	_	_
8	region	_	_	_	_	_	_	_	_
9	of	_	_	_	_	_	_	_	_
10	Oklahoma	_	_	_	_	_	_	_	Entity=(3-place-Oklahoma)2)
11	.	_	_	_	_	_	_	_	_
12	It	_	_	_	_	_	_	_	Entity=(1-place-Tulsa)
13	is	_	_	_	_	_	_	_	_
14	also	_	_	_	_	_	_	_	_
15	called	_	_	_	_	_	_	_	_
16	“	_	_	_	_	_	_	_	_
17	T-town	_	_	_	_	_	_	_	Entity=(1-place-Tulsa)
18	”	_	_	_	_	_	_	_	_

Note that key-value annotations aside from the group ID are not repeated at closing brackets, and that multiple entities can open or close at the same line (see token 10 in the example). A more complex example indicating actual usage in the GUM corpus is as follows:

# global.Entity = GRP-entity-infstat-MIN-coref_type-identity
...
# text = He is said to have had a bad relationship with his father.
1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	3	nsubj:pass	3:nsubj:pass|6:nsubj:xsubj	Entity=(1-person-giv:act-1-ana-Daniel_Bernoulli)
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	3	aux:pass	3:aux:pass	_
3	said	say	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	0:root	_
4	to	to	PART	TO	_	6	mark	6:mark	_
5	have	have	AUX	VB	VerbForm=Inf	6	aux	6:aux	_
6	had	have	VERB	VBN	Tense=Past|VerbForm=Part	3	xcomp	3:xcomp	_
7	a	a	DET	DT	Definite=Ind|PronType=Art	9	det	9:det	Entity=(54-abstract-new-3-sgl
8	bad	bad	ADJ	JJ	Degree=Pos	9	amod	9:amod	_
9	relationship	relationship	NOUN	NN	Number=Sing	6	obj	6:obj	_
10	with	with	ADP	IN	_	12	case	12:case	_
11	his	his	PRON	PRP$	Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs	12	nmod:poss	12:nmod:poss	Entity=(45-person-giv:inact-2-coref-Johann_Bernoulli(1-person-giv:act-1-ana-Daniel_Bernoulli)
12	father	father	NOUN	NN	Number=Sing	9	nmod	9:nmod:with	Entity=45)54)|SpaceAfter=No
13	.	.	PUNCT	.	_	3	punct	3:punct	_

In this example, each Entity annotation again contains possibly multiple opening or closing entities (see token 11, which begins both “his” and “his father”, i.e. we have [[his] father] as a nested entity). There are six segments to each opening bracket, separated by hyphen, including entity, GRP and identity as before, but also infstat (information status), MIN (the minimal span for fuzzy entity matching, indicating running token numbers inside the entity span) and coref_type, for example ‘ana’ for pronominal anaphora.

For more details and use case of the Entity annotation, see the Universal Anaphora documentation

Gloss

Lang

LDeriv

See also LId, LGloss, LNumValue and Root.

The lemma from which the lemma of the current token is morphologically derived (just the source word form, not a value of LId). In the Czech dependency treebanks, the value of LDeriv is directly embedded in the derived lemma and it has to be separated in UD where the LEMMA column is supposed to contain just the citation form of the current lexeme.

In the example below, the adverb ročně “yearly” is derived from the adjective roční (which, in turn, is derived from the noun rok “year”, but it is not visible on this line; it would be visible on a line where the adjective occurs in the corpus).

# sent_id = 1
# text = Tato politika stojí český stát miliardu ročně.
 Tato       tento      DET     _   _   2   det      _   _
 politika   politika   NOUN    _   _   3   nsubj    _   LGloss=(věda)
 stojí      stát       VERB    _   _   0   root     _   LId=stát-4|LGloss=(něco_stojí_peníze)
 český      český      ADJ     _   _   5   amod     _   _
 stát       stát       NOUN    _   _   3   iobj     _   LId=stát-1|LGloss=(státní_útvar)
 miliardu   miliarda   NOUN    _   _   3   obj      _   LNumValue=1000000000
 ročně      ročně      ADV     _   _   3   advmod   _   SpaceAfter=No|LDeriv=roční
 .          .          PUNCT   _   _   3   punct    _   _

LGloss

See also LId, Gloss, LNumValue and LDeriv.

An explanation of the meaning of the lemma/lexeme. This is particularly useful in connection with the LId attribute, although it can also appear alone as in the example below. While the Gloss attribute could arguably be used to convey this information, the origin and the purpose of the two attributes is slightly different. In the Czech dependency treebanks, the value of LGloss is directly embedded in the lemma and it has to be separated in UD where the LEMMA column is supposed to contain just the citation form of the lexeme. Consequently, the value of LGloss is still in Czech (rather than a translation to English or another language), and it typically uses a longer phrase to describe or illustrate the usage of the word.

# sent_id = 1
# text = Tato politika stojí český stát miliardu ročně.
 Tato       tento      DET     _   _   2   det      _   _
 politika   politika   NOUN    _   _   3   nsubj    _   LGloss=(věda)
 stojí      stát       VERB    _   _   0   root     _   LId=stát-4|LGloss=(něco_stojí_peníze)
 český      český      ADJ     _   _   5   amod     _   _
 stát       stát       NOUN    _   _   3   iobj     _   LId=stát-1|LGloss=(státní_útvar)
 miliardu   miliarda   NOUN    _   _   3   obj      _   LNumValue=1000000000
 ročně      ročně      ADV     _   _   3   advmod   _   SpaceAfter=No|LDeriv=roční
 .          .          PUNCT   _   _   3   punct    _   _

LId

LNumValue

LTranslit

MGloss

ModernForm

Morf

MSeg

MWE

See also MWEPOS, NamedEntity, Proper and Entity.

Multi-word expression. The value serves to preserve a multi-word expression from the source corpus. This attribute is typically used at places where in the source corpus the MWE was treated as one token (possibly with underscores instead of spaces). Various types of MWEs can be annotated this way, ranging from fixed functional expressions, such as multi-word prepositions, to multi-word named entities. (Note however that this attribute is not sufficient for named entity annotation, as it cannot handle nested entities.)

Besides preserving the information that there is a multi-word expression, the attribute may also serve as a warning that the annotation conversion was difficult and may have to be checked by a human here (the annotation of the individual words probably had to be guessed during conversion). It is typically placed either at the first word of the MWE, or at the head word.

It is attested in Catalan AnCora, Portuguese Bosque (as _MWE), Spanish AnCora.

# sent_id = 3LB-CAT-06010100-1-s1
# text = El Tribunal Suprem (TS) ha confirmat la condemna
El          el          DET     _   _   2    det     _   _
Tribunal    tribunal    NOUN    _   _   8    nsubj   _   MWE=Tribunal_Suprem|MWEPOS=PROPN|ClusterId=3LB-CAT-06010100-1-s1.sn.2|ClusterType=Spec.organization|MentionSpan=1-6
Suprem      suprem      ADJ     _   _   2    amod    _   _
(           (           PUNCT   _   _   5    punct   _   SpaceAfter=No
TS          TS          PROPN   _   _   2    appos   _   SpaceAfter=No|ClusterId=3LB-CAT-06010100-1-s1.sn.7|ClusterType=Spec.organization|MentionSpan=4-6
)           )           PUNCT   _   _   5    punct   _   _
ha          haver       AUX     _   _   8    aux     _   _
confirmat   confirmar   VERB    _   _   0    root    _   _
la          el          DET     _   _   10   det     _   _
condemna    condemna    NOUN    _   _   8    obj     _   _

MWEPOS

See also MWE, NamedEntity, Proper and Entity.

The part of speech of a multi-word expression. The value is taken from the set of universal part-of-speech tags (UPOS) but it is not necessarily identical to the UPOS annotation of the current token, as the whole expression may function as a different part of speech than the individual words. For instance, a MWE may function as an adposition but its member words may be nouns or adverbs. Or an expression consists of determiners, prepositions, common nouns and adjectives, but together it is a multi-word named entity, hence it acts as a PROPN.

It typically occurs together with the MWE attribute, either at the first word of the MWE, or at the head word.

It is attested in Catalan AnCora, Indonesian CSUI, Indonesian GSD, Indonesian PUD, Ligurian GLT, Portuguese Bosque (as POSMWE), Portuguese GSD, Spanish AnCora. It first appeared at around UD v1.3; it is briefly mentioned in issue #664.

Note: Some treebanks later introduced the attribute ExtPos (possibly under the influence of SUD?), which seems to serve a similar purpose. It appears in Beja NSC, various French treebanks, and Naija NSC, as well as in issues #608, #664, #678, #777 and #807. Ideally, these two attribute names should be merged into one!

# sent_id = 3LB-CAT-06010100-1-s1
# text = El Tribunal Suprem (TS) ha confirmat la condemna
El          el          DET     _   _   2    det     _   _
Tribunal    tribunal    NOUN    _   _   8    nsubj   _   MWE=Tribunal_Suprem|MWEPOS=PROPN|ClusterId=3LB-CAT-06010100-1-s1.sn.2|ClusterType=Spec.organization|MentionSpan=1-6
Suprem      suprem      ADJ     _   _   2    amod    _   _
(           (           PUNCT   _   _   5    punct   _   SpaceAfter=No
TS          TS          PROPN   _   _   2    appos   _   SpaceAfter=No|ClusterId=3LB-CAT-06010100-1-s1.sn.7|ClusterType=Spec.organization|MentionSpan=4-6
)           )           PUNCT   _   _   5    punct   _   _
ha          haver       AUX     _   _   8    aux     _   _
confirmat   confirmar   VERB    _   _   0    root    _   _
la          el          DET     _   _   10   det     _   _
condemna    condemna    NOUN    _   _   8    obj     _   _

NamedEntity

See also MWE, MWEPOS, Proper and Entity.

NamedEntity=Yes preserves the information that the word was tagged PROPN when it was first imported to UD. Typically of Google-annotated data (GSD and PUD), all words in multi-word named entities are tagged PROPN, which is wrong in UD. Determiners, numerals, adjectives, common nouns and other words should get their usual UPOS tags even inside titles of books or movies, names of organizations etc. However, as we fix the UPOS tag, we may want to still preserve the information that a sequence of words were tagged PROPN because we could later convert it to genuine annotation of named entities.

Attested in German GSD and Irish IDT.

# sent_id = train-s203
# text = Unser Erlebnis auf dem Leuchtturm Roter Sand war einmalig
 Unser        unser        DET    _   _   2   det     _   _
 Erlebnis     Erlebnis     NOUN   _   _   9   nsubj   _   _
 auf          auf          ADP    _   _   5   case    _   _
 dem          der          DET    _   _   5   det     _   _
 Leuchtturm   Leuchtturm   NOUN   _   _   2   nmod    _   _
 Roter        rot          ADJ    _   _   7   amod    _   NamedEntity=Yes
 Sand         Sand         NOUN   _   _   5   appos   _   NamedEntity=Yes
 war          sein         AUX    _   _   9   cop     _   _
 einmalig     einmalig     ADJ    _   _   0   root    _   _

NewPar

OrigLang

PDTB

Penn Discourse Treebank (PDTB) style shallow discourse relations. For example, some UD_English corpora (GUM, GUMReddit, GENTLE) include shallow discourse relation annotations, which provides information for Explicit, Implicit, AltLex, AltLexC, EntRel, Hypophora and NoRel annotations following the Penn Discourse Treebank (PDTB) v3 guidelines (see Webber et al. 2019 for definitions). Annotations are placed on the first token of the connective (words like “but”, “because”, “on the other hand”, etc.) or alternative lexicalization (e.g. “this is the reason”) marking the relation for explicit/altlex relations, or on the first token of the second argument span (arg2) for other cases. Token ranges for each argument span, the connective and relation label are provided as well. For example, the following line:

21	to	to	PART	TO	_	22	mark	22:mark	Discourse=purpose-goal:105->104:0:syn-inf-963|PDTB=Implicit:Contingency.Purpose.Arg2-as-goal:in order:_:943-962:963-981

indicates an Implicit relation with the label Contingency.Purpose.Arg2-as-goal, with an implicit connective “in order”. Because the connective is implicit, it has no token indices (_), but arg1 spans token943-962 of the document (ignoring decimal ellipsis tokens), and arg2 spans tokens 963-981. If multiple PDTB relations apply at the same token position, they are separated by a semicolon.

Proper

See also MWE, MWEPOS, NamedEntity and Entity.

Proper=True preserves a same-named feature from the original Google annotation in the PUD treebanks, but only for words that could not be tagged PROPN in UD (e.g., adjectives). Typically of Google-annotated data (GSD and PUD), all words in multi-word named entities are labeled as proper, which is wrong in UD. Determiners, numerals, adjectives, common nouns and other words should get their usual UPOS tags even inside titles of books or movies, names of organizations etc.

The information that a non-proper-noun was annotated as Proper could be later converted to genuine annotation of multi-word named entities.

Attested in the following PUD treebanks: Arabic, English, French, German, Hindi, Chinese, Italian, Korean, Portuguese, Russian, Spanish, Thai, Turkish.

# sent_id = n01035013
# text = Полиция в Британской Колумбии сказала,
# english_text = Police in B.C. said
1   Полиция      полиция      NOUN    _   _   5    nsubj   _   _
2   в            в            ADP     _   _   4    case    _   _
3   Британской   британский   ADJ     _   _   4    amod    _   Proper=True
4   Колумбии     Колумбия     PROPN   _   _   1    nmod    _   _
5   сказала      сказать      VERB    _   _   0    root    _   SpaceAfter=No
6   ,            ,            PUNCT   _   _   15   punct   _   _

Reduplication

This attribute is used in conjunction with the relation flat:redup. It is annotated on the second (or last) element receiving the relation and expresses in terms of UD morphological features the value of the reduplication, with the syntax <Feature>:<Value> (since the = sign is already used).

It is attested experimentally in the Latin treebanks IT-TB, LLCT and UDante.

# sent_id = DVE-71
# text = Quot quot autem exercitii varietates tendebant ad opus, tot tot ydiomatibus tunc genus humanum disiungitur; et quanto excellentius exercebant, tanto rudius nunc barbariusque locuntur.
# citation_hierarchy = Liber_Primus,vii,Paragraphus_7
1	Quot	quot	DET	ai	NumType=Card|PronType=Rel	5	det	_	_
2	quot	quot	DET	ai	NumType=Card|PronType=Rel	1	flat:redup	_	Reduplication=NumType:Dist
3	autem	autem	PART	co	_	6	discourse	_	_
4	exercitii	exercitium	NOUN	sns2g	Case=Gen|Gender=Neut|InflClass=IndEurO|Number=Sing	5	nmod	_	_
5	varietates	uarietas	NOUN	sfp3n	Case=Nom|Gender=Fem|InflClass=IndEurX|Number=Plur	6	nsubj	_	_
6	tendebant	tendo	VERB	va3iip3	Aspect=Imp|InflClass=LatX|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act	12	acl:relcl	_	TraditionalMood=Indicativus|TraditionalTense=Imperfectum
7	ad	ad	ADP	e	_	8	case	_	_
8	opus	opus	NOUN	sns3a	Case=Acc|Gender=Neut|InflClass=IndEurX|Number=Sing	6	obl:arg	_	SpaceAfter=No
9	,	,	PUNCT	Pu	_	6	punct	_	_
10	tot	tot	DET	yuip	NumType=Card|PronType=Dem	12	det	_	_
11	tot	tot	DET	yuip	NumType=Card|PronType=Dem	10	flat:redup	_	Reduplication=NumType:Dist
12	ydiomatibus	idioma	NOUN	snp3b	Case=Abl|Gender=Neut|InflClass=IndEurX|Number=Plur	16	obl	_	_
13	tunc	tunc	ADV	r	PronType=Dem	16	advmod	_	_
14	genus	genus	NOUN	sns3n	Case=Nom|Gender=Neut|InflClass=IndEurX|Number=Sing	16	nsubj:pass	_	_
15	humanum	humanus	ADJ	ans1n	Case=Nom|Gender=Neut|InflClass=IndEurO|Number=Sing	14	amod	_	_
16	disiungitur	disiungo	VERB	vp3ips3	Aspect=Imp|InflClass=LatX|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass	0	root	_	SpaceAfter=No|TraditionalMood=Indicativus|TraditionalTense=Praesens
17	;	;	PUNCT	Pu	_	18	punct	_	_
18	et	et	CCONJ	co	_	28	cc	_	_
19	quanto	quanto	SCONJ	r	PronType=Rel	21	mark	_	_
20	excellentius	excellenter	ADV	r+	Degree=Cmp|VerbForm=Part	21	advmod	_	_
21	exercebant	exerceo	VERB	va2iip3	Aspect=Imp|InflClass=LatE|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act	28	advcl:cmp	_	SpaceAfter=No|TraditionalMood=Indicativus|TraditionalTense=Imperfectum
22	,	,	PUNCT	Pu	_	21	punct	_	_
23	tanto	tanto	ADV	r	PronType=Dem	24	advmod	_	_
24	rudius	rude	ADV	r+	Degree=Cmp	28	advmod	_	_
25	nunc	nunc	ADV	r	AdvType=Tim	28	advmod:tmod	_	_
26-27	barbariusque	_	_	_	_	_	_	_	_
26	barbarius	barbare	ADV	r+	Degree=Cmp	24	conj	_	_
27	que	que	CCONJ	co9	_	26	cc	_	_
28	locuntur	loquor	VERB	vd3ipp3	Aspect=Imp|InflClass=LatX|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass	16	conj	_	SpaceAfter=No|TraditionalMood=Indicativus|TraditionalTense=Praesens
29	.	.	PUNCT	Pu	_	16	punct	_	_

Ref

Some standardized reference to the source text as used in classical studies. For example, annotated texts from the Bible have the uppercase abbreviation of the book, followed by an underscore and a decimal reference to the verse. It is a token-level reference (rather than sentence-level) because one sentence may contain parts with different source ids. On the other hand, the same source id may cover multiple sentences or their parts.

Used e.g. in Ancient Greek PROIEL, Latin PROIEL, Gothic PROIEL, Old Church Slavonic PROIEL, Old East Slavic TOROT, Romanian Nonstandard, Yoruba YTB.

# source = Bibeli Mimọ, Jẹ́nẹ́sísì, Chapter 1
# newdoc id = GEN_1
# sent_id = GEN_1.1
# text = Ní ìbẹ̀rẹ̀ ohun gbogbo Ọlọ́run dá àwọn ọ̀run àti ayé.
# text_en = In the beginning God created the heaven and the earth.
1    Ní       ní       ADP     _   _   2    case    _   Gloss=in|Ref=GEN_1.1
2    ìbẹ̀rẹ̀    ìbẹ̀rẹ̀    NOUN    _   _   6    obl     _   Gloss=beginning|Ref=GEN_1.1
3    ohun     ohun     NOUN    _   _   5    nmod    _   Gloss=things|Ref=GEN_1.1
4    gbogbo   gbogbo   DET     _   _   5    det     _   Gloss=all|Ref=GEN_1.1
5    Ọlọ́run   ọlọ́run   NOUN    _   _   6    nsubj   _   Gloss=god|Ref=GEN_1.1
6    dá       dá       VERB    _   _   0    root    _   Gloss=made|Ref=GEN_1.1
7    àwọn     àwọn     DET     _   _   8    det     _   Gloss=the|Ref=GEN_1.1
8    ọ̀run     ọ̀run     NOUN    _   _   6    obj     _   Gloss=heaven|Ref=GEN_1.1
9    àti      àti      CCONJ   _   _   10   cc      _   Gloss=and|Ref=GEN_1.1
10   ayé      ayé      NOUN    _   _   8    conj    _   Gloss=earth|Ref=GEN_1.1|SpaceAfter=No
11   .        .        PUNCT   _   _   6    punct   _   Gloss=.|Ref=GEN_1.1

Root

SpacesAfter

The mandatory attribute SpaceAfter=No only specifies whether there was at least one space between two tokens of a sentence. It cannot truly preserve the untokenized text if there were two spaces between two tokens, or a line break. This can be optionally preserved using the SpacesAfter attribute; in the value, the following C-like escape sequences are used: \s (space), \t (TAB), \r (CR), \n (LF), \p (pipe), \\ (backslash). Note that SpacesAfter should not occur together with SpaceAfter=No on the same line.

This attribute was proposed in issue #332. It is generated by the UDPipe software and occurs in some UD treebanks, e.g., Belarusian HSE, Bhojpuri BHTB or Classical Chinese Kyoto.

SpacesBefore

Split

Used in conjunction with Entity to indicate split antecedent anaphora, by creating a pointing relation between multiple entity GRP identifiers and the ID of an anaphor pointing back to them:

15	Padalecki	Padalecki	PROPN	NNP	Number=Sing	16	nsubj	16:nsubj	Entity=(1-person-giv:act-1-coref-Jared_Padalecki)
16	partnered	partner	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
17	with	with	ADP	IN	_	18	case	18:case	_
18	co-star	co-star	NOUN	NN	Number=Sing	16	obl	16:obl:with	Entity=(97-person-giv:inact-1,3-coref-Jensen_Ackles
19	Jensen	Jensen	PROPN	NNP	Number=Sing	18	appos	18:appos	XML=<ref target:::"https://en.wikipedia.org/wiki/Jensen_Ackles">
20	Ackles	Ackles	PROPN	NNP	Number=Sing	19	flat	19:flat	Entity=97)|XML=</ref>
21	to	to	PART	TO	_	22	mark	22:mark	Discourse=purpose:105->104:0
22	release	release	VERB	VB	VerbForm=Inf	16	advcl	16:advcl:to	_
23	a	a	DET	DT	Definite=Ind|PronType=Art	24	det	24:det	Entity=(190-object-new-2-coref
24	shirt	shirt	NOUN	NN	Number=Sing	22	obj	22:obj	Entity=190)
25	featuring	feature	VERB	VBG	VerbForm=Ger	24	acl	24:acl	Discourse=elaboration:106->105:0
26	both	both	DET	DT	_	25	obj	25:obj	Entity=(191-object-new-1-sgl
27	of	of	ADP	IN	_	29	case	29:case	_
28	their	their	PRON	PRP$	Number=Plur|Person=3|Poss=Yes|PronType=Prs	29	nmod:poss	29:nmod:poss	Entity=(192-person-acc:aggr-1-coref)|Split=1<192,97<192
29	faces	face	NOUN	NNS	Number=Plur	26	nmod	26:nmod:of	Entity=191)|SpaceAfter=No

Here “their” (entity number 192) refers to both Padalecki (entity number 1) and Jensen Ackles (entity number 97). We therefore have Split=1<192,97<192, indicating that the identity of 192 is resolvable by joint reference to entities 1 and 97. See more information in the Entity notation section and the documentation from the Universal Anaphora format specifications

Stype

Sentence type (modality). It is annotated at the head of the sentence or the clause. The following values are recognized: Stype=declarative, Stype=imperative, Stype=interrogative, Stype=interjective. The attribute overlaps with the morphological feature Mood of verbs but it is not exactly the same information.

Used in Hindi HDTB and Urdu UDTB.

# sent_id = train-s2
# text = इसे नवाब शाहजेहन ने बनवाया था ।
इसे	यह	PRON    _   _   5   obj        _   Vib=को|Tam=ko|ChunkId=NP|ChunkType=head|Translit=ise
नवाब	नवाब	NOUN    _   _   3   compound   _   Vib=0|Tam=0|ChunkId=NP2|ChunkType=child|Translit=navāba
शाहजेहन	शाहजेहन	PROPN   _   _   5   nsubj      _   Vib=0_ने|Tam=0|ChunkId=NP2|ChunkType=head|Translit=śāhajehana
ने	ने	ADP     _   _   3   case       _   ChunkId=NP2|ChunkType=child|Translit=ne
बनवाया	बनवा	VERB    _   _   0   root       _   Vib=या_था|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=banavāyā
था	था	AUX     _   _   5   aux        _   Vib=था|Tam=WA|ChunkId=VGF|ChunkType=child|Translit=thā
।	।	PUNCT   _   _   5   punct      _   ChunkId=BLK|ChunkType=head|Translit=.

Subject

The guidelines normally allow at most one subject attached to the same predicate. However, since UD 2.10 (May 2022), multiple subjects are exceptionally allowed when a clause acts as the predicate of an outer clause. It is recommended (and by default expected) that the outer subject(s) is (are) then labeled with the relation subtype nsubj:outer or csubj:outer. However, relation subtypes are optional and there may be a good reason to not use the subtype (e.g., there would be only one instance of the outer subject in the whole corpus, and it would occur in the test data, so no parser would have a chance to learn how to predict it). In such cases the treebank maintainer can opt out of using the :outer subtype. They still need to mark each instance as verified and legitimate, otherwise the UD validator would report it as an error. This is done by adding Subject=Outer to the MISC column on the line where nsubj:outer would be if the subtype were used.

# sent_id = sahidica_1corinthians-1Cor_03_s0004
# text_en = For when one says, 'I follow Paul,' and another, 'I follow Apollos,' aren't you fleshly?
# text = ϩⲟⲧⲁⲛ ⲅⲁⲣ ⲉⲣϣⲁⲛⲟⲩⲁ ϫⲟⲟⲥ ϫⲉⲁⲛⲟⲕ ⲙⲉⲛ ⲁⲛⲅⲡⲁⲡⲁⲩⲗⲟⲥ . ⲕⲉⲟⲩⲁ ⲇⲉ ϫⲉⲁⲛⲅⲡⲁⲁⲡⲟⲗⲗⲱ . ⲙⲏ ⲛⲧⲉⲧⲛ ϩⲉⲛⲣⲱⲙⲉ ⲁⲛ .
14-15	ⲕⲉⲟⲩⲁ	_	_	_	_	_	_	_	_
14	ⲕⲉ	ⲕⲉ	DET	ART	PronType=Art	15	det	_	_
15	ⲟⲩⲁ	ⲟⲩⲁ	NUM	NUM	NumType=Card	20	nsubj	_	Entity=(person)|Subject=Outer
16	ⲇⲉ	ⲇⲉ	PART	PTC	Foreign=Yes	20	advmod	_	OrigLang=grc
17-20	ϫⲉⲁⲛⲅⲡⲁⲁⲡⲟⲗⲗⲱ	_	_	_	_	_	_	_	_
17	ϫⲉ	ϫⲉ	SCONJ	CONJ	_	20	mark	_	_
18	ⲁⲛⲅ	ⲁⲛⲟⲕ	PRON	PPERI	Definite=Def|Number=Sing|Person=1|PronType=Prs	20	nsubj	_	_
19	ⲡⲁ	ⲡⲁ	DET	PPOS	Definite=Def|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs	20	det	_	Entity=(person
20	ⲁⲡⲟⲗⲗⲱ	ⲁⲡⲟⲗⲗⲱ	PROPN	NPROP	Foreign=Yes	12	parataxis	_	Entity=(person-Apollos)person)|OrigLang=grc
21	.	.	PUNCT	PUNCT	_	5	punct	_	_

Tam

TraditionalMood

This feature is used in three Latin treebanks (IT-TB, LLCT, UDante), and together with TraditionalTense supplies the traditional denominations of verb forms, in particular of “mood”. This is made for convenience, as the typologically-driven decomposition of tenses in UD features can be different from language-specific terminology and sometimes follows different logics.

First and foremost, we note that “mood”, in traditional literature about Latin, does not correspond only to UD’s Mood, but also covers so-called nonfinite VerbForms. This is possible because of the complementarity of Mood’s distribution in Latin: finite forms express it (Imp, Ind, Sub), while nonfinite forms do not. So, the values for TraditionalMood are, with their “translations” in UD:

Gerundium
- Aspect=Prosp|Case=Neut|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass (VerbForm=Ger)
Gerundivum
- Aspect=Prosp|VerbForm=Part|Voice=Pass (VerbForm=Gdv)
Imperativus
- Mood=Imp
Indicativus
- Mood=Ind
Infinitivus
- VerbForm=VNoun (VerbForm=Inf)
Participium
- VerbForm=Part
Subiunctivus
- Mood=Sub
Supinum
- “active”: Aspect=Prosp|VerbForm=Conv|Voice=Act
- “passive”: NOUN with Case=Abl|Gender=Masc|InflClass=IndEurU|Number=Sing|VerbForm=VNoun (VerbForm=Sup)

They are marked only on VERBs and AUXs, apart from the passive supine (NOUN). For further explanations about the correspondences from a morphological and syntactic point of view, see the documentation page about VerbForm.

Traditional moods and tenses are currently annotated only for single forms, and not for periphrastic constructions.

9	impressit	imprimo	VERB	_	Aspect=Perf|InflClass=LatX|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	acl:relcl	_	TraditionalMood=Indicativus|TraditionalTense=Perfectum

31	ostensum	ostendo	VERB	_	Aspect=Perf|Case=Nom|Gender=Neut|InflClass=LatX|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass	26	advcl:cmp	_	TraditionalMood=Participium|TraditionalTense=Perfectum
32	est	sum	AUX	va5ips3	Aspect=Imp|InflClass=LatAnom|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	31	aux:pass	_	TraditionalMood=Indicativus|TraditionalTense=Praesens

TraditionalTense

This feature is used in three Latin treebanks (IT-TB, LLCT, UDante), and together with TraditionalMood supplies the traditional denominations of verb forms, in particular of “tense”. This is made for convenience, as the typologically-driven decomposition of tenses in UD features can be different from language-specific terminology and sometimes follows different logics.

In Latin linguistics, the term “tense” is more general than UD’s Tense, in that it can mean or encompass also Aspect, or be used to refer to a whole periphrastic construction, not just to a single form. The notion of “tense” is extendend with the same terminology also to nonfinite forms, as these do not express Tense, and so language-internally no ambiguity arises.

Imperfectum
- Aspect=Imp|Tense=Past
Futurum
- finite: Aspect=Imp|Tense=Fut
- nonfinite: Aspect=Prosp
FuturumExactum
- Aspect=Perf|Tense=Fut
Perfectum
- finite: Aspect=Perf|Tense=Past (or in some accounts: Tense=Pres)
- nonfinite: Aspect=Perf
Plusquamperfectum
- Aspect=Perf|Tense=Pqp (or Aspect=Perf|Tense=Past if Perfectum has Tense=Pres)
Praesens
- finite: Aspect=Imp|Tense=Pres
- nonfinite: Aspect=Imp

They are marked only on VERBs and AUXs. Gerunds and gerundives are not assigned a tense. For further explanations about the correspondences from a morphological point of view, see the documentation page about Aspect.

Traditional moods and tenses are currently annotated only for single forms, and not for periphrastic constructions.

20	feci	facio	VERB	_	Aspect=Perf|InflClass=LatI2|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act	26	advcl:cmp	_	TraditionalMood=Indicativus|TraditionalTense=Perfectum

22	facturo	facio	VERB	_	Aspect=Prosp|Case=Dat|Gender=Masc|InflClass=LatI2|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Act	15	obl:arg	_	TraditionalMood=Participium|TraditionalTense=Futurum

Translit

Vform

Vib

XML

The annotation XML is used to encode opening and closing XML/HTML tags in source documents, which are not part of the text that appears in the actual word forms and do not correspond to some other, already existing MISC annotation. For example, because paragraphs are representable in # newpar or NewPar annotations, there is no need to represent XML elements such as <p>. However, some tags represent features other than block elements, and may also have attributes. These are used, for example, in the English GUM corpus:

1	Antonín	Antonín	PROPN	NNP	Number=Sing	31	nsubj	31:nsubj	XML=<hi rend:::"bold">
2	Leopold	Leopold	PROPN	NNP	Number=Sing	1	flat	1:flat	_
3	Dvořák	Dvořák	PROPN	NNP	Number=Sing	1	flat	1:flat	XML=</hi>
4	(	(	PUNCT	-LRB-	_	6	punct	6:punct	SpaceAfter=No
5	/	/	PUNCT	SYM	_	6	punct	6:punct	_
6	d(ə)ˈvɔːrʒɑːk	d(ə)ˈvɔːrʒɑːk	PROPN	NNP	Number=Sing	1	appos	1:appos	XML=<ref target:::"https://en.wikipedia.org/wiki/Help:IPA/English"></ref>
7	,	,	PUNCT	,	_	8	punct	8:punct	_
8	-ʒæk	-ʒæk	PROPN	NNP	Number=Sing	6	conj	1:appos|6:conj	XML=<ref target:::"https://en.wikipedia.org/wiki/Help:IPA/English"></ref>
9	/	/	PUNCT	SYM	_	10	punct	10:punct	_
10	d(ə)-VOR-zha(h)k	d(ə)-VOR-zha(h)k	PROPN	NNP	Number=Sing	1	appos	1:appos	SpaceAfter=No|XML=<hi rend:::"italic"><ref target:::"https://en.wikipedia.org/wiki/Help:IPA/English"></ref></hi>
11	;	;	PUNCT	:	_	12	punct	12:punct	_
12	Czech	Czech	PROPN	NNP	Number=Sing	15	dep	15:dep	SpaceAfter=No
13	:	:	PUNCT	:	_	12	punct	12:punct	_
14	[	[	PUNCT	-LRB-	_	15	punct	15:punct	SpaceAfter=No
15	ˈantoɲiːn	ˈantoɲiːn	PROPN	NNP	Number=Sing	1	parataxis	1:parataxis	XML=<ref target:::"https://en.wikipedia.org/wiki/Help:IPA/Czech">
16	ˈlɛopolt	ˈlɛopolt	PROPN	NNP	Number=Sing	15	flat	15:flat	_
17	ˈdvor̝aːk	ˈdvor̝aːk	PROPN	NNP	Number=Sing	15	flat	15:flat	SpaceAfter=No|XML=</ref>
18	]	]	PUNCT	-RRB-	_	15	punct	15:punct	SpaceAfter=No
19	;	;	PUNCT	:	_	20	punct	20:punct	_
20	8	8	NUM	CD	NumForm=Digit|NumType=Card	15	nmod:tmod	15:nmod:tmod	XML=<date when:::"1841-09-08">
21	September	September	PROPN	NNP	Number=Sing	20	compound	20:compound	_
22	1841	1841	NUM	CD	NumForm=Digit|NumType=Card	20	nmod:tmod	20:nmod:tmod	XML=</date>
23	–	-	SYM	SYM	_	24	case	24:case	_
24	1	1	NUM	CD	NumForm=Digit|NumType=Card	20	nmod	20:nmod:to	XML=<date when:::"1904-05-01">
25	May	May	PROPN	NNP	Number=Sing	24	compound	24:compound	_
26	1904	1904	NUM	CD	NumForm=Digit|NumType=Card	24	nmod:tmod	24:nmod:tmod	SpaceAfter=No|XML=</date>
27	)	)	PUNCT	-RRB-	_	15	punct	15:punct	_
28	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	31	cop	31:cop	_
29	a	a	DET	DT	Definite=Ind|PronType=Art	31	det	31:det	_
30	Czech	Czech	ADJ	JJ	Degree=Pos	31	amod	31:amod	XML=<ref target:::"https://en.wikipedia.org/wiki/Czechs"></ref>
31	composer	composer	NOUN	NN	Number=Sing	0	root	0:root	SpaceAfter=No
32	.	.	PUNCT	.	_	31	punct	31:punct	_

This example illustrates several types of tags found in the source data for this document: hyperlinks, resolved date annotations, and rendering markup, such as bold font weight. The convention for the XML annotations is to indicate all opening tags opening before a token on its line’s MISC field, in order of opening, and all closing tags on the line of the token after which the tag closes (in the reverse order). As a result, XML markup around a single token will have both the opening and closing elements on the same line (see token 30 in the example, a single-token hyperlink). The XML elements are represented canonically including their attributes, except that the equals sign is escaped as :::, to avoid confusion with the MISC field’s own = sign. If pipes occur in the value, they must also be escaped using an XML escape (e.g. |).