home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

CoNLL-U Format in UD v2

Some of the changes adopted in v2 require modifications to the CoNLL-U format. Whenever possible, we have tried to do this in the least disruptive fashion possible, keeping backward compatibility as far as we can. In particular, we have decided against changing the number of fields and/or their meaning for fear of breaking people’s tools. We make the following changes for v2:

Allow spaces in the FORM and LEMMA fields (see word segmentation)
Let the DEPS field contain the entire enhanced dependency graph (not only additional relations on top of the basic trees) (see enhanced dependencies)
Allow empty nodes with an indexing scheme distinct from both tokens and words and with relations in the DEPS field (see ellipsis)
Standardize sentence-level metadata
Restrict the use of the MISC field for ease of processing

Enhanced representation in the DEPS field

The deps field holds the enhanced representation. Unlike previously thought, it seems a better idea for the DEPS field to hold the entire enhanced parsed graph, not only a smaller set of relations on top of the base layer in HEAD+DEPREL. The primary reason is that minimally the ellipsis relations and in future undoubtedly many others are not a part of the enhanced graph but are rather replaced by some relations in it. See enhanced dependencies.

Empty nodes in the enhanced representation

A mechanism is needed for empty nodes in the enhanced representation (the base representation will always be empty-node-free). These will be on their separate lines indexed as 2.1 which means “the first empty node after the syntactic word indexed 2”. This line would be placed right after the word 2. Naturally, these empty nodes can be referred to only from the DEPS field.

Grepping out all lines which start with E is now sufficient to obtain a CoNLL-U file without empty nodes whose HEAD and DEPREL fields behave as usual.

1	Mary	_	_	_	_	2	nsubj	2:nsubj	_
2	won	_	_	_	_	0	root	0:root	_
3	silver	_	_	_	_	2	obj	2:obj	_
4	and	_	_	_	_	5	cc	E5.1:cc	_
5	Sue	_	_	_	_	2	conj	E5.1:nsubj	_
5.1	_	_	_	_	_	2	conj	2:conj	_
6	bronze	_	_	_	_	5	orphan	E5.1:dobj	_

Sentence-level metadata

Sentence-level metadata can be provided as before and all tools are required to pass it through, as before. No particular requirements are put on these fields other than “no trailing whitespace”. Several recognized key = value pairs should be standardized:

sent_id = global_sentence_id
text = sentence text

sent_id is compulsory and as per #321 it should not contain the / (slash) character.

MISC field

The only global requirement on the MISC field is that it can be split on the | (bar) character without any complex processing of escaping. The requirement of MISC containing zero whitespace characters is dropped. Of course no TAB characters are allowed and no trailing whitespace is allowed. It is likely that spacesBefore and spacesAfter will be standardized as a part of the MISC field.