This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.

home v2/v2 issue tracker

CoNLL-U Format in UD v2

Some of the proposed changes will require modifications to the CoNLL-U format. Whenever possible, we want to do this in the least disruptive fashion possible, keeping backward compatibility as far as we can. In particular, we strongly recommend not changing the number of fields and/or their meaning for fear of breaking people’s tools. We recommend the following changes for v2:

Allow spaces in the FORM and LEMMA fields (see word segmentation)
Let the DEPS field contain the entire enhanced dependency graph (not only additional relations on top of the basic trees) (see enhanced dependencies)
Allow empty nodes with an indexing scheme distinct from both tokens and words and with relations in the DEPS field (see ellipsis)
Standardize sentence-level metadata
Restrict the use of the MISC field for ease of processing

Enhanced representation in the DEPS field

The deps field holds the enhanced representation. Unlike previously thought, it seems a better idea for the DEPS field to hold the entire enhanced parsed graph, not only a smaller set of relations on top of the base layer in HEAD+DEPREL. The primary reason is that minimally the ellipsis relations and in future undoubtedly many others are not a part of the enhanced graph but are rather replaced by some relations in it. See enhanced dependencies.

Empty nodes in the enhanced representation

A mechanism is needed for empty nodes in the enhanced representation (the base representation will always be empty-node-free). These will be on their separate lines indexed as E2.1 which means “the first empty node after the syntactic word indexed 2”. This line would be placed right after the word 2. Naturally, these empty nodes can be referred to only from the DEPS field.

There doesn’t seem to be a clear idea of whether the empty nodes should receive any values in the FORM, LEMMA, UPOS, FEAT fields. Probably UPOS could be somehow realistically filled.

Grepping out all lines which start with E is now sufficient to obtain a CoNLL-U file without empty nodes whose HEAD and DEPREL fields behave as usual.

1	Mary	_	_	_	_	2	nsubj	2:nsubj	_
2	won	_	_	_	_	0	root	0:root	_
3	silver	_	_	_	_	2	dobj	2:dobj	_
4	and	_	_	_	_	2	conj>cc	E5.1:cc	_
5	Sue	_	_	_	_	2	conj>nsubj	E5.1:nsubj	_
E5.1	_	_	_	_	_	2	conj	2:conj	_
6	bronze	_	_	_	_	2	conj>dobj	E5.1:dobj	_

Sentence-level metadata

Sentence-level metadata can be provided as before and all tools are required to pass it through, as before. No particular requirements are put on these fields other than “no trailing whitespace”. Several recognized key = value pairs should be standardized:

sent_id = global_sentence_id
text = sentence text

sent_id is compulsory and as per #321 it should not contain the / (slash) character.

MISC field

The only global requirement on the MISC field is that it can be split on the | (bar) character without any complex processing of escaping. The requirement of MISC containing zero whitespace characters is dropped. Of course no TAB characters are allowed and no trailing whitespace is allowed. It is likely that spacesBefore and spacesAfter will be standardized as a part of the MISC field.