home edit page issue tracker

This page pertains to UD version 2.

CoNLL-U Plus Format

In theory, the basic CoNLL-U format can encode any kind of annotation using a combination of the sentence-level comments and the MISC attributes. However, in projects that systematically add a new annotation layer to existing UD treebanks it becomes impractical to pack everything into these two places.

We define a general technique of extending the CoNLL-U format, with the hope that various extensions will remain as compatible with each other as possible. We call this generic format “CoNLL-U Plus,” with the filename extension .conllup.

Note that this page is not about the enhanced UD representation, which, despite being arguably an extension in its own right, is part of the standard CoNLL-U format.

New columns

The extension is actually very simple. While a CoNLL-U file has always precisely ten columns separated by TAB characters, a CoNLL-U Plus file can have any non-zero number of columns. In addition, the first line must be a sentence-level comment (i.e., starting with a # character) that lists the names of the columns used in this file. There are ten predefined column names that identify the original columns of the core CoNLL-U files (see specification). A CoNLL-U Plus file may contain any subset of them (or none of them, although for stand-off annotation of UD files, at least the ID column is typically needed).

A valid CoNLL-U file thus becomes valid CoNLL-U Plus file when the comment listing its columns is inserted as the first line (here the column names are separated by a simple space character instead of a TAB):

# newdoc id = mf920901-001
# newpar id = mf920901-001-p1
# sent_id = mf920901-001-p1s1A
# text = Slovenská ústava: pro i proti
# text_en = Slovak constitution: pros and cons
1   Slovenská   slovenský   ADJ     AAFS1----1A---- Case=Nom|Degree=Pos|Gender=Fem|Number=Sing|Polarity=Pos 2 amod _ _
2   ústava      ústava      NOUN    NNFS1-----A---- Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos 0 root _ SpaceAfter=No
3   :           :           PUNCT   Z:------------- _          2       punct   _       _
4   pro         pro         ADP     RR--4---------- Case=Acc   2       appos   _       LId=pro-1
5   i           i           CCONJ   J^------------- _          6       cc      _       LId=i-1
6   proti       proti       ADP     RR--3---------- Case=Dat   4       conj    _       LId=proti-1

Note that the comment can safely reside in a CoNLL-U file and it stays a valid CoNLL-U file as long as no columns are added, reordered or removed. However, we intentionally do not make the comment mandatory in basic CoNLL-U files. We want to be able to simply split a basic CoNLL-U file on a sentence boundary and get two valid CoNLL-U files—something that is not possible with CoNLL-U Plus.

Any column names other than the original ten are considered project-specific and must use a namespace identifier that distinguishes the columns of the given project from columns defined by other initiatives, should they eventually end up in the same file. A column name consists of uppercase English letters and of a colon as the namespace delimiter:

# global.columns = ID FORM PARSEME:MWE

The internal requirements and semantics of the new column must be documented by the initiative that defined the column. The only global limitation specified by CoNLL-U Plus is that the column does not contain a TAB, CR or LF character (but it may contain other whitespace characters if allowed by the column specification). In addition:

If the file is an extension of an existing UD treebank, it is assumed that sentence-level metadata defined in UD (in particular, the text attribute), as well as the contents of the core UD columns, is identical to the source treebank.

Cross-reference to a UD treebank

In case of stand-off annotation, it becomes crucial to be able to identify the underlying UD file. Any official UD release is identified by its permanent URI in the LINDAT/CLARIN repository, e.g.,

Within a release, each file can be identified by a path. For UD releases, the path consists of the name of the treebank folder (corresponding to the name of the Github repository, e.g., UD_German-PUD) and of the actual name of the file (e.g., de_pud-ud-test.conllu). The delimiter character is always the slash and never the backslash, regardless of operating system: UD_German-PUD/de_pud-ud-test.conllu.

Within a file, each sentence has a unique sentence id (UD actually requires that the sentence id be unique treebank-wide, but for easier processing we keep the filename in the identification string). Within a sentence, the token/word id from the ID column unambiguously points to a tree node or a multi-word token.

In the CoNLL-U Plus file, the reference appears in a sentence-level attribute called source_sent_id, where the id string contains a source format specifier, followed by three of the reference parts discussed above: release id, file path and sentence id. The format specifier is a single word consisting of lowercase English letters. As long as we are referring to a CoNLL-U file, the format specifier is always “conllu”. This allows us to distinguish references to source corpora that are not saved in CoNLL-U; nevertheless, we do not define the other formats here and whoever uses them must also document them.

These four parts are separated by spaces, not slashes, because the release id itself contains slashes.

The source_sent_id attribute is optional and may be omitted if the file belongs to a standalone corpus rather than a stand-off annotation layer. On the other hand, the sent_id attribute is required even if there is a source_sent_id attribute in the same sentence. It is not even required that the value of sent_id is identical to the last part of source_sent_id, although it often makes sense and may be required by individual annotation projects. The only requirement is that the value of sent_id is unique—at least within the file in which it appears, but a project may require that it is unique across a set of files that constitute a corpus. Furthermore, the general requirements of the basic CoNLL-U format also apply (e.g., the sentence id cannot contain whitespace characters).

# source_sent_id = conllu http://hdl.handle.net/11234/1-2837 UD_German-GSD/de_gsd-ud-train.conllu train-s1682
# sent_id = train-s1682
# text = Der CDU-Politiker strebt einen einheitlichen Wohnungsmarkt an, auf dem sich die Preise an der ortsüblichen Vergleichsmiete orientieren.
1	Der	DET	2	det	_	*
2	CDU	PROPN	4	compound	SpaceAfter=No	*
3	-	PUNCT	2	punct	SpaceAfter=No	*
4	Politiker	NOUN	5	nsubj	_	*
5	strebt	VERB	0	root	_	2:VPC.full
6	einen	DET	8	det	_	*
7	einheitlichen	ADJ	8	amod	_	*
8	Wohnungsmarkt	NOUN	5	obj	_	*
9	an	ADP	5	compound:prt	SpaceAfter=No	2
10	,	PUNCT	5	punct	_	*
11	auf	ADP	12	case	_	*
12	dem	PRON	20	obl	_	*
13	sich	PRON	20	obj	_	1:IRV
14	die	DET	15	det	_	*
15	Preise	NOUN	20	nsubj	_	*
16	an	ADP	19	case	_	*
17	der	DET	19	det	_	*
18	ortsüblichen	ADJ	19	amod	_	*
19	Vergleichsmiete	NOUN	20	obl	_	*
20	orientieren	VERB	8	acl	SpaceAfter=No	1
21	.	PUNCT	5	punct	_	*

Cross-reference to another source

Other cross-references are in principle similar to referring to UD treebanks. Instead of UD release handle, another URI identifying the download site can be used; however, it must permanently refer to the same version of the corpus, i.e., the corpus must be immutable. Using LINDAT/CLARIN handles is highly recommended.

If the source treebank is local or if there is no source treebank, the release id is a single period (.).

If the release id identifies just one file, the file path is also a single period (.).

The sentence id part of the reference string must be identical to the sent_id in the CoNLL-U file we are referring to.

Known extensions of UD

A few projects ran before this specification was created; their data probably do not comply with this specification.

One extension was based on an early draft of this specification. It is not 100% compliant with the final version but it is close:

Planned extensions:

Related work: The CoNLL-* file formats date back to the 2006 task (CoNLL-X conference, hence the format was sometimes called CoNLL-X and sometimes CoNLL-ST (for shared task) or simply CoNLL. Before CoNLL-U (universal dependencies) and CoNLL-U Plus, some extensions of the format were proposed by Straňák and Štěpánek (2010) [Pavel Straňák and Jan Štěpánek. 2010. Representing Layered and Structured Data in the CoNLL-ST Format. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 143–152, Hong Kong, China, ISBN 978-962-442-323-5.]