CoNLL 2017 Shared Task

The TIRA platform

TIRA will be used to evaluate the systems. Participants will install their systems in dedicated virtual machines provided by TIRA. During test phase, the systems will get access to the test data and process the test data inside the VM. The evaluation script will run there as well. See the links below for more details.

Data format and evaluation details

The CoNLL-U data format, used for Universal Dependencies treebanks, is described in more detail at http://universaldependencies.org/format.html. It is deliberately similar to the CoNLL-X format that was used in the CoNLL 2006 Shared Task and has become a de-facto standard since then. Each word has its own line and there are tab-separated columns for word form, lemma, POS tag etc. For instance, the following snippet encodes the English sentence They buy and sell books.

1	They	they	PRON	PRP	Case=Nom|Number=Plur	2	nsubj	_	_
2	buy	buy	VERB	VBP	Number=Plur|Person=3|Tense=Pres	0	root	_	_
3	and	and	CONJ	CC	_	2	cc	_	_
4	sell	sell	VERB	VBP	Number=Plur|Person=3|Tense=Pres	2	conj	_	_
5	books	book	NOUN	NNS	Number=Plur	2	dobj	_	SpaceAfter=No
6	.	.	PUNCT	.	_	2	punct	_	_

Syntactic words vs. multi-word tokens

However, there are a few important extensions w.r.t. the CoNLL-X format. Perhaps the most important is the notion of syntactic words vs. multi-word tokens. It makes the tokenization step in UD harder than the relatively simple procedure called tokenization in other areas of NLP. For instance, German zum is a contraction of the preposition zu “to”, and the article dem “the”. In UD it is a multi-word token consisting of two syntactic words, zu and dem. These syntactic words are nodes in dependency relations. Learning this is harder than separating punctuation from words, because a contraction is not a pure concatenation of the participating words. The CoNLL-U format uses two different mechanisms here: punctuation that is conventionally written adjacent to a word is a separate single-“word” token, and an attribute in the last column tells that there was no whitespace character between the punctuation symbol and the word. On the other hand, the contraction is a multi-word token which has a separate line starting with range of following syntactic words that belong to it. Consider a German phrase zur Stadt, zum Haus “to the city, to the house”. The corresponding CoNLL-U section could look like this:

1-2	zur	_	_	_	_	_	_	_	_
1	zu	_	ADP	_	_	3	case	_	_
2	der	_	DET	_	_	3	det	_	_
3	Stadt	_	NOUN	_	_	0	root	_	SpaceAfter=No
4	,	_	PUNCT	_	_	3	punct	_	_
5-6	zum	_	_	_	_	_	_	_	_
5	zu	_	ADP	_	_	7	case	_	_
6	dem	_	DET	_	_	7	det	_	_
7	Haus	_	NOUN	_	_	3	conj	_	_

We will not evaluate whether the system correctly generated the range lines (1-2 zur and 5-6 zum, respectively), nor whether it generated the SpaceAfter=No attribute. But we will have to align the nodes (syntactic words) output by the system to those in the gold standard data. Thus if the system fails to recognize zur as a contraction and outputs

1	zur
2	Stadt
3	,

we will treat any relations going to or from the node zur as incorrect. The same will happen with the node “Stadt,”, should the system fail to separate punctuation from the word Stadt.

If the system wrongly splits the word Haus and outputs

7-8	Haus
7	Hau
8	das

relations involving either Hau or das will be considered incorrect.

Even if the system recognizes zur as contraction but outputs wrong syntactic word forms, the tokens will be considered incorrect:

1-2	zur
1	zur
2	der

Relations involving node 1 are incorrect but relations involving node 2 may be correct.

Aligning system words with the gold standard

Easy part: suppose there are no multi-word tokens (contractions). Both token sequences (gold, system) share the same underlying text (minus whitespace). Tokens can be represented as character ranges. We can find intersections of system character ranges with gold character ranges and find the alignment in one run.

Now let’s assume there are multi-word tokens. They may contain anything, without any similarity to the original text; however, the data still contains the original surface form and we know which part of the underlying text they correspond to. So we only have to align the individual words between a gold and a system multi-word token. We use the LCS (longest common subsequence) algorithm to do that.

Sentence boundaries will be ignored during token alignment, i.e. the entire test set will be aligned at once. The systems will have to perform sentence segmentation in order to produce valid CoNLL-U files but the sentence boundaries will be evaluated only indirectly, through dependency relations. A dependency relation that goes across a gold sentence boundary is incorrect. If on the other hand the system generates a false sentence break, it will not be penalized directly, but there will necessarily be at least one gold relation that the system missed; not getting points for such relations will be an indirect penalization for wrong sentence segmentation.