CoNLL 2017 Shared Task
Data format and evaluation details
The CoNLL-U data format, used for Universal Dependencies treebanks, is described in more detail at http://universaldependencies.org/format.html. It is deliberately similar to the CoNLL-X format that was used in the CoNLL 2006 Shared Task and has become a de-facto standard since then. Each word has its own line and there are tab-separated columns for word form, lemma, POS tag etc. For instance, the following snippet encodes the English sentence They buy and sell books.
1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj _ _ 2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root _ _ 3 and and CONJ CC _ 2 cc _ _ 4 sell sell VERB VBP Number=Plur|Person=3|Tense=Pres 2 conj _ _ 5 books book NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No 6 . . PUNCT . _ 2 punct _ _
Syntactic words vs. multi-word tokens
However, there are a few important extensions w.r.t. the CoNLL-X format. Perhaps the most important is the notion of syntactic words vs. multi-word tokens. It makes the tokenization step in UD harder than the relatively simple procedure called tokenization in other areas of NLP. For instance, German zum is a contraction of the preposition zu “to”, and the article dem “the”. In UD it is a multi-word token consisting of two syntactic words, zu and dem. These syntactic words are nodes in dependency relations. Learning this is harder than separating punctuation from words, because a contraction is not a pure concatenation of the participating words. The CoNLL-U format uses two different mechanisms here: punctuation that is conventionally written adjacent to a word is a separate single-“word” token, and an attribute in the last column tells that there was no whitespace character between the punctuation symbol and the word. On the other hand, the contraction is a multi-word token which has a separate line starting with range of following syntactic words that belong to it. Consider a German phrase zur Stadt, zum Haus “to the city, to the house”. The corresponding CoNLL-U section could look like this:
1-2 zur _ _ _ _ _ _ _ _ 1 zu _ ADP _ _ 3 case _ _ 2 der _ DET _ _ 3 det _ _ 3 Stadt _ NOUN _ _ 0 root _ SpaceAfter=No 4 , _ PUNCT _ _ 3 punct _ _ 5-6 zum _ _ _ _ _ _ _ _ 5 zu _ ADP _ _ 7 case _ _ 6 dem _ DET _ _ 7 det _ _ 7 Haus _ NOUN _ _ 3 conj _ _
We will not evaluate whether the system correctly generated the range
zur and 5-6
zum, respectively), nor whether it generated
SpaceAfter=No attribute. But we will have to align the nodes
(syntactic words) output by the system to those in the gold standard
data. Thus if the system fails to recognize zur
as a contraction and outputs
1 zur 2 Stadt 3 ,
we will treat any relations going to or from the node
as incorrect. The same will happen with the node “
should the system fail to separate punctuation from the word
If the system wrongly splits the word Haus and outputs
7-8 Haus 7 Hau 8 das
relations involving either
das will be considered
Even if the system recognizes zur as contraction but outputs wrong syntactic word forms, the tokens will be considered incorrect:
1-2 zur 1 zur 2 der
Relations involving node 1 are incorrect but relations involving node 2 may be correct.
Aligning system words with the gold standard
Easy part: suppose there are no multi-word tokens (contractions). Both token sequences (gold, system) share the same underlying text (minus whitespace). Tokens can be represented as character ranges. We can find intersections of system character ranges with gold character ranges and find the alignment in one run.
Now let’s assume there are multi-word tokens. They may contain anything, without any similarity to the original text; however, the data still contains the original surface form and we know which part of the underlying text they correspond to. So we only have to align the individual words between a gold and a system multi-word token. We use the LCS (longest common subsequence) algorithm to do that.
Sentence boundaries will be ignored during token alignment, i.e. the entire test set will be aligned at once. The systems will have to perform sentence segmentation in order to produce valid CoNLL-U files but the sentence boundaries will be evaluated only indirectly, through dependency relations. A dependency relation that goes across a gold sentence boundary is incorrect. If on the other hand the system generates a false sentence break, it will not be penalized directly, but there will necessarily be at least one gold relation that the system missed; not getting points for such relations will be an indirect penalization for wrong sentence segmentation.