Evaluation

The evaluation script is available for download here. See here for baseline results.

All systems will be required to generate valid output in the CoNLL-U format for all test sets. The definition of “valid” is not as strict as for released UD treebanks. For example, an unknown dependency relation label will only cost one point in the labeled accuracy score but it will not render the entire file invalid. However, cycles, multiple root nodes, wrong number of columns or wrong indexing of nodes will be considered invalid output. The Labeled Attachment Score (LAS) on the particular test file will then be set to 0 (but the overall macro-score will still be non-zero if outputs for other test files are valid).

The systems will know the language and treebank code of the test set, but they must respond even to unknown language/treebank codes (for which there are no training data). The systems will be able to select either raw text as input, or the file pre-processed by UDPipe. Every system must produce valid output for every test set.

The evaluation will focus on dependency relations, i.e., the index of the head node and the dependency label. POS tags, lemmas and morphological features are not counted in the main evaluation metric but we will evaluate them as a secondary metric; participants are thus encouraged to include these values in the output if they can predict them. On the other hand, word segmentation must be reflected in the main metric because the systems do not have access to gold-standard segmentation, and identifying the words is a prerequisite for dependency evaluation.

The evaluation starts by aligning the system-produced words to the gold standard ones (see below). Once the words are aligned, we will compute LAS as the main scoring metric. Systems will be ranked by a macro-average over all test sets.

Labeled Attachment Score (LAS) is a standard evaluation metric in dependency parsing: the percentage of words that are assigned both the correct syntactic head and the correct dependency label. For scoring purposes, only universal dependency labels will be taken into account, which means that language-specific subtypes such as acl:relcl (relative clause), a subtype of the universal relation acl (adnominal clause), will be truncated to acl both in the gold standard and in the parser output in the evaluation. (Parsers can still choose to predict language-specific subtypes if it improves accuracy.) In our configuration, the standard LAS score will also have to be modified to take word segmentation mismatches into account. A dependency is therefore scored as correct only if both nodes of the relation match existing gold-standard nodes. Precision P is the number of correct relations divided by the number of system-produced nodes; recall R is the number of correct relations divided by the number of gold-standard nodes. We then define LAS as the F1 score = 2PR / (P+R).

Besides the central metric and one overall ranking of the systems, we will evaluate the systems along various other dimensions and we may publish additional rankings for sub-tasks (e.g., performance on the surprise languages).

We use the Tira platform to evaluate the participating systems. Therefore, participants will submit systems, not parsed data, allowing us to keep unreleased test data hidden until after the task has been completed.

Data format and evaluation details

The CoNLL-U data format, used for Universal Dependencies treebanks, is described in more detail at http://universaldependencies.org/format.html. It is deliberately similar to the CoNLL-X format that was used in the CoNLL 2006 Shared Task and has become a de-facto standard since then. Each word has its own line and there are tab-separated columns for word form, lemma, POS tag etc. For instance, the following snippet encodes the English sentence They buy and sell books.

1	They	they	PRON	PRP	Case=Nom|Number=Plur	2	nsubj	_	_
2	buy	buy	VERB	VBP	Number=Plur|Person=3|Tense=Pres	0	root	_	_
3	and	and	CONJ	CC	_	2	cc	_	_
4	sell	sell	VERB	VBP	Number=Plur|Person=3|Tense=Pres	2	conj	_	_
5	books	book	NOUN	NNS	Number=Plur	2	dobj	_	SpaceAfter=No
6	.	.	PUNCT	.	_	2	punct	_	_

Syntactic words vs. multi-word tokens

However, there are a few important extensions w.r.t. the CoNLL-X format. Perhaps the most important is the notion of syntactic words vs. multi-word tokens. It makes the tokenization step in UD harder than the relatively simple procedure called tokenization in other areas of NLP. For instance, German zum is a contraction of the preposition zu “to”, and the article dem “the”. In UD it is a multi-word token consisting of two syntactic words, zu and dem. These syntactic words are nodes in dependency relations. Learning this is harder than separating punctuation from words, because a contraction is not a pure concatenation of the participating words. The CoNLL-U format uses two different mechanisms here: punctuation that is conventionally written adjacent to a word is a separate single-“word” token, and an attribute in the last column tells that there was no whitespace character between the punctuation symbol and the word. On the other hand, the contraction is a multi-word token which has a separate line starting with range of following syntactic words that belong to it. Consider a German phrase zur Stadt, zum Haus “to the city, to the house”. The corresponding CoNLL-U section could look like this:

1-2	zur	_	_	_	_	_	_	_	_
1	zu	_	ADP	_	_	3	case	_	_
2	der	_	DET	_	_	3	det	_	_
3	Stadt	_	NOUN	_	_	0	root	_	SpaceAfter=No
4	,	_	PUNCT	_	_	3	punct	_	_
5-6	zum	_	_	_	_	_	_	_	_
5	zu	_	ADP	_	_	7	case	_	_
6	dem	_	DET	_	_	7	det	_	_
7	Haus	_	NOUN	_	_	3	conj	_	_

We will not evaluate whether the system correctly generated the SpaceAfter=No attribute. But the system-produced CoNLL-U file must be a tokenization of the original text; if the system decides to split zum into zu and dem, it must not forget to also generate the multi-word token line 5-6 zum.

We will have to align the nodes (syntactic words) output by the system to those in the gold standard data. Thus if the system fails to recognize zur as a contraction and outputs

1	zur
2	Stadt
3	,

we will treat any relations going to or from the node zur as incorrect. The same will happen with the node “Stadt,”, should the system fail to separate punctuation from the word Stadt.

If the system wrongly splits the word Haus and outputs

7-8	Haus
7	Hau
8	das

relations involving either Hau or das will be considered incorrect.

Even if the system recognizes zur as contraction but outputs wrong syntactic word forms, the tokens will be considered incorrect:

1-2	zur
1	zur
2	der

Relations involving node 1 are incorrect but relations involving node 2 may be correct. Exception: The matching is case-insensitive, thus splitting Zum to zu dem or zum to Zu dem are both OK.

Aligning system words with the gold standard

Easy part: suppose there are no multi-word tokens (contractions). Both token sequences (gold, system) share the same underlying text (minus whitespace). Tokens can be represented as character ranges. We can find intersections of system character ranges with gold character ranges and find the alignment in one run.

Now let’s assume there are multi-word tokens. They may contain anything, without any similarity to the original text; however, the data still contains the original surface form and we know which part of the underlying text they correspond to. So we only have to align the individual words between a gold and a system multi-word token. We use the LCS (longest common subsequence) algorithm to do that.

Sentence boundaries will be ignored during token alignment, i.e. the entire test set will be aligned at once. The systems will have to perform sentence segmentation in order to produce valid CoNLL-U files but the sentence boundaries will be evaluated only indirectly, through dependency relations. A dependency relation that goes across a gold sentence boundary is incorrect. If on the other hand the system generates a false sentence break, it will not be penalized directly, but there will necessarily be at least one gold relation that the system missed; not getting points for such relations will be an indirect penalization for wrong sentence segmentation.