CoNLL 2017 Shared Task

Evaluation

Preliminary version of the evaluation script is available for download. Final version will be available together with the training data in the beginning of March 2017.

All systems will be required to generate valid output in the CoNLL-U format for all test sets. They will know the language and treebank code of the test set, but they must respond even to unknown language/treebank codes (for which there are no training data). The systems will be able to select either raw text as input, or the file pre-processed by UDPipe. Every system must produce valid output for every test set.

The evaluation will focus on dependency relations, i.e., the index of the head node and the dependency label. POS tags, lemmas and morphological features are not counted in the main evaluation metric but we will evaluate them as a secondary metric; participants are thus encouraged to include these values in the output if they can predict them. On the other hand, word segmentation must be reflected in the main metric because the systems do not have access to gold-standard segmentation, and identifying the words is a prerequisite for dependency evaluation.

The evaluation starts by aligning the system-produced words to the gold standard ones (see details). Once the words are aligned, we will compute the Labeled Attachment Score (LAS) as the main scoring metric. Systems will be ranked by a macro-average over all test sets.

Labeled Attachment Score (LAS) is a standard evaluation metric in dependency parsing: the percentage of words that are assigned both the correct syntactic head and the correct dependency label. For scoring purposes, only universal dependency labels will be taken into account, which means that language-specific subtypes such as acl:relcl (relative clause), a subtype of the universal relation acl (adnominal clause), will be truncated to acl both in the gold standard and in the parser output in the evaluation. (Parsers can still choose to predict language-specific subtypes if it improves accuracy.) In our configuration, the standard LAS score will also have to be modified to take word segmentation mismatches into account. A dependency is therefore scored as correct only if both nodes of the relation match existing gold-standard nodes. Precision P is the number of correct relations divided by the number of system-produced nodes; recall R is the number of correct relations divided by the number of gold-standard nodes. We then define LAS as the F1 score = 2PR / (P+R).

Besides the central metric and one overall ranking of the systems, we will evaluate the systems along various other dimensions and we may publish additional rankings for sub-tasks (e.g., performance on the surprise languages). The evaluation script will be publicly available from the shared task website.

We plan to use the Tira platform (http://www.tira.io/) to evaluate the participating systems. Therefore, participants will submit systems, not parsed data, allowing us to keep unreleased test data hidden until after the task has been completed.