Evaluation

The evaluation script (2018 version) is available for download here.

All systems will be required to generate valid output in the CoNLL-U format for all test sets.

The definition of “valid” is not as strict as for released UD treebanks. For example, an unknown dependency relation label will only cost one point in the labeled accuracy score but it will not render the entire file invalid. However, cycles, multiple root nodes, wrong number of columns or wrong indexing of nodes will be considered invalid output. The score on the particular test file will then be set to 0 (but the overall macro-score will still be non-zero if outputs for other test files are valid).

The systems will know the language and treebank code of the test set, but they must respond even to unknown language/treebank codes (for which there are no training data). The systems will be able to select either raw text as input, or the file pre-processed by UDPipe. Every system must produce valid output for every test set.

Several different metrics, evaluating different aspects of annotation, will be computed and published for each system output. Three main system rankings will be based on three main metrics. None of them is more important than the others and we will not combine them into a single ranking. Participants who want to decrease task complexity may concentrate on improvements in just one metric; however, all participating systems will be evaluated with all three metrics, and participants are strongly encouraged to output all relevant annotation (syntax + morphology + lemmas), even if they just copy values predicted by the baseline model. The three main metrics are:

  • LAS (labeled attachment score) will be computed the same way as in the 2017 task so that results of the two tasks can be compared.
  • MLAS (morphology-aware labeled attachment score) is inspired by the CLAS metric computed in 2017, and extended with evaluation of POS tags and morphological features.
  • BLEX (bi-lexical dependency score) combines content-word relations with lemmatization (but not with tags and features).

All three metrics reflect word segmentation and relations between content words; this remains the main focus of the shared task. LAS also includes relations between other words but it ignores morphology and lemmas. The other two metrics are closer to content and meaning; MLAS scores should also be more comparable across typologically different languages.

Word segmentation must be reflected in the metrics because the systems do not have access to gold-standard segmentation, and identifying the words is a prerequisite for dependency evaluation.

The evaluation starts by aligning the system-produced words to the gold standard ones; a relation cannot be counted as correct if one of the connected nodes cannot be aligned to the corresponding gold-standard node. The aligning algorithm requires that the systems preserve the input sequence of non-whitespace characters. If a system uses a tokenizer or morphological analyzer that normalizes or otherwise damages the input characters, then the system must remember the original non-whitespace characters and restore them in a postprocessing step. Any multi-word tokens that the system produces must be properly marked as such, and the surface string to which they correspond must be indicated.

Labeled Attachment Score (LAS) is a standard evaluation metric in dependency parsing: the percentage of words that are assigned both the correct syntactic head and the correct dependency label. For scoring purposes, only universal dependency labels will be taken into account, which means that language-specific subtypes such as acl:relcl (relative clause), a subtype of the universal relation acl (adnominal clause), will be truncated to acl both in the gold standard and in the parser output in the evaluation. (Parsers are still encouraged to output language-specific relations if they can predict them, as it makes the parsers more useful outside the shared task.) Identically to 2017, all nodes including punctuation will be reflected in LAS. The metric will take word segmentation mismatches into account. A dependency is therefore scored as correct only if both nodes of the relation match existing gold-standard nodes. Precision P is the number of correct relations divided by the number of system-produced nodes; recall R is the number of correct relations divided by the number of gold-standard nodes. We then define LAS as the F1 score = 2PR / (P+R). Systems will be ranked by a macro-average of LAS over all test sets.

Morphology-Aware Labeled Attachment Score (MLAS) aims at cross-linguistic comparability of the scores. It is an extension of CLAS (published experimentally in 2017), combined with evaluation of UPOS tags and morphological features. The core part is identical to LAS described above: for aligned system and gold nodes, their respective parent nodes are considered; if the system parent is not aligned with the gold parent or if the relation label differs, the word is not counted as correctly attached. Unlike in LAS, certain types of relations are not evaluated directly. Words attached via such relations (in either system or gold data) are not counted as independent words. However, they are treated as features of the content words they belong to. Therefore, a system-produced word S counts as correct if all the following conditions are met:

  • It is aligned to a gold-standard word G.
  • Their respective parent nodes are correctly aligned to each other.
  • The universal parts of their dependency relations are on the list of “content relations” and match.
  • The UPOS tags of words S and G are identical.
  • For selected morphological features (see the list below), their values at S and G are identical (missing feature counts as an empty value). In case of multi-values (e.g. Gender=Masc,Neut), the entire value strings must be identical. Both S and G may also have other features but those are ignored in the evaluation.
  • “Functional children” of a node are child nodes attached via a relation that is on the list of “function relations”. If present, they define additional conditions for the S node to be counted as correct:
    • Every functional child of S must be aligned to a functional child of G.
    • Every functional child of G must be aligned to a functional child of S.
    • For every pair of aligned functional children FS and FG:
      • The universal part of the label of their relation to S (resp. G) must match.
      • Their UPOS tags must be identical.
      • The values of listed morphological features must match analogically to how the features of the content words are compared.
  • When precision P is computed, the number of correct words is divided by the total number of system-produced content words, i.e., those that are attached via a “content relation”.
  • When recall R is computed, the number of correct words is divided by the total number of gold-standard content words, i.e., those that are attached via a “content relation”.
  • “Content relations” are nsubj, obj, iobj, csubj, ccomp, xcomp, obl, vocative, expl, dislocated, advcl, advmod, discourse, nmod, appos, nummod, acl, amod, conj, fixed, flat, compound, list, parataxis, orphan, goeswith, reparandum, root, dep.
  • “Function relations” are aux, cop, mark, det, clf, case, cc.
  • Note that the relation punct is neither content nor functional. It is ignored in MLAS.
  • “Selected features” are PronType, NumType, Poss, Reflex, Foreign, Abbr, Gender, Animacy, Number, Case, Definite, Degree, VerbForm, Mood, Tense, Aspect, Voice, Evident, Polarity, Person, Polite. Note: All these features are defined as “universal” in the UD v2 guidelines. Nevertheless, the evaluation will reflect all values of these features that appear in the data, including additional language-specific values.

Bilexical dependency score (BLEX) is similar to MLAS in that it focuses on relations between content words. Instead of morphological features, it incorporates lemmatization in the evaluation. It is thus closer to semantic content and evaluates two aspects of UD annotation that are important for language understanding: dependencies and lexemes. In BLEX, a system-produced word S is correct if it is aligned to a gold-standard word G, their parents are aligned, the universal parts of their relation types are identical and are listed as “content relations” (same as in MLAS), and their lemmas match. “Matching lemmas” normally mean identical strings in the lemma column; however, if the gold lemma is a single underscore character (“_”), any system-produced lemma is considered correct. As with MLAS, the number of correct words is divided by the total number of content words in the respective dataset in order to compute precision, recall and the F1 score. Note that functional children are ignored and don’t contribute to BLEX.

Besides the metrics described above, we will evaluate the systems along various other dimensions. Participants are encouraged to predict even morphological features and relation subtypes that are not evaluated in LAS/MLAS/BLEX; such a system is more useful for real deployment, hence we plan on evaluating all features in one of the secondary metrics. Furthermore, we may publish additional rankings for sub-tasks (e.g., performance on the low-resource languages).

We use the Tira platform to evaluate the participating systems. Therefore, participants will submit systems, not parsed data, which increases verifiability and reproducibility of the results.

Data format and evaluation details

The CoNLL-U data format, used for Universal Dependencies treebanks, is described in more detail at http://universaldependencies.org/format.html. It is deliberately similar to the CoNLL-X format that was used in the CoNLL 2006 Shared Task and has become a de-facto standard since then. Each word has its own line and there are tab-separated columns for word form, lemma, POS tag etc. For instance, the following snippet encodes the English sentence They buy and sell books.

1	They	they	PRON	PRP	Case=Nom|Number=Plur	2	nsubj	_	_
2	buy	buy	VERB	VBP	Number=Plur|Person=3|Tense=Pres	0	root	_	_
3	and	and	CONJ	CC	_	2	cc	_	_
4	sell	sell	VERB	VBP	Number=Plur|Person=3|Tense=Pres	2	conj	_	_
5	books	book	NOUN	NNS	Number=Plur	2	dobj	_	SpaceAfter=No
6	.	.	PUNCT	.	_	2	punct	_	_

Syntactic words vs. multi-word tokens

However, there are a few important extensions w.r.t. the CoNLL-X format. Perhaps the most important is the notion of syntactic words vs. multi-word tokens. It makes the tokenization step in UD harder than the relatively simple procedure called tokenization in other areas of NLP. For instance, German zum is a contraction of the preposition zu “to”, and the article dem “the”. In UD it is a multi-word token consisting of two syntactic words, zu and dem. These syntactic words are nodes in dependency relations. Learning this is harder than separating punctuation from words, because a contraction is not a pure concatenation of the participating words. The CoNLL-U format uses two different mechanisms here: punctuation that is conventionally written adjacent to a word is a separate single-“word” token, and an attribute in the last column tells that there was no whitespace character between the punctuation symbol and the word. On the other hand, the contraction is a multi-word token which has a separate line starting with range of following syntactic words that belong to it. Consider a German phrase zur Stadt, zum Haus “to the city, to the house”. The corresponding CoNLL-U section could look like this:

1-2	zur	_	_	_	_	_	_	_	_
1	zu	_	ADP	_	_	3	case	_	_
2	der	_	DET	_	_	3	det	_	_
3	Stadt	_	NOUN	_	_	0	root	_	SpaceAfter=No
4	,	_	PUNCT	_	_	3	punct	_	_
5-6	zum	_	_	_	_	_	_	_	_
5	zu	_	ADP	_	_	7	case	_	_
6	dem	_	DET	_	_	7	det	_	_
7	Haus	_	NOUN	_	_	3	conj	_	_

We will not evaluate whether the system correctly generated the SpaceAfter=No attribute. But the system-produced CoNLL-U file must be a tokenization of the original text; if the system decides to split zum into zu and dem, it must not forget to also generate the multi-word token line 5-6 zum.

We will have to align the nodes (syntactic words) output by the system to those in the gold standard data. Thus if the system fails to recognize zur as a contraction and outputs

1	zur
2	Stadt
3	,

we will treat any relations going to or from the node zur as incorrect. The same will happen with the node “Stadt,”, should the system fail to separate punctuation from the word Stadt.

If the system wrongly splits the word Haus and outputs

7-8	Haus
7	Hau
8	das

relations involving either Hau or das will be considered incorrect.

Even if the system recognizes zur as contraction but outputs wrong syntactic word forms, the tokens will be considered incorrect:

1-2	zur
1	zur
2	der

Relations involving node 1 are incorrect but relations involving node 2 may be correct. Exception: The matching is case-insensitive, thus splitting Zum to zu dem or zum to Zu dem are both OK.

Aligning system words with the gold standard

Easy part: suppose there are no multi-word tokens (contractions). Both token sequences (gold, system) share the same underlying text (minus whitespace). Tokens can be represented as character ranges. We can find intersections of system character ranges with gold character ranges and find the alignment in one run.

Now let’s assume there are multi-word tokens. They may contain anything, without any similarity to the original text; however, the data still contains the original surface form and we know which part of the underlying text they correspond to. So we only have to align the individual words between a gold and a system multi-word token. We use the LCS (longest common subsequence) algorithm to do that.

Sentence boundaries will be ignored during token alignment, i.e. the entire test set will be aligned at once. The systems will have to perform sentence segmentation in order to produce valid CoNLL-U files but the sentence boundaries will be evaluated only indirectly, through dependency relations. A dependency relation that goes across a gold sentence boundary is incorrect. If on the other hand the system generates a false sentence break, it will not be penalized directly, but there will necessarily be at least one gold relation that the system missed; not getting points for such relations will be an indirect penalization for wrong sentence segmentation.

Extrinsic Parser Evaluation

We seek to investigate how well different intrinsic evaluation measures correlate with end-to-end performance in downstream NLP applications that are believed to benefit from grammatical analysis: biological event extraction, fine-grained opinion analysis, and negation scope resolution.

We collaborate with the Extrinsic Parser Evaluation (EPE) initiative and will provide our participants the opportunity to take part in this task with minimum extra effort. There will be an extra ‘test’ set to parse (English only, 1.1 million tokens, same input format as the UD data) available in TIRA. When our participants have successfully processed the standard UD test sets, they are encouraged to invoke another system run on TIRA, processing the EPE input data. To not interfere with the core parsing task, EPE system runs have a submission deadline one week later. The EPE organizers will then collect parser outputs from TIRA and determine extrinsic, downstream results; this step is computationally intensive as it requires retraining the three EPE applications on each set of parser outputs. Hence, EPE results will only be published shortly before the initial submission deadline for system descriptions.

Please see the EPE 2018 participant information for additional details.