EVALUATION METRICS

As described in the Evaluation page, we provide two sets of scores for each submission.

Coarse F1 Scores where each score pertains to the combined test set of the language, without distinguishing individual treebanks and the enhancement types they annotate. The last line shows the macro-average over languages.

Qualitative F1 Scores where in this case the system output for each language are split into parts corresponding to individual source treebanks. The scores then ignore errors in enhancement types for which the treebank lacks gold-standard annotation. The column LAvgELAS shows the qualitative ELAS averaged over treebanks of the same language; the final average in bold is averaged over languages rather than treebanks. For comparison, the last column then shows the coarse ELAS F1 for the given language.

OFFICIAL RESULTS

Full system submitted system results are available in the list below (ordered here according to Coarse ELAS F1 scores. See the overview tables for details).

1.TURKUNLP

2.ORANGE DESKIN

3.EMORYNLP

4.FASTPARSE

5.UNIPI

6.SHANGAITECH ALIBABA

7.CLASP

8.ADAP

9.KOEBSALA

10.ROBERTNLP

OVERVIEW TABLES

As well as the official shared tasks scores, we also provide post-shared task submission deadline results with some additionnal, bug fixed, submissions from the participants. If you want your post-shared task resuls to be included, submit your system output and contact us.

BASELINES

For comparison, here are some results generated by the organizers:

  • baseline1: gold standard basic trees copied as enhanced graphs. This should give some idea of how much of the enhanced representation is actually contained in the basic representation. But since gold standard is involved, it is in a sense an upper bound.

  • baseline2: similar strategy (basic copied to enhanced) but a realistic scenario: instead of gold standard annotation, the basic tree is produced by UDPipe 1.2, biggest model for the language trained on UD 2.5 (the officially provided models), no pretrained embeddings used. Obviously this is not the state-of-the-art of UD parsing, so one could obtain a better baseline by using a better basic parser, still without doing anything EUD-specific.

  • baseline3: again UDPipe, but this time followed by the Stanford Enhancer (which uses language-specific lists of relative pronouns, and pretrained word embeddings, we used the ones that were released for the CoNLL 2017 shared task, maximum 2M most frequent words; except for Tamil, where these embeddings are not available). Our version of the enhancer crashes on some datasets, hence 2 languages are missing in the evaluation (Finnish, Latvian).