As described in the Evaluation page, we provide two sets of scores for each submission.
Coarse F1 Scores where each score pertains to the combined test set of the language, without distinguishing individual treebanks and the enhancement types they annotate. The last line shows the macro-average over languages.
Qualitative F1 Scores where the system output for each language is split into parts corresponding to individual source treebanks. The scores then ignore errors in enhancement types for which the treebank lacks gold-standard annotation. The column LAvgELAS shows the qualitative ELAS averaged over treebanks of the same language; the final average in bold is averaged over languages rather than treebanks. For comparison, the last column then shows the coarse ELAS F1 for the given language.
Full system submitted system results are available in the list below (ordered here according to Coarse ELAS F1 scores. See the overview tables for details).
For comparison, here are some results generated by the organizers:
- baseline1: gold standard basic trees copied as enhanced graphs. This should give some idea of how much of the enhanced representation is actually contained in the basic representation. But since gold standard is involved, it is in a sense an upper bound.