Training data released (sent 2017-03-01 22:51 CET)

Dear participants of the CoNLL 2017 shared task,

thank you for registering! We are very happy to announce that the training and development data is now available as part of Universal Dependencies release 2.0, officially published about an hour ago and available for download at http://hdl.handle.net/11234/1-1976.

The release contains two data packages: “ud-treebanks-conll2017.tgz” and “ud-treebanks-v2.0.tgz”. As the name suggests, ud-treebanks-conll2017.tgz contains the training and development data for the shared task, while ud-treebanks-v2.0.tgz is the full UD release. What’s the difference? The full UD release includes six additional treebanks that are not in the shared task. Also, eight treebanks that are in the shared task have only training data in the shared task package, but no dev data. This is now important because the organizing committee decided to modify the rules that were originally announced:

Where there are dev data, we ask you not to use it for training proper. It is OK to use the dev data for testing, development, tuning hyperparameters, doing error analysis etc. By this rule we hope to increase comparability of results, both within the shared task and after it. We are aware that the line between training a model and tuning hyperparameters may not be sharp for all systems. Use your good judgement and document what you did. For small treebanks use cross-validation since there is no development set, feel free to train on everything once you have figured out your hyperparameters via cross-validation.

The package that has been released today contains only gold-standard annotation, and reconstructed plain texts. The plain text is in the form you can expect when your system is evaluated, provided you choose to do all levels of processing yourself. We are now working on a variant of the data where morphological annotation will be predicted automatically by a baseline system; this should be available very soon.

Stay tuned, more data is coming! On March 15 we plan to release raw data from Common Crawl, and baseline parsing models. We will also finalize the list of additional allowed resources, suggested by some of you (currently under discussion among the organizers; sorry for the delay).

Finally, you are probably wondering how the evaluation in the TIRA platform will work. We will now give the list of registered participants to the TIRA administrators, who will set up a virtual machine for each team, and they will send instructions to your e-mail address. If you want to learn more details in advance, check the shared task website and the links to TIRA materials in the NEWS section.

Best regards and good luck

Dan Zeman

on behalf of the costocom :) (connl shared task organizing committee) http://universaldependencies.org/conll17/