CoNLL 2017 Shared Task

Data

The task will only utilize resources that are publicly available, royalty-free, and under a license that is free at least for non-commercial usage (e.g., CC BY-SA or CC BY-NC-SA).

Treebanks

Training/development data will be taken from the Universal Dependencies release 2.0.

Note: There was a bug in the script that was responsible for extracting raw text from CoNLL-U files. A few .txt files in the original UD release contain errors. The main files in the CoNLL-U format are not affected by the bug. The above link already leads to the new, fixed release. If you downloaded the data from the original URL (http://hdl.handle.net/11234/1-1976), please download the update.

Training data sets are available for 45 languages. Participants are not allowed to use any of the previous UD releases (not only is the UD 1.* annotation style incompatible, it is also not guaranteed that the training-development-test data split remains identical). The UD release 2.0 does not contain the test data, which will be only published at the beginning of the shared task test phase.

If you use trainable publicly available tools such as UDPipe or Syntaxnet, make sure you do not use them with models pre-trained on previous versions of Universal Dependencies! For your convenience, we have released baseline UDPipe models that can be used because they were were trained on UD 2.0. The package also contains UD 2.0 training and development data with morphology predicted by UDPipe. For baseline models for Syntaxnet, see this post at Google Research Blog.

As is usual in UD, there may be more than one training/development treebank for certain languages. Typically, the additional treebanks come from a different source and their text is from a different domain. There will be separate test data for each such treebank, and treebanks will be evaluated separately as if they were different languages. Nevertheless, participants are free to use any or all training treebanks/languages when they train the parser for any target treebank.

In addition, there will be test sets for which no corresponding training/development data sets exist. These additional test sets will be of two types: 1. parallel test sets for selected known languages. Texts that have not been previously released (not even as part of a UD 1.* treebank) will be manually annotated according to the UD v2 guidelines and used in evaluation. The participating systems will know the language code, they will be thus able to pick the model trained on data from the same language; but the domain will probably be different from their training data. On the other hand, the domain of all these additional test sets (in languages where they are provided) will be identical, as they are parallel texts translated from one source. 2. The second type of additional test sets are surprise languages, which have not been previously released in UD. Names of surprise languages and a small sample of gold-standard data in these languages will be published shortly before the beginning of the evaluation phase. The point of having surprise languages is to encourage participants to pursue truly multilingual approaches to parsing. However, participants who do not want to focus on the surprise languages can run a simple delexicalized parser, as predicted POS tags will be provided.

The test set for each treebank will contain at least 10,000 words. There is no upper limit on the test size (the largest test set is currently ~170K words). Gold-standard annotation of the test data will only be published after the evaluation of the shared task.

Participants will receive training+development data with gold-standard tokenization, sentence segmentation, POS tags and dependency relations; for some languages also lemmas and/or morphological features. The size of these data sets will vary according to availability. For some languages, they may be as small as the test set (or even smaller), for others it may be ten times larger than the test set, and for the surprise languages it will be close to zero.

The UD 2.0 release contains two data packages: “ud-treebanks-conll2017.tgz” and “ud-treebanks-v2.0.tgz”. As the name suggests, ud-treebanks-conll2017.tgz contains the training and development data for the shared task, while ud-treebanks-v2.0.tgz is the full UD release. What’s the difference? The full UD release includes six additional treebanks that are not in the shared task. Also, eight treebanks that are in the shared task have only training data in the shared task package, but no development data.

The train/dev distinction is important because of the following rule (note that this is a change of the rules announced in the original call for participation):

Where there are dev data, we ask you not to use it for training proper. It is OK to use the dev data for testing, development, tuning hyperparameters, doing error analysis etc. In other words, only the training set should be used for training the final submission; the dev set can be used for choosing one of the models trained on the training set. For small treebanks (where there is no development set) use cross-validation, and feel free to train on everything once you have figured out your hyperparameters via cross-validation.

By this rule we hope to increase comparability of results, both within the shared task and after it. We are aware that the line between training a model and tuning hyperparameters may not be sharp for all systems. Use your good judgement and document what you did.

It is allowed to use the extra treebanks from the full UD 2.0 release for cross-treebank learning as long as their training / development portions are treated in conformance with the above rule, and their test portions are not used at all. (Five treebanks—Belarusian, Coptic, Lithuanian, Sanskrit and Tamil—are very small but complete. The sixth treebank, Arabic-NYUAD, is large but does not contain words and lemmas because of licensing issues. In accordance with the shared task’s focus on freely available resources, you may only use the incomplete data from the UD 2.0 release, but you must not merge it with the Penn Arabic Treebank if you have access to it.)

Raw Data

We provide additional raw data for the languages of the shared task, useful, for example, for producing word embeddings. These data sets were taken from CommonCrawl and Wikipedia and automatically sorted by a language recognizer. For convenience, we provide a variant of the data pre-processed by UDPipe, and also pre-computed word embedding vectors for those participants who want to use them but do not want to tweak their own settings of the word-to-vector software.

Parallel Data

To support multi-lingual and cross-lingual approaches and model transfers, participants will be allowed to use data from the OPUS parallel corpus (http://opus.lingfil.uu.se/). We will not redistribute these data sets, participants are simply referred to the OPUS website.

Additional Resources

Participants who registered early were asked to propose additional data they would like to use. The following proposals have been approved by the organizing committee, meaning that these resources can be used by all participants. (Please report what resources your final submission uses so that we can summarize it in the overview paper.)