CoNLL 2017 Shared Task

Data

The task will only utilize resources that are publicly available, royalty-free, and under a license that is free at least for non-commercial usage (e.g., CC BY-SA or CC BY-NC-SA).

Treebanks

Training/development data will be taken from the Universal Dependencies release 2.0. They will be available for many languages but the exact set of languages will be only known at release time. Participants are not allowed to use any of the previous UD releases (not only is the UD 1.* annotation style incompatible, it is also not guaranteed that the training-development-test data split remains identical). The UD release 2.0 will not contain the test data, which will be only published at the beginning of the shared task test phase.

As is usual in UD, there may be more than one training/development treebank for certain languages. Typically, the additional treebanks come from a different source and their text is from a different domain. There will be separate test data for each such treebank, and treebanks will be evaluated separately as if they were different languages. Nevertheless, participants are free to use any or all training/development treebanks/languages when they train the parser for any target treebank.

In addition, there will be test sets for which no corresponding training/development data sets exist. These additional test sets will be of two types: 1. parallel test sets for selected known languages. Texts that have not been previously released (not even as part of a UD 1.* treebank) will be manually annotated according to the UD v2 guidelines and used in evaluation. The participating systems will know the language code, they will be thus able to pick the model trained on data from the same language; but the domain will probably be different from their training data. On the other hand, the domain of all these additional test sets (in languages where they are provided) will be identical, as they are parallel texts translated from one source. 2. The second type of additional test sets are surprise languages, which have not been previously released in UD. Names of surprise languages and a small sample of gold-standard data in these languages will be published shortly before the beginning of the evaluation phase. The point of having surprise languages is to encourage participants to pursue truly multilingual approaches to parsing. However, participants who do not want to focus on the surprise languages can run a simple delexicalized parser, as predicted POS tags will be provided.

The test set for each treebank will contain at least 10,000 words. There is no upper limit on the test size (the largest test set is currently ~170K words). Gold-standard annotation of the test data will only be published after the evaluation of the shared task.

Participants will receive training+development data with gold-standard tokenization, sentence segmentation, POS tags and dependency relations; for some languages also lemmas and/or morphological features. The size of these data sets will vary according to availability. For some languages, they may be as small as the test set (or even smaller), for others it may be ten times larger than the test set, and for the surprise languages it will be close to zero. One subset of the data will be formally designated as the development set, but participants will be free to use it also for training their final system.

Raw Data

We will provide additional raw data for the languages of the shared task, useful, for example, for producing word embeddings. These data sets will be taken from CommonCrawl and automatically sorted by a language recognizer. They may not be available for all languages (note that UD contains also some classical languages such as Ancient Greek). For convenience, we will provide a variant of this data pre-processed by UDPipe, and also pre-computed word embedding vectors for those participants who want to use them but do not want to tweak their own settings of the word-to-vector software.

Parallel Data

To support multi-lingual and cross-lingual approaches and model transfers, participants will be allowed to use data from the OPUS parallel corpus (http://opus.lingfil.uu.se/). We will not redistribute these data sets, participants are simply referred to the OPUS website.

Call for Additional Data

Instead of organizing a separate open track we encourage the participants to report additional data they want to use (see below for deadlines and time schedule). If the data sets are relevant to the task and meet the public availability condition, they will be added to the list of resources available to all participants.