Data
The task will only utilize resources that are publicly available, royalty-free, and under a license that is free at least for non-commercial usage (e.g., CC BY-SA or CC BY-NC-SA). The right to use the data must not be limited to the shared task, i.e., the data must be available for follow-up research too.
Treebanks
The main data sets have been taken from the UD treebanks. Unlike in 2017, no languages are “surprise” languages (meaning that participants do not know which languages are going to be used). However, as in 2017, there is a category of languages for which little or no training data exist, and the participants are encouraged to employ cross-lingual techniques and to use data from the other languages.
The task comprises 82 test sets from 57 languages (for comparison, the 2017 evaluation was done on 81 test sets from 49 languages). The test sets contain at least 10,000 words each; the largest test set has about 170,000 words. 61 of the 82 treebanks are large enough to provide training and development data sets of at least 10,000, resp. 5,000 words. The other 21 treebanks lack development data and some of them also training data. In the small treebank category, 7 treebanks have training data of a still reasonable size; 5 are extra test sets in languages where another large treebank exists; 9 are low-resource languages with no training data available (Breton, Faroese, Naija, Thai) or the file labeled “train” is just a tiny sample (Armenian, Buryat, Kazakh, Kurmanji, Upper Sorbian).
It is not allowed to use any previous release of Universal Dependencies, nor to take data directly from the UD repositories on Github. (Note that it is not always guaranteed that part of former training data is now in test data or the other way around.) If you use trainable publicly available tools such as UDPipe, make sure you do not use them with models pre-trained on previous versions of Universal Dependencies! For your convenience, we have released baseline UDPipe models that are trained on the approved training data and can be used in the task. The package also contains the training and development data with morphology predicted by UDPipe.
As is usual in UD, there may be more than one treebank for certain languages. Typically, the additional treebanks come from a different source and their text is from a different domain. There will be separate test data for each such treebank, and treebanks will be evaluated separately as if they were different languages. Nevertheless, participants are free to use any or all training treebanks/languages when they train the parser for any target treebank.
The train/dev distinction is important because of the following rule:
Where there are dev data, we ask you not to use it for training proper. It is OK to use the dev data for testing, development, tuning hyperparameters, doing error analysis etc. In other words, only the training set should be used for training the final submission; the dev set can be used for choosing one of the models trained on the training set. For small treebanks (where there is no development set) use cross-validation, and feel free to train on everything once you have figured out your hyperparameters via cross-validation.
By this rule we hope to increase comparability of results, both within the shared task and after it. We are aware that the line between training a model and tuning hyperparameters may not be sharp for all systems. Use your good judgement and document what you did.
Note: since the test phase of the shared task is now over, the full UD 2.2 package has been released. It contains the full data (including test sets) of shared task treebanks exactly as it was used in the task (that is, if the Github repository changed between April 15 and July 1, those changes will have to wait for UD 2.3 in November). In addition, non-shared-task treebanks are released in the UD 2.2 full release as they were available on June 15. Furthermore, the test data from both shared tasks (2017 and 2018) are available for download in the form in which they were presented to the participating systems, that is, as plain text files and as CoNLL-U files preprocessed by UDPipe.
Raw Data
We provide additional raw data for some of the languages of the shared task, useful, for example, for producing word embeddings. These data sets were taken from CommonCrawl and Wikipedia and automatically sorted by a language recognizer. For convenience, we provide pre-computed word embedding vectors for those participants who want to use them but do not want to tweak their own settings of the word-to-vector software. The raw data package was prepared for the 2017 shared task and it only covers languages that were included in the 2017 evaluation.
Parallel Data
To support multi-lingual and cross-lingual approaches and model transfers, participants will be allowed to use data from the OPUS parallel corpus (http://opus.lingfil.uu.se/). We will not redistribute these data sets, participants are simply referred to the OPUS website.
Additional Resources
Instead of organizing a separate open track we encouraged the participants to report (by February 15) additional data they wanted to use. The approved freely available resources can be used by all participants. All the data from the last year’s list are automatically approved, with the exception of the French TreeBank (nobody used it in 2017; and it would violate the above rule that treebank data are taken only from the officially released training package).
(Please report what resources your final submission uses so that we can summarize it in the overview paper.)
- World Atlas of Language Structures (WALS, http://wals.info/)
- Wikipedia dumps (https://dumps.wikimedia.org/backup-index-bydb.html)
- Word vectors for 90 languages trained on Wikipedia have been released by Facebook
- Multilingual vectors based on the same data are available from Babylonpartners’ Github
- Word vectors for 90 languages trained on Wikipedia have been released by Facebook
- WMT 2016 parallel and monolingual data (http://www.statmt.org/wmt16/translation-task.html)
- Morphological transducers in Apertium and in Giellatekno
- Unimorph (https://unimorph.github.io/)
- CoNLL-UL Universal Morphological Lattices (https://conllul.github.io/)
- The following morphological dictionaries, available from the LINDAT repository:
- English
- Italian function words and content words