Treebank Inclusion Rules
Test data must contain at least 10,000 syntactic words (possibly after re-splitting). UDv2 annotated test data will not be part of the next UD release and they should not appear in the Github repositories until the shared task is over! (Not even in the dev branch.) Send the data to the shared task organizers by e-mail instead.
Development data should also contain 10,000 words or more, but this is not so strict. Contact the shared task organizers if you are not able to meet this condition. There is no size requirement for training data: if you have just 20,000 words, split it to 10K dev + 10K test and leave the training data empty.
The UD validation script will be updated to check conformance with v2 guidelines. A treebank must pass the validation to be included in the shared task. (Test data will be validated only offline.) The validation will probably include a subset of what is now known as “content validation tests” (e.g. check that certain types of relations are left-headed). Lemmas and morphological features are still optional, although treebank owners are strongly encouraged to include them.
The data must be ready and valid by February 15 (see also the time line). In exceptional cases we may allow deadline extension for the test data. E.g. the treebank is small, annotation is running but there are only 15,000 words available by the release deadline. The annotation team is confident that they can exceed 20,000 words soon; they will thus ask us to release 10,000 words as development data and wait for the remaining 5,000 words of test data. Obviously we do not want this to become a common practice because we want to announce the set of shared task languages in the beginning of March, and we do not want to withdraw a language later, should the annotators fail to supply the remaining data.
In order to give us an idea about how many languages we should expect, we ask the teams maintaining individual treebanks to let us know by mid January that they are aiming at meeting the above conditions and have their data in the shared task.
As mentioned in the shared task specification, there will be a parallel test set of ~1000 sentences in selected shared task languages. DFKI and Google are generously providing translations & annotations in selected languages (English, Spanish, Portuguese, French, Italian, Russian, Japanese, Hindi, Arabic, Indonesian, Chinese, Turkish, German). If you are maintaining a UD treebank in one of these languages, please let us know whether you can check the annotation quality of the parallel data for us. If your language is not listed above but you are willing to translate these sentences from English to your language and annotate it UD v2 style, get in touch too (this is already the case of Swedish, Czech, Finnish and Norwegian).