There will be four test sets in languages for which no training or development data were provided. No UD treebank for these languages was released prior to the shared task. Even the names of the languages are announced only one week before the test phase.
For each of the languages we are now providing:
- Its name and ISO 639 code
- A sample CoNLL-U file with gold-standard annotation of a few sentences that will not be included in the test data
- Partial statistics (tags, features, most frequent words and multi-word tokens) in the format known from the
stats.xmlfiles of the UD releases
- Language-specific files for
validate.py, the UD format validation tool (they include info on permitted “words with spaces”, if applicable)
- Crawled raw data from Wikipedia. Unlike the raw data for the normal languages, these are much smaller and come without any preprocessing except paragraph boundaries
All of the above for all four languages is temporarily available for download here (zipped archive, 11 MB).
Feel free to use all of the above in your model (that is, you can even “train” a parser on the sample sentences if it helps).
The test data will be available in two input formats, much like for the other languages: plain text and predicted segmentation+morphology by UDPipe. The main difference is that for the surprise languages, the UDPipe models for segmentation and morphology will be trained on the test data, using 10-fold cross-validation.
The surprise consists of the following languages:
An Altaic language spoken in eastern Russia (Buryat Republic), written in Cyrillic script.
Also known as Northern Kurdish. An Indo-European (Iranian) language spoken in eastern Turkey and surrounding areas, written in Latin script.
North Sámi (sme)
An Uralic language spoken in northern Norway, Sweden and Finland, written in Latin script.
Upper Sorbian (hsb)
An Indo-European (Slavic) language spoken in eastern Germany (Lusatia), written in Latin script.