Baseline models and raw data (sent 2017-03-16 16:37 CET)

Dear shared task participants,

good news: we have some more data for you. Actually, a lot of data.

Here is a list:

  • http://hdl.handle.net/11234/1-1989
    • Raw texts in the 45 shared task languages, obtained from CommonCrawl, Wikipedia and Perseus
      • The texts have been automatically segmented, tagged and parsed by baseline UDPipe models, trained on UD 2.0
    • Word embeddings computed on those raw texts
  • http://hdl.handle.net/11234/1-1990
    • UD 2.0 training and development data with morphology predicted by UDPipe (baseline models used for development data, 10-fold cross-validation used for training data). Segmentation and syntax is gold-standard.
    • UDPipe baseline models trained on UD 2.0 training data.
    • UD 2.0 data split in the way used to train the baseline models.
    • Supplementary material and hyperparameter values needed to replicate the baseline.

Be warned that the raw texts are LARGE (well, what else are they supposed to be?) There is over 90 billion words in total. You need 630 GB disk space to download all the xzipped archives, and about 2 TB to uncompress them. For your convenience, the pre-computed word embeddings can be downloaded separately (22 GB), and the annotated raw texts are provided in separate archives by language. If you want to download them all, instead of clicking you may want to try wget in a loop, something like:

for i in Ancient_Greek Arabic Basque Bulgarian Catalan ChineseT Croatian Czech Danish Dutch English Estonian Finnish French Galician German Greek Hebrew Hindi Hungarian Indonesian Irish Italian Japanese Kazakh Korean Latin Latvian Norwegian-Bokmaal Norwegian-Nynorsk Old_Church_Slavonic Persian Polish Portuguese Romanian Russian Slovak Slovenian Spanish Swedish Turkish Ukrainian Urdu Uyghur Vietnamese ; do wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/$i-annotated-conll17.tar ; done

Best regards,

Dan Zeman

on behalf of the costocom :) (connl shared task organizing committee) http://universaldependencies.org/conll17/