Test data sizes (sent 2017-05-08 22:40 CET)
Dear all,
to help you estimate your memory needs, here are some measurements of the test data:
There are 82 test files (language-treebank codes) in total (56 “large” treebanks, 8 “small” treebanks without dev data, 4 surprise languages and 14 test sets from the new parallel treebank).
The largest test file has roughly 150,000 words (whitespace delimited, no tokenization applied). Most test files are significantly smaller than that. The total number of “words” (again, no tokenization) is about 1.36 million.
If you use segmentation predicted by UDPipe (the files *-udpipe.conllu are your input), some sentences may be pretty long because the segmentation is harder on some datasets. You have to expect sentences with almost 300 words.
Best,
Dan Zeman