The TIRA platform
TIRA will be used to evaluate the systems. Participants will install their systems in dedicated virtual machines provided by TIRA. During test phase, the systems will get access to the test data and process the test data inside the VM. The evaluation script will run there as well.
There is some flexibility to the operating system and hardware resources available in the VM. Once registered, you will be contacted by TIRA administrators and asked about your preferences. You will also be offered a private Github repository where you can keep and version-control the source code of your parser (and increase future reproducibility of your results).
Typically, you will train your models on your own hardware. Once ready, you will upload both your parsing system and the models to the VM. It is not forbidden to train the models directly in the VM but note that the resources there are limited.
When your system and models have been deployed in your VM, proceed to the TIRA web interface (same login as to your VM), register there the shell command to run your system, and run it. Note that your VM will not be accessible while your system is running – it will be “sandboxed”, detached from the internet, and after the run the state of the VM before the run will be restored. Your run can then be reviewed and evaluated by the organizers.
Note that your system is expected to read the paths to the input and output
folders from the command line. When you register the command to run your
system, put variables in positions where you expect to see these paths. Thus
if your system expects to get the options
-o, followed by input
and output path respectively, the command you register may look like this:
/home/my-user-name/my-software/run.sh -i $inputDataset -o $outputDir
The actually executed command will then look something like this:
/home/my-user-name/my-software/run.sh -i /media/training-datasets/universal-dependency-learning/conll17-ud-development-2017-03-19 -o /tmp/conll17-baseline/2017-03-29-09-35-53/output
See the links below for more details.
Processing the data on TIRA
Within your VM, you can see the development and trial data mounted read-only at
(trial data is a small subset of development data that you can use for quick debugging, without having your VM sandboxed for too long). First try running your system on these datasets from within your VM (no sandboxing), then try the same through the web interface (everything like in the test phase, i.e. including sandboxing). When invoked from the web interface, your system will be given path to the input folder and path to the output folder, where it is supposed to generate all output files. When you run the system on development or trial data, the input path will lead to the location mentioned above. But during the test phase, it will be a different path to which you normally don’t have access. And while the TIRA folder with development data contains also the gold standard files, the trial and test folders contain only the two permitted input files: raw text or CoNLL-U preprocessed by UDPipe.
As you will see, the file names are slightly different from the UD release, and files for all languages are in one folder. There are two extra files,
metadata.json and README.txt (which documents the fields in
metadata.json; don’t let your system rely on the
name field as this field is not available for the test data!). Your system should start by reading
metadata.json, which contains the list of input files that must be processed, and the names of corresponding output files that must be generated in the output folder. The metadata will also tell you the language code and treebank code of each input file (although the codes are typically also used in file names, the proper place where your system should read them is the metadata file). For test files that correspond to a UD 2.0 treebank, these codes will match those you know from the UD release. But remember that you are also supposed to process 1. unknown treebank codes for known languages; 2. and even unknown language codes (surprise languages). If your system fails to provide a valid CoNLL-U output for an input file, its score on that part will be zero. Even a random tree should be better than zero, so make sure to generate something even if surprise languages are not your focus in this task.
If you want to test your system locally, you can download the data with the folder structure used at TIRA, and with the input files preprocessed by UDPipe, from http://ufal.mff.cuni.cz/~zeman/soubory/tira-data-participants.zip.
Test phase (extended: May 8 – 14)
On May 8, the test data will become available in TIRA (we will send an announcement when it is ready) and you will be able to run your system on it. Once the run of your system completes, please also run the evaluator on the output of your system. These are two separate actions and both should be invoked through the web interface of TIRA. You don’t have to install the evaluator in your VM. It is already prepared in TIRA. You should see it in the web interface, under your software, labeled “Evaluator”. Before clicking the “Run” button, you will use a drop-down menu to select the “Input run”, i.e. one of the completed runs of your system. The output files from the selected run will be evaluated.
You will see neither the files your system outputs, nor your STDOUT or STDERR. In the evaluator run you will see STDERR, which does not contain the evaluation scores but it will tell you if there is a problem with one or more of the output files.
From time to time, the runs will be reviewed by the organizers and you will be able to see the review reports. Typically we will only review the evaluator runs (a “No Errors” review of the system run would only tell you that the run appears to have completed, but nothing about validity of the output files). Even with the system runs you usually do not need to see the review because the most likely problems, such as a missing output file, can be recognized from the evaluator’s STDERR that is visible to you. However, if the reviewer sees error messages output by your system, he may decide to unblind the run. In your TIRA interface, you can recognize unblinded test runs by the information about runtime and size (normally these fields say “hidden”). You still cannot download the output files but you can now see the STDOUT and STDERR. Alternatively, the reviewer may just copy an error message and send it to you without unblinding the entire run. If you are unsure about what caused your system to fail, write us and ask for the review of a specific system run.
In general, we will not tell you the score of your system before the test phase closes. The exception is when your score on one or more files is anomalously low, which probably means a bug or another technical problem.
You can register more than one system (“software”) per virtual machine. You can have all of them officially scored and use those numbers in your system description paper. However, you have to decide what is the primary system that will represent your team in the shared task. (You have to decide that based on your experiments with the development data, without knowing the actual performance on the test data.) TIRA gives systems automatic names “Software 1”, “Software 2” etc. If possible please make sure that “Software 1” is your primary system. If this is not possible because you deleted “Software 1” or because you changed your decision after completing runs that you do not want to delete, send us a message with the name of your primary system.
You can run one system multiple times (especially if we tell you that there is a problem with your previous run). If there are multiple successful runs of your primary system, the last one will be considered your submission to the shared task. You can delete a run if you want a previous one to be submitted.
If you modify your software between runs, please make sure you can reinstate the version that was used in the previous run. (You can use the private Github repository offered by TIRA for this purpose.) This is important for reproducibility of the final results of the shared task. If your last run is not successful and the previous run becomes your official submission, you should be able to return to the version of your system that generated the submission.
There is no time limit on your run other than that it must complete before the end of the test phase, which is (extended to) Sunday May 14, 23:59, Samoa Standard Time (Monday May 15, 6:59 EDT, 10:59 UTC, 12:59 CEDT or 19:59 JST).
If your system requires more resources than available in the default VM (memory, disk space, CPUs), please estimate what you need and discuss it with Martin Potthast, the administrator of TIRA. You can get a VM with more resources. Note however that accommodating such requests takes time, so act early. The sooner you complete at least one successful run, the safer you are.
Access to the Virtual Machines and Intellectual Property Rights
The VMs are distributed across different hosts. The only people who have access to the participant VMs are TIRA admins (a very small group of poeple operating the service) and the organizers of the CoNLL shared task.
We can guarantee that we will never deliberately share your VM or its contents, nor use it for anything else but for the purpose of evaluating your software as part of the shared task, unless you give us written permission. We ask that you give the CoNLL shared task organizers and the TIRA operators usage rights for your software for this purpose only.
However, we cannot guarantee that no content of the VM will leak accidentally and we shall not be held liable for damages caused by such leaks. In particular, we cannot vouch that the software packages and operating systems TIRA depends on are free of zero day exploits.
The performance results and output of your software will become part of public record, for which we ask for indefinite, irrevocable, and transferable rights to publish them within any scientific publication as well as on the TIRA web service.
By deploying your system in your VM and running it through the TIRA interface you express your consent to these conditions and give us the rights as described above.
We understand that for industry-based participants, protecting their software is an important matter. If you want to learn more about the TIRA procedures and about your options, please get in touch with the TIRA administrators (tira at webis dot de). TIRA has been used by a number of companies so far, some small ones but also some big ones. The involvement of industry in scientific events should not be foreclosed. If we are to improve reproducibility at large, however, there is no way around venturing more openness on either side.