The TIRA platform
TIRA will be used to evaluate the systems. Participants will install their systems in dedicated virtual machines provided by TIRA. During test phase, the systems will get access to the test data and process the test data inside the VM. The evaluation script will run there as well.
There is some flexibility to the operating system and hardware resources available in the VM. Once registered, you will be contacted by TIRA administrators and asked about your preferences. You will also be offered a private Github repository where you can keep and version-control the source code of your parser (and increase future reproducibility of your results).
Typically, you will train your models on your own hardware. Once ready, you will upload both your parsing system and the models to the VM. It is not forbidden to train the models directly in the VM but note that the resources there are limited.
When your system and models have been deployed in your VM, proceed to the TIRA web interface (same login as to your VM), register there the shell command to run your system, and run it. Note that your VM will not be accessible while your system is running – it will be “sandboxed”, detached from the internet, and after the run the state of the VM before the run will be restored. Your run can then be reviewed and evaluated by the organizers.
Note that your system is expected to read the paths to the input and output
folders from the command line. When you register the command to run your
system, put variables in positions where you expect to see these paths. Thus
if your system expects to get the options
-o, followed by input
and output path respectively, the command you register may look like this:
/home/my-user-name/my-software/run.sh -i $inputDataset -o $outputDir
The actually executed command will then look something like this:
/home/my-user-name/my-software/run.sh -i /media/training-datasets/universal-dependency-learning/conll18-ud-development-2018-05-06 -o /tmp/my-user-name/2018-06-02-09-35-53/output
See the links below for more details.
Processing the data on TIRA
Within your VM, you can see the development and trial data mounted read-only at
(trial data is a small subset of development data that you can use for quick debugging, without having your VM sandboxed for too long). First try running your system on these datasets from within your VM (no sandboxing), then try the same through the web interface (everything like in the test phase, i.e. including sandboxing). When invoked from the web interface, your system will be given path to the input folder and path to the output folder, where it is supposed to generate all output files. When you run the system on development or trial data, the input path will lead to the location mentioned above. But during the test phase, it will be a different path to which you normally don’t have access. And while the TIRA folder with development data contains also the gold standard files, the trial and test folders contain only the two permitted input files: raw text or CoNLL-U preprocessed by UDPipe.
As you will see, the file names are slightly different from the UD release,
and files for all languages are in one folder. There are two extra files,
metadata.json and README.txt
(which documents the fields in
Your system should start by reading
metadata.json, which contains the list
of input files that must be processed, and the names of corresponding output
files that must be generated in the output folder. The metadata will also tell
you the language code and treebank code of each input file (although the codes
are typically also used in file names, the proper place where your system
should read them is the metadata file). For test files that correspond to a
training dataset, these codes will match those you know from the UD release.
But remember that you are also supposed to process unknown language/treebank
codes. If your system fails to provide a valid CoNLL-U output for an input
file, its score on that part will be zero. Even a random tree should be better
than zero, so make sure to generate something even if low-resource languages
are not your focus in this task.
When you have tested your system on the development and trial data and everything works fine, run it on the test data. Once the run of your system completes, please also run the evaluator on the output of your system. These are two separate actions and both should be invoked through the web interface of TIRA. You don’t have to install the evaluator in your VM. It is already prepared in TIRA. You should see it in the web interface, under your software, labeled “Evaluator”. Before clicking the “Run” button, you will use a drop-down menu to select the “Input run”, i.e. one of the completed runs of your system. The output files from the selected run will be evaluated.
You will see neither the files your system outputs, nor your STDOUT or STDERR. In the evaluator run you will see STDERR, which will tell you if one or more of your output files is not valid. This year it will also show you your score rounded to the nearest multiple of 5%. It is not meant for you to select the best-performing system; however, it should help you spot problems that lead to unexpectedly low scores rather than invalid output and complete failure. If you think something went wrong with your run, send us an e-mail. We can unblind your STDOUT and STDERR on demand, after we check that you did not leak the test data in the output.
We may also unblind your runs actively if we see an invalid run, but do not rely on it. In your TIRA interface, you can recognize unblinded test runs by the information about runtime and size (normally these fields say “hidden”). You still cannot download the output files but you can now see the STDOUT and STDERR. Alternatively, the reviewer may just copy an error message and send it to you without unblinding the entire run. If you are unsure about what caused your system to fail, write us and ask for the review of a specific system run.
You can register more than one system (“software”) per virtual machine. You can have all of them officially scored and use those numbers in your system description paper. However, you have to decide what is the primary system that will represent your team in the shared task. TIRA gives systems automatic names “Software 1”, “Software 2” etc. If possible please make sure that “Software 1” is your primary system. If this is not possible because you deleted “Software 1” or because you changed your decision after completing runs that you do not want to delete, send us a message with the name of your primary system.
You can run one system multiple times (especially if we tell you that there is a problem with your previous run). If there are multiple successful runs of your primary system, the last one will be considered your submission to the shared task. You can delete a run if you want a previous one to be submitted.
You can also make partial runs that process only selected test files. The advantage is that you need less time to wait for the results (and see whether your system managed or failed to create a valid output); the disadvantage (both for you and for us) is that more human effort is needed to invoke and evaluate the runs. If the last run of your primary software lacks a valid output for a particular test file, we will look for a valid output on that test file in the previous runs, and we will stitch the partial runs as necessary. However, if you manage to parse all test files via partial runs, and there is still time, please try to redo everything in a single run.
If you modify your software between runs, please make sure you can reinstate the version that was used in the previous run. (You can use the private Github repository offered by TIRA for this purpose.) This is important for reproducibility of the final results of the shared task. If your last run is not successful and the previous run becomes your official submission, you should be able to return to the version of your system that generated the submission.
There is no time limit on your run other than that it must complete before the end of the test phase, which is Sunday July 1, 23:59, Samoa Standard Time / GMT-11 (Monday July 2, 6:59 EDT, 10:59 UTC, 12:59 CEDT or 19:59 JST). While we may be able to include some late arrivals in the final ranking, we do not guarantee it.
If your system requires more resources than available in the default VM (memory, disk space, CPUs), please estimate what you need and discuss it with Martin Potthast, the administrator of TIRA. You can get a VM with more resources. Note however that accommodating such requests takes time, so act early. The sooner you complete at least one successful run, the safer you are.
Access to the Virtual Machines and Intellectual Property Rights
The VMs are distributed across different hosts. The only people who have access to the participant VMs are TIRA admins (a very small group of people operating the service) and the organizers of the CoNLL shared task.
We can guarantee that we will never deliberately share your VM or its contents, nor use it for anything else but for the purpose of evaluating your software as part of the shared task, unless you give us written permission. We ask that you give the CoNLL shared task organizers and the TIRA operators usage rights for your software for this purpose only.
However, we cannot guarantee that no content of the VM will leak accidentally and we shall not be held liable for damages caused by such leaks. In particular, we cannot vouch that the software packages and operating systems TIRA depends on are free of zero day exploits.
The performance results and output of your software will become part of public record, for which we ask for indefinite, irrevocable, and transferable rights to publish them within any scientific publication as well as on the TIRA web service and on the shared task website.
By deploying your system in your VM and running it through the TIRA interface you express your consent to these conditions and give us the rights as described above.
We understand that for industry-based participants, protecting their software is an important matter. If you want to learn more about the TIRA procedures and about your options, please get in touch with the TIRA administrators (tira at webis dot de). TIRA has been used by a number of companies so far, some small ones but also some big ones. The involvement of industry in scientific events should not be foreclosed. If we are to improve reproducibility at large, however, there is no way around venturing more openness on either side.