How to Create a (UD) Treebank
The purpose of this document is to provide some advice to people who are interested in building a treebank for a new language and do not know where to start. This complements the existing guidelines on how to contribute to Universal Dependencies, which describe the technical steps needed to make the data released, but not the actual annotation process.
Select the Texts
(DISCLAIMER: This text has not been written by a lawyer and should not be taken as legal advice.)
First thing you should think about is copyright. A treebank is really useful only if it can be freely used, modified and redistributed by the research community. It is not enough that you are willing to put your annotation under a free license. If the underlying text is protected by copyright, then it cannot be distributed together with your annotation unless you have negotiated permission from the copyright holder. Most UD treebanks are distributed with a Creative Commons license. Some of them only permit usage for non-commercial purposes, but if you are not forced to put such restriction on your data, it is better to leave the data available for commercial use as well. UD contains many treebanks that can be used by commercial subjects, so the result of a non-commercial license may be that these users will simply ignore your data.
To sum up, you need texts that are public domain, or provided with a free license such as Creative Commons, or their copyright protection has expired, or the copyright holder gave you written premission to use the texts in a corpus AND let others redistribute them. Texts obtained by random web crawling are generally not copyright-free. Some corpora try to circumvent this problem by disassembling the text into individual sentences or short multi-sentence segments, then shuffling and randomly selecting the sentences so that the original copyrighted work cannot be reconstructed. While this approach seems to be relatively safe legally, it has the unfortunate consequence that linguistic research cannot go beyond the sentence boundary. If you are looking for freely available and redistributable text, Wikipedia articles are often a low-hanging fruit; however, if at all possible, you should try to mix multiple domains and genres in your corpus.
Look for Preprocessing Tools and Resources
Check whether you are really starting from scratch. If your language does not have a treebank, it does not necessarily mean that it does not have anything else. Is there a morphological analyzer? Tagger? Lemmatizer? At least a partial vocabulary for a spell checker? Parallel corpus or a machine translation model? Or maybe you cannot find a tagged corpus but a tagset has been designed and you can use its documentation to prepare your UD guidelines? All these things may prove helpful. If you can find tools to tag and lemmatize your data, you will want to design automatic conversion of the morphological annotation and then concentrate on syntactic parsing.
If such resources (or even a parser) exist for a closely related language, you may be able to adapt them for your language. Or, if there are parallel data between your and one or more other languages, you may be able to project annotation across word alignment.
All these techniques are only approximative and will not give you the data you want to release. But they can save you a lot of work. Fixing errors in automatically predicted annotation is usually faster than hand-annotating everything. Even if you start from scratch, it may be useful to bootstrap: you hand-annotate a small sample (e.g. 100 sentences), then train a tokenizer, tagger and parser (e.g. UDPipe), use it to preprocess the next batch of data, manually fix the annotation, train a better model on the larger data and so on. You should be careful though. If the output of the preprocessing tools is too good, more caution is needed to spot the remaining errors.
Annotation Process
An ever-growing list of UD-related tools is maintained here. Among them are several annotation and visualization tools. None of them is perfect; try and experiment with several ones and see what works best for you. As you become more proficient in annotation, you will appreciate if a tool allows you to annotate rapidly, using keyboard for everything except dependency links, which are typically dragged-and-dropped using the mouse. Some people have successfully used even generic software such as spreadsheets for the initial stages (e.g. if you manually disambiguate morphological categories, you may take advantage of the program’s ability to guess the value based on your first keystroke and similar values above in the same column). However, you need to be later able to convert the table back to the CoNLL-U format used in UD. Also, for any other tool, check carefully whether it keeps all information that was in your input CoNLL-U file (e.g., sentence-level comments). If the tool discards some information that should not be discarded, and you still want to use this tool, you have to be able to merge your annotation saved by the tool with the lost information from the original file.
Convert your data to CoNLL-U as soon as possible and run regularly the official validation script, validate.py
, to spot and fix possible systematic errors. Once you have the first data sample, you should follow the instructions in the release checklist, apply for a Github repository for your treebank, upload your data there and then regularly check the on-line validation site to double-check the validity of your data (note that the on-line infrastructure runs additional tests not included in validate.py
!) If your data is valid, it will become part of the next UD release. We have a release-early-release-often policy. No problem if you just have 100 sentences at the moment! There are other such small treebanks in UD, and one hundred is always better than zero. Once your data gets out, people will start using it and you may get useful feedback. It may not be perfect but you will be able to improve the quality and quantity for the next release, which is normally every six months. And if you do not have time to contribute more, someone else may be able to pick up where you stopped.
While UD does accept treebanks created by solo researchers, you can achieve better results if you have a team of several people. It is then customary that every sentence is annotated independently by two annotators, differences between the annotations are automatically identified and a third annotator makes sure that they are resolved and a consistent approach is taken. Inter-annotator agreement is an interesting measure of annotation complexity which is usually evaluated and reported in scientific papers describing a corpus.
Document Language-specific Guidelines
You should read the UD guidelines carefully before you start annotating data. Some decisions (such as what is a core / oblique argument) have to be made for every language separately. Make sure to document them – see here for instructions how to create language-specific documentation. If you are unsure about how a particular universal guideline applies to your language, do not hesitate to create an issue in the UD Github Issue Tracker. This is the forum where such questions are discussed by the community.
Besides top-level decisions such as core arguments, there will be numerous small details such as “did I tag this special word as adverb or as conjunction?” Even if it is too detailed for the top page of your language-specific documentation, you should document your decision somewhere so that you can return to it and follow a consistent approach.
One area that deserves your attention early in the project is word segmentation. Do you believe that there are strong reasons to allow words with spaces or multi-word tokens? These are means that UD provides to overcome intricacies of some languages’ orthography but they should not be adopted thoughtlessly because they complicate processing. Nevertheless, if they are justified, you want to take them into account right from the start, as any later changes are likely to influence all other levels of annotation.
Glosses and Transliteration
If your language does not use a Latin-based alphabet, you may want to provide an automatic transliteration to help users who cannot read the script. See the description of the CoNLL-U format: it has optional means of encoding romanization of the word form, the lemma and the entire sentence.
Similarly, there are means of providing fluent English translation of the sentence, as well as glosses of individual words. If the translation is available, it may be quite useful, especially for languages that are more exotic for an average user. Typical situations where English translation can be available include treebanks based on parallel data (such as the European Parliament proceedings or the Bible) and treebanks based on sentence examples from reference grammars or typological literature.