home edit page issue tracker

This page pertains to UD version 2.

Star Ranking of Treebanks

Some languages have multiple treebanks in UD. They may differ along multiple dimensions: size, genre, corpus type (learner corpora, translations / parallel treebanks), dialect, time period, annotation detail and quality. All these factors are potentially important when a user is deciding which treebank(s) to use in their research. As a quick (and inevitably imperfect) help, some of the criteria are compiled into a numeric score that is presented for each released treebank on the homepage.

The score is displayed on a scale of 0 to 5 stars (hover over the stars to show the exact score):

The star ratings signal at a glance which treebank(s) are a good starting point as a representative dataset for the language. They are NOT meant to be compared ACROSS languages.

Scoring System

In a nutshell, the scoring algorithm for the star ratings

The score is computed using the script evaluate_treebank.pl from the tools repository. If you are wondering how the script arrived at the score for a particular treebank, you can inspect the file eval.log in the master branch of the repository for the treebank.

Discussion

The star ranking was implemented in February 2018 in response to frequent user questions about which treebank for a language was best to use. This had become difficult for the release team to answer qualitatively due to the growing number of contributed languages and treebanks. (Also, around this time it was decided that all treebank names should incorporate an acronym; using the naming pattern UD_Language for each language’s first contributed treebank gave the incorrect impression that this was the default/recommended treebank for the language.)

Thus, an automatic scoring method was developed based on information provided by the treebank contributors in metadata, and by statistics collected directly on the data. Training parsers was still considered an important use case for the treebanks in 2018, hence the size plays an important role in the score, but we wanted other criteria, in particular those that could at least indirectly reflect quality of the annotation (if we only wanted quantity, we could just order the treebanks by the number of tokens; but for example in some languages we knew that one treebank had morphological features annotated manually, while in the other treebank they were predicted by a tagging model; therefore, the former treebank was considered more trustworthy in that respect). The list of criteria and their weights were debated among the release task force for quite a while, and they could certainly be adjusted back and forth, but they will never be perfect.

There are a few points that people should not forget if they are taking the rating too seriously:

For related discussion, see issue #691.