Tagset conversion tables to universal tags and features
Disclaimer:
The UD tags have substantive definitions and are not simply equivalence classes of the part-of-speech (POS) tags of
language-particular treebanks. That is, the situation is different from the pure equivalence class approach used in
the original 2011 Google Universal POS Tags work. Some tags can only be mapped correctly if we also know the lemma
or the syntactic context. As a concrete example, we now distinguish adpositions (ADP
) from subordinating conjunctions (SCONJ
).
This distinction is not available in the Penn Treebank English POS tag set (where both are IN), and for words that
can be used in either role, the correct UD POS tag can only be
recovered by looking at the syntactic context in which the word appears. Such information is not used in the
mappings made available here.
These conversion tables were generated automatically via Interset.
It uses only tags (+ features) as input; therefore the result is only an approximation of correct UD tags.
Thus both the tables and the applications of them to POS-tagged text will likely require manual postprocessing
in order to provide accurate and complete UD information.
Nevertheless, in practice, most tags in most languages can still be mapped correctly without
such use of syntactic content of morphological analysis, and hence this automatic conversion can be quite useful.
- ar::padt (Arabic)
- ar::conll (Arabic)
- ar::conll2007 (Arabic)
- bg::conll (Bulgarian)
- bn::conll (Bengali)
- ca::conll2009 (Catalan)
- cs::pdt (Czech)
- cs::conll (Czech)
- cs::ajka (Czech)
- cs::multext (Czech)
- da::conll (Danish)
- de::conll2009 (German)
- de::smor (German)
- de::stts (German)
- el::conll (Greek)
- en::penn (English) (see also the Tsurgeon converter on the English POS and Morphology page)
- es::conll2009 (Spanish)
- et::puudepank (Estonian)
- eu::conll (Basque)
- fa::conll (Persian)
- fi::turku (Finnish)
- grc::conll (Ancient Greek)
- he::conll (Hebrew)
- hi::conll (Hindi)
- hr::multext (Croatian)
- hu::conll (Hungarian)
- it::conll (Italian)
- it::isdt (Italian)
- ja::conll (Japanese)
- ja::ipadic (Japanese)
- la::conll (Latin)
- la::itconll (Latin)
- lt::jablonskis (Lithuanian)
- lt::multext (Lithuanian)
- mt::mlss (Maltese)
- nl::conll (Dutch)
- nl::cgn (Dutch)
- pl::ipipan (Polish)
- pt::cintil (Portuguese)
- pt::conll (Portuguese)
- pt::freeling (Portuguese)
- ro::multext (Romanian)
- ro::rdt (Romanian)
- ru::syntagrus (Russian)
- sk::snk (Slovak)
- sl::conll (Slovenian)
- sl::multext (Slovenian)
- sv::mamba (Swedish)
- sv::parole (Swedish)
- sv::suc (Swedish)
- ta::tamiltb (Tamil)
- te::conll (Telugu)
- tr::conll (Turkish)
- zh::conll (Chinese)