home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD for Japanese

Tokenization and Word Segmentation

In Japanese there is no obvious word boundary. So we need a definition of words. As the word definition for universal dependency (UD), we adopt short-unit word (SUW) by NINJAL [1,3]. SUW is also adopted to tokenize sentences in Balanced Corpus of Contemporary Written Japanese (BCCWJ) [2] containing more than 50,000 sentences in various domains and it has been shown that the SUW definition covers various language phenomena in real texts.

[1] A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation, Yasuharu Den, Junpei Nakamura, Toshinobu Ogiso, and Hideki Ogura, In Proceedings of the Sixth International Conference on Language Resources and Evaluation, pp. 1019-1024, 2008.

[2] Balanced corpus of contemporary written Japanese, Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den Language Resources and Evaluation Vol. 48 345-371, May 2014.

[3] 『現代日本語書き言葉均衡コーパス』形態論情報規程集(上)(下) 小椋秀樹, 小磯花絵, 冨士池優美, 宮内佐夜香, 小西光, and 原裕, 独立行政法人国立国語研究所, 2011.

Morphology

Features

No features are provided.

Syntax

Japanese syntactic dependency has the following properties.

Strictly Head Final: Bunsetsu-based dependencies in Japanese are strictly head final except for apposition and anastrophe (倒置).
Projective: Bunsetsu-based dependencies in Japanese are projective except for apposition and non-constituent conjunct coordinations (部分並列).
Arrow from modifier to head: In Japanese the NLP community, we depict the dependency arrows from modifier to head. This is opposite from the standard elsewhere in the world.

We have several annotation schema for dependency annotation. They are labelled but contain very limited syntactic information. Some syntactic labels in UD are in case frame or semantic role annotation in and are only available in Japanese (see next section).

Conversion from BCCWJ-DepPara schema:

The BCCWJ-DepPara schema is two-sided: bunsetsu-based dependency using four labels: D for normal dependency, F for filler or no head or face mark, Z for sentence boundary in nested sentences, B for resolution of discrepancy between bunsetsu units; and nested coordination structure and apposition annotation, as in ‘Coordination Annotation for the Penn Treebank’.

Treebanks

There are five Japanese UD treebanks:

Instruction: Treebank-specific pages are generated automatically from the README file in the treebank repository and from the data in the latest release. Link to the respective *-index.html page in the treebanks folder, using the language code and the treebank code in the file name.