home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Japanese GSD

Language: Japanese (code: ja)
Family: Japanese

This treebank has been part of Universal Dependencies since the UD v1.4 release.

The following people have contributed to making this treebank part of UD: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman.

Repository: UD_Japanese-GSD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18

License: CC BY-SA 4.0

Genre: news, blog

Questions, comments? General annotation questions (either Japanese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hkana (æt) jp • ibm • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation	Source
Lemmas	annotated manually in non-UD style, automatically converted to UD
UPOS	annotated manually in non-UD style, automatically converted to UD
XPOS	annotated manually
Features	not available
Relations	annotated manually in non-UD style, automatically converted to UD

Description

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

The Japanese UD treebank contains the sentences from Google Universal Dependency Treebanks v2.0 (legacy): https://github.com/ryanmcd/uni-dep-tb. First, Google UDT v2.0 was converted to UD-style with bunsetsu-based word units (say “master” corpus).

The word units in “master” is significantly different from the definition of the documents based on Short Unit Word (SUW) [1], then the sentences are automatically re-processed by Hiroshi Kanayama in Feb 2017. It is the Japanese_UD v2.0 and used in the CoNLL 2017 shared task. In November 2017, UD_Japanese v2.0 is merged with the “master” data so that the manual annotations for dependencies can be reflected to the corpus. It reduced the errors in the dependency structures and relation labels.

Still there are slight differences in the word unit between UD_Japanese v2.1 and UD_Japanese-KTC 1.3.

In May 2020, we introduce UD_Japanese BCCWJ[3] like coversion method for UD_Japanese GSD v2.6.

Acknowledgments

The original treebank was provided by:

Adam LaMontagne
Milan Souček
Timo Järvinen
Alessandra Radici

via

Dan Zeman.

The corpus was converted by:

Mai Omura
Yusuke Miyao
Hiroshi Kanayama
Hiroshi Matsuda

through annotation, discussion and validation with

Aya Wakasa
Kayo Yamashita
Masayuki Asahara
Takaaki Tanaka
Yugo Murawaki
Yuji Matsumoto
Kaoru Ito
Taishi Chika
Shinsuke Mori
Sumire Uematsu

Statistics of UD Japanese GSD

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB

Features

Polarity

Relations

acl – advcl – advmod – amod – aux – case – cc – ccomp – compound – conj – cop – csubj – csubj:outer – dep – det – discourse – fixed – iobj – mark – nmod – nsubj – nsubj:outer – nummod – obj – obl – punct – root

Tokenization and Word Segmentation

This corpus contains 8100 sentences and 193654 tokens.

This corpus contains 185312 tokens (96%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 32 types of words that contain both letters and punctuation. Examples: Wi-Fi, 一、二塁, 80’s, D.C., E.T., IT’S, L.E.D., L’Arc, L’Orateur, MR., No., O’Malley, ぼ・っ・ち・, ウィルダネス・タバーン, エル・ドラード, カハ・デュエロ, カムデン・ヤーズ, パ—マ, パパ’S, ピーク・ウィルダーネス, ブローム・ウント・フォス, ベア・ビュット, ベル・フーシェ, ペルリー・セルトゥ, マハメド・スベール, マリィ・トロステネツ, メ〜テレ, ラ・サール, ル・マン, ロギー・バイユー, ・ふ・た・り・, 一、三塁

Morphology

Nominal Features

Degree and Polarity

Polarity

Neg
- AUX: ない, ず, ん, なかっ, なく, なけれ, ざる, ぬ, なきゃ, な
- NOUN: 不, 非, 反, 無, なし, 未, ナシ, 異
- SCONJ: ず

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

This corpus uses 2 lemmas as copulas (cop). Examples: だ, です.

This corpus uses 41 lemmas as auxiliaries (aux). Examples: た, 為る, れる, だ, ます, ない, られる, です, ず, 様, せる, 出来る, たい, てる, そう, 易い, べし, らしい, みたい, 下さる, ちゃう, 難い, 致す, させる, 頂く, なり, つう, 無い, たり, 辛い, ごとし, まじ, や, 為さる, がましい, じゃ, たがる, てく, とく, まい, む.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN-ADP(が) (2515)
- VERB--NOUN-ADP(と) (86)
- VERB--NOUN-ADP(は) (1503)
- VERB--NOUN-ADP(も) (557)
- VERB--PRON-ADP(が) (46)
- VERB--PRON-ADP(と) (1)
- VERB--PRON-ADP(は) (91)
- VERB--PRON-ADP(も) (48)

obj
- VERB--NOUN-ADP(か)-ADP(を) (2)
- VERB--NOUN-ADP(だけ)-ADP(を) (2)
- VERB--NOUN-ADP(と) (104)
- VERB--NOUN-ADP(と)-ADP(か)-ADP(を) (1)
- VERB--NOUN-ADP(など)-ADP(を) (48)
- VERB--NOUN-ADP(に)-ADP(を) (2)
- VERB--NOUN-ADP(の)-ADP(の)-ADP(を) (1)
- VERB--NOUN-ADP(の)-ADP(を) (9)
- VERB--NOUN-ADP(のみ)-ADP(を) (3)
- VERB--NOUN-ADP(まで)-ADP(を) (2)
- VERB--NOUN-ADP(も) (1)
- VERB--NOUN-ADP(を) (4427)
- VERB--NOUN-ADP(を)-ADP(も) (2)
- VERB--PRON-ADP(か)-ADP(を) (5)
- VERB--PRON-ADP(まで)-ADP(を) (1)
- VERB--PRON-ADP(を) (78)

iobj
- VERB--NOUN-ADP(も) (1)
- VERB--NOUN-ADP(を) (14)

Relations Overview

This corpus uses 2 relation subtypes: csubj:outer, nsubj:outer
The following 12 relation types are not used in this corpus at all: xcomp, vocative, expl, dislocated, appos, clf, flat, list, parataxis, orphan, goeswith, reparandum