UD Korean GSD
Language: Korean (code: ko
Family: Korean
This treebank has been part of Universal Dependencies since the UD v2.0 release.
The following people have contributed to making this treebank part of UD: Ryan McDonald, Joakim Nivre, Daniel Zeman, Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun.
Repository: UD_Korean-GSD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14
License: CC BY-SA 4.0
Genre: news, blog
Questions, comments? General annotation questions (either Korean-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [jinho • choi (æt) emory • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
Lemmas | assigned by a program, not checked manually |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | assigned by a program, not checked manually |
Features | not available |
Relations | annotated manually in non-UD style, automatically converted to UD |
The Google Korean Universal Dependency Treebank is first converted from the Universal Dependency Treebank v2.0 (legacy), and then enhanced by Chun et al., 2018.
This is a collaborative work by (in alphabetic order):
- Jinho Choi, Emory University
- Jayeol Chun, Emory University
- Na-Rae Han, University of Pittsburgh
- Jena D. Hwang, Institute for Human & Machine Cognition.
- Ryan McDonald, Google Research
- Joakim Nivre, Uppsala University
- Daniel Zeman, Institute of Formal and Applied Linguistics
The project repository: https://github.com/emorynlp/ud-korean
Statistics of UD Korean GSD
POS Tags
acl – acl:relcl – advcl – advmod – amod – appos – aux – case – cc – ccomp – compound – conj – cop – csubj – dep – det – det:poss – discourse – fixed – flat – iobj – list – mark – nmod – nmod:poss – nsubj – nsubj:pass – nummod – obj – obl – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 6339 sentences and 80322 tokens.
- This corpus contains 12520 tokens (16%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 97 types of words that contain both letters and punctuation. Examples: L-29, L., M&A는, www.dynesketch.co.kr, <, 1-가, 10,057,170.7s.9d에, 3할-30홈런-100타점의, 8골-2도움의, A-B-A의, AFC-CONCACAF, AK-47, All-new, C&C사업부장, Cs-1, Double-Ugly, E-mail, Edmunds.com, FCG-1, Free-to-Play, G.는, HTCSense.com에, Hapaq-Lloyd, Hi-Los, J., K-IFRS, K-파워스, Mash-up, Multi-Layer, N., Next-Generation, P., PPAR-alpha라는, QH270-IPSB, R&D, S&P, S&P는, S-300, S-LCD, S-Oil과, S-Oil의, S., S.E.S.의, SK-II의, Semi-professional, T-850을, T-X와, U-20, U-City, UGH-BOOTS
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SYM, VERB, X
- This corpus does not use the following tags: SCONJ
- This corpus contains 13 word types tagged as particles (PART): 과는, 과도, 과의, 라는, 로는, 마저를, 보다도, 에게는, 에까지, 에는, 에도, 에서는, 와의
- This corpus contains 112 lemmas tagged as pronouns (PRON): 거기+는, 그, 그+가, 그+는, 그+들, 그+들+은, 그+들+을, 그+들+의, 그+들+이, 그+를, 그+만+의, 그+에게, 그+와+의, 그+의, 그거, 그것+ㄴ, 그것+은, 그것+을, 그것+이, 그녀, 그녀+가, 그녀+는, 그녀+들+의, 그녀+의, 그분+의, 나, 나+ㄴ, 나+가수, 나+는, 나+를, 나+만+의, 나+의, 나+이, 내+가, 너+의, 너희, 누+가, 누구, 누구+가, 누구+나, 누구+를, 누구+이+든지, 당신, 당신+도, 맨유+는, 무엇+을, 뭐, 뭐+가, 뭐+지, 셧다운제+가, 어디, 어디+가, 여기, 여기+ㄴ, 여기+가, 여기+는, 여기+를, 여기+만, 여러분+의, 우리, 우리+가, 우리+나라, 우리+는, 우리+도, 우리+들+의, 우리+땅, 우리+의, 울, 울+ㄹ, 이, 이+는, 이+들, 이+들+과, 이+들+에게, 이+들+은, 이+들+을, 이+들+의, 이+들+이, 이+를, 이+의, 이+조차+도, 이건, 이것+ㄴ, 이것+은, 이것+을, 이것+이, 이곳+은, 이곳+을, 이곳+이, 자기, 자기+부정, 자신+들+의, 자신+들+이, 자신+을, 자신+의, 자신+이, 저, 저+는, 저+도, 저+두, 저+의, 저기, 저도, 저희, 저희+도, 저희+들, 저희+를, 전, 절, 절+는, 제, 제+가
- This corpus contains 29 lemmas tagged as determiners (DET): 각, 그, 그런, 두, 몇, 몇몇, 모, 모든, 무슨, 뭔, 아무, 약, 양, 어느, 어떠+하+ㄴ, 어떤, 여느, 옛, 오+ㄴ, 올, 이, 이+들, 이러+하+ㄴ, 이런, 이번, 전, 제, 한, 현
- Out of the above, 5 lemmas occurred sometimes as PRON and sometimes as DET: 그, 이, 이+들, 전, 제
- This corpus contains 5 lemmas tagged as auxiliaries (AUX): 싶, 않, 이, 있, 하
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
- Card
- DET: 한
- NUM: 한, 두, 첫, 세, 하나, 1, 다섯, 하나는, 하나의, 네
- PUNCT: ', ), ", 15
Other Features
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: 이.
- This corpus uses 4 lemmas as auxiliaries (aux). Examples: 있, 하, 않, 싶.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (5505)
- VERB--NOUN-ADP(까지) (1)
- VERB--NOUN-ADP(는) (14)
- VERB--NOUN-ADP(도) (3)
- VERB--NOUN-ADP(안+이) (1)
- VERB--NOUN-ADP(에서+는) (2)
- VERB--NOUN-ADP(와) (1)
- VERB--NOUN-ADP(은) (7)
- VERB--NOUN-ADP(의) (2)
- VERB--NOUN-ADP(이상+이) (1)
- VERB--NOUN-ADP(정도+가) (1)
- VERB--NOUN-ADP(정도+는) (1)
- VERB--NOUN-ADP(중) (1)
- VERB--NOUN-ADP(측+은) (1)
- VERB--NOUN-ADP(측+이) (1)
- VERB--PRON (285)
- obj
- VERB--NOUN (5017)
- VERB--NOUN-ADP(는) (1)
- VERB--NOUN-ADP(도) (1)
- VERB--NOUN-ADP(마저+를) (1)
- VERB--NOUN-ADP(만) (1)
- VERB--NOUN-ADP(시행+규칙) (1)
- VERB--NOUN-ADP(어+재연) (1)
- VERB--NOUN-ADP(을) (1)
- VERB--NOUN-ADP(의) (3)
- VERB--NOUN-ADP(이란) (1)
- VERB--NOUN-ADP(정도+를) (1)
- VERB--PRON (63)
- iobj
- VERB--NOUN (93)
- VERB--PRON (3)
Relations Overview
- This corpus uses 4 relation subtypes: acl:relcl, det:poss, nmod:poss, nsubj:pass
- The following 7 relation types are not used in this corpus at all: vocative, expl, dislocated, clf, orphan, goeswith, reparandum