UD Shanghainese ShUD
Language: Shanghainese (code: wuu)
Family: Sino-Tibetan
This treebank has been part of Universal Dependencies since the UD v2.17 release.
The following people have contributed to making this treebank part of UD: Qizhen Yang.
Repository: UD_Shanghainese-ShUD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: grammar-examples
Questions, comments? General annotation questions (either Shanghainese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [qzyang • main (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | annotated manually, natively in UD style |
| Relations | annotated manually, natively in UD style |
Description
UD Shanghainese-ShUD is the first UD treebank for Shanghainese.
UD Shanghainese-ShUD is the first UD treebank for Shanghainese, a Wu Chinese variant spoken by approximately 14 million people. This treebank is annotated from a corpus with a focus on daily-use speech, which is a representative sample of contemporary Shanghainese. For details on the annotation method and pipelines, see the paper. Sentences are randomly split to train, test, and dev by ratios of 80%, 10%, 10%, respectively.
Shanghainese includes several geographical and historical variants. The focus of this treebank is on Middle and New Period Urban Shanghainese.
Acknowledgments
The open-source Scripted Chinese Shanghai Dialect Daily-use Speech Corpus by Magic Data, licensed under Creative Commons BY-NC-ND 4.0, is used. Additional permission for derivative research was granted by Beijing Magic Data Technology Co., Ltd.
Statistics of UD Shanghainese ShUD
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – clf – compound – conj – cop – csubj – dep – det – discourse – dislocated – flat – iobj – mark – nmod – nsubj – nummod – obj – obl – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 983 sentences and 8584 tokens.
- This corpus contains 8582 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: SYM, X
- This corpus contains 74 word types tagged as particles (PART): 么, 乖, 了, 事体, 们, 伐, 伐啦, 伐着, 切, 勒, 吖, 吗, 吧, 呀, 呃, 呢, 呵呵, 咯, 哈, 哎呀, 哦, 哪, 哼哼, 唻, 啊, 啦, 喃, 喔, 嗯, 嘛, 嘞, 噢, 声, 大, 好, 妈, 娘, 小, 屋, 屋里相, 市, 强, 得, 德, 心, 恩, 戆, 拉, 拧, 撒, 是伐, 晚, 毒, 毛病, 水, 没, 没有, 浴, 点, 爷, 的, 眼, 纸, 老, 者, 药, 谢谢, 辣, 边, 还是, 闲话, 额, 高, 麽
- This corpus contains 56 lemmas tagged as pronouns (PRON): 一下, 一记, 人家, 伊, 伊拉, 你, 侬, 侬老, 则, 别呃, 吾, 吾呃, 哦, 哪能, 哪里, 哪里的, 啊里, 啊里个, 啊里搭, 啊里的, 啥, 啥么, 埃样, 埃能噶, 埃里的, 埃面搭, 埃面的, 弄, 我, 搿, 搿就, 搿得, 搿搭, 搿歇, 搿的, 搿眼, 搿能嘎, 搿能噶, 搿能葛, 搿那噶, 撒, 撒么, 撒拧, 撒物, 砸样, 自家, 葛, 讲侬, 该, 该哪, 还有天, 那, 那俩, 阿拉, 阿里, 阿里的
- This corpus contains 34 lemmas tagged as determiners (DET): 一个, 一切, 一种, 下趟, 个饿, 二孰, 任何, 其他, 其它, 埃呃, 埃有, 埃饿, 所有, 搿, 搿些, 搿则, 搿副, 搿只, 搿呃, 搿埃, 搿种, 有呃, 某些, 每个, 每刻, 每天, 每时, 每趟, 每躺, 艾饿, 葛, 该, 这, 那
- Out of the above, 4 lemmas occurred sometimes as PRON and sometimes as DET: 搿, 葛, 该, 那
- This corpus contains 69 lemmas tagged as auxiliaries (AUX): 不想, 不该, 之, 也是, 也要, 了, 伐, 伐会, 伐会得, 伐会的, 伐可, 伐可以, 伐好, 伐得, 伐想, 伐敢, 伐是, 伐有, 伐来塞, 伐用, 伐着, 伐能, 伐要, 会, 会得, 会的, 侪会, 侪是, 再, 再能, 勒, 古, 只会得, 可, 可以, 可能, 咯, 哪能, 啊, 啊要, 在能, 好, 好了, 就是, 就要, 应该, 必须, 想, 愿意, 是, 有, 来塞, 歇, 特, 着, 能, 舍伐得, 要, 辣, 辣搿, 辣海, 辣该, 过, 过特, 还好, 还是, 还能, 还要, 非要
- Out of the above, 24 lemmas occurred sometimes as AUX and sometimes as VERB: 也是, 了, 伐, 伐得, 伐想, 伐是, 伐有, 伐用, 伐着, 会, 勒, 好了, 就是, 想, 是, 有, 来塞, 歇, 特, 要, 辣, 辣海, 辣该, 过
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: 是.
- This corpus uses 64 lemmas as auxiliaries (aux). Examples: 了, 要, 想, 伐要, 会, 伐想, 会的, 勒, 好, 伐是, 过, 伐能, 可能, 着, 能, 还要, 伐会的, 可以, 是, 伐, 会得, 就是, 辣该, 还是, 伐会, 伐会得, 伐可, 可, 哪能, 应该, 有, 特, 还好, 不该, 之, 也是, 也要, 伐可以, 伐好, 伐得, 伐敢, 伐用, 伐着, 侪会, 侪是, 再, 再能, 古, 只会得, 咯.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (99)
- VERB--NOUN-ADP(哴向) (1)
- VERB--PRON (553)
- VERB--PRON-ADP(之间) (1)
- VERB--PRON-ADP(对) (1)
- obj
- VERB--NOUN (326)
- VERB--NOUN-ADP(上) (1)
- VERB--NOUN-ADP(呢) (1)
- VERB--NOUN-ADP(大) (1)
- VERB--NOUN-ADP(好) (1)
- VERB--NOUN-ADP(小) (1)
- VERB--NOUN-ADP(老) (1)
- VERB--NOUN-ADP(里向) (1)
- VERB--NOUN-ADP(高) (1)
- VERB--PRON (261)
- VERB--PRON-ADP(呃) (2)
- VERB--PRON-ADP(好) (1)
- iobj
- VERB--NOUN (1)
- VERB--PRON (9)
- VERB--PRON-ADP(呃) (1)